Regression analysis can be used to search for relationships among observation variables. I wanted to find out if unemployment rate and housing prices are related. I also wanted to find out if the number of yearly sunshine hours and housing prices is related.

I found two data files on the French government OpenData platform to start with. The first one has the number of active and unemployed people per town in France in 2017. The columns are P17_ACT1564 and P17_CHOM1564. The first column contains the town code, which is similar to a zip code. The snippet below shows that the town 01001 has about 380 active people between the age of 15 and 64, who are able to work. Among those people, about 33 are unemployed.


The second file has the housing prices in euros per square meter per town in 2017. The column with the price is the last one: Prixm2. The snippet below shows that town 01001 housing prices average 1595 euros per square meter.

Commune simple,"01001",L'Abergement-Clémenciat,"2",AIN,"01",AUVERGNE-RHONE-ALPES,"84","200069193",L'ABERGEMENT-CLEMENCIAT,"1595"

Files references are at the end of this post. The first thing I need to build is some training data to use with a linear regression model. The following Python code builds a dictionary where the key is the town code, and the value contains two elements: the unemployment rate percentage and the housing price per square meter. A town might be missing some data in one of the two files, so we only keep towns with complete data. The content of the dictionary is written to a file called “train.csv”.

import csv

towns = dict()

with open('population.csv', 'r', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    for row in reader:
            town_code, unemployed, active = row[0], float(row[22]), float(row[23])
            unemployment = unemployed / active * 100
            towns[town_code] = [unemployment]
        except (ValueError, ZeroDivisionError):

with open('price.csv', 'r', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        town_code, price = row[1], row[10]
        if town_code in towns:
            except ValueError:

towns = {k:v for (k, v) in towns.items() if len(v) == 2}

with open('train.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ')
    for k, v in towns.items():
        writer.writerow([k, v[0], v[1]])

The output content stored in train.csv looks like this:

01001 8.788710157169419 1595.0
01002 8.130081300812977 1239.0
01004 14.854066827971835 1843.0

The next step is to train our model. Once trained, our model will give us the slope of our linear function and the intercept value, which is the value of y when x = 0. Here, x is the unemployment rate, and y is the housing price.

import csv
import numpy as np
from sklearn.linear_model import LinearRegression

towns = dict()
x, y = [], []

with open('train.csv', 'r', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ')
    for row in reader:
        unemployment, price = float(row[1]), float(row[2])

x = np.array(x).reshape((-1, 1))
y = np.array(y)

model = LinearRegression(), y)

The output of this script shows that the slope is approximately equal to -9, and the intercept is approximately equal to 1526. A slope of -9 here means the housing price goes down by 9 euros every time the unemployment rate goes up by one percentage point. It is a minor slope. Looking at that, it seems like the unemployment rate alone does not really impact housing prices. There are many towns with high unemployment rate close to the sea, which are quite expensive.


Here is a plot of the linear function over the observations.

Are sunshine and housing prices related? I found the number of yearly sunshine hours per county from 2017. The number of sunshine hours is a precise measurement performed using something called a heliograph. A county in France is called a “departement,” and there are about 100. Here is a sample of data where the first column is the county ID, and the second column is the number of sunshine hours in 2017. The first row has the data for Corsica (a sunny French island in the Mediterranean sea), where there were 3083 hours of sunshine in 2017. The last row is a county in the north of France with much less sunshine.

20 3083
06 3032
83 2932
04 2895
84 2894
59 1501

We can build our training data the same way as we did earlier, and we obtain the following with housing prices next to the number of sunshine hours. The first row shows that the sunny county 06 of Nice averages about 2700 euros per square meter. County 04, in the Provencal Alps, is also sunny but cheaper at about 1750 euros per square meter.

06 3032 2718.593103448276
83 2932 2407.4189189189187
04 2895 1757.1453488372092
84 2894 2223.9517241379313
30 2852 1820.7005988023952
59 1501 1457.3432601880877

The next step is to train the linear regression model and output the slope and intercept. A minor slope is outputted here too. An increase of 100 sunshine hours yearly results in an increase of 5 euros per square meter. Sunshine alone does not affect housing prices much. It makes sense since a town can be expensive due to some other factors and still be located in the north part of France where sunshine is less frequent.


Data sources used in this post:

Last modified: January 24, 2021