Welcome to the fourth notebook of the exploratory data analysis (EDA) series! In this notebook you will use the World Happiness Report dataset, that you have already seen in the Pandas tutorial notebook. The dataset consists of 2199 rows, where each row contains various hapiness-related metrics for a certain country in a given year.
In the previous video you have learned about one of the most common applications of Maximum Likelihood Estimation (MLE), which is linear regression. Linear regression is a statistical model that is used to estimate a linear relationship between two or more variables. In case of simple linear regression you have one independent (explanatory) variable and one dependent variable (response), while in case of multiple linear regression, you have more than one explanatory variable. You can read more about linear regression on Wikipedia.
In this notebook, you will create your own linear regression model and fit it to one and more explanatory variables to predict the response. You will use an open-source, commercially usable machine learning toolkit called scikit-learn. This toolkit contains implementations of many machine learning and statistical algorithms that you can encounter as a data scientist or machine learning practitioner.
You will work with the "World happiness" dataset that you have already seen in the "Introduction to Pandas" notebook. The first thing you need to do is open the notebook and clean up the data. This code will rename columns so that there are no white spaces and drop any missing values.
Have a closer look at the output of the cell above. The dataset consists of the following columns:
country_name: Name of the country where the data was taken.
year: The year when data was taken.
life_ladder: The average of the estimates of life quality on a scale of 1 to 10 as given by a survey. In the survey people subjectively estimate the quality of their own life.
log_gdp_per_capita: Logarithm of gross domestic product (log GDP) in purchasing power parity (PPP).
social_support: National avarage of responses to the binary question: "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?".
healthy_life_expectancy_at_birth: Life expectancy at birth.
freedom_to_make_life_choices: National avarage to the binary question: "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?".
generosity: Derived from answering the question: "Have you donated money to a charity in the past month?" and GDP.
perceptions_of_corruption: Average of responses to questions about corruption.
positive_affect: Average of answers to three positive affect questions, covering laugh, enjoyment and doing interesting things
negative_affect: Average of answers to three negative affect questions, covering worry, sadness and anger.
Detailed explanations of the columns in the data can be found here.
Before you jump into fitting the linear regression model to your data, it makes sense to visualize it to get the feeling of what you are dealing with. The Seaborn pairplot is a very useful function that automatically plots the scatter plots between each pair of columns in the dataframe, as well as histograms of the values in each column. Run the cell below to visualize the data.
You can see that some of the scatter plots seem quite elongated and might show some significant correlation between the two variables. These pairs may be good candidates for independent-dependent variable pairs. But just looking at the points can be dangerous. You need to have an idea of what you would like to predict and which variables may be a good choice for explanatory variables. Take another look at the column names and think about which ones you would use as explanatory variables and what you would want to predict given this dataset. You can also get some ideas from the official report.
Now that you have a good sense of the data, you can choose your independent (X) and dependent (y) variable. Let's start with the obvious ones: use the GDP per capita to explain the value on the life ladder, which measures the happiness of the people.
In machine learning you would typically not use the same data to train and evaluate your model and that is because you want to know how well the model generalizes to new (previously unseen) data. You can run the cell below to split your data into two groups. It will create the training data, X_train and y_train, as well as the test data, X_test and y_test. Note the uppercase X and lowercase y. This is to emphasize that the X variable is two dimensional, while the y variable is one dimensional. This is for better generalization for when you in fact use more than one explanatory variable. You will see this in action later, using more variables for predicting the happiness.
Now that you have your train and test data, it is time to create your linear regression model. This is actually very simple and can be done in a single line using LinearRegression as shown below.
The first part of the statement, LinearRegression(), is instantiating a linear regression model. You could also set some parameters to this function. The fit(X_train, y_train) method is then performing the actual trainingof the model, fitting its parameters to your training data.
You can write down the equation for the line that you fit to the data as y^=Wx+b, where x is the explanatory variable (or variables) and y^ is your response, while W and b are the parameters that you fit. In the cell below, you can see how to access the parameters W and b of the model you just trained.
now that you have fit your model, it's time to start using it. In machine learning, a model like this is used not only to describe the data that you already have, but more importantly to create predictions based on new datapoints. To assess how well the model works, you have already split the dataset into the train and test sets, so that you can evaluate how the model performs on previously unseen data.
The default way to calculate the predictions on the test set is to use .predict(). You will use .predict() on the whole test set at once. To reinforce how the model actually works, you will compare the predictions given by .predict() with ones calculated "by hand", using the parameters W and b that you extracted in the previous cell.
The results show that at least for the last four rows, the predictions were reasonably close to correct. The average absolute error across the test set is also 0.57 which seems to suggest that while the model can't perfectly predict happiness, it gives a pretty good approximation.
Now you can plot the predictions together with the training data to get a visual feeling of how the model performs. The blue points on the plot are real data from the training set. The orange points on the plot have their x value taken from the test set, but their y value is a prediction created by the model.
You can see that all of the predictions lie on a straight line that fits the data best. The data, however, has a lot more variation and is not perfectly explained just by this one line. As you know there are many more variables in this dataset and some other variables may be able to explain this variation. You will see how this plays out using multiple linear regression in the next section.
Often there is more than just one variable that explains the behavior of other variables. In this case you can use multiple linear regression. In the cell below there is a function defined to make your life easier:
fit_and_plot_linear_model: This function is very similar to the work you already did above. The key difference is that it can take more than one feature at a time, so you will be able to experiment with building models with two or more explanatory variables. In addition it calculates the feature importance. We will not go into detail about it here, but all you need to know is that importance score tells you the relative importance of explanatory variables. The higher the score, the more important the variable is in predicting the outcome.
Check out the code in the cell below if you want to understand better what is happening. Note that the plotting and feature importance has been abstracted away to the utils file, so that it doesnt add unnecessary clutter the code.
Now that you have your functions defined, it is time to run them. The cell below does that in an interactive way using widgets and interact_manual. This allows you to run the function many times without typing anything in but rather changing parameters using the widgets.
In the first line of code you define all of the possible predictor variables and the rest of the code takes care for interactively running the functions. After running the cell below you can select one or more predictors from the list to perform linear regression.
The function will calculate the linear regression and plot the data and the results on a 2D plot. For the x-axis it will automatically choose the variable with the highest feature importance score.
Intrustions to use the widget
To select different variables from the menu just do Ctrl+click on the variables you want to select. Single click will only consider the variable you are clicking on.
To select consecutive variables you can also click on the first variable you want and Shift+click on the last one you want. This will select all variable in between. You can also click on the first one and select with Shift+down arrow.
You see that when you select more explanatory variables, the points do not lie on a straight line in 2D anymore. This is because another variable (which is not on the 2D plot) contributes to the move away from the line. In a higher dimensional space, the points still lie on a straight line.
Congratulations on finishing this lab. If you understand what is happening above, you are well suited to perform linear regression with scikit-learn. Later in this course you will see linear regression again using another Python library.