Exploratory Data Analysis - Intro to Pandas
Welcome to the Pandas tutorial lab. This is the first notebook of the exploratory data analysis (EDA) series, where you will get your hands dirty applying the skills you have learned in the course on an actual data problem, similar to those you might encouter in real life! Here you will see and try out some basics of Pandas and get familiar with some of the useful functions that you will use across the other labs and assignments. If you already know Pandas well, feel free to skip this notebook.
For the demonstration purposes you will use the World Happiness Report dataset. The dataset consists of 2199 rows, where each row contains various hapiness-related metrics for a certain country in a given year. Right now you'll just use this dataset to understand some fundamental operations in Pandas. You will see this dataset again later in week 3, where you will dig deeper into the data and explore relationships to better understand which factors seem to best predict happiness.
This notebook is not a comprehensive guide to Pandas, but rather shows and explains the functions you will use through this course. For a more comprehensive guide on Pandas, please see the official tutorial or check the documentation.
Importing the Libraries
The most important library you will need in this notebook is - you guessed it - Pandas
. You will also use the Seaborn
library for plotting the data. To import the libraries run the cell below.
Importing the Data
Now that you have the pandas library imported, you'll need to load your dataset. The dataset you will use is saved as a .csv
file and all you need to do to load is call the function pd.read_csv(filename)
. If you have your data in another format, there exists a variety of functions to load it, you can check the documentation here.
When you load the dataset, it will be stored as a DataFrame
type (see the documentation here). This is the most commonly used Pandas datastructure that you will use throughout this and other notebooks.
Basic Operations With a Dataframe
View the Dataframe
You can use DataFrame.head()
and DataFrame.tail()
to view the first or last rows of the frame respectively. By default it will show you five rows, but you can specify the number of rows you want to see as a parameter. Technically, neither of the functions actually display anything, but just return a new dataframe. The dataframe is displayed because Jupyter notebooks show the output of the last row in the cell. You can also display the contents of your dataframe by simply writing df
. If your dataframe is too long, it will then display only the first and the last few rows.
Note that all of this only works if you use it in the last line of code in the cell, because the cells automatically display the output of the last line. If you want to see more than one dataframe by running a single cell or if you want to perform some other tasks after displaying the dataframe, then you better encapsulate it with print()
or display()
. display()
function will print the dataframe, but with the same format as just calling df
, whereas print()
will print as plain text.
Try commenting and uncommenting lines below, to see how this plays out. Try different combiations of rows.
Now display the last few rows of the dataframe. Pay attention to the additional parameter that specifies the number of rows.
Index and Column Names
In the DataFrame
, the data is stored in a two dimensional grid (rows and columns). The rows are indexed and the columns are named. To see the index or the column names, you can use DataFrame.index
or DataFrame.columns
respectively.
As you can see, the index is a range of numbers between 0 (inclusive) and 2199 (not inclusive).
Run the cell below to see the column names.
The column names are saved as strings. As you can see, they can include spaces. This can lead to difficulties when accessing the columns (you will see this very soon), so it is a good idea to rename them to get rid of the spaces. A common practice is to replace them with underscores. To rename the columns, you can use DataFrame.rename()
and pass the columns you want to rename in a dictionary.
In the next example, you will see how you can automatically replace all spaces with underscores
Data Types
One cool thing about the DataFrame type is that the columns of the resulting DataFrame can have different dtypes
. This is something you simply can not do with a Numpy array. You can look at them and if needed to you can change them.
You can see that the columns above are of different types and if you compare it to how the data actually looks like, it seems that the types are correct. Sometimes if your data is incorrectly formatted, the imported types will be wrong. In this case you will want to change the types of the columns manually before proceeding. Check the code below on how you can do that. Note that nothing will change after running the code below, as the data is already of correct types.
The df.info()
provides some additional information. In addition to data types it also tells you the number of non-null values per column.
Selecting Columns
One way of selecting a single column is to use DataFrame.column_name
. Here you can see why it was a good idea that you renamed the columns to not include any whitespaces. This returns a Pandas Series
, which is a different datatype from a DataFrame
. You will see how to return a DataFrame
a bit later.
Another way to do this is to use square brackets and the name of the column in quortes, much as you would do when accessing an entry in a dictionary. As with dictionaries, you can use double quotes or simple quotes.
Passing a list of labels rather than a single label selects the columns and returns a DataFrame (rather than a Series), with only the selected columns. You can use it to select one or more columns.
Selecting Rows
Passing a slice :
selects matching rows and returns a DataFrame with all columns in your original dataframe.
Iterating Over Rows
If you want to iterate over the rows, you can use the .iterrows()
method. For each row it yields a (index, row) tuple, where the row is a Series
object containing the data. Note that this does not preserve the data types (dtypes) across the rows (dtypes are preserved across columns for DataFrames).
Boolean Indexing
Now to the more fun part. If you looked carefully at the dataset that was displayed above, you probably saw that the datapoints are available for different years. What if you are interested only in data from a certain year? Or from a certain country? Or perhaps where a value in a certain column is greater than some predetermined value? You can use boolean indexing.
Run the cell below to select rows where the year equals to 2022. Try to uncomment some other row to see what it does.
Note that now that you selected only the certain rows, the index column does not make much sense anymore because you have a lot of gaps. While this is not a problem, in some cases you might want the index to correspond to the actual row number. To achieve this you can use reset_inex()
. In other cases you might want to keep the index as it is to more easily refer back to the original dataframe. It all depends on the context of your project. Run the cell below to reset the index and take a look at the output.
Summary Statistics
Later in this course you will learn about summary statistics. For now, this is just to show you that Pandas allows for a very simple way to calculate all sorts of statistics using describe()
. Run the cell below to see a quick statistical summary of your data. It doesn't matter if you don't know what each row means, you will learn all about it in the coming weeks.
Not all of the summary statistics always make sense. In your case, for example, you are looking at the summary statistics across various columns. But are you sure you know what the final numbers actually mean? You have data for many different countries, but are you sure that you have the same amount of datapoints for each country or for each year? Also the countries can have vastly different populations, is it fair to just average the numbers out?
Plotting
If you want to plot the data, you can use DataFrame.plot()
. By default it uses the index as the x axis and plots all the numeric columns as y axes. Run the cell below to see the output for your dataframe.
As you can see, in this case the plot is not very useful. The index does not have any specific meaning, and the values of various columns differ greatly (years are all around 2000, but the values in the other columns are much lower) and thus you cannot see much in the plot. Try setting some parameters of the .plot()
method to see what it allows you to do. You can find the documentation here.
Run the cell below to see a scatter plot with specifically chosen x and y variables. On the x axis there is logarithm of the GDP (measuring the wealth) while on the y axis there is the life ladder. This column contains values which are an estimate of self-assessed life quality on a scale of 1 to 10 as given by a survey among the people.
You can see that there is some sort of trend between the wealth of the country and the happiness of the population and you can say that it looks like that wealthier people are to some extent happier. In week three, you will explore this kind of relationship further.
Sometimes it is very insightful to separate the points by colors to highlight different characteristics or some points you are most interested in. Take a look at the example below
You can see that even though in general higher GDP means higher value on the life ladder, this is not an universal truth. Comparing Slovenia (orange) with Brazil (green), you can see that people in Brazil earn less, but are on average happier than Slovenians through the years.
Another very useful task you can do with plots is to visulize the distribution of your data. You will learn how to do this in more detail later, but for example you can easily plot a histogram using Pandas. Ise DataFrame.hist()
on the dataframe you want to plot. Note that if you have many columns in the dataframe, this command will plot a histogram for each of the columns. You can select a single column from the dataframe if you only want to plot that one.
What you see in this histogram is a distribution of values in the "life_ladder" column. What do you think about this distribution on the first glance? Are the people generally happy about their quality of life? Note that to answer this question properly, you need to dig a bit deeper into the data: understand where each value comes from, as the values are not single datapoints (single answers by people), but already aggregated values across countries and at various points in time.
You can use other external libraries to easily produce various advanced plots. One of such libraries is Seaborn. You have already imported it at the beginning of this lab using import seaborn as sns
. Run the cell below to see one of the many simple and efficient plotting possibilities (you will use this one later on in the other notebooks as well). Since the dataset has many columns it might take a few seconds to run.
With this kind of plot, you can see pairwise scatter plots for each pair of columns. On the diagonal (where both columns are the same), you don't have a scatter plot (which would only show a line), but a histogram showing the distribution of datapoints.
You can see that both the scatter plots and histograms have very different shapes across columns. Think about various insights you could get from this kind of visualization.
Operations on Columns
Sometimes the values in the columns are not giving you the information that you need, but there is a way to calculate that information from the values you have.
For example you can create a new column, which is a sum of two columns.
Above you can see your dataframe with both new columns. The first one doesn't make much sense, it's just adding the year to the life ladder. The second one, however, find the net difference between positive and negative affect. Perhaps there's an interesting set of patterns between this new column and other columns that you'd now be able to explore. What other columns might you want to calculate? In general, the ability to create new columns using operations on existing columns can be a powerful tool.
If you want to perform some more advanced operations on columns, you can use DataFrame.apply()
, with which you can apply practically any function to a column. Below you can see how to use the DataFrame.apply()
in various ways. Try to edit my_function
to perform an operation of your choice.
Conclusion
Congratulations on finishing this lab. If you understand the code above, you are well suited to start working on this week's programming assignment and other labs and assignments throughout the course which use Pandas. If you need a refresher on Pandas in other Exploratory Data Analysis labs, come back to this one and review the skills taught here.