Exploratory Data Analysis - Intro to Pandas

Welcome to the Pandas tutorial lab. This is the first notebook of the exploratory data analysis (EDA) series, where you will get your hands dirty applying the skills you have learned in the course on an actual data problem, similar to those you might encouter in real life! Here you will see and try out some basics of Pandas and get familiar with some of the useful functions that you will use across the other labs and assignments. If you already know Pandas well, feel free to skip this notebook.

For the demonstration purposes you will use the World Happiness Report dataset. The dataset consists of 2199 rows, where each row contains various hapiness-related metrics for a certain country in a given year. Right now you'll just use this dataset to understand some fundamental operations in Pandas. You will see this dataset again later in week 3, where you will dig deeper into the data and explore relationships to better understand which factors seem to best predict happiness.

This notebook is not a comprehensive guide to Pandas, but rather shows and explains the functions you will use through this course. For a more comprehensive guide on Pandas, please see the official tutorial or check the documentation.

Importing the Libraries

The most important library you will need in this notebook is - you guessed it - Pandas. You will also use the Seaborn library for plotting the data. To import the libraries run the cell below.

# Import the Pandas library
import pandas as pd
# Import the Seaborn library for plotting
#!pip install seaborn
import seaborn as sns

Importing the Data

Now that you have the pandas library imported, you'll need to load your dataset. The dataset you will use is saved as a .csv file and all you need to do to load is call the function pd.read_csv(filename). If you have your data in another format, there exists a variety of functions to load it, you can check the documentation here. When you load the dataset, it will be stored as a DataFrame type (see the documentation here). This is the most commonly used Pandas datastructure that you will use throughout this and other notebooks.

# Load the dataset and save it to the df variable
df = pd.read_csv('data/world_happiness.csv')

Basic Operations With a Dataframe

View the Dataframe

You can use DataFrame.head() and DataFrame.tail() to view the first or last rows of the frame respectively. By default it will show you five rows, but you can specify the number of rows you want to see as a parameter. Technically, neither of the functions actually display anything, but just return a new dataframe. The dataframe is displayed because Jupyter notebooks show the output of the last row in the cell. You can also display the contents of your dataframe by simply writing df. If your dataframe is too long, it will then display only the first and the last few rows.

Note that all of this only works if you use it in the last line of code in the cell, because the cells automatically display the output of the last line. If you want to see more than one dataframe by running a single cell or if you want to perform some other tasks after displaying the dataframe, then you better encapsulate it with print() or display(). display() function will print the dataframe, but with the same format as just calling df, whereas print() will print as plain text.

Try commenting and uncommenting lines below, to see how this plays out. Try different combiations of rows.

# This line will display the first few rows of the dataframe if there are no lines of code after.
df.head()
 
# Try uncommenting different combinations of the lines below.
# print("Cats are cool.")
# print(df.head())
# print(df)
# print("Some more text about cats being cool.")
# display(df)

Now display the last few rows of the dataframe. Pay attention to the additional parameter that specifies the number of rows.

# This line will display only the last two rows of the dataframe.
df.tail(2)

Index and Column Names

In the DataFrame, the data is stored in a two dimensional grid (rows and columns). The rows are indexed and the columns are named. To see the index or the column names, you can use DataFrame.index or DataFrame.columns respectively.

df.index

As you can see, the index is a range of numbers between 0 (inclusive) and 2199 (not inclusive).

Run the cell below to see the column names.

df.columns

The column names are saved as strings. As you can see, they can include spaces. This can lead to difficulties when accessing the columns (you will see this very soon), so it is a good idea to rename them to get rid of the spaces. A common practice is to replace them with underscores. To rename the columns, you can use DataFrame.rename() and pass the columns you want to rename in a dictionary.

In the next example, you will see how you can automatically replace all spaces with underscores

# A dictionary mapping old column names to new column names. In addition to replacing spaces
# with underscores, you will make all of the text lowercase.
columns_to_rename = {i: "_".join(i.split(" ")).lower() for i in df.columns}
# Note that this dictionary is created automatically from the column names.
# You can also create it by hand and rename only the columns you want to rename
# For example, see the commented line below:
# columns_to_rename = {"Country name": "country_name", "Life Ladder": "life_ladder"}
 
# Rename the columns
df = df.rename(columns=columns_to_rename)
# Display the new dataframe
df.head()

Data Types

One cool thing about the DataFrame type is that the columns of the resulting DataFrame can have different dtypes. This is something you simply can not do with a Numpy array. You can look at them and if needed to you can change them.

df.dtypes

You can see that the columns above are of different types and if you compare it to how the data actually looks like, it seems that the types are correct. Sometimes if your data is incorrectly formatted, the imported types will be wrong. In this case you will want to change the types of the columns manually before proceeding. Check the code below on how you can do that. Note that nothing will change after running the code below, as the data is already of correct types.

# List all of the columns that should be floats
float_columns = [i for i in df.columns if i not in  ["country_name", "year"]]
# Change the type of all float columns to float
df = df.astype({i: float for i in float_columns})
# Show the types of all columns
df.dtypes

The df.info() provides some additional information. In addition to data types it also tells you the number of non-null values per column.

df.info()

Selecting Columns

One way of selecting a single column is to use DataFrame.column_name. Here you can see why it was a good idea that you renamed the columns to not include any whitespaces. This returns a Pandas Series, which is a different datatype from a DataFrame. You will see how to return a DataFrame a bit later.

# Select the life_ladder column and store it in x
x = df.life_ladder
 
print(f"type(x):\n {type(x)}\n")
print(f"x:\n{x}")

Another way to do this is to use square brackets and the name of the column in quortes, much as you would do when accessing an entry in a dictionary. As with dictionaries, you can use double quotes or simple quotes.

x = df["life_ladder"]
 
print(f"type(x):\n {type(x)}\n")
print(f"x:\n{x}")

Passing a list of labels rather than a single label selects the columns and returns a DataFrame (rather than a Series), with only the selected columns. You can use it to select one or more columns.

x = df[["life_ladder"]]
# x = df[["life_ladder", "year"]]
 
print(f"type(x):\n {type(x)}\n")
print(f"x:\n{x}")

Selecting Rows

Passing a slice : selects matching rows and returns a DataFrame with all columns in your original dataframe.

df[2:5]

Iterating Over Rows

If you want to iterate over the rows, you can use the .iterrows() method. For each row it yields a (index, row) tuple, where the row is a Series object containing the data. Note that this does not preserve the data types (dtypes) across the rows (dtypes are preserved across columns for DataFrames).

index, row = next(df.iterrows())
row

Boolean Indexing

Now to the more fun part. If you looked carefully at the dataset that was displayed above, you probably saw that the datapoints are available for different years. What if you are interested only in data from a certain year? Or from a certain country? Or perhaps where a value in a certain column is greater than some predetermined value? You can use boolean indexing.

Run the cell below to select rows where the year equals to 2022. Try to uncomment some other row to see what it does.

df[df["year"] == 2022]
# df[df["life_ladder"] > 5] # Select rows where life_ladder > 5
# df[df["life_ladder"] > 11] # This one should return an empty dataframe

Note that now that you selected only the certain rows, the index column does not make much sense anymore because you have a lot of gaps. While this is not a problem, in some cases you might want the index to correspond to the actual row number. To achieve this you can use reset_inex(). In other cases you might want to keep the index as it is to more easily refer back to the original dataframe. It all depends on the context of your project. Run the cell below to reset the index and take a look at the output.

new_df = df[df["year"] == 2022]
new_df = new_df.reset_index(drop=True)
new_df

Summary Statistics

Later in this course you will learn about summary statistics. For now, this is just to show you that Pandas allows for a very simple way to calculate all sorts of statistics using describe(). Run the cell below to see a quick statistical summary of your data. It doesn't matter if you don't know what each row means, you will learn all about it in the coming weeks.

df.describe()

Not all of the summary statistics always make sense. In your case, for example, you are looking at the summary statistics across various columns. But are you sure you know what the final numbers actually mean? You have data for many different countries, but are you sure that you have the same amount of datapoints for each country or for each year? Also the countries can have vastly different populations, is it fair to just average the numbers out?

Plotting

If you want to plot the data, you can use DataFrame.plot(). By default it uses the index as the x axis and plots all the numeric columns as y axes. Run the cell below to see the output for your dataframe.

# If the plot doesn’t render, first try re-running this cell. If that doesn’t work, 
# you can restart the kernel (from the Kernel menu above) and try running the notebook again
df.plot()

As you can see, in this case the plot is not very useful. The index does not have any specific meaning, and the values of various columns differ greatly (years are all around 2000, but the values in the other columns are much lower) and thus you cannot see much in the plot. Try setting some parameters of the .plot() method to see what it allows you to do. You can find the documentation here.

Run the cell below to see a scatter plot with specifically chosen x and y variables. On the x axis there is logarithm of the GDP (measuring the wealth) while on the y axis there is the life ladder. This column contains values which are an estimate of self-assessed life quality on a scale of 1 to 10 as given by a survey among the people.

df.plot(kind='scatter', x='log_gdp_per_capita', y='life_ladder')

You can see that there is some sort of trend between the wealth of the country and the happiness of the population and you can say that it looks like that wealthier people are to some extent happier. In week three, you will explore this kind of relationship further.

Sometimes it is very insightful to separate the points by colors to highlight different characteristics or some points you are most interested in. Take a look at the example below

# Create a dictionary to map the country names to colors
cmap = {
    'Brazil': 'Green',
    'Slovenia': 'Orange',
    'India': 'purple'
}
 
df.plot(
    kind='scatter',
    x='log_gdp_per_capita',
    y='life_ladder',
    c=[cmap.get(c, 'yellow') for c in df.country_name], # Set the colors
    s=2 # Set the size of the points
    )

You can see that even though in general higher GDP means higher value on the life ladder, this is not an universal truth. Comparing Slovenia (orange) with Brazil (green), you can see that people in Brazil earn less, but are on average happier than Slovenians through the years.

Another very useful task you can do with plots is to visulize the distribution of your data. You will learn how to do this in more detail later, but for example you can easily plot a histogram using Pandas. Ise DataFrame.hist() on the dataframe you want to plot. Note that if you have many columns in the dataframe, this command will plot a histogram for each of the columns. You can select a single column from the dataframe if you only want to plot that one.

df.hist("life_ladder")

What you see in this histogram is a distribution of values in the "life_ladder" column. What do you think about this distribution on the first glance? Are the people generally happy about their quality of life? Note that to answer this question properly, you need to dig a bit deeper into the data: understand where each value comes from, as the values are not single datapoints (single answers by people), but already aggregated values across countries and at various points in time.

You can use other external libraries to easily produce various advanced plots. One of such libraries is Seaborn. You have already imported it at the beginning of this lab using import seaborn as sns. Run the cell below to see one of the many simple and efficient plotting possibilities (you will use this one later on in the other notebooks as well). Since the dataset has many columns it might take a few seconds to run.

# If the plot doesn’t render, first try re-running this cell. If that doesn’t work, 
# you can restart the kernel (from the Kernel menu above) and try running the notebook again
sns.pairplot(df)

With this kind of plot, you can see pairwise scatter plots for each pair of columns. On the diagonal (where both columns are the same), you don't have a scatter plot (which would only show a line), but a histogram showing the distribution of datapoints.

You can see that both the scatter plots and histograms have very different shapes across columns. Think about various insights you could get from this kind of visualization.

Operations on Columns

Sometimes the values in the columns are not giving you the information that you need, but there is a way to calculate that information from the values you have.

For example you can create a new column, which is a sum of two columns.

# Create a new column which is the sum of the year and the value on the life ladder.
df["this_column_makes_no_sense"] = df["year"] + df["life_ladder"]
# Create a new column which is the difference of two columns.
df["net_affect_difference"] = df["positive_affect"] - df["negative_affect"]
 
df.head()

Above you can see your dataframe with both new columns. The first one doesn't make much sense, it's just adding the year to the life ladder. The second one, however, find the net difference between positive and negative affect. Perhaps there's an interesting set of patterns between this new column and other columns that you'd now be able to explore. What other columns might you want to calculate? In general, the ability to create new columns using operations on existing columns can be a powerful tool.

If you want to perform some more advanced operations on columns, you can use DataFrame.apply(), with which you can apply practically any function to a column. Below you can see how to use the DataFrame.apply() in various ways. Try to edit my_function to perform an operation of your choice.

# Using df.apply() with a lambda function
# Rescale the life_ladder column to values between 0 and 1 and save it to a new column
df['life_ladder_rescaled'] = df['life_ladder'].apply(lambda x: x / 10)
 
# Using df.apply() with your own function
# First define a function. The function can do whatever you want. This example will double the column's values
def my_function(x):
    # do stuff to x
    y = x * 2
    return y
# Apply the function.
df['my_function'] = df['life_ladder'].apply(my_function)
 
# Show the new dataframe
df.head()

Conclusion

Congratulations on finishing this lab. If you understand the code above, you are well suited to start working on this week's programming assignment and other labs and assignments throughout the course which use Pandas. If you need a refresher on Pandas in other Exploratory Data Analysis labs, come back to this one and review the skills taught here.

Exploratory Data Analysis - Intro to Pandas

On this page