(EDA) - Data Visualization and Summary Statistics
Welcome to the third notebook of the exploratory data analysis (EDA) series. This notebook is a continuation of the rideshare notebook you used last week.
For this notebook you will use the data on ridesharing in the year 2022 in the city of Chicago, which can be found here. You have already worked with this dataset in the previous week. This time you will continue working on the cleaned-up and reduced version of the dataset, which you prepared in the previous notebook.
Learning Objectives:
In this notebook you will use the following concepts from the course in a practical setting:
- Probability
- Descriptive statistics (mean, median, standard deviation and quartiles)
- Box plots
- Joint distribution
- Marginal distribution
- Correlation
Import the Python Libraries
As usual, the first thing you need to do is import the libraries that you will use in this notebook.
Loading the Dataset
The next step is to open the dataset. This is the reduced and cleaned-up version that you used in the previous notebook.
Investigate the Summary Statistics
To get a better grasp of the data it is very useful to learn a bit more about the values in each column. In the previous notebook you have already plotted some histograms of individual columns to see how the data is distributed. Now it's time to approach this more systematically. Let's look at the numeric values first. Pandas has a very useful function .describe()
, which returns a new dataframe with summary statistics for each of the columns. Run the cell below to compute and display summary statistics for your dataset. The output is a new dataframe that contains the count, mean, standard deviation, minimum value, maximum value and 25%, 50% (median) and 75% quartiles for each of the columns. By now, you should be familiar with all of these statistics. If you need a refresher, check out the Week 2 Lesson 1 videos again.
In the dataframe above, you can find a lot of useful information. Carefully inspect the contents of the table to understand the data better. The following questions may help you think about the insights you can get from the summary statistics presented in the table.
- Check the minimum and maximum value for each column. What are their values and how far apart are they? For example: What is the shortest and longest trip that was taken and what is the difference between them?
- What is mean value of each column? For example: What is the mean trip length? Is it closer to the minimum or maximum value?
- What is the standard deviation of each column? For example: how much do the trip lengths vary?
- Compare the quartiles and the mean. Is the mean above or below the median? How could you explain this (think of the shape of the histograms you saw in the previous notebook and look for long or heavy tails).
Note that the first three questions all relate to the same column trip_miles
. You can ask the same question about any column, for example: What is the mean/median/highest/lowest tip?
Visualize the Summary Statistics Using Boxplots
A great way to understand your data is to visualize it! A commonly used tool to display summary statistics is a boxplot. Fortunately, it is already integrated to Pandas
, and you can simply draw it by using DataFrame.boxplot()
. Remember that the box-plot involves infromation about all the quartiles and the maximum and minumum values of the variable, and it looks something like this
As with other plots, if you do not specify which variable, or column, you want to plot, it will plot all of them. Because the columns have very different values, it is better to plot one by one, so you can more easily see the information being communicated.
The blue box shows the interquartile range (IQR), and the horizontal blue lines in the box show the Q1, Q2 (median), and Q3 quartiles. You can see these three numbers in the dataframe above (25%, 50% and 75%). Check whether they align with the plot. The horizontal black lines outside the box show +/- 1.5 times IQR, which is the default range used to identify outliers. The individual datapoints plotted outside the lines are the outliers.
In the case of fare
you can see a lot of outlier points, can you figure out why? Remember the distribution of this variable
As you can see, this distribution is really heavy for smaller values, and has a really long tail, which is the values you are seeing as outliers.
Go ahead and change the column_to_plot
variable and plot the boxplots for other columns. Do they all have outliers only on one side? can you infer why?
Visualize the Data on different weekdays
If you want to split the data into subsets (for example for given days of the week) and plot a boxplot for each, you can easily do that using the parameter by
. You just need to set it to the column name you want to use to split the data. Suppose you want to analize the tip
variable by day of the week. Intuitively, what you are doing is creting classes according to the different days in weekday
, and then analyzing the data for each of this classes. This way what you are actually doing is exploring the conditional distributions of a variable. In this case, you are looking at the tip given that it's a Monday, the tip given that it's a Tuesday, and so on.
If you look closely, there are a couple of things that look a bit off in these boxplots. For example, what is happening with the Sunday data when you are looking at tips? It seems that there is no box in the boxplot. Also if you look more closely, the boxes that you see dont have a horizontal line in the middle. Why do you think this is happening? You can find a hint by thinking about the histogram of the tips that you saw in the previous notebook. Did the distribution have any interesting properties? If you don't remember what it looked like, feel free to create a new cell and draw the histogram.
Another, more obvious, hint is in the summary statistics. Run the cell below to display them again, but this time you will do it just for the tip
column, grouped by weekday
.
You can see that there are a lot of zeroes in this dataframe. Where do they all come from? It turns out that on Sunday more than 75% of the people did not tip. This makes the first three quartiles all show a zero, which is why you couldn't see a box on the boxplot. Remember that the size of the box is the distance between the values in the columns 25%
and 75%
. If these are both zero, the size of the box must also be zero.
On the other days, however, less than 75% of the people did not tip, thus you have a nonzero value of the tip in the third quartile (in the column with name 75%). What does it mean for the plots? You have guessed it: the boxes appear. However, you still can't see the middle line and that is because the median (50% of the data) is the same as the first quaritle (25%), and thus the two lines overlap at zero.
An interesting insight: on Sunday, all of the tips are outliers.
Now, let's repeat the process but leave out non-tippers. This should give you a better understanding of the distribution of the actual tips.
This plot gives you a better insight into the distribution of the tips that actually happened, however it misses an important piece of information: how many people do actually tip? You have calculated this already in the previous notebook.
Imagine you are a driver. Is there any day of the week that you would like to drive more, as the tips are higher?
Spliting by day of the week doesn't seem to have much of an impact to make this decision. Maybe you can have a look at the tips given different hours of day. Perhaps there is a higher chance of getitng tipped at a certain hour. Lets see if that's the case by running the cell below. For this, you will need to extract the hour of the trip for the trip_start_timestamp
column, and save it on a new column in the dataframe.
It seems like the tips are slightly higher in the morning hours. But why would the tips be higher in the early morning? Maybe this has to do with the hour of day, perhaps people feel more empathy in the morning. Before you jump to this conclusion, however, let's see if there's any other explanations worth considering.
Let's see if you can get any any extra information by looking at the the lenght of the trip. Plot the length of the trip next to see how it changes through the day:
As you can see the trips are longer in the early morning hours. Now this is getting interesting. Maybe this is the reason for the higher average tips? Run the next cell to plot a scatter plot of trip_miles
vs tip
. Remember that you are only looking at the tippers (no tippers would just contribute to many additional points at tip=0).
Can you say something about the correlation between these two variables? Remember that the correation was a scaled version of the Covariance, that was scaled to have range between -1 and 1
The plot suggests that the riders pay higher tips for longer rides, which would imply a positive correlation. However the correlation is not that obvious, so the correlation will be positive, but not close 1. This correlation makes sense, as longer rides also cost more and thus the tips are likely to be higher.
You can actually measure the correlation between the two variables using Pandas. Let's do it in the cell below!
You got a correlation of 0.637. Remember that 0 correlation means no correlation at all, while correlation 1 means a perfect linear relationship, with positive slope. In this case, as predicted you get a value somewhere in between, being slightly above 0.5. You could say this is a modeate correlation between variables.
Now try changing the variables in the plot above to see other columns. For example, you can plot tip
against fare
. How does it compare with the tip
against trip_miles
plot? Try also checking the correlation and making a comparison.
Check the Locations of Rides
Another thing you might want to know is where the rides usually start. The dataset contains the columns "Pickup Centroid Latitude", "Pickup Centroid Longitude", "Dropoff Centroid Latitude", and "Dropoff Centroid Longitude", which tell you the locations of the pickup and dropoff respectively.
Run the cell below to plot the geographical distribution of the pickup locations.
What you have just plotted is a joint distribution! Joint distributions are distributions where a variable (in your case tip
) is distributed across multiple variables (in your case latitude and longitude).
Looking at the distribution, it seems you have many rides in the middle right of the plot, which is likely to be downtown Chicago. But then there are also quite a few rides in the top left corner, very far away from everything else. What could there be at that location?
To answer this question, you can actually produce an interactive map. Run the following code to do so, plotting the same points on an actual map of Chicago. In reality this cell is only going to plot a limited number of your data points, even from this downsampled data set, to ensure the map renders quickly enough. Check the locations on the map to see where the majority of the points are and what the location in the upper left could be.
Note generating this map is a more resource intensive operation and can sometimes fail. If the map doesn't render after a short wait, you can try re-running the cell.
If you inspect the map carefully, you probably noticed that the rides from the top left corner come from the Chicago O'Hare International Airport. Run the code below to isolate these rides by their latitude and longitude and inspect them.
As you can see the percentage of the people who tip is much higher for the rides that start at the airport. Looks like this is a good place to be as a driver!
Conclusion
Congratulations on finishing this lab. You have seen the implementation of quite a few concepts covered in this course: probabilities, descriptive statistics, such as mean, median, standard deviation and quartiles, you plotted box plots and a 2D histogram to represent a joint distribution and you looked into marginal distributions. On top of that you have practiced Pandas and plotting. If you liked this exercise, look out for another similar notebook next week!