Birthday Problems
Welcome! During this lab you will reinforce the notion of how counter-intuitive probabilities can be by taking a look at the famous birthday problem. In fact you will take a look at 4 variations of this problem. You can use one you have already seen the solution for, to try and come up with the solution for the next one, the results might surprise you!
Let's get started!
Packages
Introduction to the problem
All of these problems share a similar setting. You have a classroom full of students (the number may vary) and want to know the probabilities of two students having the same birthday or of any student having a particular birthday, anything along those lines. As mentioned before, you will see 4 variations of the problem.
You can think of these problems in two ways:
- What is the minimum number of students
n
that need to be in the classroom to have a matching birthday with a given probability? - Given
n
what is the probability of having a match?
Both ways model the situation from different angles but they are essentially covering the same.
Play the game of matching your birthday
To further motivate this situation a game is presented. You can use the following cell to run an interactive game, it is very simple to use: you need to select your birthday (the year does not matter) in the dropdown widget and then you can click the Simulate!
button to randomly create students until one of them has the same birthday as you. The left plot shows you the history of the result for each simulation and the right plot shows you the same information in a histogram so you can see how this variable distributes.
You can try this for as long as you want so you get a sense of the probability distribution for this process (it is recommended to try it for at least 30 runs):
First Problem
The first problem tries to answer the question: given a pre-defined date, what is the value of n
such that the probability of having a match is greater than or equal to 0.5?
Before taking a look at the analytical solution you will try to solve it by creating simulations with Python. For this purpose you can use the simulate
helper function provided in the next cell. Run it to load this function which will be used shortly:
This function returns the simulated probability for a given problem when you pass to it the number of students and the number of simulations and it has these two properties:
- The higher the number of students the higher the probability of a match.
- The higher the number of simulations the more accurate the simulated probability will be (This fact will be discussed further in Week 3. This is known as the Central Limit Theorem).
This is pretty cool but how do you use this helper function? You need to pass another function to it that models the situation at hand. This other function should have two criteria so that simulate
works as expected:
- It should receive the number of students as input.
- It should return True if there was a match or False otherwise.
You can create a function that models problem number 1 by running the following cell:
Now you can use these two functions in conjuction to get the probability of a match for a given classroom size. Notice that you can tweak the value of the n
variable to simulate classrooms with different number of students. Also notice that this time the simulation is run 10,000 times instead of the default 1000. This gives you a more accurate simulated probability at the expense of taking longer to execute:
This is very cool but it still has one major drawback: you would need to try a bunch of values for n
before arriving at the solution. Instead of taking this approach you can generate a plot that shows the simulated probability as a function of the number of students in the classroom:
Remember that this approach is a simulation and thus you are generating simulated (or approximated) probabilities. Because of this the curve is not completely smooth and you will get slightly different values every time you run the simulation.
Analytical Solution
Now that you have built a stronger intuition, let's calculate explicitily the probability that at least one student in the room has the birthday the same as the pre-defined date. It is clear that , i.e., the value for depends on the number of students in the room and, as become large, must become closer to . With a formula for we can then find the minimum such that . Let's suppose that a year has days.
Let's consider the pre-defined birthday and suppose a student is selected at random.
Defining the event as the -th student has birthday in the day . Then , because there are equally likely possibilities for their birthday. So, using the complement rule, the probability that this student's birthday isn't day is .
Note that this probability is the same for any student and we can fairily assume that each student's birthday is independent from each other. Consider the event the desired event, i.e., at least one student has birthday in day . Note that:
is the probability that no student has birthday in day and this is the same as:
- Student has birthday in a day different than D, AND
- Student has birthday in a day different than D, AND, ...
- Student has birthday in a day different than D.
With our definitions, this is just Therefore
As you've expected, . Now, you are ready to answer the question: for wich value of , ?
Well, is equivalent to
Now, using a calculator we can easily find that and , the last inequality becomes
which is equivalent to .
Second Problem
The second problem is very similar to the first one, with the difference that the predefined value is not previously defined but it is drawn from one of the students at random so it can be worded like this: given a classroom with n
students, if you draw any student at random what is the value of n
such that the probability of having a match with another student is greater than or equal to 0.5?
You can reuse the simulate
helper function defined earlier so the only thing left is to code the function that models this particular problem. But before doing this try to come up with a hypothesis about the result. What do you think will happen? Will n
be similar to the previous one or do you need a higher value? What about a smaller value?
Run the next cells to find out!
Analytical Solution
Note that this problem is very similar to the first one. The difference is that, instead of selecting one day at random from the room, you select a student and fix their birthday . You can reduce this problem to the previous one by removing this student from the room and considering now a room with students. The problem now is analogous to the previous one but you will end up with a instead of . Therefore, it is easy to see that, in this case,
With some calculations you get that if and only if .
Third Problem
The third one is the most famous of all the birthday problems and it was covered in the lectures in a sligthly different way.
This time you don't want to find a match with a predefined value but rather to find a match between any two birthdays, it can be worded like this: given a classroom with students, what is the value of n
such that the probability of having a match is greater than or equal to 0.5 for any two students?
Note that, in the lectures, it was calculated the probability that no students share a birthday. Here, you are dealing with the case that, at least two students share a birthday, which is the complement of the question discussed in the lecture.
Before doing the simulation as with previous problems ask yourself: Do you think that the value of n
will be similar to that of the previous problems? If you have to guess would you say it needs to be greater than or lower?
To help you out run the next cell to play an interactive version of this problem. The instructions are simple:
- To start a new simulation click anywhere on the upper panel (just below where the
Figure
headline appears) - The upper panel shows randomly generated birthdays and let's you know when there is a match between two students
- The bottom left panel keeps track of the number of students required to have a match for every run
- The bottom right panel shows that same information as a histogram
- Try running the simulation at least 30 times to get a sense of how this particular problem behaves
Now you should have a hypothesis of the number of students in the classroom needed for the match. Test your intuition by generating the simulated probabilities as before:
Analytical solution
This problem is a bit different from the previous two, so you will need to make some calculations again. Now, the idea is to perform the calculation by steps, selecting one student at time. So let's define as the probability that the -th student has birthday different from the previous students. Note that , because there is no previous student. Note also that,
This is because, given that you've selected one student, the second has possible values, it has a chance of of not matching birthdays with the first. Now, for , there are selected students with different birthdays, so the probability that the third student not matches any of the previous two is:
Inductively, if we select students, then
Since we assume that we are choosing students independently from each other, then the probability of picking students that don't have their birthday in common is just the product of all . Let's call it , so:
The desired probability is, therefore, , since we want the probability of at least two students match their birthday (this is just the complement rule).
Note that we could just write a small program in Python to compute this value for every and return the first value that achieves the inequality we want (), but for sake of completion we will provide an analytic solution.
We will use the following approximation: for small and positive. We can re-write as
Thus, using the approximation:
Using the formula (this is the sum of the first terms of a arithmetic progression with first term and common difference ), we have:
So, and is equivalent to which is equivalent to , i.e., . Since , so
Solving the quadratic equation , the only positive value for is
Therefore, if then .
Fourth Problem
The fourth and final one is similar to the third problem but with the difference that you have two classrooms and want to find a match between a student in one classroom and a student in the other, presenting it like a question it will be: given two classrooms with n
students, what is the value of n
such that the probability of having a match is greater than or equal to 0.5 for any two students in each classroom?
Once again try to come up with your own hypothesis before doing the simulation!
Analytical solution
The solution to this problem is similar to the first one. Now, instead of only one date, there are dates to compare. Remember that if we have only one date to compare than the probability, let's say of having no student with birthday is (the complement of in that case). Now we proceed as problem three, by having independent samples of students. For each student sampled, the probability is , so the probability of no student matches any of the given dates is therefore
Using the approximation for small,
Therefore,
Thus, if
Conclusion
Congratulations! You have finished the ungraded lab on the Birthday problems!