Skip to content

Commit 7a4ac99

Browse files
committed
Added lesson 13 materials
1 parent 0652df3 commit 7a4ac99

File tree

3 files changed

+681
-0
lines changed

3 files changed

+681
-0
lines changed

notebooks/notebooks.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,3 +112,12 @@ units:
112112
file: "Lesson_12_activity.ipynb"
113113
- name: "Anscombe's quartet demo"
114114
file: "anscombes_quartet.ipynb"
115+
116+
- number: "13"
117+
title: "Probability distributions"
118+
topics: "Discrete and continuous distributions, central limit theorem"
119+
notebooks:
120+
- name: "In class demo"
121+
file: "Lesson_13_demo.ipynb"
122+
- name: "Activity"
123+
file: "Lesson_13_activity.ipynb"
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "b4a46963",
6+
"metadata": {},
7+
"source": [
8+
"# Lesson 13 activity: probability distributions\n",
9+
"\n",
10+
"## Learning objectives\n",
11+
"\n",
12+
"This activity will help you to:\n",
13+
"\n",
14+
"1. Understand and apply binomial distributions to model discrete events\n",
15+
"2. Demonstrate the Central Limit Theorem through sampling distributions\n",
16+
"3. Visualize theoretical and empirical probability distributions\n",
17+
"4. Connect statistical theory to real-world data analysis"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"id": "86e3eac2",
23+
"metadata": {},
24+
"source": [
25+
"## Setup\n",
26+
"\n",
27+
"Import the required libraries and load the weather dataset."
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": null,
33+
"id": "117792af",
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"import pandas as pd\n",
38+
"import numpy as np\n",
39+
"import matplotlib.pyplot as plt\n",
40+
"from scipy import stats"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": null,
46+
"id": "05634da2",
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"# Load the weather dataset\n",
51+
"url = 'https://gperdrizet.github.io/FSA_devops/assets/data/unit2/weather.csv'\n",
52+
"df = pd.read_csv(url)\n",
53+
"df.head()"
54+
]
55+
},
56+
{
57+
"cell_type": "markdown",
58+
"id": "809f038a",
59+
"metadata": {},
60+
"source": [
61+
"## Exercise 1: binomial distribution - modeling rainy days\n",
62+
"\n",
63+
"**Objective**: Understand and visualize binomial distributions using real weather data.\n",
64+
"\n",
65+
"The binomial distribution models the number of successes in a fixed number of independent trials. In weather forecasting, we can use it to model the probability of rainy days over a period of time.\n",
66+
"\n",
67+
"**Tasks**:\n",
68+
"\n",
69+
"1. **Calculate the probability of rain**:\n",
70+
" - Count how many days in the dataset have `rainfall_inches > 0`\n",
71+
" - Calculate the proportion of rainy days (this is your probability `p`)\n",
72+
" - Print this probability with an interpretation (e.g., \"Based on our data, there's a X% chance of rain on any given day\")\n",
73+
"\n",
74+
"2. **Create a theoretical binomial distribution**:\n",
75+
" - Assume you're looking at a 30-day period (like a month)\n",
76+
" - Using the probability from step 1, calculate the theoretical probability of getting exactly k rainy days for k = 0, 1, 2, ..., 30\n",
77+
" - Use `scipy.stats.binom.pmf()`\n",
78+
"\n",
79+
"3. **Visualize the distribution**:\n",
80+
" - Create a bar plot showing the probability of each possible number of rainy days (0 to 30)\n",
81+
" - Add a vertical line showing the expected value (mean = n × p)\n",
82+
" - Label the axes appropriately\n",
83+
" - Include a title with the probability of rain\n",
84+
"\n",
85+
"4. **Interpret** your findings:\n",
86+
" - What is the most likely number of rainy days in a 30-day period?\n",
87+
" - What is the expected (mean) number of rainy days?\n",
88+
" - What's the probability of having 15 or more rainy days in a month?\n",
89+
" - How does this distribution help weather forecasters make predictions?\n",
90+
" - **Bonus**: Calculate the standard deviation and explain what it tells you about the variability in monthly rainfall patterns"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"id": "2c5fd71d",
97+
"metadata": {},
98+
"outputs": [],
99+
"source": [
100+
"# Your code here"
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"id": "714bd825",
106+
"metadata": {},
107+
"source": [
108+
"## Exercise 2: central limit theorem - sampling distribution of rainfall\n",
109+
"\n",
110+
"**Objective**: Demonstrate the Central Limit Theorem by creating and analyzing a sampling distribution.\n",
111+
"\n",
112+
"The Central Limit Theorem (CLT) states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population's original distribution. This is fundamental to statistical inference.\n",
113+
"\n",
114+
"**Tasks**:\n",
115+
"\n",
116+
"1. **Examine the population distribution**:\n",
117+
" - Create a histogram of all `rainfall_inches` values in the dataset\n",
118+
" - Calculate and print the population mean and standard deviation\n",
119+
" - Note the shape of this distribution (is it normal, skewed, etc.?)\n",
120+
"\n",
121+
"2. **Create a sampling distribution**:\n",
122+
" - Take 1000 random samples from the rainfall data, each of size n=30\n",
123+
" - For each sample, calculate the mean rainfall\n",
124+
" - Store all 1000 sample means in a list or array\n",
125+
" - Hint: Use `df['rainfall_inches'].sample(n=30, replace=True)` for each sample\n",
126+
"\n",
127+
"3. **Visualize the sampling distribution**:\n",
128+
" - Create a histogram of the 1000 sample means\n",
129+
" - Overlay a normal distribution curve using the theoretical mean (μ) and standard error (σ/√n)\n",
130+
" - Add a vertical line at the population mean\n",
131+
" - You can use `scipy.stats.norm.pdf()` to create the normal curve\n",
132+
" - Label axes and add a descriptive title\n",
133+
"\n",
134+
"4. **Compare distributions**:\n",
135+
" - Create two side-by-side histograms:\n",
136+
" - Left: Original rainfall distribution (from step 1)\n",
137+
" - Right: Sampling distribution of means (from step 3)\n",
138+
" - Make sure both use the same y-axis scale for comparison\n",
139+
" - Include the mean and standard deviation in each subplot title\n",
140+
"\n",
141+
"5. **Interpret** your findings:\n",
142+
" - How does the shape of the sampling distribution compare to the original distribution?\n",
143+
" - Is the sampling distribution approximately normal? (This demonstrates the CLT!)\n",
144+
" - Calculate the standard error: population σ divided by √30. How does this compare to the standard deviation of your sample means?\n",
145+
" - What does the CLT tell us about why we can use normal-based methods (like confidence intervals) even when our data isn't normally distributed?\n",
146+
" - **Bonus**: Repeat the experiment with different sample sizes (n=5, n=10, n=50). How does sample size affect the spread and normality of the sampling distribution?"
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": null,
152+
"id": "a183c001",
153+
"metadata": {},
154+
"outputs": [],
155+
"source": [
156+
"# Your code here"
157+
]
158+
}
159+
],
160+
"metadata": {
161+
"language_info": {
162+
"name": "python"
163+
}
164+
},
165+
"nbformat": 4,
166+
"nbformat_minor": 5
167+
}

0 commit comments

Comments
 (0)