You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2019-12-29-understanding_splines_what_they_how_they_used_data_analysis.md
+22-15Lines changed: 22 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,9 @@ categories:
6
6
- Machine Learning
7
7
classes: wide
8
8
date: '2019-12-29'
9
-
excerpt: Splines are powerful tools for modeling complex, nonlinear relationships in data. In this article, we'll explore what splines are, how they work, and how they are used in data analysis, statistics, and machine learning.
9
+
excerpt: Splines are powerful tools for modeling complex, nonlinear relationships
10
+
in data. In this article, we'll explore what splines are, how they work, and how
11
+
they are used in data analysis, statistics, and machine learning.
10
12
header:
11
13
image: /assets/images/data_science_19.jpg
12
14
og_image: /assets/images/data_science_19.jpg
@@ -16,25 +18,30 @@ header:
16
18
twitter_image: /assets/images/data_science_19.jpg
17
19
keywords:
18
20
- Splines
19
-
- Spline Regression
20
-
- Nonlinear Models
21
-
- Data Smoothing
22
-
- Statistical Modeling
23
-
- python
24
-
- bash
25
-
- go
26
-
seo_description: Splines are flexible mathematical tools used for smoothing and modeling complex data patterns. Learn what they are, how they work, and their practical applications in regression, data smoothing, and machine learning.
21
+
- Spline regression
22
+
- Nonlinear models
23
+
- Data smoothing
24
+
- Statistical modeling
25
+
- Python
26
+
- Bash
27
+
- Go
28
+
seo_description: Splines are flexible mathematical tools used for smoothing and modeling
29
+
complex data patterns. Learn what they are, how they work, and their practical applications
30
+
in regression, data smoothing, and machine learning.
27
31
seo_title: What Are Splines? A Deep Dive into Their Uses in Data Analysis
28
32
seo_type: article
29
-
summary: Splines are flexible mathematical functions used to approximate complex patterns in data. They help smooth data, model non-linear relationships, and fit curves in regression analysis. This article covers the basics of splines, their various types, and their practical applications in statistics, data science, and machine learning.
33
+
summary: Splines are flexible mathematical functions used to approximate complex patterns
34
+
in data. They help smooth data, model non-linear relationships, and fit curves in
35
+
regression analysis. This article covers the basics of splines, their various types,
36
+
and their practical applications in statistics, data science, and machine learning.
30
37
tags:
31
38
- Splines
32
39
- Regression
33
-
- Data Smoothing
34
-
- Nonlinear Models
35
-
- python
36
-
- bash
37
-
- go
40
+
- Data smoothing
41
+
- Nonlinear models
42
+
- Python
43
+
- Bash
44
+
- Go
38
45
title: 'Understanding Splines: What They Are and How They Are Used in Data Analysis'
Copy file name to clipboardExpand all lines: _posts/2019-12-30-evaluating_binary_classifiers_imbalanced_datasets.md
+22-13Lines changed: 22 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,9 @@ categories:
5
5
- Machine Learning
6
6
classes: wide
7
7
date: '2019-12-30'
8
-
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus on Precision and Recall, offers a better evaluation for handling rare events.
8
+
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but
9
+
they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus
10
+
on Precision and Recall, offers a better evaluation for handling rare events.
9
11
header:
10
12
image: /assets/images/data_science_8.jpg
11
13
og_image: /assets/images/data_science_8.jpg
@@ -14,21 +16,28 @@ header:
14
16
teaser: /assets/images/data_science_8.jpg
15
17
twitter_image: /assets/images/data_science_8.jpg
16
18
keywords:
17
-
- AUC-PR
18
-
- Precision-Recall
19
-
- Binary Classifiers
20
-
- Imbalanced Data
21
-
- Machine Learning Metrics
22
-
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves provide a clearer picture of model performance on rare events.
19
+
- Auc-pr
20
+
- Precision-recall
21
+
- Binary classifiers
22
+
- Imbalanced data
23
+
- Machine learning metrics
24
+
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR
25
+
is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves
26
+
provide a clearer picture of model performance on rare events.
23
27
seo_title: 'AUC-PR vs. AUC-ROC: Evaluating Classifiers on Imbalanced Data'
24
28
seo_type: article
25
-
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve) is a superior metric for evaluating binary classifiers on imbalanced datasets compared to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics and provide real-world examples of why Precision-Recall curves give a clearer understanding of model performance on rare events.
29
+
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve)
30
+
is a superior metric for evaluating binary classifiers on imbalanced datasets compared
31
+
to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics
32
+
and provide real-world examples of why Precision-Recall curves give a clearer understanding
33
+
of model performance on rare events.
26
34
tags:
27
-
- Binary Classifiers
28
-
- Imbalanced Data
29
-
- AUC-PR
30
-
- Precision-Recall
31
-
title: 'Evaluating Binary Classifiers on Imbalanced Datasets: Why AUC-PR Beats AUC-ROC and Gini'
When working with binary classifiers, metrics like **AUC-ROC** and **Gini** have long been the default for evaluating model performance. These metrics offer a quick way to assess how well a model discriminates between two classes, typically a **positive class** (e.g., detecting fraud or predicting defaults) and a **negative class** (e.g., non-fraudulent or non-default cases).
Copy file name to clipboardExpand all lines: _posts/2019-12-31-deep_dive_into_why_multiple_imputation_indefensible.md
+15-10Lines changed: 15 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,8 @@ categories:
4
4
- Statistics
5
5
classes: wide
6
6
date: '2019-12-31'
7
-
excerpt: Let's examine why multiple imputation, despite being popular, may not be as robust or interpretable as it's often considered. Is there a better approach?
7
+
excerpt: Let's examine why multiple imputation, despite being popular, may not be
8
+
as robust or interpretable as it's often considered. Is there a better approach?
8
9
header:
9
10
image: /assets/images/data_science_20.jpg
10
11
og_image: /assets/images/data_science_20.jpg
@@ -13,18 +14,22 @@ header:
13
14
teaser: /assets/images/data_science_20.jpg
14
15
twitter_image: /assets/images/data_science_20.jpg
15
16
keywords:
16
-
- multiple imputation
17
-
- missing data
18
-
- single stochastic imputation
19
-
- deterministic sensitivity analysis
20
-
seo_description: Exploring the issues with multiple imputation and why single stochastic imputation with deterministic sensitivity analysis is a superior alternative.
17
+
- Multiple imputation
18
+
- Missing data
19
+
- Single stochastic imputation
20
+
- Deterministic sensitivity analysis
21
+
seo_description: Exploring the issues with multiple imputation and why single stochastic
22
+
imputation with deterministic sensitivity analysis is a superior alternative.
21
23
seo_title: 'The Case Against Multiple Imputation: An In-depth Look'
22
24
seo_type: article
23
-
summary: Multiple imputation is widely regarded as the gold standard for handling missing data, but it carries significant conceptual and interpretative challenges. We will explore its weaknesses and propose an alternative using single stochastic imputation and deterministic sensitivity analysis.
25
+
summary: Multiple imputation is widely regarded as the gold standard for handling
26
+
missing data, but it carries significant conceptual and interpretative challenges.
27
+
We will explore its weaknesses and propose an alternative using single stochastic
28
+
imputation and deterministic sensitivity analysis.
24
29
tags:
25
-
- Multiple Imputation
26
-
- Missing Data
27
-
- Data Imputation
30
+
- Multiple imputation
31
+
- Missing data
32
+
- Data imputation
28
33
title: A Deep Dive into Why Multiple Imputation is Indefensible
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving paradoxes and leading to more accurate insights from data analysis.
7
+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
8
+
paradoxes and leading to more accurate insights from data analysis.
8
9
header:
9
10
image: /assets/images/data_science_4.jpg
10
11
og_image: /assets/images/data_science_1.jpg
@@ -18,10 +19,14 @@ keywords:
18
19
- Berkson's paradox
19
20
- Correlation
20
21
- Data science
21
-
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and Berkson's, can help us avoid the common pitfalls of interpreting data solely based on correlation.
22
+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
23
+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
24
+
on correlation.
22
25
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
23
26
seo_type: article
24
-
summary: An in-depth exploration of the limits of correlation in data interpretation, highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as a tool for uncovering true causal relationships.
27
+
summary: An in-depth exploration of the limits of correlation in data interpretation,
28
+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
29
+
a tool for uncovering true causal relationships.
25
30
tags:
26
31
- Simpson's paradox
27
32
- Berkson's paradox
@@ -36,20 +41,41 @@ In today's data-driven world, we often rely on statistical correlations to make
36
41
This article is aimed at anyone who works with data and is interested in gaining a more accurate understanding of how to interpret statistical relationships. Here, we will explore how to uncover **causal relationships** in data, how to resolve confusing situations like **Simpson's Paradox** and **Berkson's Paradox**, and how to use **causal graphs** as a tool for making better decisions. The goal is to demonstrate that by understanding causality, we can avoid the pitfalls of over-relying on correlation and make more informed decisions.
37
42
38
43
---
39
-
40
-
## Correlation and Causation: Why the Distinction Matters
41
-
42
-
In statistics, **correlation** measures the strength of a relationship between two variables. For example, if you observe that ice cream sales increase as temperatures rise, you might conclude that warmer weather causes more ice cream to be sold. This conclusion feels intuitive, but what about cases where the data is less obvious? Imagine a study finds a correlation between shark attacks and ice cream sales. Does one cause the other? Clearly not—but the correlation exists because both are influenced by a common factor: hot weather.
43
-
44
-
This example underscores the central problem: **correlation does not imply causation**. Just because two variables move together doesn’t mean one causes the other. Correlation can arise for several reasons:
45
-
46
-
-**Direct causality**: One variable causes the other.
47
-
-**Reverse causality**: The relationship runs in the opposite direction.
48
-
-**Confounding variables**: A third variable influences both.
49
-
-**Coincidence**: The relationship is due to chance.
50
-
51
-
To understand the true nature of relationships in data, we need to go beyond correlation and ask **why** the variables are related. This is where **causal inference** comes in.
52
-
44
+
author_profile: false
45
+
categories:
46
+
- Statistics
47
+
classes: wide
48
+
date: '2020-01-01'
49
+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
50
+
paradoxes and leading to more accurate insights from data analysis.
51
+
header:
52
+
image: /assets/images/data_science_4.jpg
53
+
og_image: /assets/images/data_science_1.jpg
54
+
overlay_image: /assets/images/data_science_4.jpg
55
+
show_overlay_excerpt: false
56
+
teaser: /assets/images/data_science_4.jpg
57
+
twitter_image: /assets/images/data_science_1.jpg
58
+
keywords:
59
+
- Simpson's paradox
60
+
- Causality
61
+
- Berkson's paradox
62
+
- Correlation
63
+
- Data science
64
+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
65
+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
66
+
on correlation.
67
+
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
68
+
seo_type: article
69
+
summary: An in-depth exploration of the limits of correlation in data interpretation,
70
+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
71
+
a tool for uncovering true causal relationships.
72
+
tags:
73
+
- Simpson's paradox
74
+
- Berkson's paradox
75
+
- Correlation
76
+
- Data science
77
+
- Causal inference
78
+
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
53
79
---
54
80
55
81
## The Importance of Causal Inference
@@ -61,23 +87,41 @@ In most real-world scenarios, we rely on **observational data**, which is data c
61
87
Fortunately, researchers have developed methods to uncover causal relationships from observational data by combining **statistical reasoning** with a deep understanding of the data's context. This is where **causal graphs** and tools like **Simpson's Paradox** and **Berkson's Paradox** come into play.
62
88
63
89
---
64
-
65
-
## Simpson's Paradox: The Danger of Aggregating Data
66
-
67
-
Simpson's Paradox is a statistical phenomenon in which a trend that appears in different groups of data disappears or reverses when the groups are combined. This paradox occurs because of a **lurking confounder**, a variable that influences both the independent and dependent variables, skewing the relationship between them.
68
-
69
-
### The Classic Example
70
-
71
-
Imagine you're analyzing the effectiveness of a new drug across two groups: younger patients and older patients. Within each group, the drug seems to improve health outcomes. However, when you combine the two groups, the overall analysis shows that the drug is **less** effective.
72
-
73
-
This reversal happens because age, a **confounding variable**, is driving the overall result. If more older patients received the drug and older patients have worse outcomes in general, it can skew the overall data. Thus, the combined analysis gives a misleading result, suggesting the drug is less effective when it actually benefits each group.
74
-
75
-
### Why Does This Happen?
76
-
77
-
Simpson’s Paradox occurs because the relationship between variables changes when data is aggregated. In the example above, **age** confounds the relationship between the drug and health outcomes. It’s important to note that combining data from different groups without accounting for confounders can hide the true relationships within each group.
78
-
79
-
This paradox demonstrates why it’s crucial to understand the **story behind the data**. If we simply relied on the overall correlation, we would draw the wrong conclusion about the drug’s effectiveness.
80
-
90
+
author_profile: false
91
+
categories:
92
+
- Statistics
93
+
classes: wide
94
+
date: '2020-01-01'
95
+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
96
+
paradoxes and leading to more accurate insights from data analysis.
97
+
header:
98
+
image: /assets/images/data_science_4.jpg
99
+
og_image: /assets/images/data_science_1.jpg
100
+
overlay_image: /assets/images/data_science_4.jpg
101
+
show_overlay_excerpt: false
102
+
teaser: /assets/images/data_science_4.jpg
103
+
twitter_image: /assets/images/data_science_1.jpg
104
+
keywords:
105
+
- Simpson's paradox
106
+
- Causality
107
+
- Berkson's paradox
108
+
- Correlation
109
+
- Data science
110
+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
111
+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
112
+
on correlation.
113
+
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
114
+
seo_type: article
115
+
summary: An in-depth exploration of the limits of correlation in data interpretation,
116
+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
117
+
a tool for uncovering true causal relationships.
118
+
tags:
119
+
- Simpson's paradox
120
+
- Berkson's paradox
121
+
- Correlation
122
+
- Data science
123
+
- Causal inference
124
+
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
81
125
---
82
126
83
127
## Berkson's Paradox: The Pitfall of Selection Bias
@@ -99,39 +143,41 @@ Berkson's Paradox illustrates the problem of **selection bias**—when we restri
99
143
The key takeaway from Berkson’s Paradox is that we need to be careful about **how we select data for analysis**. If we focus only on a specific group without understanding how that group was selected, we can introduce misleading correlations.
100
144
101
145
---
102
-
103
-
## Causal Graphs: A Tool for Visualizing Relationships
104
-
105
-
To avoid falling into the traps of Simpson’s and Berkson’s Paradoxes, it’s helpful to use **causal graphs** to visualize the relationships between variables. These graphs, also known as **Directed Acyclic Graphs (DAGs)**, allow us to represent the causal structure of a system and identify which variables are influencing others.
106
-
107
-
### What Are Causal Graphs?
108
-
109
-
A **causal graph** is a diagram that represents variables as **nodes** and the causal relationships between them as **directed edges** (arrows). A directed edge from variable **A** to variable **B** indicates that **A** has a causal influence on **B**.
110
-
111
-
Causal graphs are powerful because they help us:
112
-
113
-
1.**Identify confounders**: Variables that influence both the independent and dependent variables.
114
-
2.**Clarify causal relationships**: Show which variables are direct causes and which are effects.
115
-
3.**Avoid incorrect controls**: Help us decide which variables to control for in statistical analysis.
116
-
117
-
### Using Causal Graphs to Resolve Simpson's Paradox
118
-
119
-
Let’s return to the example of the drug trial. A causal graph for this scenario might look like this:
120
-
121
-
-**Age** influences both **Drug Use** and **Health Outcome**.
In this case, **Age** is a **confounder** because it influences both the independent variable (**Drug Use**) and the dependent variable (**Health Outcome**). When we control for **Age**, we remove its confounding effect and can properly assess the impact of the drug on health outcomes.
125
-
126
-
### Using Causal Graphs to Resolve Berkson's Paradox
127
-
128
-
In the case of celebrities, a causal graph might look like this:
129
-
130
-
-**Talent** and **Attractiveness** are independent in the general population.
131
-
-**Celebrity Status** depends on both **Talent** and **Attractiveness**.
132
-
133
-
Here, **Celebrity Status** is a **collider**, a variable that is influenced by both **Talent** and **Attractiveness**. When we condition on a collider (i.e., focus only on celebrities), we create a spurious correlation between **Talent** and **Attractiveness**. The key is to recognize that the negative correlation between these variables only exists because we have selected a specific subset of the population (celebrities), not because there is a true relationship between talent and attractiveness.
134
-
146
+
author_profile: false
147
+
categories:
148
+
- Statistics
149
+
classes: wide
150
+
date: '2020-01-01'
151
+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
152
+
paradoxes and leading to more accurate insights from data analysis.
153
+
header:
154
+
image: /assets/images/data_science_4.jpg
155
+
og_image: /assets/images/data_science_1.jpg
156
+
overlay_image: /assets/images/data_science_4.jpg
157
+
show_overlay_excerpt: false
158
+
teaser: /assets/images/data_science_4.jpg
159
+
twitter_image: /assets/images/data_science_1.jpg
160
+
keywords:
161
+
- Simpson's paradox
162
+
- Causality
163
+
- Berkson's paradox
164
+
- Correlation
165
+
- Data science
166
+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
167
+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
168
+
on correlation.
169
+
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
170
+
seo_type: article
171
+
summary: An in-depth exploration of the limits of correlation in data interpretation,
172
+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
173
+
a tool for uncovering true causal relationships.
174
+
tags:
175
+
- Simpson's paradox
176
+
- Berkson's paradox
177
+
- Correlation
178
+
- Data science
179
+
- Causal inference
180
+
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
135
181
---
136
182
137
183
## The Broader Implications of Causality in Data Analysis
0 commit comments