Skip to content

Commit dc11e6d

Browse files
committed
fix: fix duplicate
1 parent f862366 commit dc11e6d

File tree

239 files changed

+5087
-3442
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

239 files changed

+5087
-3442
lines changed

_posts/2019-12-29-understanding_splines_what_they_how_they_used_data_analysis.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ categories:
66
- Machine Learning
77
classes: wide
88
date: '2019-12-29'
9-
excerpt: Splines are powerful tools for modeling complex, nonlinear relationships in data. In this article, we'll explore what splines are, how they work, and how they are used in data analysis, statistics, and machine learning.
9+
excerpt: Splines are powerful tools for modeling complex, nonlinear relationships
10+
in data. In this article, we'll explore what splines are, how they work, and how
11+
they are used in data analysis, statistics, and machine learning.
1012
header:
1113
image: /assets/images/data_science_19.jpg
1214
og_image: /assets/images/data_science_19.jpg
@@ -16,25 +18,30 @@ header:
1618
twitter_image: /assets/images/data_science_19.jpg
1719
keywords:
1820
- Splines
19-
- Spline Regression
20-
- Nonlinear Models
21-
- Data Smoothing
22-
- Statistical Modeling
23-
- python
24-
- bash
25-
- go
26-
seo_description: Splines are flexible mathematical tools used for smoothing and modeling complex data patterns. Learn what they are, how they work, and their practical applications in regression, data smoothing, and machine learning.
21+
- Spline regression
22+
- Nonlinear models
23+
- Data smoothing
24+
- Statistical modeling
25+
- Python
26+
- Bash
27+
- Go
28+
seo_description: Splines are flexible mathematical tools used for smoothing and modeling
29+
complex data patterns. Learn what they are, how they work, and their practical applications
30+
in regression, data smoothing, and machine learning.
2731
seo_title: What Are Splines? A Deep Dive into Their Uses in Data Analysis
2832
seo_type: article
29-
summary: Splines are flexible mathematical functions used to approximate complex patterns in data. They help smooth data, model non-linear relationships, and fit curves in regression analysis. This article covers the basics of splines, their various types, and their practical applications in statistics, data science, and machine learning.
33+
summary: Splines are flexible mathematical functions used to approximate complex patterns
34+
in data. They help smooth data, model non-linear relationships, and fit curves in
35+
regression analysis. This article covers the basics of splines, their various types,
36+
and their practical applications in statistics, data science, and machine learning.
3037
tags:
3138
- Splines
3239
- Regression
33-
- Data Smoothing
34-
- Nonlinear Models
35-
- python
36-
- bash
37-
- go
40+
- Data smoothing
41+
- Nonlinear models
42+
- Python
43+
- Bash
44+
- Go
3845
title: 'Understanding Splines: What They Are and How They Are Used in Data Analysis'
3946
---
4047

_posts/2019-12-30-evaluating_binary_classifiers_imbalanced_datasets.md

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ categories:
55
- Machine Learning
66
classes: wide
77
date: '2019-12-30'
8-
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus on Precision and Recall, offers a better evaluation for handling rare events.
8+
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but
9+
they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus
10+
on Precision and Recall, offers a better evaluation for handling rare events.
911
header:
1012
image: /assets/images/data_science_8.jpg
1113
og_image: /assets/images/data_science_8.jpg
@@ -14,21 +16,28 @@ header:
1416
teaser: /assets/images/data_science_8.jpg
1517
twitter_image: /assets/images/data_science_8.jpg
1618
keywords:
17-
- AUC-PR
18-
- Precision-Recall
19-
- Binary Classifiers
20-
- Imbalanced Data
21-
- Machine Learning Metrics
22-
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves provide a clearer picture of model performance on rare events.
19+
- Auc-pr
20+
- Precision-recall
21+
- Binary classifiers
22+
- Imbalanced data
23+
- Machine learning metrics
24+
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR
25+
is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves
26+
provide a clearer picture of model performance on rare events.
2327
seo_title: 'AUC-PR vs. AUC-ROC: Evaluating Classifiers on Imbalanced Data'
2428
seo_type: article
25-
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve) is a superior metric for evaluating binary classifiers on imbalanced datasets compared to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics and provide real-world examples of why Precision-Recall curves give a clearer understanding of model performance on rare events.
29+
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve)
30+
is a superior metric for evaluating binary classifiers on imbalanced datasets compared
31+
to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics
32+
and provide real-world examples of why Precision-Recall curves give a clearer understanding
33+
of model performance on rare events.
2634
tags:
27-
- Binary Classifiers
28-
- Imbalanced Data
29-
- AUC-PR
30-
- Precision-Recall
31-
title: 'Evaluating Binary Classifiers on Imbalanced Datasets: Why AUC-PR Beats AUC-ROC and Gini'
35+
- Binary classifiers
36+
- Imbalanced data
37+
- Auc-pr
38+
- Precision-recall
39+
title: 'Evaluating Binary Classifiers on Imbalanced Datasets: Why AUC-PR Beats AUC-ROC
40+
and Gini'
3241
---
3342

3443
When working with binary classifiers, metrics like **AUC-ROC** and **Gini** have long been the default for evaluating model performance. These metrics offer a quick way to assess how well a model discriminates between two classes, typically a **positive class** (e.g., detecting fraud or predicting defaults) and a **negative class** (e.g., non-fraudulent or non-default cases).

_posts/2019-12-31-deep_dive_into_why_multiple_imputation_indefensible.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ categories:
44
- Statistics
55
classes: wide
66
date: '2019-12-31'
7-
excerpt: Let's examine why multiple imputation, despite being popular, may not be as robust or interpretable as it's often considered. Is there a better approach?
7+
excerpt: Let's examine why multiple imputation, despite being popular, may not be
8+
as robust or interpretable as it's often considered. Is there a better approach?
89
header:
910
image: /assets/images/data_science_20.jpg
1011
og_image: /assets/images/data_science_20.jpg
@@ -13,18 +14,22 @@ header:
1314
teaser: /assets/images/data_science_20.jpg
1415
twitter_image: /assets/images/data_science_20.jpg
1516
keywords:
16-
- multiple imputation
17-
- missing data
18-
- single stochastic imputation
19-
- deterministic sensitivity analysis
20-
seo_description: Exploring the issues with multiple imputation and why single stochastic imputation with deterministic sensitivity analysis is a superior alternative.
17+
- Multiple imputation
18+
- Missing data
19+
- Single stochastic imputation
20+
- Deterministic sensitivity analysis
21+
seo_description: Exploring the issues with multiple imputation and why single stochastic
22+
imputation with deterministic sensitivity analysis is a superior alternative.
2123
seo_title: 'The Case Against Multiple Imputation: An In-depth Look'
2224
seo_type: article
23-
summary: Multiple imputation is widely regarded as the gold standard for handling missing data, but it carries significant conceptual and interpretative challenges. We will explore its weaknesses and propose an alternative using single stochastic imputation and deterministic sensitivity analysis.
25+
summary: Multiple imputation is widely regarded as the gold standard for handling
26+
missing data, but it carries significant conceptual and interpretative challenges.
27+
We will explore its weaknesses and propose an alternative using single stochastic
28+
imputation and deterministic sensitivity analysis.
2429
tags:
25-
- Multiple Imputation
26-
- Missing Data
27-
- Data Imputation
30+
- Multiple imputation
31+
- Missing data
32+
- Data imputation
2833
title: A Deep Dive into Why Multiple Imputation is Indefensible
2934
---
3035

_posts/2020-01-01-causality_correlation.md

Lines changed: 113 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ categories:
44
- Statistics
55
classes: wide
66
date: '2020-01-01'
7-
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving paradoxes and leading to more accurate insights from data analysis.
7+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
8+
paradoxes and leading to more accurate insights from data analysis.
89
header:
910
image: /assets/images/data_science_4.jpg
1011
og_image: /assets/images/data_science_1.jpg
@@ -18,10 +19,14 @@ keywords:
1819
- Berkson's paradox
1920
- Correlation
2021
- Data science
21-
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and Berkson's, can help us avoid the common pitfalls of interpreting data solely based on correlation.
22+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
23+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
24+
on correlation.
2225
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
2326
seo_type: article
24-
summary: An in-depth exploration of the limits of correlation in data interpretation, highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as a tool for uncovering true causal relationships.
27+
summary: An in-depth exploration of the limits of correlation in data interpretation,
28+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
29+
a tool for uncovering true causal relationships.
2530
tags:
2631
- Simpson's paradox
2732
- Berkson's paradox
@@ -36,20 +41,41 @@ In today's data-driven world, we often rely on statistical correlations to make
3641
This article is aimed at anyone who works with data and is interested in gaining a more accurate understanding of how to interpret statistical relationships. Here, we will explore how to uncover **causal relationships** in data, how to resolve confusing situations like **Simpson's Paradox** and **Berkson's Paradox**, and how to use **causal graphs** as a tool for making better decisions. The goal is to demonstrate that by understanding causality, we can avoid the pitfalls of over-relying on correlation and make more informed decisions.
3742

3843
---
39-
40-
## Correlation and Causation: Why the Distinction Matters
41-
42-
In statistics, **correlation** measures the strength of a relationship between two variables. For example, if you observe that ice cream sales increase as temperatures rise, you might conclude that warmer weather causes more ice cream to be sold. This conclusion feels intuitive, but what about cases where the data is less obvious? Imagine a study finds a correlation between shark attacks and ice cream sales. Does one cause the other? Clearly not—but the correlation exists because both are influenced by a common factor: hot weather.
43-
44-
This example underscores the central problem: **correlation does not imply causation**. Just because two variables move together doesn’t mean one causes the other. Correlation can arise for several reasons:
45-
46-
- **Direct causality**: One variable causes the other.
47-
- **Reverse causality**: The relationship runs in the opposite direction.
48-
- **Confounding variables**: A third variable influences both.
49-
- **Coincidence**: The relationship is due to chance.
50-
51-
To understand the true nature of relationships in data, we need to go beyond correlation and ask **why** the variables are related. This is where **causal inference** comes in.
52-
44+
author_profile: false
45+
categories:
46+
- Statistics
47+
classes: wide
48+
date: '2020-01-01'
49+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
50+
paradoxes and leading to more accurate insights from data analysis.
51+
header:
52+
image: /assets/images/data_science_4.jpg
53+
og_image: /assets/images/data_science_1.jpg
54+
overlay_image: /assets/images/data_science_4.jpg
55+
show_overlay_excerpt: false
56+
teaser: /assets/images/data_science_4.jpg
57+
twitter_image: /assets/images/data_science_1.jpg
58+
keywords:
59+
- Simpson's paradox
60+
- Causality
61+
- Berkson's paradox
62+
- Correlation
63+
- Data science
64+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
65+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
66+
on correlation.
67+
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
68+
seo_type: article
69+
summary: An in-depth exploration of the limits of correlation in data interpretation,
70+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
71+
a tool for uncovering true causal relationships.
72+
tags:
73+
- Simpson's paradox
74+
- Berkson's paradox
75+
- Correlation
76+
- Data science
77+
- Causal inference
78+
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
5379
---
5480

5581
## The Importance of Causal Inference
@@ -61,23 +87,41 @@ In most real-world scenarios, we rely on **observational data**, which is data c
6187
Fortunately, researchers have developed methods to uncover causal relationships from observational data by combining **statistical reasoning** with a deep understanding of the data's context. This is where **causal graphs** and tools like **Simpson's Paradox** and **Berkson's Paradox** come into play.
6288

6389
---
64-
65-
## Simpson's Paradox: The Danger of Aggregating Data
66-
67-
Simpson's Paradox is a statistical phenomenon in which a trend that appears in different groups of data disappears or reverses when the groups are combined. This paradox occurs because of a **lurking confounder**, a variable that influences both the independent and dependent variables, skewing the relationship between them.
68-
69-
### The Classic Example
70-
71-
Imagine you're analyzing the effectiveness of a new drug across two groups: younger patients and older patients. Within each group, the drug seems to improve health outcomes. However, when you combine the two groups, the overall analysis shows that the drug is **less** effective.
72-
73-
This reversal happens because age, a **confounding variable**, is driving the overall result. If more older patients received the drug and older patients have worse outcomes in general, it can skew the overall data. Thus, the combined analysis gives a misleading result, suggesting the drug is less effective when it actually benefits each group.
74-
75-
### Why Does This Happen?
76-
77-
Simpson’s Paradox occurs because the relationship between variables changes when data is aggregated. In the example above, **age** confounds the relationship between the drug and health outcomes. It’s important to note that combining data from different groups without accounting for confounders can hide the true relationships within each group.
78-
79-
This paradox demonstrates why it’s crucial to understand the **story behind the data**. If we simply relied on the overall correlation, we would draw the wrong conclusion about the drug’s effectiveness.
80-
90+
author_profile: false
91+
categories:
92+
- Statistics
93+
classes: wide
94+
date: '2020-01-01'
95+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
96+
paradoxes and leading to more accurate insights from data analysis.
97+
header:
98+
image: /assets/images/data_science_4.jpg
99+
og_image: /assets/images/data_science_1.jpg
100+
overlay_image: /assets/images/data_science_4.jpg
101+
show_overlay_excerpt: false
102+
teaser: /assets/images/data_science_4.jpg
103+
twitter_image: /assets/images/data_science_1.jpg
104+
keywords:
105+
- Simpson's paradox
106+
- Causality
107+
- Berkson's paradox
108+
- Correlation
109+
- Data science
110+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
111+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
112+
on correlation.
113+
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
114+
seo_type: article
115+
summary: An in-depth exploration of the limits of correlation in data interpretation,
116+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
117+
a tool for uncovering true causal relationships.
118+
tags:
119+
- Simpson's paradox
120+
- Berkson's paradox
121+
- Correlation
122+
- Data science
123+
- Causal inference
124+
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
81125
---
82126

83127
## Berkson's Paradox: The Pitfall of Selection Bias
@@ -99,39 +143,41 @@ Berkson's Paradox illustrates the problem of **selection bias**—when we restri
99143
The key takeaway from Berkson’s Paradox is that we need to be careful about **how we select data for analysis**. If we focus only on a specific group without understanding how that group was selected, we can introduce misleading correlations.
100144

101145
---
102-
103-
## Causal Graphs: A Tool for Visualizing Relationships
104-
105-
To avoid falling into the traps of Simpson’s and Berkson’s Paradoxes, it’s helpful to use **causal graphs** to visualize the relationships between variables. These graphs, also known as **Directed Acyclic Graphs (DAGs)**, allow us to represent the causal structure of a system and identify which variables are influencing others.
106-
107-
### What Are Causal Graphs?
108-
109-
A **causal graph** is a diagram that represents variables as **nodes** and the causal relationships between them as **directed edges** (arrows). A directed edge from variable **A** to variable **B** indicates that **A** has a causal influence on **B**.
110-
111-
Causal graphs are powerful because they help us:
112-
113-
1. **Identify confounders**: Variables that influence both the independent and dependent variables.
114-
2. **Clarify causal relationships**: Show which variables are direct causes and which are effects.
115-
3. **Avoid incorrect controls**: Help us decide which variables to control for in statistical analysis.
116-
117-
### Using Causal Graphs to Resolve Simpson's Paradox
118-
119-
Let’s return to the example of the drug trial. A causal graph for this scenario might look like this:
120-
121-
- **Age** influences both **Drug Use** and **Health Outcome**.
122-
- **Drug Use** directly affects **Health Outcome**.
123-
124-
In this case, **Age** is a **confounder** because it influences both the independent variable (**Drug Use**) and the dependent variable (**Health Outcome**). When we control for **Age**, we remove its confounding effect and can properly assess the impact of the drug on health outcomes.
125-
126-
### Using Causal Graphs to Resolve Berkson's Paradox
127-
128-
In the case of celebrities, a causal graph might look like this:
129-
130-
- **Talent** and **Attractiveness** are independent in the general population.
131-
- **Celebrity Status** depends on both **Talent** and **Attractiveness**.
132-
133-
Here, **Celebrity Status** is a **collider**, a variable that is influenced by both **Talent** and **Attractiveness**. When we condition on a collider (i.e., focus only on celebrities), we create a spurious correlation between **Talent** and **Attractiveness**. The key is to recognize that the negative correlation between these variables only exists because we have selected a specific subset of the population (celebrities), not because there is a true relationship between talent and attractiveness.
134-
146+
author_profile: false
147+
categories:
148+
- Statistics
149+
classes: wide
150+
date: '2020-01-01'
151+
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
152+
paradoxes and leading to more accurate insights from data analysis.
153+
header:
154+
image: /assets/images/data_science_4.jpg
155+
og_image: /assets/images/data_science_1.jpg
156+
overlay_image: /assets/images/data_science_4.jpg
157+
show_overlay_excerpt: false
158+
teaser: /assets/images/data_science_4.jpg
159+
twitter_image: /assets/images/data_science_1.jpg
160+
keywords:
161+
- Simpson's paradox
162+
- Causality
163+
- Berkson's paradox
164+
- Correlation
165+
- Data science
166+
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
167+
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
168+
on correlation.
169+
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
170+
seo_type: article
171+
summary: An in-depth exploration of the limits of correlation in data interpretation,
172+
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
173+
a tool for uncovering true causal relationships.
174+
tags:
175+
- Simpson's paradox
176+
- Berkson's paradox
177+
- Correlation
178+
- Data science
179+
- Causal inference
180+
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
135181
---
136182

137183
## The Broader Implications of Causality in Data Analysis

0 commit comments

Comments
 (0)