Skip to content

Commit 43cb381

Browse files
Merge pull request #92 from DiogoRibeiro7/feat/reserve_branche
Feat/reserve branche
2 parents ad53e19 + dc11e6d commit 43cb381

File tree

244 files changed

+6443
-3337
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

244 files changed

+6443
-3337
lines changed

_posts/-_ideas/Epidemiology.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
tags: []
3+
---
4+
15
## Epidimiology
26

37
- TODO: "Leveraging Machine Learning in Epidemiology for Disease Prediction"
@@ -15,10 +19,10 @@
1519
- TODO: "Bayesian Statistics in Epidemiological Modeling"
1620
- Introduce how Bayesian methods can improve disease risk assessment and uncertainty quantification in epidemiological studies.
1721

18-
- TODO: "Real-Time Data Processing and Epidemiological Surveillance"
22+
- "Real-Time Data Processing and Epidemiological Surveillance"
1923
- Write about how real-time analytics platforms like Apache Flink can be used for tracking diseases and improving epidemiological surveillance systems.
2024

21-
- TODO: "Spatial Epidemiology: Using Geospatial Data in Public Health"
25+
- "Spatial Epidemiology: Using Geospatial Data in Public Health"
2226
- Discuss the importance of geospatial data in tracking disease outbreaks and how data science techniques can integrate spatial data for public health insights.
2327

2428
- TODO: "Epidemiological Data Challenges and How Data Science Can Solve Them"

_posts/2019-12-29-understanding_splines_what_they_how_they_used_data_analysis.md

Lines changed: 406 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
author_profile: false
3+
categories:
4+
- Data Science
5+
- Machine Learning
6+
classes: wide
7+
date: '2019-12-30'
8+
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but
9+
they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus
10+
on Precision and Recall, offers a better evaluation for handling rare events.
11+
header:
12+
image: /assets/images/data_science_8.jpg
13+
og_image: /assets/images/data_science_8.jpg
14+
overlay_image: /assets/images/data_science_8.jpg
15+
show_overlay_excerpt: false
16+
teaser: /assets/images/data_science_8.jpg
17+
twitter_image: /assets/images/data_science_8.jpg
18+
keywords:
19+
- Auc-pr
20+
- Precision-recall
21+
- Binary classifiers
22+
- Imbalanced data
23+
- Machine learning metrics
24+
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR
25+
is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves
26+
provide a clearer picture of model performance on rare events.
27+
seo_title: 'AUC-PR vs. AUC-ROC: Evaluating Classifiers on Imbalanced Data'
28+
seo_type: article
29+
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve)
30+
is a superior metric for evaluating binary classifiers on imbalanced datasets compared
31+
to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics
32+
and provide real-world examples of why Precision-Recall curves give a clearer understanding
33+
of model performance on rare events.
34+
tags:
35+
- Binary classifiers
36+
- Imbalanced data
37+
- Auc-pr
38+
- Precision-recall
39+
title: 'Evaluating Binary Classifiers on Imbalanced Datasets: Why AUC-PR Beats AUC-ROC
40+
and Gini'
41+
---
42+
43+
When working with binary classifiers, metrics like **AUC-ROC** and **Gini** have long been the default for evaluating model performance. These metrics offer a quick way to assess how well a model discriminates between two classes, typically a **positive class** (e.g., detecting fraud or predicting defaults) and a **negative class** (e.g., non-fraudulent or non-default cases).
44+
45+
However, when dealing with **imbalanced datasets**, where one class is much more prevalent than the other, these metrics can **mislead** us into believing a model is better than it truly is. In such cases, **AUC-PR**—which focuses on **Precision** and **Recall**—offers a more meaningful evaluation of a model’s ability to handle rare events, providing a clearer picture of how the model performs on the **minority class**.
46+
47+
In this article, we'll explore why **AUC-PR** (Area Under the Precision-Recall Curve) is more informative than **AUC-ROC** and **Gini** when evaluating models on imbalanced datasets. We’ll delve into why AUC-ROC often **overstates model performance**, and how AUC-PR shifts the focus to the model’s performance on the **positive class**, giving a more reliable assessment of how well it handles **imbalanced classes**.
48+
49+
## The Challenges of Imbalanced Data
50+
51+
Before diving into metrics, it’s important to understand the **challenges of imbalanced data**. In many real-world applications, the class distribution is highly skewed. For instance, in **fraud detection**, **medical diagnosis**, or **default prediction**, the positive class (e.g., fraudulent transactions, patients with a disease, or customers defaulting on loans) represents only a **tiny fraction** of the total cases.
52+
53+
In these scenarios, models tend to **focus heavily on the majority class**, often leading to deceptive results. A model might show high accuracy by correctly identifying many **True Negatives** but fail to adequately detect the **True Positives**—the rare but critical cases. This is where traditional metrics like AUC-ROC and Gini can fall short.
54+
55+
### Imbalanced Data Example: Fraud Detection
56+
57+
Imagine you’re building a model to detect fraudulent transactions. Out of 100,000 transactions, only 500 are fraudulent. That’s a **0.5% positive class** and a **99.5% negative class**. A model that predicts **all transactions as non-fraudulent** would still achieve **99.5% accuracy**, despite **failing completely** to detect any fraud.
58+
59+
While accuracy alone is clearly misleading, even metrics like **AUC-ROC** and **Gini**, which aim to balance True Positives and False Positives, can still provide an **inflated sense of performance**. This is because they take **True Negatives** into account, which, in imbalanced datasets, dominate the metric and obscure the model’s struggles with the positive class.
60+
61+
## Why AUC-ROC and Gini Can Be Misleading
62+
63+
The **AUC-ROC curve** (Area Under the Receiver Operating Characteristic Curve) is widely used to evaluate binary classifiers. It plots the **True Positive Rate** (TPR) against the **False Positive Rate** (FPR) at various classification thresholds. The **Gini coefficient** is closely related to AUC-ROC, as it is simply **2 * AUC-ROC - 1**.
64+
65+
While AUC-ROC is effective for **balanced datasets**, it becomes problematic when applied to **imbalanced data**. Here’s why:
66+
67+
### 1. **Over-Emphasis on True Negatives**
68+
69+
The ROC curve incorporates the **True Negative Rate** (TNR), which means that a model can appear to perform well by simply classifying the majority of non-events (True Negatives) correctly. In imbalanced datasets, where the negative class is abundant, even a model with **poor performance on the positive class** can still achieve a high AUC-ROC score, giving a **false sense of effectiveness**.
70+
71+
For example, a model that classifies all non-fraudulent transactions correctly while missing most fraudulent transactions will still show a **high AUC-ROC**. This is because the **False Positive Rate** (FPR) will remain low, and the **True Positive Rate** (TPR) can look decent even if many fraud cases are missed.
72+
73+
### 2. **Sensitivity to Class Imbalance**
74+
75+
In imbalanced datasets, the **majority class** dominates the calculation of the ROC curve. As a result, the metric often emphasizes performance on the negative class rather than the positive class. For highly skewed datasets, this can result in a **high AUC-ROC score**, even if the model is **failing** to correctly classify the minority class.
76+
77+
For instance, if 95% of your dataset consists of **True Negatives**, a model that excels at classifying the negative class but performs poorly on the positive class can still produce a high **AUC-ROC** score. In this way, AUC-ROC can **overstate** how well your model is really doing when you care most about the positive class.
78+
79+
## Why AUC-PR Is Better for Imbalanced Data
80+
81+
When evaluating binary classifiers on imbalanced datasets, a better approach is to use the **AUC-PR curve** (Area Under the Precision-Recall Curve). The **Precision-Recall curve** plots **Precision** (the proportion of correctly predicted positive cases out of all predicted positive cases) against **Recall** (the proportion of actual positive cases that are correctly identified).
82+
83+
### 1. **Focus on the Positive Class**
84+
85+
The key advantage of **AUC-PR** is that it **focuses on the positive class**, without being distracted by the abundance of True Negatives. This is particularly important when dealing with **rare events**, where identifying the minority class (e.g., fraud, defaults, or disease) is the primary goal.
86+
87+
**Precision** measures how many of the predicted positive cases are correct, and **Recall** measures how well the model identifies actual positive cases. Together, they provide a clearer picture of the model's performance when dealing with **imbalanced classes**.
88+
89+
For example, in fraud detection, the **Precision-Recall curve** will give a more accurate sense of how well the model balances **finding fraud cases** (high Recall) with ensuring that **predicted fraud cases are actually fraudulent** (high Precision).
90+
91+
### 2. **Ignoring True Negatives**
92+
93+
One of the strengths of **AUC-PR** is that it **ignores True Negatives**—which are often overwhelmingly present in imbalanced datasets. This means that the model’s performance is evaluated **solely** on its ability to handle the positive class (the class of interest in most real-world applications).
94+
95+
By ignoring True Negatives, the **Precision-Recall curve** gives a more direct view of the model’s performance on **rare events**, making it **far more suitable** for tasks like **fraud detection**, **default prediction**, or **medical diagnoses** where false positives and false negatives carry different risks and costs.
96+
97+
## A Real-World Example: Comparing AUC-ROC and AUC-PR
98+
99+
Let’s look at a real-world example to illustrate how AUC-PR offers a better assessment of model performance on imbalanced data. Imagine you’re building a classifier to predict loan defaults.
100+
101+
### Step 1: Evaluating with AUC-ROC
102+
103+
When you plot the **ROC curve**, you see that the model achieves a **high AUC-ROC score** of 0.92. Based on this, it might seem that the model is excellent at distinguishing between default and non-default cases. The **Gini coefficient**, calculated as **2 * AUC-ROC - 1**, is similarly high, suggesting strong model performance.
104+
105+
### Step 2: Evaluating with AUC-PR
106+
107+
Now, you turn to the **Precision-Recall curve** and find a different story. Although Recall is high (the model identifies most default cases), **Precision is much lower**, suggesting that many of the predicted defaults are actually **false positives**. This means that while the model is good at detecting defaults, it’s not as confident in its predictions. As a result, the **AUC-PR** score is significantly lower than the AUC-ROC score, reflecting the model’s **struggle with class imbalance**.
108+
109+
### Step 3: What This Tells Us
110+
111+
This discrepancy between AUC-ROC and AUC-PR tells us that while the model might appear to perform well overall (high AUC-ROC), its **actual performance** in identifying and confidently predicting defaults is **suboptimal** (low AUC-PR). In practice, this could lead to **incorrect predictions**, where too many non-default cases are classified as defaults, resulting in unnecessary interventions or loss of trust in the model.
112+
113+
## Conclusion: Why AUC-PR Should Be Your Go-To for Imbalanced Data
114+
115+
For **imbalanced datasets**, AUC-ROC and Gini can **mislead** you into thinking your model performs well when, in fact, it struggles with the **minority class**. Metrics like **AUC-PR** offer a more focused evaluation by prioritizing **Precision** and **Recall**—two critical metrics for rare events where misclassification can be costly.
116+
117+
In practice, when evaluating models on tasks like **fraud detection**, **default prediction**, or **disease diagnosis**, where the positive class is rare but crucial, the **Precision-Recall curve** and **AUC-PR** give a more honest reflection of the model’s performance. While AUC-ROC might inflate the model's effectiveness by focusing on the majority class, AUC-PR shows how well the model **balances** Precision and Recall—two metrics that matter most in real-world applications where **rare events** have significant consequences.
118+
119+
### Key Takeaways:
120+
121+
- **AUC-ROC** and **Gini** are suitable for balanced datasets but can **overstate** model performance on imbalanced data.
122+
- **AUC-PR** focuses on the **positive class**, providing a clearer view of how well the model handles **rare events**.
123+
- When evaluating binary classifiers on **imbalanced datasets**, always consider using **AUC-PR** as it offers a more honest assessment of your model's strengths and weaknesses.
124+
125+
In your next machine learning project, especially when handling imbalanced datasets, prioritize **AUC-PR** over AUC-ROC and Gini for a clearer, more accurate evaluation of your model’s ability to manage rare but critical events.

0 commit comments

Comments
 (0)