Skip to content

Commit bdc0370

Browse files
Merge pull request #74 from DiogoRibeiro7/feat/reserve_branche
Feat/reserve branche
2 parents 03ed4d9 + 26f8703 commit bdc0370

13 files changed

+1369
-31
lines changed

_posts/-_ideas/2030-01-01-climate_change.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,12 @@ tags:
3131
title: Exploring Climate Change, Sustainability, and Data Science
3232
---
3333

34-
### 1. Leveraging Big Data for Climate Change Mitigation
35-
Explore how large-scale data collection and analysis are helping scientists better understand climate patterns and predict changes. Discuss the role of satellite data, sensors, and environmental monitoring.
34+
## TODO:
3635

3736
### 2. The Role of Machine Learning in Predicting Climate Change Impacts
3837
Examine how machine learning algorithms are being used to model and predict the future impacts of climate change. Focus on predictive analytics for extreme weather, sea-level rise, and biodiversity loss.
3938

40-
### 3. Sustainability Analytics: How Data Science Drives Green Innovation
41-
Look at how companies and organizations are using data science to improve sustainability practices. Cover areas like resource optimization, waste reduction, and improving the energy efficiency of supply chains.
39+
4240

4341
### 4. AI and Machine Learning in Renewable Energy Optimization
4442
Discuss how AI is used to optimize renewable energy sources, such as wind and solar power, by improving energy forecasting, managing grid systems, and balancing energy storage.

_posts/-_ideas/2030-01-01-data_model_drift.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,6 @@ tags: []
1313

1414
## Article Ideas on Data Drift and Model Drift
1515

16-
### 1. **Understanding Data Drift: What It Is and Why It Matters in Machine Learning**
17-
- **Overview**: Explain the concept of data drift and how changes in input data distribution can affect model performance over time.
18-
- **Focus**: Define types of data drift (e.g., **covariate drift**, **label drift**, and **concept drift**), with practical examples from industries like finance and healthcare.
1916

2017
### 2. **Model Drift: Why Even the Best Machine Learning Models Fail Over Time**
2118
- **Overview**: Explore the concept of model drift and how changes in the environment or target variable can degrade model accuracy.

_posts/-_ideas/2030-01-01-ideas_statistical_tests.md

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -73,29 +73,11 @@ Here are some interesting article ideas centered around statistical tests, desig
7373
- Explain how the sign test works as a non-parametric test.
7474
- Compare the sign test with the Wilcoxon signed-rank test and other small-sample tests.
7575

76-
### 8. **"Levene's Test vs. Bartlett’s Test: Checking for Homogeneity of Variances"**
77-
- Comparison of Levene's and Bartlett’s tests for checking homogeneity of variances in data.
78-
- Discuss when to use each test (parametric vs. non-parametric, normal vs. non-normal data).
79-
- Application in conjunction with ANOVA and other tests that assume equal variances.
8076

81-
### 9. **"Multiple Comparisons Problem: Bonferroni Correction and Other Solutions"**
82-
- Explain the multiple comparisons problem that arises in hypothesis testing.
83-
- Discuss the Bonferroni correction and other methods (e.g., Holm-Bonferroni, FDR).
84-
- Show practical applications in experiments with multiple testing (e.g., medical studies, genetics).
8577

86-
### 10. **"Kolmogorov-Smirnov Test: Assessing Goodness-of-Fit in Non-Parametric Data"**
87-
- Introduction to the Kolmogorov-Smirnov (K-S) test for checking distribution fit.
88-
- Compare K-S test with other goodness-of-fit tests like Shapiro-Wilk.
89-
- Explore real-world use cases, such as testing if a dataset follows a specific distribution.
9078

9179

9280

93-
### 12. **"Log-Rank Test in Survival Analysis: Comparing Survival Curves"**
94-
- Introduction to the log-rank test used in survival analysis.
95-
- Discuss its applications in medical studies to compare survival times between two groups.
96-
- Explain how to interpret survival curves and the p-values from a log-rank test.
97-
98-
9981

10082

10183

_posts/2020-01-02-maximum_likelihood_estimation_statistical_modeling.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ The likelihood function is at the heart of MLE. It measures how likely the obser
5959

6060
$$ x_1, x_2, \dots, x_n $$
6161

62-
These observations are assumed to be drawn from some probability distribution, say $p(x | \theta)$, where $\theta$ represents the unknown parameters of the model. The likelihood function is the product of the probability density (or mass) functions for all observations:
62+
These observations are assumed to be drawn from some probability distribution, say $$p(x | \theta)$$, where $$\theta$$ represents the unknown parameters of the model. The likelihood function is the product of the probability density (or mass) functions for all observations:
6363

6464
$$ L(\theta) = p(x_1 \mid \theta) \times p(x_2 \mid \theta) \times \dots \times p(x_n \mid \theta) $$
6565

@@ -75,11 +75,11 @@ $$ \log L(\theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta) $$
7575

7676
### 2.3 Maximization
7777

78-
The objective of MLE is to find the parameter values that maximize the log-likelihood function. This is typically done by taking the derivative of the log-likelihood with respect to the parameter $\theta$, setting it equal to zero, and solving for $\theta$:
78+
The objective of MLE is to find the parameter values that maximize the log-likelihood function. This is typically done by taking the derivative of the log-likelihood with respect to the parameter $$\theta$$, setting it equal to zero, and solving for $$\theta$$:
7979

8080
$$ \frac{\partial}{\partial \theta} \log L(\theta) = 0 $$
8181

82-
This solution gives the maximum likelihood estimate of $\theta$, which is denoted as $\hat{\theta}$.
82+
This solution gives the maximum likelihood estimate of $$\theta$$, which is denoted as $$\hat{\theta}$$.
8383

8484
## 3. Why MLE is Essential in Data Science
8585

@@ -303,16 +303,16 @@ Subclasses are expected to implement these methods.
303303

304304
#### Normal Distribution MLE (`MLENormal`):
305305

306-
- The `log_likelihood()` method computes the log-likelihood for the normal distribution given mean ($\mu$) and variance ($\sigma^2$).
306+
- The `log_likelihood()` method computes the log-likelihood for the normal distribution given mean ($$\mu$$) and variance ($$\sigma^2$$).
307307
- The `fit()` method estimates the parameters (mean and variance) using the following formulas:
308308

309309
$$ \hat{\mu} = \text{mean}(data) $$
310310
$$ \hat{\sigma^2} = \text{variance}(data) $$
311311

312312
#### Bernoulli Distribution MLE (`MLEBernoulli`):
313313

314-
- The `log_likelihood()` method computes the log-likelihood for the Bernoulli distribution based on the probability $p$ of success.
315-
- The `fit()` method estimates the probability $p$ using the formula:
314+
- The `log_likelihood()` method computes the log-likelihood for the Bernoulli distribution based on the probability $$p$$ of success.
315+
- The `fit()` method estimates the probability $$p$$ using the formula:
316316

317317
$$ \hat{p} = \text{mean}(data) $$
318318

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
author_profile: false
3+
categories:
4+
- Statistics
5+
- Data Science
6+
- Machine Learning
7+
classes: wide
8+
date: '2020-01-03'
9+
excerpt: The Kolmogorov-Smirnov test is a powerful tool for assessing goodness-of-fit in non-parametric data. Learn how it works, how it compares to the Shapiro-Wilk test, and explore real-world applications.
10+
header:
11+
image: /assets/images/data_science_3.jpg
12+
og_image: /assets/images/data_science_3.jpg
13+
overlay_image: /assets/images/data_science_3.jpg
14+
show_overlay_excerpt: false
15+
teaser: /assets/images/data_science_3.jpg
16+
twitter_image: /assets/images/data_science_3.jpg
17+
keywords:
18+
- Kolmogorov-Smirnov test
19+
- goodness-of-fit tests
20+
- non-parametric statistics
21+
- distribution fitting
22+
- Shapiro-Wilk test
23+
seo_description: This article introduces the Kolmogorov-Smirnov test for assessing goodness-of-fit in non-parametric data, comparing it with other tests like Shapiro-Wilk, and exploring real-world use cases.
24+
seo_title: 'Kolmogorov-Smirnov Test: A Guide to Non-Parametric Goodness-of-Fit Testing'
25+
seo_type: article
26+
summary: This article explains the Kolmogorov-Smirnov (K-S) test for assessing the goodness-of-fit of non-parametric data. We compare the K-S test to other goodness-of-fit tests, such as Shapiro-Wilk, and provide real-world use cases, including testing whether a dataset follows a specific distribution.
27+
tags:
28+
- Kolmogorov-Smirnov Test
29+
- Goodness-of-Fit Tests
30+
- Non-Parametric Data
31+
- Shapiro-Wilk Test
32+
- Distribution Fitting
33+
title: 'Kolmogorov-Smirnov Test: Assessing Goodness-of-Fit in Non-Parametric Data'
34+
---
35+
36+
## Introduction to the Kolmogorov-Smirnov Test
37+
38+
The **Kolmogorov-Smirnov (K-S) test** is a widely used statistical method for assessing whether a sample of data follows a specific distribution. As a **non-parametric** test, the K-S test does not assume any specific underlying data distribution, making it particularly valuable for situations where we cannot confidently assume a parametric form like the normal distribution. Instead, the K-S test compares the cumulative distribution function (CDF) of the observed data with the CDF of a reference distribution, assessing how well they align.
39+
40+
The K-S test is especially useful for:
41+
42+
- **Goodness-of-fit testing**: Evaluating whether an observed dataset conforms to a known distribution, such as normal, uniform, or exponential.
43+
- **Comparing two distributions**: Testing whether two independent samples come from the same distribution.
44+
45+
In this article, we will explore how the Kolmogorov-Smirnov test works, how it compares to other goodness-of-fit tests like **Shapiro-Wilk**, and some real-world applications where the K-S test is particularly useful.
46+
47+
## How the Kolmogorov-Smirnov Test Works
48+
49+
The K-S test compares the **empirical cumulative distribution function (ECDF)** of a dataset to the CDF of a specified reference distribution. The goal is to determine if the two distributions differ significantly, which would indicate that the observed data does not follow the hypothesized distribution.
50+
51+
### 1.1 Steps of the K-S Test
52+
53+
1. **Define the hypothesis**:
54+
- **Null hypothesis ($H_0$):** The sample data follows the specified distribution (e.g., normal, uniform, etc.).
55+
- **Alternative hypothesis ($H_A$):** The sample data does not follow the specified distribution.
56+
57+
2. **Compute the empirical cumulative distribution function (ECDF)**: The ECDF represents the proportion of observed data points less than or equal to each value in the dataset. This is the observed distribution.
58+
59+
3. **Compare the ECDF to the reference distribution's CDF**: The test calculates the **maximum difference** between the ECDF and the CDF of the specified reference distribution. This difference is denoted by **D**.
60+
61+
4. **Calculate the test statistic (D)**: The test statistic for the K-S test is the maximum absolute difference between the ECDF and the reference CDF:
62+
63+
$$
64+
D = \max |F_n(x) - F(x)|
65+
$$
66+
67+
Where:
68+
- $F_n(x)$ is the ECDF of the sample data.
69+
- $F(x)$ is the CDF of the reference distribution.
70+
71+
5. **Interpret the results**: The calculated **D-statistic** is compared to a critical value from the **Kolmogorov distribution**. A p-value is derived from this comparison. If the p-value is lower than a chosen significance level (usually 0.05), the null hypothesis is rejected, meaning that the data does not follow the specified distribution.
72+
73+
### 1.2 One-Sample and Two-Sample K-S Tests
74+
75+
There are two main variations of the K-S test:
76+
77+
- **One-sample K-S test**: Used to compare an observed sample to a specific theoretical distribution (e.g., testing if data is normally distributed).
78+
79+
- **Two-sample K-S test**: Used to compare two independent samples to see if they are drawn from the same distribution.
80+
81+
### 1.3 P-Value Interpretation
82+
83+
The **p-value** from the K-S test tells us the probability of observing a test statistic as extreme as the one computed, assuming the null hypothesis is true. A small p-value (typically less than 0.05) suggests that the observed data does not come from the specified distribution.
84+
85+
## K-S Test vs. Other Goodness-of-Fit Tests
86+
87+
There are several other goodness-of-fit tests that serve similar purposes to the Kolmogorov-Smirnov test. The key difference between these tests often lies in their assumptions, sensitivity to different types of deviations, and specific use cases.
88+
89+
### 2.1 Shapiro-Wilk Test
90+
91+
The **Shapiro-Wilk test** is one of the most commonly used goodness-of-fit tests for assessing **normality**. Unlike the K-S test, the Shapiro-Wilk test is **parametric**, meaning it specifically tests whether a sample comes from a normal distribution.
92+
93+
#### Comparison to K-S Test:
94+
95+
- **Assumptions**: The Shapiro-Wilk test is strictly used for testing normality, while the K-S test can be applied to any reference distribution, making it more versatile.
96+
- **Sensitivity**: The Shapiro-Wilk test is more powerful (i.e., it has higher sensitivity) when it comes to detecting deviations from normality, especially in small samples. The K-S test may be less sensitive in detecting small differences between the empirical and reference distributions.
97+
- **Use cases**: Shapiro-Wilk is preferred for small datasets and when specifically testing for normality. The K-S test is ideal for larger datasets or when testing goodness-of-fit to any distribution (normal, uniform, exponential, etc.).
98+
99+
### 2.2 Anderson-Darling Test
100+
101+
The **Anderson-Darling test** is another goodness-of-fit test that is an extension of the K-S test but gives more weight to the tails of the distribution. It is particularly useful when the focus is on the fit in the tails of the distribution, as might be the case in risk modeling or financial applications.
102+
103+
#### Comparison to K-S Test:
104+
105+
- **Tail sensitivity**: The Anderson-Darling test is more sensitive to differences in the tails of the distribution compared to the K-S test, which treats all parts of the distribution equally.
106+
- **Use cases**: Anderson-Darling is favored when deviations in the tails of the distribution are critical, such as in stress testing for financial risk or extreme event modeling.
107+
108+
### 2.3 Chi-Squared Test
109+
110+
The **Chi-squared goodness-of-fit test** compares the frequency distribution of observed data to a theoretically expected frequency distribution. It is widely used in categorical data but can also be applied to continuous data if binned into categories.
111+
112+
#### Comparison to K-S Test:
113+
114+
- **Assumptions**: The chi-squared test requires the data to be grouped into categories, which may involve loss of information when applied to continuous data. The K-S test, by contrast, works directly with continuous data without binning.
115+
- **Data Type**: Chi-squared is often used for categorical data, while the K-S test is better suited for continuous data.
116+
- **Sensitivity**: The chi-squared test can be less sensitive to small sample sizes or when the expected frequencies in some categories are very low.
117+
118+
## Real-World Use Cases of the Kolmogorov-Smirnov Test
119+
120+
The Kolmogorov-Smirnov test is applicable across various fields, including finance, biology, engineering, and data science. Below, we explore some real-world use cases where the K-S test is particularly valuable.
121+
122+
### 3.1 Testing for Normality in Finance
123+
124+
In financial markets, it is often necessary to test whether the returns on stocks, bonds, or other financial instruments follow a **normal distribution**. Many financial models, such as those used for portfolio optimization or risk management, assume that returns are normally distributed.
125+
126+
The one-sample K-S test can be used to assess whether historical returns data conform to a normal distribution. If the p-value is low, analysts might conclude that the returns deviate significantly from normality, which would affect the assumptions of the models they are using.
127+
128+
#### Example:
129+
130+
An investment firm wants to test whether the daily returns of a stock over the past year follow a normal distribution. By applying the K-S test, they compare the ECDF of the daily returns to the CDF of a normal distribution with the same mean and standard deviation. A significant p-value would suggest that the returns are not normally distributed, and the firm may need to revise its risk models.
131+
132+
### 3.2 Quality Control in Manufacturing
133+
134+
In manufacturing, the K-S test can be used to determine whether a batch of products conforms to a specific tolerance level for a continuous variable, such as weight or size. Ensuring that products follow the expected distribution can be critical for maintaining quality and consistency in production.
135+
136+
#### Example:
137+
138+
A company that manufactures precision-engineered components wants to ensure that the diameter of their parts follows a uniform distribution within a specific tolerance range. By applying the K-S test, they compare the observed distribution of part diameters to a uniform distribution. If the p-value from the K-S test is below the threshold, the company may need to investigate potential issues in the manufacturing process.
139+
140+
### 3.3 Ecological Studies: Comparing Species Distributions
141+
142+
In ecology, researchers often compare the distribution of species in different habitats or regions to understand environmental influences on biodiversity. The two-sample K-S test is useful for comparing the distribution of species abundances between different ecosystems or time periods.
143+
144+
#### Example:
145+
146+
An ecologist is studying the distribution of bird species in two different regions to determine if the environmental conditions result in significant differences in species diversity. The two-sample K-S test can be used to compare the distributions of bird counts in the two regions. A significant result would indicate that the distributions differ, suggesting that the regions have different environmental characteristics affecting biodiversity.
147+
148+
## Conclusion
149+
150+
The **Kolmogorov-Smirnov test** is a powerful, versatile tool for assessing the goodness-of-fit in non-parametric data and comparing two distributions. Its ability to test data against any theoretical distribution—without the need for strong parametric assumptions—makes it particularly useful in many fields, from finance to ecology.
151+
152+
While the K-S test has some limitations, such as lower sensitivity compared to tests like Shapiro-Wilk for detecting deviations from normality, its flexibility and simplicity make it a popular choice for distributional comparisons. By understanding how to use the K-S test and interpreting its results, data scientists and researchers can draw meaningful conclusions about the underlying patterns in their data.

0 commit comments

Comments
 (0)