Skip to content

Commit 638cb5c

Browse files
Merge pull request #108 from DiogoRibeiro7/feat/reserve_branche
Feat/reserve branche
2 parents 43cb381 + ed98d2d commit 638cb5c

25 files changed

+356
-47
lines changed

_posts/-_ideas/2030-01-01-Article Title Ideas for Statistical Tests.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -67,11 +67,3 @@ TODO:
6767
### 13. **"Granger Causality Test: Assessing Temporal Causal Relationships in Time-Series Data"**
6868
- Introduction to the Granger causality test for time-series data.
6969
- Applications in economics, climate science, and finance.
70-
71-
### 14. **"Shapiro-Wilk Test vs. Anderson-Darling: Checking for Normality in Small vs. Large Samples"**
72-
- Comparing two common tests for normality: Shapiro-Wilk and Anderson-Darling.
73-
- How sample size and distribution affect the choice of normality test.
74-
75-
### 15. **"Log-Rank Test: Comparing Survival Curves in Clinical Studies"**
76-
- Overview of the Log-Rank test for comparing survival distributions.
77-
- Applications in clinical trials, epidemiology, and medical research.
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
author_profile: false
3+
categories:
4+
- Statistics
5+
classes: wide
6+
date: '2019-12-28'
7+
excerpt: Explore the differences between the Shapiro-Wilk and Anderson-Darling tests,
8+
two common methods for testing normality, and how sample size and distribution affect
9+
their performance.
10+
header:
11+
image: /assets/images/data_science_20.jpg
12+
og_image: /assets/images/data_science_20.jpg
13+
overlay_image: /assets/images/data_science_20.jpg
14+
show_overlay_excerpt: false
15+
teaser: /assets/images/data_science_20.jpg
16+
twitter_image: /assets/images/data_science_20.jpg
17+
keywords:
18+
- Shapiro-wilk test
19+
- Anderson-darling test
20+
- Normality test
21+
- Small sample size
22+
- Large sample size
23+
- Statistical distribution
24+
- Python
25+
seo_description: A comparison of the Shapiro-Wilk and Anderson-Darling tests for normality,
26+
analyzing their strengths and weaknesses based on sample size and distribution.
27+
seo_title: 'Shapiro-Wilk vs Anderson-Darling: Normality Tests for Small and Large
28+
Samples'
29+
seo_type: article
30+
summary: This article compares the Shapiro-Wilk and Anderson-Darling tests, emphasizing
31+
how sample size and distribution characteristics influence the choice of method
32+
when assessing normality.
33+
tags:
34+
- Normality testing
35+
- Shapiro-wilk test
36+
- Anderson-darling test
37+
- Sample size
38+
- Python
39+
title: 'Shapiro-Wilk Test vs. Anderson-Darling: Checking for Normality in Small vs.
40+
Large Samples'
41+
---
42+
43+
## Shapiro-Wilk Test vs. Anderson-Darling: Checking for Normality in Small vs. Large Samples
44+
45+
Testing for normality is a crucial step in many statistical analyses, particularly when using parametric tests that assume data is normally distributed. Two of the most widely used normality tests are the **Shapiro-Wilk test** and the **Anderson-Darling test**. Although both are used to assess whether a dataset follows a normal distribution, they perform differently depending on sample size and the underlying distribution characteristics. This article explores these differences and guides how to choose the appropriate test based on your data.
46+
47+
### 1. Understanding the Basics of Normality Testing
48+
49+
In statistics, many parametric tests (such as t-tests or ANOVAs) require the assumption that the data follows a normal distribution. While visual methods like histograms or Q-Q plots are useful for assessing normality, formal statistical tests like Shapiro-Wilk and Anderson-Darling provide quantitative measures.
50+
51+
#### Why Is Normality Important?
52+
53+
- **Parametric tests** (like the t-test, ANOVA) are based on the assumption that the underlying data follows a normal distribution.
54+
- **Non-normal data** can lead to inaccurate results in hypothesis testing, confidence intervals, and other statistical inferences.
55+
56+
The objective of normality tests is to determine whether to reject the hypothesis that a dataset is drawn from a normally distributed population.
57+
58+
### 2. Shapiro-Wilk Test: Best for Small Samples
59+
60+
The **Shapiro-Wilk test** is commonly regarded as the most powerful test for detecting deviations from normality, especially for **small sample sizes** (usually \( n < 50 \)). It was introduced in 1965 by Shapiro and Wilk and is based on the correlation between the data and the corresponding normal scores.
61+
62+
#### How Does It Work?
63+
64+
The Shapiro-Wilk test compares the ordered data points with the expected values of a normal distribution. The null hypothesis (\( H_0 \)) for the Shapiro-Wilk test states that the data is normally distributed. If the test produces a **p-value** below a predefined significance level (commonly 0.05), the null hypothesis is rejected, suggesting that the data is not normally distributed.
65+
66+
- **Test statistic**: The test statistic \( W \) is calculated using the equation:
67+
68+
$$
69+
W = \frac{\left( \sum_{i=1}^{n} a_i x_{(i)} \right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
70+
$$
71+
72+
where \( a_i \) are constants generated from a normal distribution, \( x_{(i)} \) are the ordered sample values, and \( \bar{x} \) is the sample mean.
73+
74+
#### Strengths of Shapiro-Wilk
75+
76+
- **High power with small samples**: The Shapiro-Wilk test is highly effective in detecting non-normality in small datasets, typically outperforming other tests when \( n \) is below 50.
77+
- **Sensitive to skewness and kurtosis**: It can detect deviations due to both the shape of the distribution and extreme values.
78+
79+
#### Limitations
80+
81+
- **Less effective for large samples**: When sample sizes increase significantly (e.g., \( n > 2000 \)), the Shapiro-Wilk test becomes overly sensitive and may flag trivial deviations as significant.
82+
- **Slower computation**: The test involves more complex calculations, making it computationally heavier for larger datasets.
83+
84+
### 3. Anderson-Darling Test: Better for Large Samples
85+
86+
The **Anderson-Darling test** is another widely used normality test, which is a modification of the Kolmogorov-Smirnov test. It provides a more sensitive measure of the difference between the empirical distribution of the data and the expected cumulative distribution of a normal distribution. Unlike the Shapiro-Wilk test, the Anderson-Darling test performs well with **larger sample sizes**.
87+
88+
#### How Does It Work?
89+
90+
The Anderson-Darling test compares the observed cumulative distribution function (CDF) of the data to the expected CDF of the normal distribution. The test statistic \( A^2 \) is calculated based on the differences between these functions, giving more weight to the tails of the distribution:
91+
92+
- **Test statistic**: The Anderson-Darling statistic is computed as:
93+
94+
$$
95+
A^2 = -n - \frac{1}{n} \sum_{i=1}^{n} \left[ (2i-1) \left( \ln F(x_{(i)}) + \ln(1 - F(x_{(n+1-i)})) \right) \right]
96+
$$
97+
98+
where \( F(x) \) is the cumulative distribution function of the normal distribution.
99+
100+
#### Strengths of Anderson-Darling
101+
102+
- **More sensitive to tail behavior**: The Anderson-Darling test gives more weight to observations in the tails of the distribution, making it particularly useful for detecting deviations in the extremes.
103+
- **Suitable for larger samples**: It performs well with larger datasets and remains powerful for both small and large samples, though it is especially reliable for larger datasets (e.g., \( n > 50 \)).
104+
105+
#### Limitations
106+
107+
- **Less powerful for small samples**: The Anderson-Darling test may not detect non-normality as effectively as the Shapiro-Wilk test for small datasets.
108+
- **More prone to Type I errors**: In very large samples, it may detect statistically significant but practically negligible deviations from normality.
109+
110+
### 4. Choosing Between Shapiro-Wilk and Anderson-Darling
111+
112+
The choice between Shapiro-Wilk and Anderson-Darling tests depends primarily on the **sample size** and the **type of deviations** you expect from normality.
113+
114+
#### Small Samples (\( n < 50 \))
115+
116+
For small sample sizes, the Shapiro-Wilk test is generally preferred due to its higher power and reliability. It is more sensitive to deviations in both the center and tails of the distribution in smaller datasets.
117+
118+
- **Recommendation**: Use Shapiro-Wilk for \( n < 50 \).
119+
120+
#### Large Samples (\( n > 200 \))
121+
122+
As sample size increases, the Shapiro-Wilk test can become too sensitive, flagging minor deviations as statistically significant. The Anderson-Darling test, with its focus on tail behavior, often provides a more balanced view of normality for larger samples.
123+
124+
- **Recommendation**: Use Anderson-Darling for larger samples, especially if deviations in the tails are of particular interest.
125+
126+
#### Mid-range Samples (\( 50 \leq n \leq 200 \))
127+
128+
For datasets that fall in this mid-range, both tests can be useful, depending on the nature of the data. If your analysis is concerned with tail behavior or extreme values, the Anderson-Darling test may be more informative. However, the Shapiro-Wilk test remains a reliable choice if computational efficiency is not a concern.
129+
130+
### 5. Impact of Distribution Characteristics on Test Choice
131+
132+
Different distributions, especially those with heavy tails, skewness, or kurtosis, can influence the performance of normality tests. Both the Shapiro-Wilk and Anderson-Darling tests can detect non-normality, but their focus differs slightly.
133+
134+
- **Tail-heavy distributions**: The Anderson-Darling test is better suited for detecting deviations in the tails.
135+
- **Symmetry and kurtosis**: The Shapiro-Wilk test is generally better at identifying issues related to skewness and kurtosis in smaller datasets.
136+
137+
### 6. Practical Considerations and Software Implementation
138+
139+
Both the Shapiro-Wilk and Anderson-Darling tests are widely implemented in statistical software such as R, Python (via SciPy), and SPSS. Here are examples of how to perform these tests in Python:
140+
141+
#### Shapiro-Wilk in Python
142+
143+
```python
144+
from scipy.stats import shapiro
145+
146+
data = [4.5, 5.6, 7.8, 4.3, 6.1]
147+
stat, p = shapiro(data)
148+
print('Statistics=%.3f, p=%.3f' % (stat, p))
149+
```
150+
151+
#### Anderson-Darling in Python
152+
153+
```python
154+
from scipy.stats import anderson
155+
156+
data = [4.5, 5.6, 7.8, 4.3, 6.1]
157+
result = anderson(data)
158+
print('Statistic: %.3f' % result.statistic)
159+
```
160+
161+
### 7. Conclusion: Which Test Should You Use?
162+
163+
Ultimately, the decision between the Shapiro-Wilk and Anderson-Darling tests depends on your sample size and the nature of the deviations you want to detect. For small samples, the Shapiro-Wilk test is a powerful and reliable option, while the Anderson-Darling test offers a more flexible and tail-sensitive approach, particularly useful for larger datasets.
164+
165+
Both tests provide valuable insights into the distribution of your data, ensuring you can make informed decisions in parametric testing and beyond.

_posts/2020-01-07-how_big_data_transforming_predictive_maintenance.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
author_profile: false
33
categories:
4-
- Big Data
4+
- Data Science
55
classes: wide
66
date: '2020-01-07'
77
excerpt: Big Data is revolutionizing predictive maintenance by offering unprecedented
@@ -56,7 +56,7 @@ title: How Big Data is Transforming Predictive Maintenance
5656
---
5757
author_profile: false
5858
categories:
59-
- Big Data
59+
- Data Science
6060
classes: wide
6161
date: '2020-01-07'
6262
excerpt: Big Data is revolutionizing predictive maintenance by offering unprecedented

_posts/2021-06-01-customer_segmentation.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
author_profile: false
33
categories:
4-
- Customer Analytics
4+
- Data Science
55
classes: wide
66
date: '2021-06-01'
77
excerpt: RFM Segmentation (Recency, Frequency, Monetary Value) is a widely used method
@@ -15,6 +15,7 @@ header:
1515
teaser: /assets/images/data_science_9.jpg
1616
twitter_image: /assets/images/data_science_1.jpg
1717
keywords:
18+
- Customer analytics
1819
- Customer segmentation
1920
- Unsupervised learning
2021
- Data science

_posts/2022-07-26-geospatial_data_for_public_health_insights.md renamed to _posts/2022-07-26-geospatial_data_public_health_insights.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,17 @@ categories:
44
- Data Science
55
- Public Health
66
classes: wide
7+
date: '2022-07-26'
78
excerpt: Spatial epidemiology combines geospatial data with data science techniques
89
to track and analyze disease outbreaks, offering public health agencies critical
910
tools for intervention and planning.
11+
header:
12+
image: /assets/images/data_science_19.jpg
13+
og_image: /assets/images/data_science_19.jpg
14+
overlay_image: /assets/images/data_science_19.jpg
15+
show_overlay_excerpt: false
16+
teaser: /assets/images/data_science_19.jpg
17+
twitter_image: /assets/images/data_science_19.jpg
1018
keywords:
1119
- Spatial epidemiology
1220
- Geospatial data
@@ -18,6 +26,7 @@ seo_description: Explore how geospatial data is revolutionizing public health. L
1826
how spatial epidemiology and data science techniques track disease outbreaks and
1927
offer critical insights for health interventions.
2028
seo_title: 'Spatial Epidemiology: Leveraging Geospatial Data in Public Health'
29+
seo_type: article
2130
summary: This article explores the importance of geospatial data in spatial epidemiology,
2231
focusing on how it is used to track and analyze disease outbreaks. It delves into
2332
the integration of spatial data with data science methods and how these insights

_posts/2023-12-30-expected_shortfall.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
author_profile: false
33
categories:
44
- Data Science
5-
- Financial Risk Management
65
classes: wide
76
date: '2023-12-30'
87
excerpt: A comprehensive comparison of Value at Risk (VaR) and Expected Shortfall

_posts/2024-02-01-customer_life_value.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
author_profile: false
33
categories:
44
- Machine Learning
5-
- Data Science
65
classes: wide
76
date: '2024-02-01'
87
excerpt: Discover the importance of Customer Lifetime Value (CLV) in shaping business

_posts/2024-05-21-Probability_integral_transform.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,33 @@ categories:
77
- Machine Learning
88
classes: wide
99
date: '2024-05-21'
10+
excerpt: An in-depth guide to understanding and applying the Probability Integral
11+
Transform in various fields, from finance to statistics.
1012
header:
1113
image: /assets/images/data_science_2.jpg
1214
og_image: /assets/images/data_science_3.jpg
1315
overlay_image: /assets/images/data_science_2.jpg
1416
show_overlay_excerpt: false
1517
teaser: /assets/images/data_science_2.jpg
1618
twitter_image: /assets/images/data_science_3.jpg
19+
keywords:
20+
- Probability integral transform
21+
- Cumulative distribution function
22+
- Goodness of fit
23+
- Copula construction
24+
- Financial risk management
25+
- Monte carlo simulations
26+
- Hypothesis testing
27+
- Credit risk modeling
28+
- R
29+
seo_description: A comprehensive exploration of the probability integral transform,
30+
its theoretical foundations, and practical applications in fields such as risk management
31+
and marketing mix modeling.
32+
seo_title: 'Probability Integral Transform: Theory and Applications'
1733
seo_type: article
34+
summary: This article explains the Probability Integral Transform, its role in statistical
35+
modeling, and how it is applied in diverse fields like risk management, hypothesis
36+
testing, and Monte Carlo simulations.
1837
tags:
1938
- Probability integral transform
2039
- Cumulative distribution function
@@ -131,14 +150,33 @@ categories:
131150
- Machine Learning
132151
classes: wide
133152
date: '2024-05-21'
153+
excerpt: An in-depth guide to understanding and applying the Probability Integral
154+
Transform in various fields, from finance to statistics.
134155
header:
135156
image: /assets/images/data_science_2.jpg
136157
og_image: /assets/images/data_science_3.jpg
137158
overlay_image: /assets/images/data_science_2.jpg
138159
show_overlay_excerpt: false
139160
teaser: /assets/images/data_science_2.jpg
140161
twitter_image: /assets/images/data_science_3.jpg
162+
keywords:
163+
- Probability integral transform
164+
- Cumulative distribution function
165+
- Goodness of fit
166+
- Copula construction
167+
- Financial risk management
168+
- Monte carlo simulations
169+
- Hypothesis testing
170+
- Credit risk modeling
171+
- R
172+
seo_description: A comprehensive exploration of the probability integral transform,
173+
its theoretical foundations, and practical applications in fields such as risk management
174+
and marketing mix modeling.
175+
seo_title: 'Probability Integral Transform: Theory and Applications'
141176
seo_type: article
177+
summary: This article explains the Probability Integral Transform, its role in statistical
178+
modeling, and how it is applied in diverse fields like risk management, hypothesis
179+
testing, and Monte Carlo simulations.
142180
tags:
143181
- Probability integral transform
144182
- Cumulative distribution function
@@ -259,14 +297,33 @@ categories:
259297
- Machine Learning
260298
classes: wide
261299
date: '2024-05-21'
300+
excerpt: An in-depth guide to understanding and applying the Probability Integral
301+
Transform in various fields, from finance to statistics.
262302
header:
263303
image: /assets/images/data_science_2.jpg
264304
og_image: /assets/images/data_science_3.jpg
265305
overlay_image: /assets/images/data_science_2.jpg
266306
show_overlay_excerpt: false
267307
teaser: /assets/images/data_science_2.jpg
268308
twitter_image: /assets/images/data_science_3.jpg
309+
keywords:
310+
- Probability integral transform
311+
- Cumulative distribution function
312+
- Goodness of fit
313+
- Copula construction
314+
- Financial risk management
315+
- Monte carlo simulations
316+
- Hypothesis testing
317+
- Credit risk modeling
318+
- R
319+
seo_description: A comprehensive exploration of the probability integral transform,
320+
its theoretical foundations, and practical applications in fields such as risk management
321+
and marketing mix modeling.
322+
seo_title: 'Probability Integral Transform: Theory and Applications'
269323
seo_type: article
324+
summary: This article explains the Probability Integral Transform, its role in statistical
325+
modeling, and how it is applied in diverse fields like risk management, hypothesis
326+
testing, and Monte Carlo simulations.
270327
tags:
271328
- Probability integral transform
272329
- Cumulative distribution function

_posts/2024-07-11-pre_commit.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
author_profile: false
33
categories:
4-
- Software Development
4+
- Python
55
classes: wide
66
date: '2024-07-11'
77
header:

_posts/2024-07-16-Einstein.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
author_profile: false
33
categories:
4-
- Science
54
- Data Analysis
65
classes: wide
76
date: '2024-07-16'

0 commit comments

Comments
 (0)