Skip to content

Commit 7860f2e

Browse files
committed
feat: new article
1 parent dc11e6d commit 7860f2e

3 files changed

+172
-8
lines changed

_posts/-_ideas/2030-01-01-Article Title Ideas for Statistical Tests.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -67,11 +67,3 @@ TODO:
6767
### 13. **"Granger Causality Test: Assessing Temporal Causal Relationships in Time-Series Data"**
6868
- Introduction to the Granger causality test for time-series data.
6969
- Applications in economics, climate science, and finance.
70-
71-
### 14. **"Shapiro-Wilk Test vs. Anderson-Darling: Checking for Normality in Small vs. Large Samples"**
72-
- Comparing two common tests for normality: Shapiro-Wilk and Anderson-Darling.
73-
- How sample size and distribution affect the choice of normality test.
74-
75-
### 15. **"Log-Rank Test: Comparing Survival Curves in Clinical Studies"**
76-
- Overview of the Log-Rank test for comparing survival distributions.
77-
- Applications in clinical trials, epidemiology, and medical research.
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
author_profile: false
3+
categories:
4+
- Statistics
5+
classes: wide
6+
date: '2019-12-28'
7+
excerpt: Explore the differences between the Shapiro-Wilk and Anderson-Darling tests,
8+
two common methods for testing normality, and how sample size and distribution affect
9+
their performance.
10+
header:
11+
image: /assets/images/data_science_20.jpg
12+
og_image: /assets/images/data_science_20.jpg
13+
overlay_image: /assets/images/data_science_20.jpg
14+
show_overlay_excerpt: false
15+
teaser: /assets/images/data_science_20.jpg
16+
twitter_image: /assets/images/data_science_20.jpg
17+
keywords:
18+
- Shapiro-wilk test
19+
- Anderson-darling test
20+
- Normality test
21+
- Small sample size
22+
- Large sample size
23+
- Statistical distribution
24+
seo_description: A comparison of the Shapiro-Wilk and Anderson-Darling tests for normality,
25+
analyzing their strengths and weaknesses based on sample size and distribution.
26+
seo_title: 'Shapiro-Wilk vs Anderson-Darling: Normality Tests for Small and Large
27+
Samples'
28+
seo_type: article
29+
summary: This article compares the Shapiro-Wilk and Anderson-Darling tests, emphasizing
30+
how sample size and distribution characteristics influence the choice of method
31+
when assessing normality.
32+
tags:
33+
- Normality testing
34+
- Shapiro-wilk test
35+
- Anderson-darling test
36+
- Sample size
37+
title: 'Shapiro-Wilk Test vs. Anderson-Darling: Checking for Normality in Small vs.
38+
Large Samples'
39+
---
40+
41+
## Shapiro-Wilk Test vs. Anderson-Darling: Checking for Normality in Small vs. Large Samples
42+
43+
Testing for normality is a crucial step in many statistical analyses, particularly when using parametric tests that assume data is normally distributed. Two of the most widely used normality tests are the **Shapiro-Wilk test** and the **Anderson-Darling test**. Although both are used to assess whether a dataset follows a normal distribution, they perform differently depending on sample size and the underlying distribution characteristics. This article explores these differences and guides how to choose the appropriate test based on your data.
44+
45+
### 1. Understanding the Basics of Normality Testing
46+
47+
In statistics, many parametric tests (such as t-tests or ANOVAs) require the assumption that the data follows a normal distribution. While visual methods like histograms or Q-Q plots are useful for assessing normality, formal statistical tests like Shapiro-Wilk and Anderson-Darling provide quantitative measures.
48+
49+
#### Why Is Normality Important?
50+
51+
- **Parametric tests** (like the t-test, ANOVA) are based on the assumption that the underlying data follows a normal distribution.
52+
- **Non-normal data** can lead to inaccurate results in hypothesis testing, confidence intervals, and other statistical inferences.
53+
54+
The objective of normality tests is to determine whether to reject the hypothesis that a dataset is drawn from a normally distributed population.
55+
56+
### 2. Shapiro-Wilk Test: Best for Small Samples
57+
58+
The **Shapiro-Wilk test** is commonly regarded as the most powerful test for detecting deviations from normality, especially for **small sample sizes** (usually \( n < 50 \)). It was introduced in 1965 by Shapiro and Wilk and is based on the correlation between the data and the corresponding normal scores.
59+
60+
#### How Does It Work?
61+
62+
The Shapiro-Wilk test compares the ordered data points with the expected values of a normal distribution. The null hypothesis (\( H_0 \)) for the Shapiro-Wilk test states that the data is normally distributed. If the test produces a **p-value** below a predefined significance level (commonly 0.05), the null hypothesis is rejected, suggesting that the data is not normally distributed.
63+
64+
- **Test statistic**: The test statistic \( W \) is calculated using the equation:
65+
66+
$$
67+
W = \frac{\left( \sum_{i=1}^{n} a_i x_{(i)} \right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
68+
$$
69+
70+
where \( a_i \) are constants generated from a normal distribution, \( x_{(i)} \) are the ordered sample values, and \( \bar{x} \) is the sample mean.
71+
72+
#### Strengths of Shapiro-Wilk
73+
74+
- **High power with small samples**: The Shapiro-Wilk test is highly effective in detecting non-normality in small datasets, typically outperforming other tests when \( n \) is below 50.
75+
- **Sensitive to skewness and kurtosis**: It can detect deviations due to both the shape of the distribution and extreme values.
76+
77+
#### Limitations
78+
79+
- **Less effective for large samples**: When sample sizes increase significantly (e.g., \( n > 2000 \)), the Shapiro-Wilk test becomes overly sensitive and may flag trivial deviations as significant.
80+
- **Slower computation**: The test involves more complex calculations, making it computationally heavier for larger datasets.
81+
82+
### 3. Anderson-Darling Test: Better for Large Samples
83+
84+
The **Anderson-Darling test** is another widely used normality test, which is a modification of the Kolmogorov-Smirnov test. It provides a more sensitive measure of the difference between the empirical distribution of the data and the expected cumulative distribution of a normal distribution. Unlike the Shapiro-Wilk test, the Anderson-Darling test performs well with **larger sample sizes**.
85+
86+
#### How Does It Work?
87+
88+
The Anderson-Darling test compares the observed cumulative distribution function (CDF) of the data to the expected CDF of the normal distribution. The test statistic \( A^2 \) is calculated based on the differences between these functions, giving more weight to the tails of the distribution:
89+
90+
- **Test statistic**: The Anderson-Darling statistic is computed as:
91+
92+
$$
93+
A^2 = -n - \frac{1}{n} \sum_{i=1}^{n} \left[ (2i-1) \left( \ln F(x_{(i)}) + \ln(1 - F(x_{(n+1-i)})) \right) \right]
94+
$$
95+
96+
where \( F(x) \) is the cumulative distribution function of the normal distribution.
97+
98+
#### Strengths of Anderson-Darling
99+
100+
- **More sensitive to tail behavior**: The Anderson-Darling test gives more weight to observations in the tails of the distribution, making it particularly useful for detecting deviations in the extremes.
101+
- **Suitable for larger samples**: It performs well with larger datasets and remains powerful for both small and large samples, though it is especially reliable for larger datasets (e.g., \( n > 50 \)).
102+
103+
#### Limitations
104+
105+
- **Less powerful for small samples**: The Anderson-Darling test may not detect non-normality as effectively as the Shapiro-Wilk test for small datasets.
106+
- **More prone to Type I errors**: In very large samples, it may detect statistically significant but practically negligible deviations from normality.
107+
108+
### 4. Choosing Between Shapiro-Wilk and Anderson-Darling
109+
110+
The choice between Shapiro-Wilk and Anderson-Darling tests depends primarily on the **sample size** and the **type of deviations** you expect from normality.
111+
112+
#### Small Samples (\( n < 50 \))
113+
114+
For small sample sizes, the Shapiro-Wilk test is generally preferred due to its higher power and reliability. It is more sensitive to deviations in both the center and tails of the distribution in smaller datasets.
115+
116+
- **Recommendation**: Use Shapiro-Wilk for \( n < 50 \).
117+
118+
#### Large Samples (\( n > 200 \))
119+
120+
As sample size increases, the Shapiro-Wilk test can become too sensitive, flagging minor deviations as statistically significant. The Anderson-Darling test, with its focus on tail behavior, often provides a more balanced view of normality for larger samples.
121+
122+
- **Recommendation**: Use Anderson-Darling for larger samples, especially if deviations in the tails are of particular interest.
123+
124+
#### Mid-range Samples (\( 50 \leq n \leq 200 \))
125+
126+
For datasets that fall in this mid-range, both tests can be useful, depending on the nature of the data. If your analysis is concerned with tail behavior or extreme values, the Anderson-Darling test may be more informative. However, the Shapiro-Wilk test remains a reliable choice if computational efficiency is not a concern.
127+
128+
### 5. Impact of Distribution Characteristics on Test Choice
129+
130+
Different distributions, especially those with heavy tails, skewness, or kurtosis, can influence the performance of normality tests. Both the Shapiro-Wilk and Anderson-Darling tests can detect non-normality, but their focus differs slightly.
131+
132+
- **Tail-heavy distributions**: The Anderson-Darling test is better suited for detecting deviations in the tails.
133+
- **Symmetry and kurtosis**: The Shapiro-Wilk test is generally better at identifying issues related to skewness and kurtosis in smaller datasets.
134+
135+
### 6. Practical Considerations and Software Implementation
136+
137+
Both the Shapiro-Wilk and Anderson-Darling tests are widely implemented in statistical software such as R, Python (via SciPy), and SPSS. Here are examples of how to perform these tests in Python:
138+
139+
#### Shapiro-Wilk in Python
140+
141+
```python
142+
from scipy.stats import shapiro
143+
144+
data = [4.5, 5.6, 7.8, 4.3, 6.1]
145+
stat, p = shapiro(data)
146+
print('Statistics=%.3f, p=%.3f' % (stat, p))
147+
```
148+
149+
#### Anderson-Darling in Python
150+
151+
```python
152+
from scipy.stats import anderson
153+
154+
data = [4.5, 5.6, 7.8, 4.3, 6.1]
155+
result = anderson(data)
156+
print('Statistic: %.3f' % result.statistic)
157+
```
158+
159+
### 7. Conclusion: Which Test Should You Use?
160+
161+
Ultimately, the decision between the Shapiro-Wilk and Anderson-Darling tests depends on your sample size and the nature of the deviations you want to detect. For small samples, the Shapiro-Wilk test is a powerful and reliable option, while the Anderson-Darling test offers a more flexible and tail-sensitive approach, particularly useful for larger datasets.
162+
163+
Both tests provide valuable insights into the distribution of your data, ensuring you can make informed decisions in parametric testing and beyond.

_posts/2022-07-26-geospatial_data_for_public_health_insights.md renamed to _posts/2022-07-26-geospatial_data_public_health_insights.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,17 @@ categories:
44
- Data Science
55
- Public Health
66
classes: wide
7+
date: '2022-07-26'
78
excerpt: Spatial epidemiology combines geospatial data with data science techniques
89
to track and analyze disease outbreaks, offering public health agencies critical
910
tools for intervention and planning.
11+
header:
12+
image: /assets/images/data_science_19.jpg
13+
og_image: /assets/images/data_science_19.jpg
14+
overlay_image: /assets/images/data_science_19.jpg
15+
show_overlay_excerpt: false
16+
teaser: /assets/images/data_science_19.jpg
17+
twitter_image: /assets/images/data_science_19.jpg
1018
keywords:
1119
- Spatial epidemiology
1220
- Geospatial data
@@ -18,6 +26,7 @@ seo_description: Explore how geospatial data is revolutionizing public health. L
1826
how spatial epidemiology and data science techniques track disease outbreaks and
1927
offer critical insights for health interventions.
2028
seo_title: 'Spatial Epidemiology: Leveraging Geospatial Data in Public Health'
29+
seo_type: article
2130
summary: This article explores the importance of geospatial data in spatial epidemiology,
2231
focusing on how it is used to track and analyze disease outbreaks. It delves into
2332
the integration of spatial data with data science methods and how these insights

0 commit comments

Comments
 (0)