Skip to content

Commit ef5dd47

Browse files
committed
feat: new article
1 parent f0c7727 commit ef5dd47

File tree

2 files changed

+120
-1
lines changed

2 files changed

+120
-1
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,4 +144,4 @@ Here’s why: **Permutation testing** naturally performs under the true null hyp
144144

145145
If you’ve been relying on the same diagrams and pseudorules for choosing statistical tests, it’s time to rethink your approach. These flowcharts may be a decent introduction, but they often ignore the complexities of real-world data. By focusing on meaningful interpretations, using robust methods like **Welch’s t-test**, and avoiding unnecessary data transformations, you can make better decisions and gain deeper insights from your data.
146146

147-
Remember, statistical tests are tools—not laws to be followed blindly. The real power lies in understanding what your data is telling you and choosing methods that respect its structure without distorting the interpretation.
147+
Remember, statistical tests are tools—not laws to be followed blindly. The real power lies in understanding what your data is telling you and choosing methods that respect its structure without distorting the interpretation.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
author_profile: false
3+
categories:
4+
- Data Science
5+
- Statistics
6+
classes: wide
7+
date: '2020-01-14'
8+
excerpt: Residual diagnostics often trigger debates, especially when tests like Shapiro-Wilk suggest non-normality. But should it be the final verdict on your model? Let's dive deeper into residual analysis, focusing on its impact in GLS, mixed models, and robust alternatives.
9+
header:
10+
image: /assets/images/data_science_13.jpg
11+
og_image: /assets/images/data_science_13.jpg
12+
overlay_image: /assets/images/data_science_13.jpg
13+
show_overlay_excerpt: false
14+
teaser: /assets/images/data_science_13.jpg
15+
twitter_image: /assets/images/data_science_13.jpg
16+
keywords:
17+
- Residual Diagnostics
18+
- Shapiro-Wilk Test
19+
- Generalized Least Squares
20+
- Mixed Models
21+
- Statistical Modeling
22+
seo_description: An in-depth exploration of the limitations of Shapiro-Wilk and the real issues to consider in residual diagnostics when fitting models. Focusing on Generalized Least Squares and robust alternatives, this article provides insight into the complexities of longitudinal data analysis.
23+
seo_title: 'Residual Diagnostics: Beyond the Shapiro-Wilk Test in Model Fitting'
24+
seo_type: article
25+
summary: In this article, we examine why the Shapiro-Wilk test should not be the final say in assessing model fit, particularly in complex models like Generalized Least Squares for longitudinal data. Instead, we explore alternative diagnostics, the role of kurtosis, skewness, and the practical impact of non-normality on parameter estimates.
26+
tags:
27+
- Residual Analysis
28+
- Longitudinal Data
29+
- Generalized Least Squares
30+
- Parametric Models
31+
title: 'Don''t Get MAD About Shapiro-Wilk: Real Issues in Residual Diagnostics and Model Fitting'
32+
---
33+
34+
When fitting models, especially in longitudinal studies, residual diagnostics often become a contentious part of the statistical review process. It's not uncommon for a reviewer to wave the **Shapiro-Wilk test** in your face, claiming that the residuals' departure from normality invalidates your entire parametric model. But is this rigid adherence to normality testing warranted?
35+
36+
Today, I'm going to walk you through a discussion I had with a statistical reviewer while analyzing data from a longitudinal study using a **Mixed-Model Repeated Measures** (MMRM) approach. We’ll examine why over-reliance on the **Shapiro-Wilk test** is misguided and how real-world data almost never meets theoretical assumptions perfectly. And more importantly, I’ll explain why **other diagnostic tools** and practical considerations should play a bigger role in determining whether your model is valid.
37+
38+
## The Problem with Over-Reliance on the Shapiro-Wilk Test
39+
40+
First, let’s talk about **Shapiro-Wilk**. It’s a test that measures the goodness-of-fit between your residuals and a normal distribution. When the p-value is below a certain threshold (usually 0.05), many take it as definitive evidence that the residuals are not normally distributed and, therefore, the model assumptions are violated. But here's the catch: this test becomes overly sensitive when sample sizes are large.
41+
42+
For instance, with **N ~ 360 observations**, the Shapiro-Wilk test will pick up **even the smallest deviations** from normality. This means that, although your data might not be perfectly normal (and in practice, it never is), it may still be **close enough** that the deviation has no practical effect on the validity of your model. Let’s not forget that **statistical models** are tools for approximation—not exact replicas of reality.
43+
44+
In my experience, using the Shapiro-Wilk test as a **litmus test for model validity** can be overly rigid and misguided. When my reviewer argued that the p-value for the Shapiro-Wilk test was less than 0.001, they essentially viewed this as grounds to dismiss the entire parametric model. However, I knew that other aspects of residual diagnostics—like **skewness**, **kurtosis**, and visual inspections (like **QQ plots**)—were far more indicative of the model’s practical robustness.
45+
46+
### Sample Size Sensitivity
47+
48+
Shapiro-Wilk is notorious for being **oversensitive** with large datasets. The irony is that, as your data size grows, this test is likely to reject normality due to minuscule deviations from the theoretical distribution. So, if you’re analyzing hundreds of data points, should you really be worried about a slight p-value drop below 0.05? Most likely not.
49+
50+
In my case, with **N = 360** residuals, the histogram of residuals overlapped almost perfectly with the normal curve. The **skewness** was practically zero, and while there was some **kurtosis** (~5.5 vs. the ideal of 3), it wasn’t extreme. A simple QQ plot showed only minor deviations in the tails, but the theoretical and empirical quantiles largely matched. Despite this, my reviewer was adamant that these results violated formal assumptions.
51+
52+
## Understanding Residual Diagnostics: More than Just Normality
53+
54+
The point I emphasized during this discussion was that **Shapiro-Wilk should not be the be-all and end-all** of model diagnostics. Residual analysis is about understanding the **behavior** of your data in relation to the assumptions of the model and ensuring that any deviations are not **practically significant**. Here are some of the diagnostic tools and metrics that can provide a clearer picture of what’s happening under the hood of your model:
55+
56+
### 1. **Skewness**: A Measure of Symmetry
57+
58+
One of the first checks I perform after running a model is to look at the **skewness** of the residuals. Skewness measures the asymmetry of the distribution of residuals. In an ideal world, residuals should have a skewness of zero, indicating a perfectly symmetrical distribution.
59+
60+
In the case of my longitudinal data, the skewness was around **0.05**, which is essentially **perfectly symmetrical** for practical purposes. A skewness value close to zero means there’s no need to worry about large asymmetries that could bias the results.
61+
62+
### 2. **Kurtosis**: Understanding Fat Tails
63+
64+
**Kurtosis** is another essential metric that often gets overlooked in favor of the Shapiro-Wilk test. Kurtosis tells you about the **heaviness of the tails** in the residuals' distribution. The normal distribution has a kurtosis of 3. If your residuals have a kurtosis higher than this, it indicates that the tails are fatter than those of a normal distribution, potentially signaling **outliers** or **extreme values**.
65+
66+
In my case, the kurtosis was around **5.5**—slightly above the ideal 3, but nowhere near the threshold where it would be a red flag (usually a kurtosis of **10+**). The small excess kurtosis here was not indicative of any serious issue.
67+
68+
### 3. **QQ Plots**: Visualizing Deviations from Normality
69+
70+
**QQ plots** (Quantile-Quantile plots) are another indispensable tool for diagnosing residuals. They plot the **empirical quantiles** of the residuals against the **theoretical quantiles** of a normal distribution. If the points fall along a straight line, the residuals are normally distributed.
71+
72+
In the conversation with my reviewer, the QQ plot showed minor deviations in the tails, but the **axes** made the deviations look far more dramatic than they actually were. In fact, apart from a few outliers, the theoretical and empirical quantiles were almost identical.
73+
74+
This is where the **practical significance** comes into play. Yes, there was a slight deviation from normality, but it was minor enough that it didn’t have a substantial impact on the **parameter estimates** of the model.
75+
76+
## Robustness Checks: Going Beyond Normality Assumptions
77+
78+
When fitting models—especially complex ones like **Mixed-Model Repeated Measures** (MMRM)—it’s often helpful to run **robustness checks** to see how much the residual distribution impacts your final results. In my case, I re-fitted the model using a **robust mixed-effects model** with **Huberized errors** (a method for reducing the influence of outliers by down-weighting them). This robust model essentially smooths out the impact of deviations in the residuals.
79+
80+
The result? The **parameter estimates** were nearly identical to those from the original parametric model, indicating that any deviation from normality had **little to no impact** on the overall conclusions of the model.
81+
82+
### Sensitivity Analysis: Non-Parametric Approaches
83+
84+
Another key part of the discussion involved conducting a **sensitivity analysis** using non-parametric methods to validate the parametric model’s results. I ran a **permutation paired t-test** (a non-parametric approach) and used **Generalized Estimating Equations** (GEE), which makes no assumptions about the normality of the residuals. Once again, the estimates were consistent across both parametric and non-parametric models, confirming that the original parametric approach was robust.
85+
86+
The **Shapiro-Wilk p-value** did not alter the **practical conclusions** of the study. In fact, the model produced **accurate and reliable results**, despite minor deviations from normality.
87+
88+
## The Real Issue: Are the Estimates Reliable?
89+
90+
Here’s the heart of the matter: the **real issue** with residual diagnostics isn’t whether the p-value from Shapiro-Wilk is below 0.05 or if the QQ plot deviates slightly from a straight line. The real issue is whether these deviations have a **practical impact** on your parameter estimates and conclusions.
91+
92+
In many cases, small deviations from normality will have **no meaningful effect** on your estimates. However, overly relying on strict statistical rules without understanding the **underlying behavior** of your model can lead to **overcorrection** and the use of inappropriate methods.
93+
94+
### Random Slopes and Residual Diagnostics
95+
96+
Another important issue that came up in the discussion was the use of **random slopes** in mixed models. In longitudinal studies, it’s common to include **random intercepts** and **random slopes** to account for the variation across individual subjects over time. However, in this particular study, I had difficulty getting the model to converge when adding random slopes.
97+
98+
Rather than forcing a **random slopes model** and risking **model convergence issues**, I opted for a **random intercept model**. Even though my reviewer initially criticized this choice, I showed that the estimates were practically identical to those from the more complex model (when it did converge). This brings us back to the main point: **practical validity** trumps the pursuit of perfect assumptions.
99+
100+
## Why the Shapiro-Wilk Test Alone Is Not Enough
101+
102+
The takeaway is this: **Shapiro-Wilk** is just one of many tools in the diagnostic toolbox. It’s not sufficient to look at a p-value below 0.05 and conclude that the model is flawed. Real data rarely conforms to perfect normality, and in most cases, **slight deviations from normality are inconsequential**. What’s more important is to assess the overall **robustness** of the model through **multiple diagnostic methods**:
103+
104+
- **Skewness** and **kurtosis** provide more nuanced insights into the distribution of residuals.
105+
- **QQ plots** visually depict the nature of any deviations from normality.
106+
- **Robust models** (such as Huberized models or GEE) allow you to test whether any deviation has a substantial impact on your estimates.
107+
- **Sensitivity analyses** using non-parametric methods can confirm the stability of your results.
108+
109+
### When Normality Really Matters
110+
111+
That said, there are cases where normality really does matter—especially in small-sample studies or when extreme outliers are present. In these cases, deviations from normality can bias the results and lead to **misleading conclusions**. But in studies with larger samples or only slight deviations from normality, the impact on estimates is often minimal.
112+
113+
## The Role of Practicality in Statistical Modeling
114+
115+
Statistical models are ultimately **practical tools**—they’re designed to help us **approximate reality** and make informed decisions. They’re not meant to perfectly fit every theoretical assumption. When working with real-world data, the key is to strike a balance between meeting model assumptions and producing valid, interpretable results.
116+
117+
**Don’t get MAD** (Mean Absolute Deviation, for the pun-inclined) about Shapiro-Wilk when it flags deviations from normality. Look at the **broader picture**: how do your residuals behave? Are there any **outliers** or **heavy tails** that could distort your results? Is your model robust to minor deviations from assumptions?
118+
119+
By understanding these nuances, you can make informed decisions that go beyond mechanistic rules and focus on what really matters: the **interpretation** and **practical significance** of your findings.

0 commit comments

Comments
 (0)