Skip to content

Commit 1d63e02

Browse files
committed
Added plotting overview page
1 parent 152a473 commit 1d63e02

File tree

3 files changed

+230
-0
lines changed

3 files changed

+230
-0
lines changed

resources.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ permalink: /resources/
77
## How-to Guides & Instructions
88

99
1. [DevOps Guides Overview](https://gperdrizet.github.io/FSA_devops/devops_pages/overview.html)
10+
2. [Plotting Overview](https://gperdrizet.github.io/FSA_devops/resource_pages/plotting.html)
1011
2. [Statistics Overview](https://gperdrizet.github.io/FSA_devops/resource_pages/statistics.html)
1112

1213
---

site/resource_pages/plotting.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
---
2+
layout: page
3+
title: Statistics
4+
permalink: /resource_pages/statistics.html
5+
nav_exclude: true
6+
---
7+
8+
# Plotting Overview
9+
10+
## Introduction
11+
12+
Data visualization is essential for exploring, understanding, and communicating insights from data. This guide covers common plot types, their purposes, and when to use them.
13+
14+
## Table of Contents
15+
1. [Common Plot Types by Data Type and Purpose](#1-common-plot-types-by-data-type-and-purpose)
16+
- [Univariate Plots (Single Variable)](#11-univariate-plots-single-variable)
17+
- [Bivariate Plots (Two Variables)](#12-bivariate-plots-two-variables)
18+
- [Multivariate Plots (Three or More Variables)](#13-multivariate-plots-three-or-more-variables)
19+
- [Specialized Statistical Plots](#14-specialized-statistical-plots)
20+
- [Time Series Plots](#15-time-series-plots)
21+
2. [Plot Selection Guide](#2-plot-selection-guide)
22+
- [By Analysis Goal](#21-by-analysis-goal)
23+
- [By Data Type Combination](#22-by-data-type-combination)
24+
3. [Best Practices](#3-best-practices)
25+
- [General Guidelines](#31-general-guidelines)
26+
- [Common Mistakes to Avoid](#32-common-mistakes-to-avoid)
27+
- [Python Visualization Libraries](#33-python-visualization-libraries)
28+
4. [Quick Reference Code Examples](#4-quick-reference-code-examples)
29+
- [Matplotlib Basics](#41-matplotlib-basics)
30+
- [Seaborn Examples](#42-seaborn-examples)
31+
- [Pandas Plotting](#43-pandas-plotting)
32+
5. [Additional Resources](#5-additional-resources)
33+
34+
---
35+
36+
## 1. Common Plot Types by Data Type and Purpose
37+
38+
### 1.1. Univariate Plots (Single Variable)
39+
40+
| **Plot Type** | **Data Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** |
41+
|---------------|---------------|-------------|--------------|------------------|---------------------------|
42+
| **Histogram** | Continuous | Show distribution and frequency of values | Understanding data distribution, identifying skewness, detecting outliers | Bins group continuous data; bar heights show frequency | `plt.hist(data, bins=30)` or `sns.histplot(data)` |
43+
| **Box Plot (Box-and-Whisker)** | Continuous | Display distribution summary (quartiles, median, outliers) | Comparing distributions, identifying outliers, seeing spread | Shows Q1, median, Q3, whiskers (1.5×IQR), and outliers as points | `plt.boxplot(data)` or `sns.boxplot(data)` |
44+
| **Violin Plot** | Continuous | Combination of box plot and density plot | Showing distribution shape and density, comparing groups | Wider sections indicate higher density; includes median marker | `sns.violinplot(x=data)` |
45+
| **Density Plot (KDE)** | Continuous | Smooth estimate of probability density function | Visualizing distribution shape without binning | Smooth curve showing probability density | `sns.kdeplot(data)` or `data.plot(kind='kde')` |
46+
| **Bar Chart** | Categorical | Compare frequencies or values across categories | Showing counts, comparing categories, discrete comparisons | Each bar represents a category; height shows value/count | `plt.bar(categories, values)` or `sns.barplot(x, y)` |
47+
| **Count Plot** | Categorical | Show frequency of categorical values | Counting occurrences in categorical data | Specialized bar chart for counts | `sns.countplot(x=category)` |
48+
| **Pie Chart** | Categorical | Show proportions of a whole | Displaying percentage composition (use sparingly) | Circular chart divided into slices; each slice represents proportion | `plt.pie(values, labels=labels)` |
49+
| **Strip Plot** | Continuous (grouped by category) | Show individual data points along one axis | Displaying all observations, small datasets | Points plotted along axis; shows each individual value | `sns.stripplot(x=category, y=values)` |
50+
| **Swarm Plot** | Continuous (grouped by category) | Like strip plot but points don't overlap | Showing distribution of small-medium datasets | Non-overlapping points; good for seeing density | `sns.swarmplot(x=category, y=values)` |
51+
52+
### 1.2. Bivariate Plots (Two Variables)
53+
54+
| **Plot Type** | **X-axis Type** | **Y-axis Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** |
55+
|---------------|-----------------|-----------------|-------------|--------------|------------------|---------------------------|
56+
| **Scatter Plot** | Continuous | Continuous | Show relationship between two continuous variables | Identifying correlations, patterns, clusters, outliers | Each point represents one observation | `plt.scatter(x, y)` or `sns.scatterplot(x, y)` |
57+
| **Line Plot** | Continuous/Time | Continuous | Show trends over continuous variable (often time) | Time series data, showing trends and patterns | Points connected by lines | `plt.plot(x, y)` or `data.plot()` |
58+
| **Bar Chart** | Categorical | Continuous | Compare continuous values across categories | Comparing means, totals, or other aggregates by category | Bars represent values for each category | `plt.bar(categories, values)` or `sns.barplot(x, y)` |
59+
| **Box Plot (Grouped)** | Categorical | Continuous | Compare distributions across categories | Comparing multiple groups, seeing differences in spread | Multiple box plots side by side | `sns.boxplot(x=category, y=values)` |
60+
| **Violin Plot (Grouped)** | Categorical | Continuous | Compare distribution shapes across categories | Detailed distribution comparison across groups | Multiple violin plots side by side | `sns.violinplot(x=category, y=values)` |
61+
| **Heatmap** | Categorical/Discrete | Categorical/Discrete | Show magnitude of values in two-dimensional space | Correlation matrices, confusion matrices, pivot tables | Color intensity represents value magnitude | `sns.heatmap(data, annot=True)` |
62+
| **Hexbin Plot** | Continuous | Continuous | Show density of points in 2D space | Large datasets where scatter plots become cluttered | Hexagonal bins; color shows point density | `plt.hexbin(x, y, gridsize=30)` |
63+
| **2D Density Plot** | Continuous | Continuous | Show probability density in 2D space | Understanding joint distributions | Contour lines or color gradients show density | `sns.kdeplot(x=x, y=y)` |
64+
| **Joint Plot** | Continuous | Continuous | Combine scatter plot with marginal distributions | Comprehensive view of bivariate relationship | Central scatter with histograms/KDEs on margins | `sns.jointplot(x=x, y=y)` |
65+
66+
### 1.3. Multivariate Plots (Three or More Variables)
67+
68+
| **Plot Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** |
69+
|---------------|-------------|--------------|------------------|---------------------------|
70+
| **Pair Plot (Scatter Matrix)** | Show all pairwise relationships in dataset | Exploratory data analysis, finding correlations | Grid of scatter plots; diagonal shows distributions | `sns.pairplot(dataframe)` |
71+
| **3D Scatter Plot** | Show relationship between three continuous variables | Visualizing 3D relationships | Points plotted in 3D space | `from mpl_toolkits.mplot3d import Axes3D` then `ax.scatter3D(x, y, z)` |
72+
| **Bubble Chart** | Show three continuous variables (x, y, size) | Adding third dimension to scatter plot | Like scatter plot but point size represents third variable | `plt.scatter(x, y, s=sizes)` |
73+
| **Facet Grid (Small Multiples)** | Show subsets of data in separate subplots | Comparing patterns across categories | Multiple plots arranged in grid | `sns.FacetGrid(data, col='category').map(plt.scatter, 'x', 'y')` |
74+
| **Parallel Coordinates** | Compare multiple variables across observations | Comparing multivariate profiles, clustering | Lines connect values across parallel axes | `pd.plotting.parallel_coordinates(df, 'class_column')` |
75+
| **Correlation Heatmap** | Show correlation between all variable pairs | Identifying multicollinearity, feature selection | Color-coded correlation matrix | `sns.heatmap(df.corr(), annot=True)` |
76+
77+
### 1.4. Specialized Statistical Plots
78+
79+
| **Plot Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** |
80+
|---------------|-------------|--------------|------------------|---------------------------|
81+
| **Q-Q Plot (Quantile-Quantile)** | Test if data follows theoretical distribution | Checking normality assumption | Points should follow diagonal line if normally distributed | `stats.probplot(data, dist="norm", plot=plt)` |
82+
| **Residual Plot** | Diagnose regression model fit | Checking regression assumptions | Plot residuals vs fitted values; should show random pattern | `sns.residplot(x=predictions, y=residuals)` |
83+
| **ROC Curve** | Evaluate binary classifier performance | Comparing classification models | Plots True Positive Rate vs False Positive Rate | `from sklearn.metrics import roc_curve` then `plt.plot(fpr, tpr)` |
84+
| **Confusion Matrix** | Show classification results | Evaluating classifier accuracy by class | Matrix showing predicted vs actual classes | `sns.heatmap(confusion_matrix, annot=True, fmt='d')` |
85+
| **Error Bars** | Show uncertainty or variability | Displaying confidence intervals, standard errors | Bars extend from points to show range | `plt.errorbar(x, y, yerr=errors)` |
86+
| **Regression Plot** | Show linear relationship and confidence interval | Visualizing regression fit | Scatter plot with fitted line and confidence band | `sns.regplot(x=x, y=y)` |
87+
88+
### 1.5. Time Series Plots
89+
90+
| **Plot Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** |
91+
|---------------|-------------|--------------|------------------|---------------------------|
92+
| **Line Plot** | Show values changing over time | General time series visualization | X-axis is time; y-axis is value | `plt.plot(dates, values)` or `data.plot()` |
93+
| **Area Plot** | Show cumulative totals over time | Stacked time series, showing composition | Filled area under line(s) | `data.plot.area()` or `plt.fill_between(x, y)` |
94+
| **Stacked Area Plot** | Show multiple time series composition | Visualizing parts of a whole over time | Multiple series stacked on top of each other | `data.plot.area(stacked=True)` |
95+
| **Seasonal Plot** | Show patterns that repeat over time | Identifying seasonal patterns | Multiple lines for each season/cycle | Manually create with groupby and plot |
96+
| **Autocorrelation Plot** | Show correlation of series with lagged versions | Detecting seasonality, patterns | Correlation at different lag values | `pd.plotting.autocorrelation_plot(data)` |
97+
| **Lag Plot** | Check for randomness in time series | Identifying patterns, testing randomness | Current value vs lagged value | `pd.plotting.lag_plot(data)` |
98+
99+
---
100+
101+
## 2. Plot Selection Guide
102+
103+
### 2.1. By Analysis Goal
104+
105+
| **Goal** | **Recommended Plot Types** |
106+
|----------|---------------------------|
107+
| **Understand distribution of single variable** | Histogram, Box plot, Violin plot, Density plot |
108+
| **Compare groups** | Box plot, Violin plot, Bar chart, Strip plot |
109+
| **Find relationships between variables** | Scatter plot, Line plot, Regression plot, Heatmap |
110+
| **Show composition** | Pie chart, Stacked bar chart, Area plot |
111+
| **Analyze time series** | Line plot, Area plot, Seasonal plot |
112+
| **Detect outliers** | Box plot, Scatter plot, Strip plot |
113+
| **Explore multivariate data** | Pair plot, Correlation heatmap, Facet grid |
114+
| **Check statistical assumptions** | Q-Q plot, Residual plot, Histogram |
115+
| **Show uncertainty** | Error bars, Confidence bands, Violin plots |
116+
117+
### 2.2. By Data Type Combination
118+
119+
| **X Variable** | **Y Variable** | **Z Variable (optional)** | **Recommended Plots** |
120+
|----------------|----------------|---------------------------|----------------------|
121+
| Continuous | None | None | Histogram, Box plot, Violin plot, Density plot |
122+
| Categorical | None | None | Bar chart, Pie chart, Count plot |
123+
| Continuous | Continuous | None | Scatter plot, Line plot, Hexbin, 2D density |
124+
| Categorical | Continuous | None | Box plot, Violin plot, Bar chart, Strip plot |
125+
| Categorical | Categorical | None | Heatmap, Stacked bar chart, Grouped bar chart |
126+
| Continuous | Continuous | Continuous | 3D scatter, Bubble chart, Contour plot |
127+
| Continuous | Continuous | Categorical | Scatter with hue, Facet grid |
128+
129+
---
130+
131+
## 3. Best Practices
132+
133+
### 3.1. General Guidelines
134+
135+
1. **Choose the right plot for your data type and message**
136+
- Match plot type to data structure (categorical vs continuous)
137+
- Consider what you want to communicate
138+
139+
2. **Keep it simple**
140+
- Avoid unnecessary 3D effects
141+
- Remove chart junk (excessive gridlines, decorations)
142+
- Use appropriate aspect ratios
143+
144+
3. **Use color effectively**
145+
- Use color to encode information, not just for decoration
146+
- Ensure accessibility (colorblind-friendly palettes)
147+
- Maintain consistency across related plots
148+
149+
4. **Label clearly**
150+
- Always include axis labels with units
151+
- Add informative titles
152+
- Include legends when needed
153+
- Annotate important points
154+
155+
5. **Consider your audience**
156+
- Technical vs general audience
157+
- Level of detail appropriate for context
158+
- Medium of presentation (paper, screen, presentation)
159+
160+
### 3.2. Common Mistakes to Avoid
161+
162+
| **Mistake** | **Problem** | **Solution** |
163+
|-------------|-------------|--------------|
164+
| **Starting y-axis at non-zero** | Exaggerates differences | Start at zero for bar charts; flexible for line plots |
165+
| **Too many categories** | Cluttered, hard to read | Limit to 7-10 categories; consider grouping or filtering |
166+
| **3D when 2D suffices** | Distorts perception, hard to read | Use 2D alternatives with color or size |
167+
| **Pie charts with many slices** | Hard to compare angles | Use bar chart instead |
168+
| **Dual y-axes** | Can be misleading | Use separate plots or normalize scales |
169+
| **Missing error bars** | Unclear uncertainty | Add error bars or confidence intervals |
170+
| **Poor color choices** | Not colorblind-safe, poor contrast | Use established palettes (ColorBrewer, Seaborn) |
171+
| **Overplotting** | Too many overlapping points | Use transparency, jitter, hexbin, or sample data |
172+
173+
### 3.3. Python Visualization Libraries
174+
175+
| **Library** | **Strengths** | **Best For** | **Import Statement** |
176+
|-------------|---------------|--------------|---------------------|
177+
| **Matplotlib** | Highly customizable, fine control | Publication-quality plots, custom visualizations | `import matplotlib.pyplot as plt` |
178+
| **Seaborn** | Beautiful defaults, statistical plots | Exploratory data analysis, statistical visualization | `import seaborn as sns` |
179+
| **Pandas** | Integrated with DataFrames | Quick exploration, simple plots | Built-in: `df.plot()` |
180+
| **Plotly** | Interactive plots, 3D support | Dashboards, web applications, interactive exploration | `import plotly.express as px` |
181+
| **Bokeh** | Interactive web-ready plots | Interactive dashboards, streaming data | `from bokeh.plotting import figure` |
182+
183+
---
184+
185+
## 5. Additional Resources
186+
187+
### Recommended Reading
188+
- <a href="https://www.edwardtufte.com/tufte/books_vdqi" target="_blank">**"The Visual Display of Quantitative Information"**</a> by Edward Tufte
189+
- <a href="https://clauswilke.com/dataviz/" target="_blank">**"Fundamentals of Data Visualization"**</a> by Claus O. Wilke
190+
- <a href="https://www.storytellingwithdata.com/" target="_blank">**"Storytelling with Data"**</a> by Cole Nussbaumer Knaflic
191+
192+
### Online Tools
193+
- <a href="https://colorbrewer2.org/" target="_blank">**ColorBrewer**</a>: Color schemes for maps and data visualization
194+
- <a href="https://seaborn.pydata.org/examples/index.html" target="_blank">**Seaborn Gallery**</a>: Examples of statistical plots
195+
- <a href="https://matplotlib.org/stable/gallery/index.html" target="_blank">**Matplotlib Gallery**</a>: Comprehensive plot examples
196+
- <a href="https://www.python-graph-gallery.com/" target="_blank">**Python Graph Gallery**</a>: Collection of plot types with code
197+
198+
### Color Palettes
199+
```python
200+
# Colorblind-friendly palettes
201+
sns.color_palette("colorblind")
202+
sns.color_palette("Set2")
203+
204+
# Diverging (for correlation matrices)
205+
sns.color_palette("RdBu_r", as_cmap=True)
206+
sns.color_palette("coolwarm", as_cmap=True)
207+
208+
# Sequential (for heatmaps)
209+
sns.color_palette("YlOrRd", as_cmap=True)
210+
sns.color_palette("viridis", as_cmap=True)
211+
```

site/resource_pages/statistics.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,24 @@ nav_exclude: true
77

88
# Statistics Overview
99

10+
Statistics is the foundation of data science, providing the mathematical framework for analyzing data, quantifying uncertainty, and making informed decisions. This guide covers essential descriptive statistics, probability distributions, and hypothesis testing methods used in data analysis.
11+
12+
## Table of Contents
13+
1. [Common Descriptive Statistics](#1-common-descriptive-statistics)
14+
- [Measures of Central Tendency, Spread, and Shape](#11-measures-of-central-tendency-spread-and-shape)
15+
- [Interpretation Guidelines](#12-interpretation-guidelines)
16+
- [Choosing the Right Statistic](#13-choosing-the-right-statistic)
17+
- [Important Notes](#14-important-notes)
18+
2. [Common Probability Distributions](#2-common-probability-distributions)
19+
- [Discrete and Continuous Probability Distributions](#21-discrete-and-continuous-probability-distributions)
20+
- [Key Properties by Distribution](#13-key-properties-by-distribution)
21+
3. [Statistical Test Selection Guide](#3-statistical-test-selection-guide)
22+
- [Hypothesis Testing by Scenario](#31-hypothesis-testing-by-scenario)
23+
- [Test Assumptions Checklist](#32-test-assumptions-checklist)
24+
- [Multiple Testing Correction](#33-multiple-testing-correction)
25+
26+
---
27+
1028
## 1. Common Descriptive Statistics
1129

1230
Descriptive statistics summarize and describe the main features of a dataset. They are divided into three main categories based on what aspect of the data they measure.

0 commit comments

Comments
 (0)