|
| 1 | +--- |
| 2 | +layout: page |
| 3 | +title: Statistics |
| 4 | +permalink: /resource_pages/statistics.html |
| 5 | +nav_exclude: true |
| 6 | +--- |
| 7 | + |
| 8 | +# Plotting Overview |
| 9 | + |
| 10 | +## Introduction |
| 11 | + |
| 12 | +Data visualization is essential for exploring, understanding, and communicating insights from data. This guide covers common plot types, their purposes, and when to use them. |
| 13 | + |
| 14 | +## Table of Contents |
| 15 | +1. [Common Plot Types by Data Type and Purpose](#1-common-plot-types-by-data-type-and-purpose) |
| 16 | + - [Univariate Plots (Single Variable)](#11-univariate-plots-single-variable) |
| 17 | + - [Bivariate Plots (Two Variables)](#12-bivariate-plots-two-variables) |
| 18 | + - [Multivariate Plots (Three or More Variables)](#13-multivariate-plots-three-or-more-variables) |
| 19 | + - [Specialized Statistical Plots](#14-specialized-statistical-plots) |
| 20 | + - [Time Series Plots](#15-time-series-plots) |
| 21 | +2. [Plot Selection Guide](#2-plot-selection-guide) |
| 22 | + - [By Analysis Goal](#21-by-analysis-goal) |
| 23 | + - [By Data Type Combination](#22-by-data-type-combination) |
| 24 | +3. [Best Practices](#3-best-practices) |
| 25 | + - [General Guidelines](#31-general-guidelines) |
| 26 | + - [Common Mistakes to Avoid](#32-common-mistakes-to-avoid) |
| 27 | + - [Python Visualization Libraries](#33-python-visualization-libraries) |
| 28 | +4. [Quick Reference Code Examples](#4-quick-reference-code-examples) |
| 29 | + - [Matplotlib Basics](#41-matplotlib-basics) |
| 30 | + - [Seaborn Examples](#42-seaborn-examples) |
| 31 | + - [Pandas Plotting](#43-pandas-plotting) |
| 32 | +5. [Additional Resources](#5-additional-resources) |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## 1. Common Plot Types by Data Type and Purpose |
| 37 | + |
| 38 | +### 1.1. Univariate Plots (Single Variable) |
| 39 | + |
| 40 | +| **Plot Type** | **Data Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** | |
| 41 | +|---------------|---------------|-------------|--------------|------------------|---------------------------| |
| 42 | +| **Histogram** | Continuous | Show distribution and frequency of values | Understanding data distribution, identifying skewness, detecting outliers | Bins group continuous data; bar heights show frequency | `plt.hist(data, bins=30)` or `sns.histplot(data)` | |
| 43 | +| **Box Plot (Box-and-Whisker)** | Continuous | Display distribution summary (quartiles, median, outliers) | Comparing distributions, identifying outliers, seeing spread | Shows Q1, median, Q3, whiskers (1.5×IQR), and outliers as points | `plt.boxplot(data)` or `sns.boxplot(data)` | |
| 44 | +| **Violin Plot** | Continuous | Combination of box plot and density plot | Showing distribution shape and density, comparing groups | Wider sections indicate higher density; includes median marker | `sns.violinplot(x=data)` | |
| 45 | +| **Density Plot (KDE)** | Continuous | Smooth estimate of probability density function | Visualizing distribution shape without binning | Smooth curve showing probability density | `sns.kdeplot(data)` or `data.plot(kind='kde')` | |
| 46 | +| **Bar Chart** | Categorical | Compare frequencies or values across categories | Showing counts, comparing categories, discrete comparisons | Each bar represents a category; height shows value/count | `plt.bar(categories, values)` or `sns.barplot(x, y)` | |
| 47 | +| **Count Plot** | Categorical | Show frequency of categorical values | Counting occurrences in categorical data | Specialized bar chart for counts | `sns.countplot(x=category)` | |
| 48 | +| **Pie Chart** | Categorical | Show proportions of a whole | Displaying percentage composition (use sparingly) | Circular chart divided into slices; each slice represents proportion | `plt.pie(values, labels=labels)` | |
| 49 | +| **Strip Plot** | Continuous (grouped by category) | Show individual data points along one axis | Displaying all observations, small datasets | Points plotted along axis; shows each individual value | `sns.stripplot(x=category, y=values)` | |
| 50 | +| **Swarm Plot** | Continuous (grouped by category) | Like strip plot but points don't overlap | Showing distribution of small-medium datasets | Non-overlapping points; good for seeing density | `sns.swarmplot(x=category, y=values)` | |
| 51 | + |
| 52 | +### 1.2. Bivariate Plots (Two Variables) |
| 53 | + |
| 54 | +| **Plot Type** | **X-axis Type** | **Y-axis Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** | |
| 55 | +|---------------|-----------------|-----------------|-------------|--------------|------------------|---------------------------| |
| 56 | +| **Scatter Plot** | Continuous | Continuous | Show relationship between two continuous variables | Identifying correlations, patterns, clusters, outliers | Each point represents one observation | `plt.scatter(x, y)` or `sns.scatterplot(x, y)` | |
| 57 | +| **Line Plot** | Continuous/Time | Continuous | Show trends over continuous variable (often time) | Time series data, showing trends and patterns | Points connected by lines | `plt.plot(x, y)` or `data.plot()` | |
| 58 | +| **Bar Chart** | Categorical | Continuous | Compare continuous values across categories | Comparing means, totals, or other aggregates by category | Bars represent values for each category | `plt.bar(categories, values)` or `sns.barplot(x, y)` | |
| 59 | +| **Box Plot (Grouped)** | Categorical | Continuous | Compare distributions across categories | Comparing multiple groups, seeing differences in spread | Multiple box plots side by side | `sns.boxplot(x=category, y=values)` | |
| 60 | +| **Violin Plot (Grouped)** | Categorical | Continuous | Compare distribution shapes across categories | Detailed distribution comparison across groups | Multiple violin plots side by side | `sns.violinplot(x=category, y=values)` | |
| 61 | +| **Heatmap** | Categorical/Discrete | Categorical/Discrete | Show magnitude of values in two-dimensional space | Correlation matrices, confusion matrices, pivot tables | Color intensity represents value magnitude | `sns.heatmap(data, annot=True)` | |
| 62 | +| **Hexbin Plot** | Continuous | Continuous | Show density of points in 2D space | Large datasets where scatter plots become cluttered | Hexagonal bins; color shows point density | `plt.hexbin(x, y, gridsize=30)` | |
| 63 | +| **2D Density Plot** | Continuous | Continuous | Show probability density in 2D space | Understanding joint distributions | Contour lines or color gradients show density | `sns.kdeplot(x=x, y=y)` | |
| 64 | +| **Joint Plot** | Continuous | Continuous | Combine scatter plot with marginal distributions | Comprehensive view of bivariate relationship | Central scatter with histograms/KDEs on margins | `sns.jointplot(x=x, y=y)` | |
| 65 | + |
| 66 | +### 1.3. Multivariate Plots (Three or More Variables) |
| 67 | + |
| 68 | +| **Plot Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** | |
| 69 | +|---------------|-------------|--------------|------------------|---------------------------| |
| 70 | +| **Pair Plot (Scatter Matrix)** | Show all pairwise relationships in dataset | Exploratory data analysis, finding correlations | Grid of scatter plots; diagonal shows distributions | `sns.pairplot(dataframe)` | |
| 71 | +| **3D Scatter Plot** | Show relationship between three continuous variables | Visualizing 3D relationships | Points plotted in 3D space | `from mpl_toolkits.mplot3d import Axes3D` then `ax.scatter3D(x, y, z)` | |
| 72 | +| **Bubble Chart** | Show three continuous variables (x, y, size) | Adding third dimension to scatter plot | Like scatter plot but point size represents third variable | `plt.scatter(x, y, s=sizes)` | |
| 73 | +| **Facet Grid (Small Multiples)** | Show subsets of data in separate subplots | Comparing patterns across categories | Multiple plots arranged in grid | `sns.FacetGrid(data, col='category').map(plt.scatter, 'x', 'y')` | |
| 74 | +| **Parallel Coordinates** | Compare multiple variables across observations | Comparing multivariate profiles, clustering | Lines connect values across parallel axes | `pd.plotting.parallel_coordinates(df, 'class_column')` | |
| 75 | +| **Correlation Heatmap** | Show correlation between all variable pairs | Identifying multicollinearity, feature selection | Color-coded correlation matrix | `sns.heatmap(df.corr(), annot=True)` | |
| 76 | + |
| 77 | +### 1.4. Specialized Statistical Plots |
| 78 | + |
| 79 | +| **Plot Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** | |
| 80 | +|---------------|-------------|--------------|------------------|---------------------------| |
| 81 | +| **Q-Q Plot (Quantile-Quantile)** | Test if data follows theoretical distribution | Checking normality assumption | Points should follow diagonal line if normally distributed | `stats.probplot(data, dist="norm", plot=plt)` | |
| 82 | +| **Residual Plot** | Diagnose regression model fit | Checking regression assumptions | Plot residuals vs fitted values; should show random pattern | `sns.residplot(x=predictions, y=residuals)` | |
| 83 | +| **ROC Curve** | Evaluate binary classifier performance | Comparing classification models | Plots True Positive Rate vs False Positive Rate | `from sklearn.metrics import roc_curve` then `plt.plot(fpr, tpr)` | |
| 84 | +| **Confusion Matrix** | Show classification results | Evaluating classifier accuracy by class | Matrix showing predicted vs actual classes | `sns.heatmap(confusion_matrix, annot=True, fmt='d')` | |
| 85 | +| **Error Bars** | Show uncertainty or variability | Displaying confidence intervals, standard errors | Bars extend from points to show range | `plt.errorbar(x, y, yerr=errors)` | |
| 86 | +| **Regression Plot** | Show linear relationship and confidence interval | Visualizing regression fit | Scatter plot with fitted line and confidence band | `sns.regplot(x=x, y=y)` | |
| 87 | + |
| 88 | +### 1.5. Time Series Plots |
| 89 | + |
| 90 | +| **Plot Type** | **Purpose** | **Best For** | **Key Features** | **Python Implementation** | |
| 91 | +|---------------|-------------|--------------|------------------|---------------------------| |
| 92 | +| **Line Plot** | Show values changing over time | General time series visualization | X-axis is time; y-axis is value | `plt.plot(dates, values)` or `data.plot()` | |
| 93 | +| **Area Plot** | Show cumulative totals over time | Stacked time series, showing composition | Filled area under line(s) | `data.plot.area()` or `plt.fill_between(x, y)` | |
| 94 | +| **Stacked Area Plot** | Show multiple time series composition | Visualizing parts of a whole over time | Multiple series stacked on top of each other | `data.plot.area(stacked=True)` | |
| 95 | +| **Seasonal Plot** | Show patterns that repeat over time | Identifying seasonal patterns | Multiple lines for each season/cycle | Manually create with groupby and plot | |
| 96 | +| **Autocorrelation Plot** | Show correlation of series with lagged versions | Detecting seasonality, patterns | Correlation at different lag values | `pd.plotting.autocorrelation_plot(data)` | |
| 97 | +| **Lag Plot** | Check for randomness in time series | Identifying patterns, testing randomness | Current value vs lagged value | `pd.plotting.lag_plot(data)` | |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## 2. Plot Selection Guide |
| 102 | + |
| 103 | +### 2.1. By Analysis Goal |
| 104 | + |
| 105 | +| **Goal** | **Recommended Plot Types** | |
| 106 | +|----------|---------------------------| |
| 107 | +| **Understand distribution of single variable** | Histogram, Box plot, Violin plot, Density plot | |
| 108 | +| **Compare groups** | Box plot, Violin plot, Bar chart, Strip plot | |
| 109 | +| **Find relationships between variables** | Scatter plot, Line plot, Regression plot, Heatmap | |
| 110 | +| **Show composition** | Pie chart, Stacked bar chart, Area plot | |
| 111 | +| **Analyze time series** | Line plot, Area plot, Seasonal plot | |
| 112 | +| **Detect outliers** | Box plot, Scatter plot, Strip plot | |
| 113 | +| **Explore multivariate data** | Pair plot, Correlation heatmap, Facet grid | |
| 114 | +| **Check statistical assumptions** | Q-Q plot, Residual plot, Histogram | |
| 115 | +| **Show uncertainty** | Error bars, Confidence bands, Violin plots | |
| 116 | + |
| 117 | +### 2.2. By Data Type Combination |
| 118 | + |
| 119 | +| **X Variable** | **Y Variable** | **Z Variable (optional)** | **Recommended Plots** | |
| 120 | +|----------------|----------------|---------------------------|----------------------| |
| 121 | +| Continuous | None | None | Histogram, Box plot, Violin plot, Density plot | |
| 122 | +| Categorical | None | None | Bar chart, Pie chart, Count plot | |
| 123 | +| Continuous | Continuous | None | Scatter plot, Line plot, Hexbin, 2D density | |
| 124 | +| Categorical | Continuous | None | Box plot, Violin plot, Bar chart, Strip plot | |
| 125 | +| Categorical | Categorical | None | Heatmap, Stacked bar chart, Grouped bar chart | |
| 126 | +| Continuous | Continuous | Continuous | 3D scatter, Bubble chart, Contour plot | |
| 127 | +| Continuous | Continuous | Categorical | Scatter with hue, Facet grid | |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## 3. Best Practices |
| 132 | + |
| 133 | +### 3.1. General Guidelines |
| 134 | + |
| 135 | +1. **Choose the right plot for your data type and message** |
| 136 | + - Match plot type to data structure (categorical vs continuous) |
| 137 | + - Consider what you want to communicate |
| 138 | + |
| 139 | +2. **Keep it simple** |
| 140 | + - Avoid unnecessary 3D effects |
| 141 | + - Remove chart junk (excessive gridlines, decorations) |
| 142 | + - Use appropriate aspect ratios |
| 143 | + |
| 144 | +3. **Use color effectively** |
| 145 | + - Use color to encode information, not just for decoration |
| 146 | + - Ensure accessibility (colorblind-friendly palettes) |
| 147 | + - Maintain consistency across related plots |
| 148 | + |
| 149 | +4. **Label clearly** |
| 150 | + - Always include axis labels with units |
| 151 | + - Add informative titles |
| 152 | + - Include legends when needed |
| 153 | + - Annotate important points |
| 154 | + |
| 155 | +5. **Consider your audience** |
| 156 | + - Technical vs general audience |
| 157 | + - Level of detail appropriate for context |
| 158 | + - Medium of presentation (paper, screen, presentation) |
| 159 | + |
| 160 | +### 3.2. Common Mistakes to Avoid |
| 161 | + |
| 162 | +| **Mistake** | **Problem** | **Solution** | |
| 163 | +|-------------|-------------|--------------| |
| 164 | +| **Starting y-axis at non-zero** | Exaggerates differences | Start at zero for bar charts; flexible for line plots | |
| 165 | +| **Too many categories** | Cluttered, hard to read | Limit to 7-10 categories; consider grouping or filtering | |
| 166 | +| **3D when 2D suffices** | Distorts perception, hard to read | Use 2D alternatives with color or size | |
| 167 | +| **Pie charts with many slices** | Hard to compare angles | Use bar chart instead | |
| 168 | +| **Dual y-axes** | Can be misleading | Use separate plots or normalize scales | |
| 169 | +| **Missing error bars** | Unclear uncertainty | Add error bars or confidence intervals | |
| 170 | +| **Poor color choices** | Not colorblind-safe, poor contrast | Use established palettes (ColorBrewer, Seaborn) | |
| 171 | +| **Overplotting** | Too many overlapping points | Use transparency, jitter, hexbin, or sample data | |
| 172 | + |
| 173 | +### 3.3. Python Visualization Libraries |
| 174 | + |
| 175 | +| **Library** | **Strengths** | **Best For** | **Import Statement** | |
| 176 | +|-------------|---------------|--------------|---------------------| |
| 177 | +| **Matplotlib** | Highly customizable, fine control | Publication-quality plots, custom visualizations | `import matplotlib.pyplot as plt` | |
| 178 | +| **Seaborn** | Beautiful defaults, statistical plots | Exploratory data analysis, statistical visualization | `import seaborn as sns` | |
| 179 | +| **Pandas** | Integrated with DataFrames | Quick exploration, simple plots | Built-in: `df.plot()` | |
| 180 | +| **Plotly** | Interactive plots, 3D support | Dashboards, web applications, interactive exploration | `import plotly.express as px` | |
| 181 | +| **Bokeh** | Interactive web-ready plots | Interactive dashboards, streaming data | `from bokeh.plotting import figure` | |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +## 5. Additional Resources |
| 186 | + |
| 187 | +### Recommended Reading |
| 188 | +- <a href="https://www.edwardtufte.com/tufte/books_vdqi" target="_blank">**"The Visual Display of Quantitative Information"**</a> by Edward Tufte |
| 189 | +- <a href="https://clauswilke.com/dataviz/" target="_blank">**"Fundamentals of Data Visualization"**</a> by Claus O. Wilke |
| 190 | +- <a href="https://www.storytellingwithdata.com/" target="_blank">**"Storytelling with Data"**</a> by Cole Nussbaumer Knaflic |
| 191 | + |
| 192 | +### Online Tools |
| 193 | +- <a href="https://colorbrewer2.org/" target="_blank">**ColorBrewer**</a>: Color schemes for maps and data visualization |
| 194 | +- <a href="https://seaborn.pydata.org/examples/index.html" target="_blank">**Seaborn Gallery**</a>: Examples of statistical plots |
| 195 | +- <a href="https://matplotlib.org/stable/gallery/index.html" target="_blank">**Matplotlib Gallery**</a>: Comprehensive plot examples |
| 196 | +- <a href="https://www.python-graph-gallery.com/" target="_blank">**Python Graph Gallery**</a>: Collection of plot types with code |
| 197 | + |
| 198 | +### Color Palettes |
| 199 | +```python |
| 200 | +# Colorblind-friendly palettes |
| 201 | +sns.color_palette("colorblind") |
| 202 | +sns.color_palette("Set2") |
| 203 | + |
| 204 | +# Diverging (for correlation matrices) |
| 205 | +sns.color_palette("RdBu_r", as_cmap=True) |
| 206 | +sns.color_palette("coolwarm", as_cmap=True) |
| 207 | + |
| 208 | +# Sequential (for heatmaps) |
| 209 | +sns.color_palette("YlOrRd", as_cmap=True) |
| 210 | +sns.color_palette("viridis", as_cmap=True) |
| 211 | +``` |
0 commit comments