This directory contains all data files used for figure generation, organized into raw (original) and processed (cleaned) data.
data/
├── raw/ # Original, immutable data files (gitignored except READMEs)
│ ├── gpu_prices/ # Epoch AI ML Hardware Database (GPU pricing)
│ └── ... # Other external datasets
└── processed/ # Cleaned, transformed data ready for plotting
├── deployment_impact/ # Workshop timeline data
├── sleap_dependencies.json # Cached SLEAP dependency data
└── ... # Processed analysis results
Purpose: Store original, immutable data files exactly as downloaded/collected.
Git status: Gitignored (except README.md files)
- Raw data files are too large or change too frequently for git
- Each subdirectory should have a README.md documenting:
- Data source (URL, API endpoint, manual collection)
- Download/collection instructions
- License and usage terms
- Last updated date
- File formats and schemas
Example subdirectories:
gpu_prices/- Epoch AI ML Hardware Database CSVsoftware_complexity/- conda-forge feedstock metadata (if cached)survey_data/- Workshop surveys, user feedback (if applicable)
Purpose: Store cleaned, transformed data ready for plotting.
Git status: Selectively committed
- Small processed files (<1MB) are committed for reproducibility
- Large processed files (>1MB) are gitignored, regenerate as needed
- Cached API responses are gitignored (can be refreshed)
Characteristics:
- Clean: Missing values handled, outliers addressed
- Standardized: Consistent formats, column names, units
- Aggregated: Summarized at appropriate granularity
- Documented: Metadata files describe processing steps
| Dataset | Location | Source | License | Download |
|---|---|---|---|---|
| GPU Hardware Pricing | raw/gpu_prices/ |
Epoch AI | CC BY 4.0 | epoch.ai/data/machine-learning-hardware |
| Software Dependencies | API calls | PyPI, conda-forge | Public APIs | Auto-fetched by scripts |
| SLEAP Dependencies | processed/sleap_dependencies.json |
PyPI JSON API | Public API | Auto-fetched and cached |
| Dataset | Location | Purpose | Maintained By |
|---|---|---|---|
| Workshop Timeline | processed/deployment_impact/workshops.csv |
Deployment impact figure | Research team |
See individual data/raw/<dataset>/README.md files for download instructions.
Example (GPU pricing):
# 1. Visit https://epoch.ai/data/machine-learning-hardware
# 2. Click "Download Data" to get ZIP
# 3. Extract ml_hardware.csv to data/raw/gpu_prices/
ls data/raw/gpu_prices/ml_hardware.csv # VerifyMost processed data can be regenerated from raw data:
# Software complexity
cd scripts/analysis
uv run python collect_dependency_data.py # Fetch from APIs (20-40 min)
uv run python process_dependency_data.py # Clean and aggregate
# Output: data/processed/dependency_data.csv
# SLEAP dependencies
cd ../plotting
uv run python generate_sleap_dependency_graph.py # Auto-fetches and caches
# Output: data/processed/sleap_dependencies.jsonMany scripts cache expensive API calls:
# Use cached data (default, fast)
uv run python scripts/plotting/<script>.py
# Force refresh from APIs (slow, use when data is stale)
uv run python scripts/plotting/<script>.py --force-refreshScripts validate raw data before processing:
- Minimum record count: Ensures sufficient data (e.g., ≥50 GPUs)
- Date range: Checks expected time span (e.g., 2006+)
- Required fields: Verifies essential columns present (price, date, name)
- Data types: Validates numeric fields, date formats
Example (GPU costs):
✓ Found 162 GPUs in dataset
✓ Date range: 2006-2025 (19 years)
✓ 127/162 (78%) have pricing data
Each processed dataset should include:
- Data file: CSV, JSON, or Parquet format
- Metadata file: Processing date, script version, data sources
- Schema documentation: Column names, types, units, descriptions
- No credentials: Never commit API tokens, passwords, or private keys
- No PII: Don't store personally identifiable information
- Public data only: Only use publicly available datasets or with proper licenses
- Check licenses: Verify usage rights for external datasets (CC BY, MIT, etc.)
When adding a new dataset:
- Create subdirectory:
data/raw/<dataset-name>/ - Add README: Document source, license, download instructions
- Update .gitignore: Add large files to
.gitignore - Process data: Create script in
scripts/analysis/to clean/transform - Save processed: Output to
data/processed/<dataset>.csv - Document processing: Add metadata file with processing details
- Update this README: Add row to data sources table
- Analysis Figures - How data is used for figures
- Development Guide - Contributing new analyses
- Scripts README - Data processing scripts
- Raw data files: Gitignore if >10MB
- Processed files: Commit if <1MB, gitignore if larger
- API caches: Always gitignore (can be regenerated)
- Use git-lfs: Consider for essential large files (>5MB)
Each data subdirectory should have its own README.md:
- raw/gpu_prices/README.md - GPU pricing dataset
- processed/deployment_impact/README.md - Workshop timeline