Tagline: Same Pandas taste, half the calories (RAM).
Pandas is built for safety and ease of use, not memory efficiency. When you load a CSV, standard Pandas defaults to "safe" but wasteful data types:
int64for small integers (wasting 75%+ memory per number)float64for simple metrics (wasting 50% memory per number)objectfor repetitive strings (wasting massive amounts of memory and CPU)
Diet Pandas solves this by acting as a strict nutritionist for your data. It aggressively analyzes data distributions and "downcasts" types to the smallest safe representationβoften reducing memory usage by 50% to 80% without losing information.
pip install diet-pandasimport dietpandas as dp
# 1. Drop-in replacement for pandas.read_csv
# Loads faster and uses less RAM automatically
df = dp.read_csv("huge_dataset.csv")
# Diet Complete: Memory reduced by 67.3%
# 450.00MB -> 147.15MB
# 2. Or optimize an existing DataFrame
import pandas as pd
df_heavy = pd.DataFrame({
'year': [2020, 2021, 2022],
'revenue': [1.1, 2.2, 3.3]
})
print(df_heavy.info())
# year int64 (8 bytes each)
# revenue float64 (8 bytes each)
df_light = dp.diet(df_heavy)
# Diet Complete: Memory reduced by 62.5%
# 0.13MB -> 0.05MB
print(df_light.info())
# year uint16 (2 bytes each)
# revenue float32 (4 bytes each)Diet Pandas now uses multi-threaded processing for 2-4x faster optimization:
import dietpandas as dp
# Parallel processing enabled by default (uses all CPU cores)
df = dp.diet(df, parallel=True)
# Control number of worker threads
df = dp.diet(df, parallel=True, max_workers=4)
# Disable for sequential processing
df = dp.diet(df, parallel=False)Performance improvements:
- 2-4x faster on multi-core systems
- Automatic fallback to sequential for small DataFrames
- Thread-safe optimization of independent columns
Diet Pandas uses Polars (a blazing-fast DataFrame library) to parse CSV files, then automatically converts to optimized Pandas DataFrames.
import dietpandas as dp
# 5-10x faster than pandas.read_csv AND uses less memory
df = dp.read_csv("large_file.csv")import dietpandas as dp
# Automatic optimization
df = dp.diet(df_original)
# See detailed memory report
report = dp.get_memory_report(df)
print(report)
# column dtype memory_bytes memory_mb percent_of_total
# 0 large_text category 12589875 12.59 45.2
# 1 user_id uint32 4000000 4.00 14.4For maximum compression, use aggressive mode:
# Safe mode: float64 -> float32 (lossless for most ML tasks)
df = dp.diet(df, aggressive=False)
# Keto mode: float64 -> float16 (extreme compression, some precision loss)
df = dp.diet(df, aggressive=True)
# Diet Complete: Memory reduced by 81.2%import dietpandas as dp
# CSV with fast Polars engine
df = dp.read_csv("data.csv")
# Parquet
df = dp.read_parquet("data.parquet")
# Excel
df = dp.read_excel("data.xlsx")
# JSON
df = dp.read_json("data.json")
# HDF5
df = dp.read_hdf("data.h5", key="dataset1")
# Feather
df = dp.read_feather("data.feather")
# All readers automatically optimize memory usage!For data with many repeated values (zeros, NaNs, or any repeated value):
# Enable sparse optimization for columns with >90% repeated values
df = dp.diet(df, optimize_sparse_cols=True)
# Perfect for: binary features, indicator variables, sparse matricesAutomatically optimizes datetime columns for better memory efficiency:
df = pd.DataFrame({
'date': pd.date_range('2020-01-01', periods=1000000),
'value': range(1000000)
})
df_optimized = dp.diet(df, optimize_datetimes=True)
# DateTime columns automatically optimizedAutomatically detects and optimizes boolean-like columns:
df = pd.DataFrame({
'is_active': [0, 1, 1, 0, 1], # int64 -> boolean (87.5% memory reduction)
'has_data': ['yes', 'no', 'yes', 'no', 'yes'], # object -> boolean
'approved': ['True', 'False', 'True', 'False', 'True'] # object -> boolean
})
df_optimized = dp.diet(df, optimize_bools=True)
# All three columns converted to memory-efficient boolean type!Supports multiple boolean representations:
- Numeric:
0,1 - Strings:
'true'/'false','yes'/'no','y'/'n','t'/'f' - Case-insensitive detection
NEW in v0.3.0! Fine-grained control over optimization:
# Skip specific columns (e.g., IDs, UUIDs)
df = dp.diet(df, skip_columns=['user_id', 'uuid'])
# Force categorical conversion on high-cardinality columns
df = dp.diet(df, force_categorical=['country_code', 'product_sku'])
# Use aggressive mode only for specific columns
df = dp.diet(df, force_aggressive=['approximation_field', 'estimated_value'])
# Combine multiple controls
df = dp.diet(
df,
skip_columns=['id'],
force_categorical=['category'],
force_aggressive=['approx_price']
)NEW in v0.3.0! Analyze your DataFrame before optimization to see what changes will be made:
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'id': range(1000),
'amount': [1.1, 2.2, 3.3] * 333 + [1.1],
'category': ['A', 'B', 'C'] * 333 + ['A']
})
# Analyze without modifying the DataFrame
analysis = dp.analyze(df)
print(analysis)
#
# column current_dtype recommended_dtype current_memory_mb optimized_memory_mb savings_mb savings_percent reasoning
# 0 id int64 uint16 0.008 0.002 0.006 75.0 Integer range 0-999 fits in uint16
# 1 amount float64 float32 0.008 0.004 0.004 50.0 Standard float optimization
# 2 category object category 0.057 0.001 0.056 98.2 Low cardinality (3 unique values)
# Get summary statistics
summary = dp.get_optimization_summary(analysis)
print(summary)
# {
# 'total_columns': 3,
# 'optimizable_columns': 3,
# 'current_memory_mb': 0.073,
# 'optimized_memory_mb': 0.007,
# 'total_savings_mb': 0.066,
# 'total_savings_percent': 90.4
# }
# Quick estimate without detailed analysis
reduction_pct = dp.estimate_memory_reduction(df)
print(f"Estimated reduction: {reduction_pct:.1f}%")
# Estimated reduction: 90.4%NEW in v0.3.0! Get helpful warnings about potential issues:
import dietpandas as dp
df = pd.DataFrame({
'id': range(10000), # High cardinality
'value': [1.123456789] * 10000, # Will lose precision in float16
'empty': [None] * 10000 # All NaN column
})
# Warnings are enabled by default
df_optimized = dp.diet(df, aggressive=True, warn_on_issues=True)
# β οΈ Warning: Column 'empty' is entirely NaN - consider dropping it
# β οΈ Warning: Column 'id' has high cardinality (100.0%) - may not benefit from categorical
# β οΈ Warning: Aggressive mode on column 'value' may lose precision (float64 -> float16)
# Disable warnings if you know what you're doing
df_optimized = dp.diet(df, aggressive=True, warn_on_issues=False)import dietpandas as dp
# CSV (with Polars acceleration)
df = dp.read_csv("data.csv")
# Parquet (with Polars acceleration)
df = dp.read_parquet("data.parquet")
# Excel
df = dp.read_excel("data.xlsx")
# All return optimized Pandas DataFramesDiet Pandas uses a "Trojan Horse" architecture:
-
Ingestion Layer (The Fast Lane):
- Uses Polars or PyArrow for multi-threaded CSV parsing (5-10x faster)
-
Optimization Layer (The Metabolism):
- Calculates min/max for numeric columns
- Analyzes string cardinality (unique values ratio)
- Maps stats to smallest safe numpy types
-
Conversion Layer (The Result):
- Returns a standard
pandas.DataFrame(100% compatible) - Works seamlessly with Scikit-Learn, PyTorch, XGBoost, Matplotlib
- Returns a standard
| Original Type | Optimization | Example |
|---|---|---|
int64 with only 0/1 |
boolean |
NEW! Flags, indicators (87.5% reduction) |
object with 'yes'/'no' |
boolean |
NEW! Survey responses |
int64 with values 0-255 |
uint8 |
User ages, small counts |
int64 with values -100 to 100 |
int8 |
Temperature data |
float64 |
float32 |
Most ML features |
object with <50% unique |
category |
Country names, product categories |
Diet-pandas has been benchmarked on the ENEM 2024 dataset (Brazilian National Exam) with 4.3 million student records across multiple files:
import pandas as pd
import dietpandas as dp
# Standard Pandas
df = pd.read_csv("RESULTADOS_2024.csv", sep=";")
# Memory: 4,349 MB | Load time: 17.31 sec
# Diet Pandas
df = dp.read_csv("RESULTADOS_2024.csv", sep=";")
# Memory: 1,623 MB | Load time: 32.99 sec
# β
62.7% reduction | 2.7 GB saved!Key Findings:
- β 62-96% memory reduction on real government data
- β 2.7-5.4 GB saved per file - critical for laptop workflows
- β Handles 4.3 million rows with mixed data types
- β Extremely effective on categorical/geographic data (Brazilian states, cities)
β οΈ Load time 2-3x slower (worth it for massive memory savings + iterative analysis)
| Dataset Size | Memory Reduction | Optimization Time |
|---|---|---|
| 10K rows | 82.3% | 0.009 sec |
| 50K rows | 85.8% | 0.033 sec |
| 100K rows | 86.3% | 0.061 sec |
| 500K rows | 86.6% | 0.304 sec |
Consistent 85%+ reduction across all dataset sizes with minimal overhead.
You can see other benchmarks in the benchmarks folder.
- π Large datasets (>100 MB) on memory-constrained systems
- π» Laptop workflows - Process 3-5x more data without upgrading RAM
- π Iterative analysis - Load once, query many times (worth the initial load time)
- πΊοΈ Categorical/geographic data - State codes, city names, categories (95%+ reduction)
- π Educational/research - Work with real datasets on student hardware
- π€ ML pipelines - Reduce memory for feature engineering and model training
- π Data exploration - Fit larger datasets in Jupyter notebooks
β οΈ Tiny datasets (<10 MB) - Optimization overhead not worth itβ οΈ One-time read-and-aggregate - Won't query data multiple timesβ οΈ Time-critical ETL - Where 2-3x load time matters more than memoryβ οΈ Unlimited RAM available - Cloud instances with 128+ GB RAM
Parquet helps with disk space, diet-pandas helps with RAM usage:
# Scenario 1: Parquet from unoptimized data (COMMON)
df = pd.read_parquet('data.parquet') # int64, object types
# In memory: 1800 MB
df_optimized = dp.diet(df)
# In memory: 500 MB β 72% reduction still possible!
# Scenario 2: Parquet from already-optimized data (BEST)
df = dp.read_csv('data.csv') # Already optimized
df.to_parquet('optimized.parquet') # Saves efficient types
# Future reads already optimal βWhen to use with Parquet:
- β Parquet created from raw/unoptimized data (most cases)
- β Need to reduce in-memory usage during analysis
- β Not sure if original DataFrame was optimized
- β You optimized before saving to Parquet (already efficient)
Pro tip: Optimize THEN save to Parquet for best results!
Slower initial load (2-3x)
Worth it when:
- You'll run multiple queries on the data
- Memory is limited (8-16 GB laptops)
- Processing multiple large files simultaneously
- Need to keep data in memory for hours
Not worth it when:
- Quick one-off aggregation then done
- Have plenty of RAM available
- Load time is critical (real-time systems)
# Skip optimization for specific columns
df = dp.diet(df, skip_columns=['user_id', 'uuid'])
# Force categorical conversion for high-cardinality columns
df = dp.diet(df, force_categorical=['country_code'])
# Apply aggressive optimization only to specific columns
df = dp.diet(df, force_aggressive=['estimated_value'])# Convert to category if <30% unique values (default is 50%)
df = dp.diet(df, categorical_threshold=0.3)# Keep binary columns as integers instead of converting to boolean
df = dp.diet(df, optimize_bools=False)# Modify DataFrame in place (saves memory)
dp.diet(df, inplace=True)import pandas as pd
import dietpandas as dp
df = dp.read_csv("data.csv", optimize=False) # Load without optimization
df = df.drop(columns=['id_column']) # Remove high-cardinality columns
df = dp.diet(df) # Now optimizedf = dp.diet(df, verbose=True)
# Diet Complete: Memory reduced by 67.3%
# 450.00MB -> 147.15MBDiet Pandas returns standard Pandas DataFrames, so it works seamlessly with:
import dietpandas as dp
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Load optimized data
df = dp.read_csv("train.csv")
# Works with Scikit-Learn
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)
# Works with Matplotlib
df['revenue'].plot()
plt.show()
# Works with any Pandas operation
result = df.groupby('category')['sales'].sum()| Solution | Speed | Memory Savings | Pandas Compatible | Learning Curve |
|---|---|---|---|---|
| Diet Pandas | β‘β‘β‘ Fast | π― 50-80% | β 100% | β None |
| Manual downcasting | π Slow | π― 50-80% | β Yes | β High |
| Polars | β‘β‘β‘ Very Fast | π― 60-90% | β No | |
| Dask | β‘β‘ Medium | π― Varies |
git clone https://github.com/yourusername/diet-pandas.git
cd diet-pandas
# Install in development mode
pip install -e ".[dev]"pytest tests/ -vpython scripts/examples.py
# Or run the interactive demo
python scripts/demo.pydiet-pandas/
βββ src/
β βββ dietpandas/
β βββ __init__.py # Public API
β βββ core.py # Optimization logic
β βββ io.py # Fast I/O with Polars
βββ tests/
β βββ test_core.py # Core function tests
β βββ test_io.py # I/O function tests
βββ scripts/
β βββ demo.py # Interactive demo
β βββ examples.py # Usage examples
β βββ quickstart.py # Setup script
βββ pyproject.toml # Project configuration
βββ README.md # Documentation
βββ CHANGELOG.md # Version history
βββ CONTRIBUTING.md # Contribution guide
βββ LICENSE # MIT License
Optimize an existing DataFrame.
Parameters:
df(pd.DataFrame): DataFrame to optimizeverbose(bool): Print memory reduction statisticsaggressive(bool): Use float16 instead of float32 (may lose precision)categorical_threshold(float): Convert to category if unique_ratio < thresholdinplace(bool): Modify DataFrame in place
Returns: Optimized pd.DataFrame
Get detailed memory usage report per column.
Returns: DataFrame with memory statistics
Read CSV with automatic optimization.
Read Parquet with automatic optimization.
Read Excel with automatic optimization.
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details.
- Built on top of the excellent Pandas library
- Uses Polars for high-speed CSV parsing
- Inspired by the need for memory-efficient data science workflows
- GitHub: @luiz826
- Issues: GitHub Issues
Remember: A lean DataFrame is a happy DataFrame! πΌπ₯