California Gentrification Prediction Model

Predicting neighborhood change across California using machine learning and U.S. Census data

📊 Overview

A comprehensive machine learning pipeline that analyzes and predicts gentrification patterns across 8,000+ census tracts in California. This project uses American Community Survey (ACS) data to identify neighborhoods at risk of gentrification and quantify the factors driving neighborhood change.

Key Features:

🔍 Automated data collection from U.S. Census Bureau API
📈 Multiple ML models (Linear, Ridge, Lasso, Random Forest, Gradient Boosting)
🗺️ County-level analysis across all 58 California counties
📊 Rich visualizations and statistical analysis
🎯 Predicts both rent change (regression) and gentrification status (classification)

🎯 What It Does

This system analyzes how neighborhoods change over time by:

Collecting census data for two time periods (e.g., 2012 vs. 2022)
Calculating gentrification metrics including:
- Rent and home value changes
- Income shifts
- Educational attainment increases
- Demographic transitions
Building predictive models that use baseline neighborhood characteristics to forecast future change
Generating insights through visualizations, county rankings, and detailed reports

📈 Results

Analyzes 8,000+ census tracts across California
Achieves R² of 0.20-0.30 for rent change prediction
Identifies top gentrifying counties: Alameda, Contra Costa, San Francisco, Los Angeles

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/dhanyabhat16/california-gentrification.git
cd california-gentrification

# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn census requests scipy

Get a Census API Key

Sign up at: https://api.census.gov/data/key_signup.html
Receive your key instantly via email

Run the Pipeline

#Directly use the datasets in the data folder or extract from data_collection.py file
python data_collection.py       # Collect census data
python analysis_modeling.py     # Train ML models
python ca_gentrify_viz.py       # Generate visualizations

🔬 Methodology

Gentrification Definition

A census tract is classified as gentrified if it meets these criteria:

Was lower-income at baseline (bottom 40th percentile)
Experienced high rent increases (top 33rd percentile)
Shows significant increases in education levels OR home values

Data Sources

U.S. Census Bureau - American Community Survey (ACS) 5-Year Estimates

Category	Variables
Demographics	Population, race/ethnicity, age
Economics	Median income, per capita income, poverty rates, employment
Housing	Median rent, median home value, vacancy rates, owner/renter ratio
Education	Bachelor's degree attainment and higher

Machine Learning Models

Regression (Predicting Rent Change %):

Linear Regression
Ridge Regression
Lasso Regression
Random Forest Regressor
Gradient Boosting Regressor

Classification (Predicting Gentrification Status):

Logistic Regression
Random Forest Classifier

📊 Sample Outputs

County Rankings

Visualizes top counties by gentrification rate and absolute number of gentrified tracts

Feature Importance

Shows which baseline characteristics best predict future gentrification

Correlation Analysis

Heatmap revealing relationships between demographic, economic, and housing variables

Exploratory Data Analysis

Distribution plots, scatter plots, and statistical summaries

🎓 Use Cases

Urban Planning: Identify neighborhoods at risk of displacement
Policy Analysis: Evaluate effects of housing policies
Academic Research: Study drivers of neighborhood change
Community Advocacy: Support affordable housing initiatives
Real Estate: Understand market trends and investment patterns

📚 Key Findings

Income Paradox: Lower-income neighborhoods at baseline experience the highest rent increases
Geographic Clustering: Gentrification spreads to adjacent neighborhoods (spillover effect)
Education Factor: Areas with moderate education levels show highest gentrification potential
Bay Area Dominance: Alameda, Contra Costa and Solano lead in gentrification rates
Volume vs. Intensity: LA has most gentrified tracts (177) but lower rate (~7%) due to size

⚙️ Configuration

Adjust Time Periods

python complete_pipeline.py --api_key YOUR_KEY --baseline 2010 --comparison 2020

Customize Features

Edit prepare_features() in analysis_modeling.py:

custom_features = [
    'baseline_median_income',
    'baseline_median_rent',
    'baseline_pct_higher_ed',
    'baseline_poverty_rate',
    'baseline_vacancy_rate'  # Add your own
]

Adjust Classification Criteria

python adjust_criteria.py  # Interactive tool for threshold adjustment

Fix "Only One Class" Error

If you encounter classification errors due to imbalanced data:

python adjust_criteria.py  # Automatically adjusts thresholds

Generate Only Visualizations

python ca_gentrify_viz.py  # Assumes data already collected

📖 Documentation

Command-Line Options

complete_pipeline.py:

--api_key (required): Your Census API key
--baseline (default: 2012): Baseline year for analysis
--comparison (default: 2022): Comparison year for analysis
--skip_collection: Skip data collection, use existing data

Output Files

Data Files (./data/):

california_baseline.csv: Baseline year census tract data
california_comparison.csv: Comparison year census tract data
california_gentrification_metrics.csv: Calculated metrics and classifications

Visualization Files (./output/):

eda_overview.png: Exploratory data analysis charts
scatter_analysis.png: Variable relationship scatter plots
correlation_heatmap.png: Correlation matrix
feature_importance.png: ML model feature importance
predictions_*.png: Model prediction vs. actual plots
county_rankings.png: County comparison charts
distribution_maps.png: Distribution visualizations by county

Report Files (./output/):

analysis_report.txt: Comprehensive analysis report with model metrics
county_report.txt: Detailed county-level statistics

⚠️ Limitations

Data Lag: Census data has 1-2 year publication delay
5-Year Estimates: ACS data represents 5-year averages, smoothing annual variation
Correlation ≠ Causation: Models identify patterns but don't prove causal relationships
R² Values: Typical R² of 0.30-0.45 reflects complexity of human behavior (standard for social science)
Missing Variables: Unable to capture policy changes, investment decisions, cultural trends, transit development
Small Sample Bias: Small census tracts may have unreliable estimates
Geographic Scope: Currently limited to California (extensible to other states)

🐛 Troubleshooting

Common Issues

"This solver needs samples of at least 2 classes"

# Solution: Use the adjustment tool
python adjust_criteria.py

"Error: Could not collect data"

Check your Census API key is correct
Verify internet connection
Try different years (some years may not be available yet)

"Module not found"

pip install pandas numpy matplotlib seaborn scikit-learn census requests scipy

"File not found"

Ensure you run data_collection.py first
Check that ./data/ and ./output/ folders exist

📄 License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

⭐ Star This Project

If you find this useful for your research or work, please consider giving it a star! It helps others discover the project.

📊 Project Statistics

Lines of Code: ~2,500+
Census Tracts Analyzed: 8,000+
Counties Covered: 58
Variables Tracked: 25+
ML Models Implemented: 7
Visualizations Generated: 10+

Built with 🏙️ for better understanding of urban change in California

Note: This project is for research and educational purposes. Policy decisions should incorporate additional community input, qualitative research, and local context.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
output		output
README.md		README.md
adjust_criteria.py		adjust_criteria.py
analysis_modeling.py		analysis_modeling.py
ca_gentrify_viz.py		ca_gentrify_viz.py
complete_pipeline.py		complete_pipeline.py
data_collection.py		data_collection.py

Folders and files

Latest commit

History

Repository files navigation

California Gentrification Prediction Model

📊 Overview

🎯 What It Does

📈 Results

🚀 Quick Start

Installation

Get a Census API Key

Run the Pipeline

🔬 Methodology

Gentrification Definition

Data Sources

Machine Learning Models

📊 Sample Outputs

County Rankings

Feature Importance

Correlation Analysis

Exploratory Data Analysis

🎓 Use Cases

📚 Key Findings

⚙️ Configuration

Adjust Time Periods

Customize Features

Adjust Classification Criteria

Fix "Only One Class" Error

Generate Only Visualizations

📖 Documentation

Command-Line Options

Output Files

⚠️ Limitations

🐛 Troubleshooting

Common Issues

📄 License

⭐ Star This Project

📊 Project Statistics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages