Skip to content

dhanyabhat16/Prediction-Model-for-Gentrification-in-California

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

California Gentrification Prediction Model

Predicting neighborhood change across California using machine learning and U.S. Census data

πŸ“Š Overview

A comprehensive machine learning pipeline that analyzes and predicts gentrification patterns across 8,000+ census tracts in California. This project uses American Community Survey (ACS) data to identify neighborhoods at risk of gentrification and quantify the factors driving neighborhood change.

Key Features:

  • πŸ” Automated data collection from U.S. Census Bureau API
  • πŸ“ˆ Multiple ML models (Linear, Ridge, Lasso, Random Forest, Gradient Boosting)
  • πŸ—ΊοΈ County-level analysis across all 58 California counties
  • πŸ“Š Rich visualizations and statistical analysis
  • 🎯 Predicts both rent change (regression) and gentrification status (classification)

🎯 What It Does

This system analyzes how neighborhoods change over time by:

  1. Collecting census data for two time periods (e.g., 2012 vs. 2022)
  2. Calculating gentrification metrics including:
    • Rent and home value changes
    • Income shifts
    • Educational attainment increases
    • Demographic transitions
  3. Building predictive models that use baseline neighborhood characteristics to forecast future change
  4. Generating insights through visualizations, county rankings, and detailed reports

πŸ“ˆ Results

  • Analyzes 8,000+ census tracts across California
  • Achieves RΒ² of 0.20-0.30 for rent change prediction
  • Identifies top gentrifying counties: Alameda, Contra Costa, San Francisco, Los Angeles

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/dhanyabhat16/california-gentrification.git
cd california-gentrification

# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn census requests scipy

Get a Census API Key

  1. Sign up at: https://api.census.gov/data/key_signup.html
  2. Receive your key instantly via email

Run the Pipeline

#Directly use the datasets in the data folder or extract from data_collection.py file
python data_collection.py       # Collect census data
python analysis_modeling.py     # Train ML models
python ca_gentrify_viz.py       # Generate visualizations

πŸ”¬ Methodology

Gentrification Definition

A census tract is classified as gentrified if it meets these criteria:

  • Was lower-income at baseline (bottom 40th percentile)
  • Experienced high rent increases (top 33rd percentile)
  • Shows significant increases in education levels OR home values

Data Sources

U.S. Census Bureau - American Community Survey (ACS) 5-Year Estimates

Category Variables
Demographics Population, race/ethnicity, age
Economics Median income, per capita income, poverty rates, employment
Housing Median rent, median home value, vacancy rates, owner/renter ratio
Education Bachelor's degree attainment and higher

Machine Learning Models

Regression (Predicting Rent Change %):

  • Linear Regression
  • Ridge Regression
  • Lasso Regression
  • Random Forest Regressor
  • Gradient Boosting Regressor

Classification (Predicting Gentrification Status):

  • Logistic Regression
  • Random Forest Classifier

πŸ“Š Sample Outputs

County Rankings

Visualizes top counties by gentrification rate and absolute number of gentrified tracts

Feature Importance

Shows which baseline characteristics best predict future gentrification

Correlation Analysis

Heatmap revealing relationships between demographic, economic, and housing variables

Exploratory Data Analysis

Distribution plots, scatter plots, and statistical summaries

πŸŽ“ Use Cases

  • Urban Planning: Identify neighborhoods at risk of displacement
  • Policy Analysis: Evaluate effects of housing policies
  • Academic Research: Study drivers of neighborhood change
  • Community Advocacy: Support affordable housing initiatives
  • Real Estate: Understand market trends and investment patterns

πŸ“š Key Findings

  1. Income Paradox: Lower-income neighborhoods at baseline experience the highest rent increases
  2. Geographic Clustering: Gentrification spreads to adjacent neighborhoods (spillover effect)
  3. Education Factor: Areas with moderate education levels show highest gentrification potential
  4. Bay Area Dominance: Alameda, Contra Costa and Solano lead in gentrification rates
  5. Volume vs. Intensity: LA has most gentrified tracts (177) but lower rate (~7%) due to size

βš™οΈ Configuration

Adjust Time Periods

python complete_pipeline.py --api_key YOUR_KEY --baseline 2010 --comparison 2020

Customize Features

Edit prepare_features() in analysis_modeling.py:

custom_features = [
    'baseline_median_income',
    'baseline_median_rent',
    'baseline_pct_higher_ed',
    'baseline_poverty_rate',
    'baseline_vacancy_rate'  # Add your own
]

Adjust Classification Criteria

python adjust_criteria.py  # Interactive tool for threshold adjustment

Fix "Only One Class" Error

If you encounter classification errors due to imbalanced data:

python adjust_criteria.py  # Automatically adjusts thresholds

Generate Only Visualizations

python ca_gentrify_viz.py  # Assumes data already collected

πŸ“– Documentation

Command-Line Options

complete_pipeline.py:

  • --api_key (required): Your Census API key
  • --baseline (default: 2012): Baseline year for analysis
  • --comparison (default: 2022): Comparison year for analysis
  • --skip_collection: Skip data collection, use existing data

Output Files

Data Files (./data/):

  • california_baseline.csv: Baseline year census tract data
  • california_comparison.csv: Comparison year census tract data
  • california_gentrification_metrics.csv: Calculated metrics and classifications

Visualization Files (./output/):

  • eda_overview.png: Exploratory data analysis charts
  • scatter_analysis.png: Variable relationship scatter plots
  • correlation_heatmap.png: Correlation matrix
  • feature_importance.png: ML model feature importance
  • predictions_*.png: Model prediction vs. actual plots
  • county_rankings.png: County comparison charts
  • distribution_maps.png: Distribution visualizations by county

Report Files (./output/):

  • analysis_report.txt: Comprehensive analysis report with model metrics
  • county_report.txt: Detailed county-level statistics

⚠️ Limitations

  • Data Lag: Census data has 1-2 year publication delay
  • 5-Year Estimates: ACS data represents 5-year averages, smoothing annual variation
  • Correlation β‰  Causation: Models identify patterns but don't prove causal relationships
  • RΒ² Values: Typical RΒ² of 0.30-0.45 reflects complexity of human behavior (standard for social science)
  • Missing Variables: Unable to capture policy changes, investment decisions, cultural trends, transit development
  • Small Sample Bias: Small census tracts may have unreliable estimates
  • Geographic Scope: Currently limited to California (extensible to other states)

πŸ› Troubleshooting

Common Issues

"This solver needs samples of at least 2 classes"

# Solution: Use the adjustment tool
python adjust_criteria.py

"Error: Could not collect data"

  • Check your Census API key is correct
  • Verify internet connection
  • Try different years (some years may not be available yet)

"Module not found"

pip install pandas numpy matplotlib seaborn scikit-learn census requests scipy

"File not found"

  • Ensure you run data_collection.py first
  • Check that ./data/ and ./output/ folders exist

πŸ“„ License

MIT License

Copyright (c) 2024

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

⭐ Star This Project

If you find this useful for your research or work, please consider giving it a star! It helps others discover the project.


πŸ“Š Project Statistics

  • Lines of Code: ~2,500+
  • Census Tracts Analyzed: 8,000+
  • Counties Covered: 58
  • Variables Tracked: 25+
  • ML Models Implemented: 7
  • Visualizations Generated: 10+

Built with πŸ™οΈ for better understanding of urban change in California

Note: This project is for research and educational purposes. Policy decisions should incorporate additional community input, qualitative research, and local context.

About

Predicting gentrification patterns across 8,000+ California census tracts using machine learning and U.S. Census data. Analyzes neighborhood change through regression and classification models. Includes automated data collection, visualization tools and county-level analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages