Predicting neighborhood change across California using machine learning and U.S. Census data
A comprehensive machine learning pipeline that analyzes and predicts gentrification patterns across 8,000+ census tracts in California. This project uses American Community Survey (ACS) data to identify neighborhoods at risk of gentrification and quantify the factors driving neighborhood change.
Key Features:
- π Automated data collection from U.S. Census Bureau API
- π Multiple ML models (Linear, Ridge, Lasso, Random Forest, Gradient Boosting)
- πΊοΈ County-level analysis across all 58 California counties
- π Rich visualizations and statistical analysis
- π― Predicts both rent change (regression) and gentrification status (classification)
This system analyzes how neighborhoods change over time by:
- Collecting census data for two time periods (e.g., 2012 vs. 2022)
- Calculating gentrification metrics including:
- Rent and home value changes
- Income shifts
- Educational attainment increases
- Demographic transitions
- Building predictive models that use baseline neighborhood characteristics to forecast future change
- Generating insights through visualizations, county rankings, and detailed reports
- Analyzes 8,000+ census tracts across California
- Achieves RΒ² of 0.20-0.30 for rent change prediction
- Identifies top gentrifying counties: Alameda, Contra Costa, San Francisco, Los Angeles
# Clone the repository
git clone https://github.com/dhanyabhat16/california-gentrification.git
cd california-gentrification
# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn census requests scipy- Sign up at: https://api.census.gov/data/key_signup.html
- Receive your key instantly via email
#Directly use the datasets in the data folder or extract from data_collection.py file
python data_collection.py # Collect census data
python analysis_modeling.py # Train ML models
python ca_gentrify_viz.py # Generate visualizationsA census tract is classified as gentrified if it meets these criteria:
- Was lower-income at baseline (bottom 40th percentile)
- Experienced high rent increases (top 33rd percentile)
- Shows significant increases in education levels OR home values
U.S. Census Bureau - American Community Survey (ACS) 5-Year Estimates
| Category | Variables |
|---|---|
| Demographics | Population, race/ethnicity, age |
| Economics | Median income, per capita income, poverty rates, employment |
| Housing | Median rent, median home value, vacancy rates, owner/renter ratio |
| Education | Bachelor's degree attainment and higher |
Regression (Predicting Rent Change %):
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- Gradient Boosting Regressor
Classification (Predicting Gentrification Status):
- Logistic Regression
- Random Forest Classifier
Visualizes top counties by gentrification rate and absolute number of gentrified tracts
Shows which baseline characteristics best predict future gentrification
Heatmap revealing relationships between demographic, economic, and housing variables
Distribution plots, scatter plots, and statistical summaries
- Urban Planning: Identify neighborhoods at risk of displacement
- Policy Analysis: Evaluate effects of housing policies
- Academic Research: Study drivers of neighborhood change
- Community Advocacy: Support affordable housing initiatives
- Real Estate: Understand market trends and investment patterns
- Income Paradox: Lower-income neighborhoods at baseline experience the highest rent increases
- Geographic Clustering: Gentrification spreads to adjacent neighborhoods (spillover effect)
- Education Factor: Areas with moderate education levels show highest gentrification potential
- Bay Area Dominance: Alameda, Contra Costa and Solano lead in gentrification rates
- Volume vs. Intensity: LA has most gentrified tracts (177) but lower rate (~7%) due to size
python complete_pipeline.py --api_key YOUR_KEY --baseline 2010 --comparison 2020Edit prepare_features() in analysis_modeling.py:
custom_features = [
'baseline_median_income',
'baseline_median_rent',
'baseline_pct_higher_ed',
'baseline_poverty_rate',
'baseline_vacancy_rate' # Add your own
]python adjust_criteria.py # Interactive tool for threshold adjustmentIf you encounter classification errors due to imbalanced data:
python adjust_criteria.py # Automatically adjusts thresholdspython ca_gentrify_viz.py # Assumes data already collectedcomplete_pipeline.py:
--api_key(required): Your Census API key--baseline(default: 2012): Baseline year for analysis--comparison(default: 2022): Comparison year for analysis--skip_collection: Skip data collection, use existing data
Data Files (./data/):
california_baseline.csv: Baseline year census tract datacalifornia_comparison.csv: Comparison year census tract datacalifornia_gentrification_metrics.csv: Calculated metrics and classifications
Visualization Files (./output/):
eda_overview.png: Exploratory data analysis chartsscatter_analysis.png: Variable relationship scatter plotscorrelation_heatmap.png: Correlation matrixfeature_importance.png: ML model feature importancepredictions_*.png: Model prediction vs. actual plotscounty_rankings.png: County comparison chartsdistribution_maps.png: Distribution visualizations by county
Report Files (./output/):
analysis_report.txt: Comprehensive analysis report with model metricscounty_report.txt: Detailed county-level statistics
- Data Lag: Census data has 1-2 year publication delay
- 5-Year Estimates: ACS data represents 5-year averages, smoothing annual variation
- Correlation β Causation: Models identify patterns but don't prove causal relationships
- RΒ² Values: Typical RΒ² of 0.30-0.45 reflects complexity of human behavior (standard for social science)
- Missing Variables: Unable to capture policy changes, investment decisions, cultural trends, transit development
- Small Sample Bias: Small census tracts may have unreliable estimates
- Geographic Scope: Currently limited to California (extensible to other states)
"This solver needs samples of at least 2 classes"
# Solution: Use the adjustment tool
python adjust_criteria.py"Error: Could not collect data"
- Check your Census API key is correct
- Verify internet connection
- Try different years (some years may not be available yet)
"Module not found"
pip install pandas numpy matplotlib seaborn scikit-learn census requests scipy"File not found"
- Ensure you run
data_collection.pyfirst - Check that
./data/and./output/folders exist
MIT License
Copyright (c) 2024
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you find this useful for your research or work, please consider giving it a star! It helps others discover the project.
- Lines of Code: ~2,500+
- Census Tracts Analyzed: 8,000+
- Counties Covered: 58
- Variables Tracked: 25+
- ML Models Implemented: 7
- Visualizations Generated: 10+
Built with ποΈ for better understanding of urban change in California
Note: This project is for research and educational purposes. Policy decisions should incorporate additional community input, qualitative research, and local context.