About

In this project, we identified areas in New York City (NYC) that are underrepresented in transit accessibility. We wanted to understand areas of opportunity (defined as currently underserved areas where we predict people would have high public transportation usage) where the MTA could create a transit station that we believe will receive a lot of foot traffic and ticket sales.

To solve this problem, we determined the average distance to a transit station from areas all over NYC. We then determined if the area is worth investing in (would a transit station here be profitable?) by looking at predicted ridership. Each subway ride in NYC earns $2.90 per ride per person (https://www.mta.info/fares-tolls/subway-bus), and since we can predict weekly ridership, we will predict weekly revenue based on that predicted ridership.

Our data science approach will not only enable us to identify underserved areas but also to predict ridership for these areas. The resulting decision-making platform will enable the MTA to more readily evaluate potential locations for transit stops/stations rather than having to manually sort through all options.

Results

Identifying Underserved Areas

To identify underserved areas, we used the average distance to a subway station as a proxy for accessibility. We calculated the average distance to a subway station for points all over NYC, and then we plotted the results. The areas with the highest average distance to a subway station are the most underserved areas.


Ground Truth Ridership	Predicted Ridership

We next constructed a model to predict weekly ridership based on demographics data like median income and population density. This data came from NYC census data, so predictions are for census tracts. We tried a few different models, but the best performing model was a Random Forest Regressor, with an R^2 of 0.675 (on 10-fold cross-validation).

The two maps above show the ground truth ridership and the predicted ridership. There are a few notable differences, in particular, some areas have one stop with high ridership, but since the values reflect on the entire census tract, the ground truth prediction is much higher for those areas.

We also computed the predicted revenue for each census tract based on the predicted ridership. The map above shows the predicted revenue for each census tract.

Our final result is a map of the areas of opportunity, which is the product of the average distance to a subway station and the predicted revenue (both normalized to a range of 0-1). The areas with the highest values are the areas that are most underserved and have the highest predicted revenue.

We discuss these results further in our report and presentation, linked below.

Report: https://docs.google.com/document/d/1vnjbzVh_RYYclWAOj0efJ_VCKNdW5mEic235lsq59Hc/edit?usp=sharing

Slides: https://docs.google.com/presentation/d/1bggkTmrDGwmAPLfzoR553Cgrc8w8DcQyCdhVLl1KfVM/edit?usp=sharing

Datasets

NYC Census Data

https://data.cccnewyork.org/data/download#0,8,10/66,97

Note: Fips codes represent location.

Critical Note: There are aggregate rows in the dataset. There is a row for NYC, but then there are also rows for each borough and then also each community district in each borough.

For Median Incomes, for example, you will want to look at Household Type, since there are multiple values for each location based on the type of household. Similar problems exist for total population by age group and ethnicity.

Development

Uses the uv python package manager to install the dependencies.

Use uv sync to install the dependencies and create the virtual environment.

You can select the python interpreter from .venv in VSCode, and then run the code in the terminal.

Use uv add <package> to add a new package to the project. This is preferred to using pip directly.

Usage

Run uv run python main.py to run the project. This will run the main.py file in the current directory.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
demographic_data		demographic_data
figures		figures
src		src
.gitignore		.gitignore
.python-version		.python-version
Karen_conjecture.ipynb		Karen_conjecture.ipynb
README.md		README.md
deena_conjecture.ipynb		deena_conjecture.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
surbhi_conjecture.ipynb		surbhi_conjecture.ipynb
uv.lock		uv.lock
weaver_conjecture.ipynb		weaver_conjecture.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Results

Identifying Underserved Areas

Datasets

NYC Census Data

NYC Transit Data

Community Districts

Census Data for Aligning FIPS to Community Districts

Historic Ridership Data

Development

Usage

About

Uh oh!

Languages

We-Gold/ds-501-cs4-public

Folders and files

Latest commit

History

Repository files navigation

About

Results

Identifying Underserved Areas

Datasets

NYC Census Data

NYC Transit Data

Community Districts

Census Data for Aligning FIPS to Community Districts

Historic Ridership Data

Development

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages