NYC Commercial Intelligence

A data-driven decision-support system for exploring and ranking commercial locations in New York City using urban data and machine learning.

Overview

This project integrates NYC Open Data (pedestrian counts, subway stations, storefront vacancy filings) and NYC Public Neighborhood Profiles–style community statistics, aggregated to CDTA boundaries, to model neighborhood-level commercial environments.

streamlit run app.py opens K-Selection / clustering (app.py). The Ranking UI is pages/Ranking.py: hard SQL filters, then α·semantic + β·competitive penalty — MinMax on the filtered rows for [cosine similarity, −log1p(count/(avg_pedestrian+1))] (overall filings or a chosen act_*_storefront column; same as /api/rank), with one α slider and a Claude panel that explains the fixed top 5 from that blend (/api/agent parity).

Data sources

Inputs live under data/raw/ and are joined to CDTA 2020 polygons (NYC Planning boundaries: footprint, area_km2, spatial joins). run_pipeline.py wires paths to each source.

Mobility — DOT bi-annual pedestrian counts → per-CDTA foot traffic (avg_pedestrian, peak_pedestrian, etc.); MTA/NYS subway station points → subway_station_count, subway_density_per_km2, transit_activity_score inputs.
Community and economic context — Comptroller Neighborhood Economic Profiles (ACS-style jobs, demographics, income, education, commute); Neighborhood Financial Health indicators → nfh_* when that feed is present.
Commerce and public safety — City storefront vacancy / activity filings → storefront_*, act_*_storefront, category mix, competitive_score, commercial_activity_score; NYPD shooting points → per-CDTA shooting totals in the feature table.

The pipeline writes data/processed/neighborhood_features_final.csv (one row per CDTA) for clustering, ranking, and embeddings. Exact filenames, layout, and where to download: see data/raw/README.MD and the data/raw/ tree.

End-to-end data flow

Raw data in data/raw/ (CSVs + CDTA shapefile under nyc_boundaries/). See data/raw/README.MD.
python run_pipeline.py — src/data_processing.py cleans sources (including the Neighborhood Financial Health / NFH CSV merged into nbhd_clean); src/feature_engineering.py reads raw storefront filings (path configured in run_pipeline.py), spatially aggregates storefront counts by CDTA and primary business activity, merges MOCEJ-style neighborhood profiles and nfh_* columns on a normalized Community District key, then imputes remaining gaps in those merged numeric columns with borough median, then citywide median (dashboard-friendly proxy where a CDTA does not match a single profile row). commercial_activity_score = log1p(storefront_filing_count × avg_pedestrian) and transit_activity_score = log1p(subway_station_count × avg_pedestrian), computed after filling missing storefront/subway/pedestrian inputs (inner product clipped at 0 before log1p) so scores are not stuck at zero from ordering alone and heavy tails are compressed for hard-filter sliders. (Soft ranking on the dashboard uses semantic + competitive — see Ranking section — not a MinMax of commercial_activity_score.) Output: data/processed/neighborhood_features_final.csv. A healthy run ends with no missing values in that table; if any column still has NaN, investigate before shipping.
Embeddings (for the app) — python -m src.embeddings builds embeddings from neighborhood text profiles (including every non-zero act_*_storefront business-activity count, population proxies, and NFH fields where present — see data/processed/README.md); caches under outputs/embeddings/. Default (EMBEDDING_BACKEND unset or auto): OpenAI text-embedding-3-small if OPENAI_API_KEY is set, else local sentence-transformers. EMBEDDING_BACKEND=openai uses OpenAI when a key is present, otherwise falls back like auto. EMBEDDING_BACKEND=sentence_transformers forces local only. Use --force after changing features or profile text so embeddings match the CSV.
streamlit run app.py — Home = K-Selection / clustering (app.py); Ranking is pages/Ranking.py (hard filters, α·semantic + β·competitive blend, map, Claude). Loads the feature table (cached; Rerun or Clear cache after regenerating the CSV). The Next.js app on Vercel calls the same /api/cluster, /api/filter, /api/rank, and /api/agent endpoints when NEXT_PUBLIC_API_URL points at the FastAPI backend.

Streamlit: Ranking page (`pages/Ranking.py`)

The ranking dashboard reads data/processed/neighborhood_features_final.csv (cached).

1. Hard filters (deterministic)

Sidebar controls set thresholds on:

Borough (multiselect)
Minimum subway_station_count, avg_pedestrian, storefront_density_per_km2, storefront_filing_count, commercial_activity_score
Maximum competitive_score (competition pressure — same column used in competitive_score hard filter on /api/filter / /api/rank)
Maximum shooting-incident count (column name may be shooting_incident_count or shooting_incident_count_2024 depending on the pipeline export)
NFH thresholds (nfh_overall_score / nfh_goal4_fin_shocks_score): minimum thresholds via sidebar sliders (shown only when those columns exist and have values)

These are applied with DuckDB using parameterized WHERE clauses (same binding idea as the FastAPI backend): the full table is registered as nbhd, a SELECT … WHERE … runs, and rows are ordered by commercial_activity_score DESC for the hard-filter preview table. The main area shows a table of surviving neighborhoods (key columns). View generated SQL expands to show the exact query. An expander (About zeros, nulls, and refreshing data) documents imputation, score formulas, and when zeros are expected.

If no rows match, the app stops with a warning.

2. Soft preferences — two-way ranking (in-app “ranking”)

User enters a free-text query (ideal area description).
One blend slider sets α ∈ [0, 1] for semantic similarity (cosine similarity after MinMax on the filtered set). β = 1 − α applies to a competitive-pressure signal derived from log1p(count / (avg_pedestrian + 1)), where count is either total storefront_filing_count or a selected act_*_storefront column (same behavior as the Next.js app calling /api/rank). Higher competition penalizes rank, so the second axis uses MinMax(−competitive) before blending. No second slider; α + β = 1 by construction.
Embeddings: query and neighborhoods use the active backend in src/embeddings.py — by default OpenAI text-embedding-3-small when OPENAI_API_KEY is set, else local sentence-transformers (all-MiniLM-L6-v2); or whichever backend you force with EMBEDDING_BACKEND. Cosine similarity is computed on the filtered set (aligned by neighborhood name to the full embedding matrix).
Build a matrix cosine_sim, −competitive for those rows and apply sklearn.preprocessing.MinMaxScaler (column-wise, 0–1 on the filtered set). With a single row, scaling falls back to a neutral mid-score to avoid degenerate MinMax.
blended_score = α·col0 + β·col1. Sort by blended_score descending. The table shows semantic_similarity, specific_competitive_score (the log1p competitive scalar used in the blend), and blended_score. (commercial_activity_score remains available for hard filters and the hard-filter table sort, not as the soft-ranking second axis.)
Map (when embeddings succeed): a CDTA choropleth colors polygons by blended_score (sequential greens; requires data/raw/nyc_boundaries/nycdta2020.shp).

If embeddings are missing or the API key is unset, this block shows a warning (pre-generate embeddings with python -m src.embeddings; use --force after feature or profile text changes).

3. AI analysis

A button sends Claude the fixed top 5 neighborhoods from the soft ranker (same rules as /api/agent) plus the hard-filtered dataframe. The agent may call run_sql only for extra context on those five — it must not re-rank or replace them. Requires ANTHROPIC_API_KEY.

4. What is not in the Streamlit ranker

K-means clustering — not used to order results in the app (it feeds cluster labels on the Ranking page only after you run K-Selection on the home page).

UI Usage (Ranking Page)

A walkthrough of the Ranking page from a user's perspective. Open it from the sidebar after streamlit run app.py.

Set hard filters in the sidebar. Pick boroughs and drag the sliders for the minimums above, plus max competitive_score, max shooting count (when present), and NFH minimums when shown. These translate into a parameterized SQL WHERE clause; expand View generated SQL to see the literal query. Surviving neighborhoods appear in the table at the top of the page. If no rows match, loosen a constraint until at least one neighborhood passes.
Describe what you want in plain English. Type a free-text query into the soft-preferences box (see Example queries below for project-shaped prompts). The query is embedded with the same model used for neighborhood text profiles, then matched by cosine similarity against the filtered set.
Tune the α slider (semantic ↔ competitive tradeoff). α controls how much the ranking trusts your text query versus the competitive-pressure axis (after MinMax with semantic cosine):

α = 1 → pure semantic match (text query only).
α = 0 → pure competitive axis (lower log1p(count/(ped+1)) wins after MinMax with negation — same construction as /api/rank).
In between → MinMax-scaled blend of the two. β is automatically 1 − α.

Read the output. The ranked table shows semantic_similarity, specific_competitive_score, and blended_score for each surviving neighborhood. Below it, a CDTA choropleth colors the map by blended_score so you can see geographic patterns at a glance.
Run the AI analysis panel. Click the button to send the fixed top 5 from the soft ranker (same contract as /api/agent) plus the hard-filtered table to Claude. The agent may call read-only SELECT only for extra context and must not re-rank or replace those five neighborhoods. Requires ANTHROPIC_API_KEY.

Clustering vs ranking

Piece	Role
K-means + K-Selection (`app.py`, home)	Exploratory: sweep k, charts, CDTA choropleth, rich cluster descriptions (titled archetype + percentile-based metric levels + activity-category profile + borough concentration & nearest text-profile matches); labels saved for Ranking. Does not define the rank order on the Ranking page.
Ranking (`pages/Ranking.py`)	Hard SQL filters → MinMax([semantic, −competitive]) → α·col0 + (1−α)·col1 (map + Claude explains fixed top 5 + cluster columns).

Cluster descriptions (K-Selection page + `/api/cluster`)

Clusters are summarized in three layers (all generated by api/cluster_helpers.py:_cluster_rich_description, shared by Streamlit and FastAPI):

Titled archetype chosen from centroid z-scores — e.g. Dense Mixed-Use Commercial Core, Transit-Oriented Commercial District, Employment-Heavy Business District, Stable Lower-Density Neighborhood Market, Lower-Density Local Commercial Area, Balanced Neighborhood Market. A special No Recorded Commercial Activity Zone / Mostly No-Activity / Special-Use Geography title fires when member CDTAs have zero non-vacant filings and zero commercial / competitive scores (so reviewers don't mistake parks, waterfront, or other special-use polygons for retail markets).
Profile signals as percentile-band sentences ("high foot traffic around the 82nd percentile citywide", "moderate transit activity around the 54th percentile") computed against the master feature table — not raw z-scores.
Granular storefront mix — top categories by per-CDTA density (e.g. "food service — 38% of filings, 91st pct density"), plus unusually elevated and comparatively sparse act_*_density categories at the centroid. Ends with borough concentration and the nearest text-profile matches from cached embeddings (cosine to cluster mean).

The legacy one-liner ("Above average on X; relatively lower on Y.") is kept as a fallback in _cluster_brief_description when embeddings or member data are unavailable.

Supervised ML extension

The baseline dashboard relies on transparent filters plus scores from the feature table — no supervised model is required. If labeled outcomes are added later, supervised models (with proper validation to avoid geographic leakage) can sit alongside the same pipeline outputs.

Key components (code map)

Area	Files	Notes
Data pipeline	`src/data_processing.py`, `src/feature_engineering.py`, `run_pipeline.py`	Produces `neighborhood_features_final.csv`
Embeddings	`src/embeddings.py`	Text profiles → `.npy` cache (OpenAI or sentence-transformers)
K-Selection / clustering (home)	`app.py`	Thin Streamlit page: K sweep, viz, CDTA map. Heavy logic lives in `streamlit_app/cluster_helpers.py` (which re-uses `api/cluster_helpers.py`). Elbow heuristics match `/api/cluster` (perpendicular-distance `elbow_k`, Δ² inertia `elbow_k_kneedle`); writes cluster labels + descriptions to session state
Ranking	`pages/Ranking.py`	Hard filters, MinMax [semantic, −competitive] (one α), map, Claude (fixed top 5), cluster join
FastAPI backend	`api/main.py`	Slim endpoint layer; all SQL building, ranking, clustering and description logic now lives in `api/rank_helpers.py`, `api/cluster_helpers.py`, and `api/formatting.py` (shared with Streamlit)
Shared cluster + rank helpers	`api/cluster_helpers.py`, `api/rank_helpers.py`, `api/formatting.py`	Single source of truth for elbow detection, rich cluster descriptions (`_cluster_rich_description`, `_cluster_title`, `_activity_category_profile`), DuckDB SQL building, and feature-name pretty-printing. Imported by both FastAPI and the Streamlit app
Streamlit-only adapters	`streamlit_app/cluster_helpers.py`, `streamlit_app/constants.py`	Thin wrappers over `api/cluster_helpers.py` plus Streamlit-specific constants (cluster palette, candidate/default feature lists, readable feature labels). Avoids duplicating logic between `app.py` and `api/main.py`
K-means (library)	`src/kmeans_numpy.py`	Used by `app.py`; not the ranking sort key
Agent	`src/agent.py`	Claude + DuckDB `SELECT` tools
Scripts	`scripts/load_supabase.py`, `scripts/build_cdta_geojson.py`	Load neighborhoods into Supabase; build `cdta_geo.json` for the API without geopandas on Railway

Project Structure

data/            raw and processed datasets (see data/*/README*)
src/             core logic (pipeline, features, embeddings, agent)
api/             FastAPI app + shared helpers (cluster_helpers.py, rank_helpers.py, formatting.py)
streamlit_app/   Streamlit-side adapters and constants (re-uses api/* helpers)
pages/           `Ranking.py` only (soft ranker; **K-Selection lives in `app.py`**)
frontend/        Next.js 14 UI (deployed to Vercel)
scripts/         Supabase load, CDTA GeoJSON build (`load_supabase.py`, `build_cdta_geojson.py`)
outputs/         saved models, embeddings, figures, validation artifacts
tests/           unit tests (`pytest tests/`) + manual evaluation pipeline
app.py           Streamlit entry: **K-Selection / clustering** home (`streamlit run app.py`)

README index

Document	Purpose
`README.md` (this file)	Setup, Streamlit behavior, API keys, troubleshooting
`data/raw/README.MD`	Where to obtain raw CSVs and CDTA shapefile; layout under `data/raw/`
`data/processed/README.md`	Processed CSVs, final feature columns, app + embedding pipeline

Setup

Recommended (uv)

uv venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
uv pip install -r requirements-dev.txt    # local: includes Streamlit, GIS, sentence-transformers, pytest

Fallback (pip)

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements-dev.txt

requirements.txt is the slim production set used by Railway (no streamlit, no torch, no geopandas — ~2 min build). requirements-dev.txt extends it with the local Streamlit dashboard, the GIS stack, and tests.

Build features and embeddings, then run the app

python run_pipeline.py
python -m src.embeddings    # use --force to refresh; OpenAI when OPENAI_API_KEY set, else sentence-transformers (unless forced local)
streamlit run app.py        # set ANTHROPIC_API_KEY to enable the Claude panel

Copy .env.example to .env and set API keys as needed: OPENAI_API_KEY selects OpenAI embeddings when not forcing local-only; omit the key to use sentence-transformers. ANTHROPIC_API_KEY enables the Claude panel.

Example queries (soft text)

These target the semantic half of the blend (text profiles from the feature table via src/embeddings). Use pedestrians, transit, storefront mix, competition, or NFH-style context—those signals appear in the embeddings.

"High foot traffic and multiple subway lines; strong food-service storefront activity, moderate competition per pedestrian"
"Diverse retail and services, lower competitive pressure per pedestrian; NFH stability where the feed has it"
"Transit-heavy CDTA, solid pedestrian volume, mixed storefront activity—dense commercial corridor"

Algorithm implementation (course / clustering)

K-means is implemented from scratch in src/kmeans_numpy.py (Euclidean distance, iterative centroids). The K-Selection home page (app.py) runs sweeps and charts; Ranking (pages/Ranking.py) is separate.

Testing

pytest tests/

Includes tests/test_kmeans.py and tests/test_feature_engineering.py. Utility scripts under scripts/ are not run by CI unless you add them.

Rank-stability validation (interactive HTML)

tests/rank_stability_validation_business_queries.py is a manual CLI script (no pytest assertions) that compares blended rankings between the 2022 and 2024 vintages for a fixed set of business queries. Outputs land in outputs/validation/rank_stability_business_queries/ and are now interactive Plotly HTML — no static PNGs anymore:

rank_stability_rankings.html — single combined scatter (rank 2022 vs rank 2024) with a dropdown to switch between queries; each marker is colored by |rank delta| and exposes a hover tooltip with the neighborhood, borough, CD, both ranks, the signed delta, and both blended scores. A dashed y = x reference line marks perfect stability.
ranking_stability_<query_slug>.html — per-query standalone scatter (same encoding) for embedding or sharing one query at a time.
query_rank_correlations_summary.html — Spearman r bar chart with a Kendall τ overlay on a secondary axis; hover surfaces the CDTA overlap count per query.

All HTML files use include_plotlyjs="cdn", so they render anywhere with internet access without bundling Plotly. Generate them with:

cd tests
python rank_stability_validation_business_queries.py

Open them in the file explorer to interact with the scatter plots!

Data & live demo

data/processed/ is committed so you can run streamlit run app.py without rebuilding features. Re-run python run_pipeline.py after changing pipeline code or raw inputs.
data/raw/ CSVs are not committed (download locally; see data/raw/README.MD). The CDTA 2020 shapefile under data/raw/nyc_boundaries/nycdta2020.* is committed (~1.5MB) so spatial joins work out of the box.
Regenerate processed tables: python run_pipeline.py (requires geopandas, local CSVs as above, and the repo shapefile path).

Notes

Large datasets are not included in the repository.
Precomputed embeddings live under outputs/embeddings/ after running python -m src.embeddings (neighborhood_embeddings.npy for OpenAI backend, neighborhood_embeddings_st.npy for sentence-transformers; neighborhood_texts.npy is shared).
OpenAI 429 / insufficient_quota means the account billing or quota for that API key is exhausted; fix billing in the OpenAI dashboard, then re-run embeddings.

Limitations and Future Work

Current Limitations

Linear scoring function (α blend). Rankings use α · semantic_similarity + (1 − α) · MinMax(−competitive) on the filtered set (same as /api/rank). This is transparent and easy to tune, but cannot capture interactions between features (e.g. "subway access matters only for high-pedestrian areas") or non-monotonic preferences.
Limited expressiveness of the semantic query. Users can write queries such as "quiet residential area for retail with good subway access and good safety" and the system handles them well at the neighborhood-character level. However, it is difficult to express:
- Fine-grained retail categories (e.g. "specialty bookstore" vs. generic "retail").
- Niche restaurant types (e.g. "Sichuan hot pot only" or "third-wave coffee").
- Specific cinema / movie-related preferences (e.g. "indie art-house cinema" vs. "multiplex"). This limitation exists because neighborhood embeddings are built from aggregated text profiles (business-activity counts, demographics, NFH fields). The text describes neighborhoods at a high abstraction level — categories of activity, not individual venues — so the embedding space cannot distinguish between sub-types that share the same parent category. Sub-category preferences get washed out by the dominant signal of the broader profile.
Static embeddings (not personalized). Every user sees the same neighborhood embeddings. The system has no notion of who is asking, so a real-estate developer and a first-time café owner get identical rankings for identical queries.
Dependence on handcrafted features. commercial_activity_score, transit_activity_score, storefront density, and the activity-by-CDTA aggregations are hand-designed. The model's ceiling is bounded by which features the pipeline happens to compute.
MinMax scaling instability (depends on the filtered set). Both semantic cosine and the competitive signal are MinMax-scaled on the filtered rows, not the full table. Tightening or loosening a hard filter therefore changes the scale of every score, which means the ordering between two neighborhoods can shift even when the underlying numbers did not. With a single surviving row, the scaler falls back to a neutral mid-score.

Future Improvements

More advanced ranking models. Replace the linear α blend with non-linear models (gradient-boosted trees, learning-to-rank such as LambdaMART) once labeled outcomes (e.g. survival of new businesses, observed click-throughs in the UI) are available. This would let the system learn feature interactions instead of forcing the user to guess α.
Fine-grained taxonomy for business types. Replace the current top-level act_*_storefront activity counts with a deeper category hierarchy (e.g. NAICS 6-digit or a custom retail taxonomy) so the embedding text can distinguish between sub-types like "indie bookstore" vs. "chain bookstore" or "art-house cinema" vs. "multiplex"
Enhanced Competitive + Commercial Activity Scores. Improve the calculation and integration of competitive and commercial activity scores to better reflect neighborhood dynamics and business potential. For example, incorporate more granular storefront categories, temporal trends in filings and pedestrian activity, or additional context from NFH indicators to create a more nuanced competitive pressure signal.
Hybrid structured + semantic query system. Let the user mix structured constraints and free text in the same query (e.g. "neighborhoods with ≥3 subway stations, low vacancy, and a vibe like SoHo"). An LLM-based parser can extract the structured pieces into SQL filters and pass the remainder to the semantic ranker — closer to how users actually think.
Personalization (user embeddings / preference learning). Capture session-level signals (which neighborhoods the user clicked, downloaded, or asked Claude about) and learn a user embedding that is added to the query embedding. Even a lightweight bandit over saved preferences would meaningfully improve repeat-user experience.
Better LLM integration. Use Claude not only for the post-hoc analysis panel but also for query expansion ("quiet" → quiet, residential, low-traffic, low-noise) and constraint extraction (parsing "near a subway" into a hard subway_station_count ≥ 1 filter). This narrows the gap between what the user types and what the embedding model can match against.

Deploying as a web app (Railway + Vercel)

The repo also ships as a two-tier web app: FastAPI on Railway (Python pipeline + ML) and Next.js on Vercel (UI). The Streamlit app remains supported for local use; the FastAPI backend wraps the same src/ modules.

1. Backend on Railway

api/main.py is the FastAPI app — endpoints /api/health, /api/feature-ranges, /api/cluster, /api/filter, /api/rank, /api/agent, /api/geo/cdta. Run locally:

 uv pip install -r requirements.txt
 uvicorn api.main:app --reload --port 8000

Push to GitHub, then on railway.app → New Project → Deploy from GitHub repo. Railway’s builder (Railpack) picks the Python version from .python-version (pinned to 3.11.9), then installs from requirements.txt. Deploy/runtime uses railway.json / Procfile for the start command and healthcheck path.
Set environment variables on the Railway service (Settings → Variables):

FRONTEND_ORIGINS — comma-separated list of Vercel URLs that may call the API (e.g. https://nyc-commercial.vercel.app). Without this, only http://localhost:3000 is allowed.
OPENAI_API_KEY — required at runtime to embed the user's query.
ANTHROPIC_API_KEY — required for the /api/agent endpoint; other endpoints work without it.
SUPABASE_URL and SUPABASE_SERVICE_ROLE_KEY — strongly recommended for production (see step 4). When both are set, /api/rank calls public.match_neighborhoods (pgvector cosine + filters in SQL) instead of loading the full DataFrame and the embeddings cache. Falls back to the CSV path if either is missing.

Memory: the slim requirements.txt (no torch, no geopandas, no streamlit) fits on Railway's free tier. Build is ~1.5–2 min.
Embeddings cache is only used when Supabase is not configured. In that mode the server builds the cache lazily in outputs/embeddings/ on first /api/rank call (slow on cold start). Going the Supabase path is the recommended deploy mode.

Supabase (recommended for production)

The repo includes Postgres migrations under supabase/migrations/ that create a public.neighborhoods table with a vector(1536) embedding column, an HNSW cosine index, RLS policies for anon-key reads, and a match_neighborhoods RPC that does cosine similarity + hard-filter SQL in one round trip.

To set up:

Create a Supabase project (note the URL and the service-role key — server-only).
Apply migrations: install the Supabase CLI (brew install supabase/tap/supabase), then supabase link --project-ref <ref> and supabase db push. (Or paste each supabase/migrations/*.sql into the SQL editor in numeric order.)
Populate the table once with scripts/load_supabase.py:

 export SUPABASE_URL=https://<ref>.supabase.co
 export SUPABASE_SERVICE_ROLE_KEY=...    # never commit
 export OPENAI_API_KEY=...
 uv run python scripts/load_supabase.py

The script handles the act_OTHER_* / act_other_* case collision (Postgres folds identifiers to lowercase; the originally-lowercase variant is renamed to *_lower_* to match the schema). 4. Set SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY on Railway. /api/rank will switch to the Supabase RPC automatically.

2. Frontend on Vercel

The frontend is a stand-alone Next.js 14 (App Router, TypeScript, Tailwind) app under frontend/.

cd frontend
npm install
cp .env.local.example .env.local      # set NEXT_PUBLIC_API_URL=http://127.0.0.1:8000
npm run dev                            # http://localhost:3000

To deploy:

On vercel.com → Add New Project → Import Git Repository, point at this repo.
Set the Root Directory to frontend/ in project settings (otherwise Vercel will try to build from the repo root and fail on the Python files).
Add the env var NEXT_PUBLIC_API_URL = https://<your-railway-app>.up.railway.app (no trailing slash).
Vercel auto-detects Next.js and uses npm run build. Pushes to main deploy to production; PR branches get preview URLs.

Once deployed, copy the production Vercel URL back into Railway's FRONTEND_ORIGINS so CORS allows it.

Deploy gotchas

Cold starts: Railway sleeps idle services on Hobby. First request takes 5–15s to wake. The frontend handles this with loading states; for demos, hit /api/health from a cron.
CDTA boundaries ship as a pre-rendered data/processed/cdta_geo.json (~4 MB) so Railway doesn't need geopandas. Regenerate locally with uv run python scripts/build_cdta_geojson.py after editing the shapefile (requires geopandas — installed via requirements-dev.txt).
The clustering "Run Analysis" call recomputes K-means on every request. For the 71-CDTA dataset this is sub-second; if you swap in a larger feature table, add server-side caching keyed on (features, max_k, vintage, boroughs).
The Streamlit app (app.py) and the FastAPI app share two layers of common code: (1) src/ modules (pipeline, embeddings, K-means, agent), and (2) api/cluster_helpers.py + api/rank_helpers.py + api/formatting.py (elbow heuristics, rich cluster descriptions, DuckDB SQL building, label formatting). streamlit_app/cluster_helpers.py is a thin wrapper that re-exports those helpers for the Streamlit page so logic does not drift between interfaces — when you touch ranking, clustering, or description text, edit the api/*_helpers.py module, not the Streamlit page. The Streamlit-specific @st.cache_data import in src/config.py is wrapped in try/except so the FastAPI deploy doesn't need streamlit installed.

Live deployments

Vercel (public dashboard) — Production Next.js app: https://nyc-commercial-intelligence.vercel.app/. Stack matches frontend/: Next.js 14 (App Router), TypeScript, Tailwind CSS; it calls the FastAPI API via NEXT_PUBLIC_API_URL (Railway in production).
Streamlit Community Cloud — K-selection / clustering UI: https://nyc-commercial-intelligence.streamlit.app/.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.devcontainer		.devcontainer
api		api
data		data
frontend		frontend
outputs/clusters		outputs/clusters
pages		pages
scripts		scripts
src		src
streamlit_app		streamlit_app
supabase/migrations		supabase/migrations
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Procfile		Procfile
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
railway.json		railway.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

NYC Commercial Intelligence

Overview

Data sources

End-to-end data flow

Streamlit: Ranking page (pages/Ranking.py)

1. Hard filters (deterministic)

2. Soft preferences — two-way ranking (in-app “ranking”)

3. AI analysis

4. What is not in the Streamlit ranker

UI Usage (Ranking Page)

Clustering vs ranking

Cluster descriptions (K-Selection page + /api/cluster)

Supervised ML extension

Key components (code map)

Project Structure

README index

Setup

Recommended (uv)

Fallback (pip)

Build features and embeddings, then run the app

Example queries (soft text)

Algorithm implementation (course / clustering)

Testing

Rank-stability validation (interactive HTML)

Data & live demo

Notes

Limitations and Future Work

Current Limitations

Future Improvements

Deploying as a web app (Railway + Vercel)

1. Backend on Railway

Supabase (recommended for production)

2. Frontend on Vercel

Deploy gotchas

Live deployments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Streamlit: Ranking page (`pages/Ranking.py`)

Cluster descriptions (K-Selection page + `/api/cluster`)

Packages