A data-driven decision-support system for exploring and ranking commercial locations in New York City using urban data and machine learning.
This project integrates NYC Open Data (pedestrian counts, subway stations, storefront vacancy filings) and NYC Public Neighborhood Profiles–style community statistics, aggregated to CDTA boundaries, to model neighborhood-level commercial environments.
streamlit run app.py opens K-Selection / clustering (app.py). The Ranking UI is pages/Ranking.py: hard SQL filters, then α·semantic + β·competitive penalty — MinMax on the filtered rows for [cosine similarity, −log1p(count/(avg_pedestrian+1))] (overall filings or a chosen act_*_storefront column; same as /api/rank), with one α slider and a Claude panel that explains the fixed top 5 from that blend (/api/agent parity).
Inputs live under data/raw/ and are joined to CDTA 2020 polygons (NYC Planning boundaries: footprint, area_km2, spatial joins). run_pipeline.py wires paths to each source.
- Mobility — DOT bi-annual pedestrian counts → per-CDTA foot traffic (
avg_pedestrian,peak_pedestrian, etc.); MTA/NYS subway station points →subway_station_count,subway_density_per_km2,transit_activity_scoreinputs. - Community and economic context — Comptroller Neighborhood Economic Profiles (ACS-style jobs, demographics, income, education, commute); Neighborhood Financial Health indicators →
nfh_*when that feed is present. - Commerce and public safety — City storefront vacancy / activity filings →
storefront_*,act_*_storefront, category mix,competitive_score,commercial_activity_score; NYPD shooting points → per-CDTA shooting totals in the feature table.
The pipeline writes data/processed/neighborhood_features_final.csv (one row per CDTA) for clustering, ranking, and embeddings. Exact filenames, layout, and where to download: see data/raw/README.MD and the data/raw/ tree.
- Raw data in
data/raw/(CSVs + CDTA shapefile undernyc_boundaries/). Seedata/raw/README.MD. python run_pipeline.py—src/data_processing.pycleans sources (including the Neighborhood Financial Health / NFH CSV merged intonbhd_clean);src/feature_engineering.pyreads raw storefront filings (path configured inrun_pipeline.py), spatially aggregates storefront counts by CDTA and primary business activity, merges MOCEJ-style neighborhood profiles andnfh_*columns on a normalized Community District key, then imputes remaining gaps in those merged numeric columns with borough median, then citywide median (dashboard-friendly proxy where a CDTA does not match a single profile row).commercial_activity_score=log1p(storefront_filing_count×avg_pedestrian) andtransit_activity_score=log1p(subway_station_count×avg_pedestrian), computed after filling missing storefront/subway/pedestrian inputs (inner product clipped at 0 beforelog1p) so scores are not stuck at zero from ordering alone and heavy tails are compressed for hard-filter sliders. (Soft ranking on the dashboard uses semantic + competitive — see Ranking section — not a MinMax ofcommercial_activity_score.) Output:data/processed/neighborhood_features_final.csv. A healthy run ends with no missing values in that table; if any column still has NaN, investigate before shipping.- Embeddings (for the app) —
python -m src.embeddingsbuilds embeddings from neighborhood text profiles (including every non-zeroact_*_storefrontbusiness-activity count, population proxies, and NFH fields where present — seedata/processed/README.md); caches underoutputs/embeddings/. Default (EMBEDDING_BACKENDunset orauto): OpenAItext-embedding-3-smallifOPENAI_API_KEYis set, else local sentence-transformers.EMBEDDING_BACKEND=openaiuses OpenAI when a key is present, otherwise falls back like auto.EMBEDDING_BACKEND=sentence_transformersforces local only. Use--forceafter changing features or profile text so embeddings match the CSV. streamlit run app.py— Home = K-Selection / clustering (app.py); Ranking ispages/Ranking.py(hard filters, α·semantic + β·competitive blend, map, Claude). Loads the feature table (cached; Rerun or Clear cache after regenerating the CSV). The Next.js app on Vercel calls the same/api/cluster,/api/filter,/api/rank, and/api/agentendpoints whenNEXT_PUBLIC_API_URLpoints at the FastAPI backend.
The ranking dashboard reads data/processed/neighborhood_features_final.csv (cached).
Sidebar controls set thresholds on:
- Borough (multiselect)
- Minimum
subway_station_count,avg_pedestrian,storefront_density_per_km2,storefront_filing_count,commercial_activity_score - Maximum
competitive_score(competition pressure — same column used incompetitive_scorehard filter on/api/filter//api/rank) - Maximum shooting-incident count (column name may be
shooting_incident_countorshooting_incident_count_2024depending on the pipeline export) - NFH thresholds (
nfh_overall_score/nfh_goal4_fin_shocks_score): minimum thresholds via sidebar sliders (shown only when those columns exist and have values)
These are applied with DuckDB using parameterized WHERE clauses (same binding idea as the FastAPI backend): the full table is registered as nbhd, a SELECT … WHERE … runs, and rows are ordered by commercial_activity_score DESC for the hard-filter preview table. The main area shows a table of surviving neighborhoods (key columns). View generated SQL expands to show the exact query. An expander (About zeros, nulls, and refreshing data) documents imputation, score formulas, and when zeros are expected.
If no rows match, the app stops with a warning.
- User enters a free-text query (ideal area description).
- One blend slider sets α ∈ [0, 1] for semantic similarity (cosine similarity after MinMax on the filtered set). β = 1 − α applies to a competitive-pressure signal derived from
log1p(count / (avg_pedestrian + 1)), wherecountis either totalstorefront_filing_countor a selectedact_*_storefrontcolumn (same behavior as the Next.js app calling/api/rank). Higher competition penalizes rank, so the second axis uses MinMax(−competitive) before blending. No second slider; α + β = 1 by construction. - Embeddings: query and neighborhoods use the active backend in
src/embeddings.py— by default OpenAItext-embedding-3-smallwhenOPENAI_API_KEYis set, else local sentence-transformers (all-MiniLM-L6-v2); or whichever backend you force withEMBEDDING_BACKEND. Cosine similarity is computed on the filtered set (aligned by neighborhood name to the full embedding matrix). - Build a matrix cosine_sim, −competitive for those rows and apply
sklearn.preprocessing.MinMaxScaler(column-wise, 0–1 on the filtered set). With a single row, scaling falls back to a neutral mid-score to avoid degenerate MinMax. blended_score = α·col0 + β·col1. Sort byblended_scoredescending. The table showssemantic_similarity,specific_competitive_score(thelog1pcompetitive scalar used in the blend), andblended_score. (commercial_activity_scoreremains available for hard filters and the hard-filter table sort, not as the soft-ranking second axis.)- Map (when embeddings succeed): a CDTA choropleth colors polygons by
blended_score(sequential greens; requiresdata/raw/nyc_boundaries/nycdta2020.shp).
If embeddings are missing or the API key is unset, this block shows a warning (pre-generate embeddings with python -m src.embeddings; use --force after feature or profile text changes).
A button sends Claude the fixed top 5 neighborhoods from the soft ranker (same rules as /api/agent) plus the hard-filtered dataframe. The agent may call run_sql only for extra context on those five — it must not re-rank or replace them. Requires ANTHROPIC_API_KEY.
- K-means clustering — not used to order results in the app (it feeds cluster labels on the Ranking page only after you run K-Selection on the home page).
A walkthrough of the Ranking page from a user's perspective. Open it from the sidebar after streamlit run app.py.
- Set hard filters in the sidebar. Pick boroughs and drag the sliders for the minimums above, plus max
competitive_score, max shooting count (when present), and NFH minimums when shown. These translate into a parameterized SQLWHEREclause; expand View generated SQL to see the literal query. Surviving neighborhoods appear in the table at the top of the page. If no rows match, loosen a constraint until at least one neighborhood passes. - Describe what you want in plain English. Type a free-text query into the soft-preferences box (see Example queries below for project-shaped prompts). The query is embedded with the same model used for neighborhood text profiles, then matched by cosine similarity against the filtered set.
- Tune the α slider (semantic ↔ competitive tradeoff). α controls how much the ranking trusts your text query versus the competitive-pressure axis (after MinMax with semantic cosine):
- α = 1 → pure semantic match (text query only).
- α = 0 → pure competitive axis (lower
log1p(count/(ped+1))wins after MinMax with negation — same construction as/api/rank). - In between → MinMax-scaled blend of the two. β is automatically
1 − α.
- Read the output. The ranked table shows
semantic_similarity,specific_competitive_score, andblended_scorefor each surviving neighborhood. Below it, a CDTA choropleth colors the map byblended_scoreso you can see geographic patterns at a glance. - Run the AI analysis panel. Click the button to send the fixed top 5 from the soft ranker (same contract as
/api/agent) plus the hard-filtered table to Claude. The agent may call read-onlySELECTonly for extra context and must not re-rank or replace those five neighborhoods. RequiresANTHROPIC_API_KEY.
| Piece | Role |
|---|---|
K-means + K-Selection (app.py, home) |
Exploratory: sweep k, charts, CDTA choropleth, rich cluster descriptions (titled archetype + percentile-based metric levels + activity-category profile + borough concentration & nearest text-profile matches); labels saved for Ranking. Does not define the rank order on the Ranking page. |
Ranking (pages/Ranking.py) |
Hard SQL filters → MinMax([semantic, −competitive]) → α·col0 + (1−α)·col1 (map + Claude explains fixed top 5 + cluster columns). |
Clusters are summarized in three layers (all generated by api/cluster_helpers.py:_cluster_rich_description, shared by Streamlit and FastAPI):
- Titled archetype chosen from centroid z-scores — e.g. Dense Mixed-Use Commercial Core, Transit-Oriented Commercial District, Employment-Heavy Business District, Stable Lower-Density Neighborhood Market, Lower-Density Local Commercial Area, Balanced Neighborhood Market. A special No Recorded Commercial Activity Zone / Mostly No-Activity / Special-Use Geography title fires when member CDTAs have zero non-vacant filings and zero commercial / competitive scores (so reviewers don't mistake parks, waterfront, or other special-use polygons for retail markets).
- Profile signals as percentile-band sentences ("high foot traffic around the 82nd percentile citywide", "moderate transit activity around the 54th percentile") computed against the master feature table — not raw z-scores.
- Granular storefront mix — top categories by per-CDTA density (e.g. "food service — 38% of filings, 91st pct density"), plus unusually elevated and comparatively sparse
act_*_densitycategories at the centroid. Ends with borough concentration and the nearest text-profile matches from cached embeddings (cosine to cluster mean).
The legacy one-liner ("Above average on X; relatively lower on Y.") is kept as a fallback in _cluster_brief_description when embeddings or member data are unavailable.
The baseline dashboard relies on transparent filters plus scores from the feature table — no supervised model is required. If labeled outcomes are added later, supervised models (with proper validation to avoid geographic leakage) can sit alongside the same pipeline outputs.
| Area | Files | Notes |
|---|---|---|
| Data pipeline | src/data_processing.py, src/feature_engineering.py, run_pipeline.py |
Produces neighborhood_features_final.csv |
| Embeddings | src/embeddings.py |
Text profiles → .npy cache (OpenAI or sentence-transformers) |
| K-Selection / clustering (home) | app.py |
Thin Streamlit page: K sweep, viz, CDTA map. Heavy logic lives in streamlit_app/cluster_helpers.py (which re-uses api/cluster_helpers.py). Elbow heuristics match /api/cluster (perpendicular-distance elbow_k, Δ² inertia elbow_k_kneedle); writes cluster labels + descriptions to session state |
| Ranking | pages/Ranking.py |
Hard filters, MinMax [semantic, −competitive] (one α), map, Claude (fixed top 5), cluster join |
| FastAPI backend | api/main.py |
Slim endpoint layer; all SQL building, ranking, clustering and description logic now lives in api/rank_helpers.py, api/cluster_helpers.py, and api/formatting.py (shared with Streamlit) |
| Shared cluster + rank helpers | api/cluster_helpers.py, api/rank_helpers.py, api/formatting.py |
Single source of truth for elbow detection, rich cluster descriptions (_cluster_rich_description, _cluster_title, _activity_category_profile), DuckDB SQL building, and feature-name pretty-printing. Imported by both FastAPI and the Streamlit app |
| Streamlit-only adapters | streamlit_app/cluster_helpers.py, streamlit_app/constants.py |
Thin wrappers over api/cluster_helpers.py plus Streamlit-specific constants (cluster palette, candidate/default feature lists, readable feature labels). Avoids duplicating logic between app.py and api/main.py |
| K-means (library) | src/kmeans_numpy.py |
Used by app.py; not the ranking sort key |
| Agent | src/agent.py |
Claude + DuckDB SELECT tools |
| Scripts | scripts/load_supabase.py, scripts/build_cdta_geojson.py |
Load neighborhoods into Supabase; build cdta_geo.json for the API without geopandas on Railway |
data/ raw and processed datasets (see data/*/README*)
src/ core logic (pipeline, features, embeddings, agent)
api/ FastAPI app + shared helpers (cluster_helpers.py, rank_helpers.py, formatting.py)
streamlit_app/ Streamlit-side adapters and constants (re-uses api/* helpers)
pages/ `Ranking.py` only (soft ranker; **K-Selection lives in `app.py`**)
frontend/ Next.js 14 UI (deployed to Vercel)
scripts/ Supabase load, CDTA GeoJSON build (`load_supabase.py`, `build_cdta_geojson.py`)
outputs/ saved models, embeddings, figures, validation artifacts
tests/ unit tests (`pytest tests/`) + manual evaluation pipeline
app.py Streamlit entry: **K-Selection / clustering** home (`streamlit run app.py`)
| Document | Purpose |
|---|---|
README.md (this file) |
Setup, Streamlit behavior, API keys, troubleshooting |
data/raw/README.MD |
Where to obtain raw CSVs and CDTA shapefile; layout under data/raw/ |
data/processed/README.md |
Processed CSVs, final feature columns, app + embedding pipeline |
uv venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -r requirements-dev.txt # local: includes Streamlit, GIS, sentence-transformers, pytestpython -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements-dev.txtrequirements.txt is the slim production set used by Railway (no streamlit, no torch, no geopandas — ~2 min build). requirements-dev.txt extends it with the local Streamlit dashboard, the GIS stack, and tests.
python run_pipeline.py
python -m src.embeddings # use --force to refresh; OpenAI when OPENAI_API_KEY set, else sentence-transformers (unless forced local)
streamlit run app.py # set ANTHROPIC_API_KEY to enable the Claude panelCopy .env.example to .env and set API keys as needed: OPENAI_API_KEY selects OpenAI embeddings when not forcing local-only; omit the key to use sentence-transformers. ANTHROPIC_API_KEY enables the Claude panel.
These target the semantic half of the blend (text profiles from the feature table via src/embeddings). Use pedestrians, transit, storefront mix, competition, or NFH-style context—those signals appear in the embeddings.
- "High foot traffic and multiple subway lines; strong food-service storefront activity, moderate competition per pedestrian"
- "Diverse retail and services, lower competitive pressure per pedestrian; NFH stability where the feed has it"
- "Transit-heavy CDTA, solid pedestrian volume, mixed storefront activity—dense commercial corridor"
K-means is implemented from scratch in src/kmeans_numpy.py (Euclidean distance, iterative centroids). The K-Selection home page (app.py) runs sweeps and charts; Ranking (pages/Ranking.py) is separate.
pytest tests/Includes tests/test_kmeans.py and tests/test_feature_engineering.py. Utility scripts under scripts/ are not run by CI unless you add them.
tests/rank_stability_validation_business_queries.py is a manual CLI script (no pytest assertions) that compares blended rankings between the 2022 and 2024 vintages for a fixed set of business queries. Outputs land in outputs/validation/rank_stability_business_queries/ and are now interactive Plotly HTML — no static PNGs anymore:
rank_stability_rankings.html— single combined scatter (rank 2022 vs rank 2024) with a dropdown to switch between queries; each marker is colored by|rank delta|and exposes a hover tooltip with the neighborhood, borough, CD, both ranks, the signed delta, and both blended scores. A dashedy = xreference line marks perfect stability.ranking_stability_<query_slug>.html— per-query standalone scatter (same encoding) for embedding or sharing one query at a time.query_rank_correlations_summary.html— Spearman r bar chart with a Kendall τ overlay on a secondary axis; hover surfaces the CDTA overlap count per query.
All HTML files use include_plotlyjs="cdn", so they render anywhere with internet access without bundling Plotly. Generate them with:
cd tests
python rank_stability_validation_business_queries.pyOpen them in the file explorer to interact with the scatter plots!
data/processed/is committed so you can runstreamlit run app.pywithout rebuilding features. Re-runpython run_pipeline.pyafter changing pipeline code or raw inputs.data/raw/CSVs are not committed (download locally; seedata/raw/README.MD). The CDTA 2020 shapefile underdata/raw/nyc_boundaries/nycdta2020.*is committed (~1.5MB) so spatial joins work out of the box.- Regenerate processed tables:
python run_pipeline.py(requiresgeopandas, local CSVs as above, and the repo shapefile path).
- Large datasets are not included in the repository.
- Precomputed embeddings live under
outputs/embeddings/after runningpython -m src.embeddings(neighborhood_embeddings.npyfor OpenAI backend,neighborhood_embeddings_st.npyfor sentence-transformers;neighborhood_texts.npyis shared). - OpenAI 429 / insufficient_quota means the account billing or quota for that API key is exhausted; fix billing in the OpenAI dashboard, then re-run embeddings.
- Linear scoring function (α blend). Rankings use
α · semantic_similarity + (1 − α) · MinMax(−competitive)on the filtered set (same as/api/rank). This is transparent and easy to tune, but cannot capture interactions between features (e.g. "subway access matters only for high-pedestrian areas") or non-monotonic preferences. - Limited expressiveness of the semantic query. Users can write queries such as "quiet residential area for retail with good subway access and good safety" and the system handles them well at the neighborhood-character level. However, it is difficult to express:
- Fine-grained retail categories (e.g. "specialty bookstore" vs. generic "retail").
- Niche restaurant types (e.g. "Sichuan hot pot only" or "third-wave coffee").
- Specific cinema / movie-related preferences (e.g. "indie art-house cinema" vs. "multiplex"). This limitation exists because neighborhood embeddings are built from aggregated text profiles (business-activity counts, demographics, NFH fields). The text describes neighborhoods at a high abstraction level — categories of activity, not individual venues — so the embedding space cannot distinguish between sub-types that share the same parent category. Sub-category preferences get washed out by the dominant signal of the broader profile.
- Static embeddings (not personalized). Every user sees the same neighborhood embeddings. The system has no notion of who is asking, so a real-estate developer and a first-time café owner get identical rankings for identical queries.
- Dependence on handcrafted features.
commercial_activity_score,transit_activity_score, storefront density, and the activity-by-CDTA aggregations are hand-designed. The model's ceiling is bounded by which features the pipeline happens to compute. - MinMax scaling instability (depends on the filtered set). Both semantic cosine and the competitive signal are MinMax-scaled on the filtered rows, not the full table. Tightening or loosening a hard filter therefore changes the scale of every score, which means the ordering between two neighborhoods can shift even when the underlying numbers did not. With a single surviving row, the scaler falls back to a neutral mid-score.
- More advanced ranking models. Replace the linear α blend with non-linear models (gradient-boosted trees, learning-to-rank such as LambdaMART) once labeled outcomes (e.g. survival of new businesses, observed click-throughs in the UI) are available. This would let the system learn feature interactions instead of forcing the user to guess α.
- Fine-grained taxonomy for business types. Replace the current top-level
act_*_storefrontactivity counts with a deeper category hierarchy (e.g. NAICS 6-digit or a custom retail taxonomy) so the embedding text can distinguish between sub-types like "indie bookstore" vs. "chain bookstore" or "art-house cinema" vs. "multiplex" - Enhanced Competitive + Commercial Activity Scores. Improve the calculation and integration of competitive and commercial activity scores to better reflect neighborhood dynamics and business potential. For example, incorporate more granular storefront categories, temporal trends in filings and pedestrian activity, or additional context from NFH indicators to create a more nuanced competitive pressure signal.
- Hybrid structured + semantic query system. Let the user mix structured constraints and free text in the same query (e.g. "neighborhoods with ≥3 subway stations, low vacancy, and a vibe like SoHo"). An LLM-based parser can extract the structured pieces into SQL filters and pass the remainder to the semantic ranker — closer to how users actually think.
- Personalization (user embeddings / preference learning). Capture session-level signals (which neighborhoods the user clicked, downloaded, or asked Claude about) and learn a user embedding that is added to the query embedding. Even a lightweight bandit over saved preferences would meaningfully improve repeat-user experience.
- Better LLM integration. Use Claude not only for the post-hoc analysis panel but also for query expansion ("quiet" → quiet, residential, low-traffic, low-noise) and constraint extraction (parsing "near a subway" into a hard
subway_station_count ≥ 1filter). This narrows the gap between what the user types and what the embedding model can match against.
The repo also ships as a two-tier web app: FastAPI on Railway (Python pipeline + ML) and Next.js on Vercel (UI). The Streamlit app remains supported for local use; the FastAPI backend wraps the same src/ modules.
api/main.pyis the FastAPI app — endpoints/api/health,/api/feature-ranges,/api/cluster,/api/filter,/api/rank,/api/agent,/api/geo/cdta. Run locally:
uv pip install -r requirements.txt
uvicorn api.main:app --reload --port 8000- Push to GitHub, then on railway.app → New Project → Deploy from GitHub repo. Railway’s builder (Railpack) picks the Python version from
.python-version(pinned to 3.11.9), then installs fromrequirements.txt. Deploy/runtime usesrailway.json/Procfilefor the start command and healthcheck path. - Set environment variables on the Railway service (Settings → Variables):
FRONTEND_ORIGINS— comma-separated list of Vercel URLs that may call the API (e.g.https://nyc-commercial.vercel.app). Without this, onlyhttp://localhost:3000is allowed.OPENAI_API_KEY— required at runtime to embed the user's query.ANTHROPIC_API_KEY— required for the/api/agentendpoint; other endpoints work without it.SUPABASE_URLandSUPABASE_SERVICE_ROLE_KEY— strongly recommended for production (see step 4). When both are set,/api/rankcallspublic.match_neighborhoods(pgvector cosine + filters in SQL) instead of loading the full DataFrame and the embeddings cache. Falls back to the CSV path if either is missing.
- Memory: the slim
requirements.txt(no torch, no geopandas, no streamlit) fits on Railway's free tier. Build is ~1.5–2 min. - Embeddings cache is only used when Supabase is not configured. In that mode the server builds the cache lazily in
outputs/embeddings/on first/api/rankcall (slow on cold start). Going the Supabase path is the recommended deploy mode.
The repo includes Postgres migrations under supabase/migrations/ that create a public.neighborhoods table with a vector(1536) embedding column, an HNSW cosine index, RLS policies for anon-key reads, and a match_neighborhoods RPC that does cosine similarity + hard-filter SQL in one round trip.
To set up:
- Create a Supabase project (note the URL and the service-role key — server-only).
- Apply migrations: install the Supabase CLI (
brew install supabase/tap/supabase), thensupabase link --project-ref <ref>andsupabase db push. (Or paste eachsupabase/migrations/*.sqlinto the SQL editor in numeric order.) - Populate the table once with
scripts/load_supabase.py:
export SUPABASE_URL=https://<ref>.supabase.co
export SUPABASE_SERVICE_ROLE_KEY=... # never commit
export OPENAI_API_KEY=...
uv run python scripts/load_supabase.pyThe script handles the act_OTHER_* / act_other_* case collision (Postgres folds identifiers to lowercase; the originally-lowercase variant is renamed to *_lower_* to match the schema).
4. Set SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY on Railway. /api/rank will switch to the Supabase RPC automatically.
The frontend is a stand-alone Next.js 14 (App Router, TypeScript, Tailwind) app under frontend/.
cd frontend
npm install
cp .env.local.example .env.local # set NEXT_PUBLIC_API_URL=http://127.0.0.1:8000
npm run dev # http://localhost:3000To deploy:
- On vercel.com → Add New Project → Import Git Repository, point at this repo.
- Set the Root Directory to
frontend/in project settings (otherwise Vercel will try to build from the repo root and fail on the Python files). - Add the env var
NEXT_PUBLIC_API_URL=https://<your-railway-app>.up.railway.app(no trailing slash). - Vercel auto-detects Next.js and uses
npm run build. Pushes tomaindeploy to production; PR branches get preview URLs.
Once deployed, copy the production Vercel URL back into Railway's FRONTEND_ORIGINS so CORS allows it.
- Cold starts: Railway sleeps idle services on Hobby. First request takes 5–15s to wake. The frontend handles this with loading states; for demos, hit
/api/healthfrom a cron. - CDTA boundaries ship as a pre-rendered
data/processed/cdta_geo.json(~4 MB) so Railway doesn't need geopandas. Regenerate locally withuv run python scripts/build_cdta_geojson.pyafter editing the shapefile (requires geopandas — installed viarequirements-dev.txt). - The clustering "Run Analysis" call recomputes K-means on every request. For the 71-CDTA dataset this is sub-second; if you swap in a larger feature table, add server-side caching keyed on
(features, max_k, vintage, boroughs). - The Streamlit app (
app.py) and the FastAPI app share two layers of common code: (1)src/modules (pipeline, embeddings, K-means, agent), and (2)api/cluster_helpers.py+api/rank_helpers.py+api/formatting.py(elbow heuristics, rich cluster descriptions, DuckDB SQL building, label formatting).streamlit_app/cluster_helpers.pyis a thin wrapper that re-exports those helpers for the Streamlit page so logic does not drift between interfaces — when you touch ranking, clustering, or description text, edit theapi/*_helpers.pymodule, not the Streamlit page. The Streamlit-specific@st.cache_dataimport insrc/config.pyis wrapped in try/except so the FastAPI deploy doesn't need streamlit installed.
-
Vercel (public dashboard) — Production Next.js app: https://nyc-commercial-intelligence.vercel.app/. Stack matches
frontend/: Next.js 14 (App Router), TypeScript, Tailwind CSS; it calls the FastAPI API viaNEXT_PUBLIC_API_URL(Railway in production). -
Streamlit Community Cloud — K-selection / clustering UI: https://nyc-commercial-intelligence.streamlit.app/.