Table of Contents generated with DocToc
- CLAUDE.md — sportsdataverse-py Development Guide
- Package Overview
- Commit Convention
- Branches
- Packaging
- Build & Development Commands
- Project Structure
- ESPN Cross-League Architecture (0.0.51+)
- Parser Layer (0.0.51+)
- Key Coding Conventions
- Module pattern (NEW modules)
- NFL — nflreadpy parity
- NFL —
ep_wpmodel application + EPA/WPA (nflfastR alignment, 0.0.67+) - CFB —
cfb_play_participantsand the 0.36-live reconciliation - CFB — offline reprocess (
odds_override, raw allowlist,odds_source) (0.0.52+) - CFB — rule-era models + decision surfaces (0.0.68)
- MLB — Statcast (Baseball Savant) comprehensive surface (0.0.64+)
- HTTP / retry layer
- Polars version
- Type hints
- Test gating
- ID column types (join keys / player & team IDs)
- Common Pitfalls
- Documentation Maintenance
- Docstring conventions for new functions
- Example notebooks
- Reference-docs build toolchain (codegen)
sportsdataverse-py is the Python sister to the SportsDataverse R packages
(wehoop, hoopR, cfbfastR, cfbfastR-py, etc.) and provides tidy access
to play-by-play, box score, schedule, roster, and other sports data across
multiple leagues (NBA, WNBA, NFL, MLB, NHL, MBB, WBB, CFB, plus odds).
When this guide differs from current repository docs, treat
CONTRIBUTING.md and the current test suite under tests/ as authoritative.
- License: MIT
- Branch:
mainis the default branch and release branch. - Python target: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
- Packaging: uv (PEP 621
[project]+ PEP 735[dependency-groups]) - DataFrame engine: polars 1.x (the
0.36-livebranch is a parallel pandas-based line of development; see "Branches" below)
Use Conventional Commits:
feat(wbb): add espn_wbb_team_roster() season-level scraper
fix(cfb): correct kneel-down classification in cfb_pbp parser
docs(contributing): document uv workflow and skip_if_no_live gate
test(wnba): add live smoke tests for player_stats canonical categories
refactor(dl_utils): rewrite download() retry as iterative + raise on exhaustion
chore(deps): bump polars to >=1.0,<2.0 + re-lock
ci(actions): add py3.14 to the test matrix
Prefer scoped commit subjects when useful (e.g., feat(wbb): ...,
fix(cfb): ...). Use type!: or a BREAKING CHANGE: footer for breaking
changes. Split unrelated work into separate commits for reviewability.
Important: Never include AI agents or assistants (e.g., Claude, Copilot,
Cursor, GPT, Gemini) as co-authors on commits. Omit all Co-Authored-By
trailers referencing AI tools. This applies whether the change was
generated, refactored, or reviewed with AI assistance — the human author
is the sole attributable contributor.
main— default; uses polars 1.x end-to-end. Recently migrated from polars 0.18 → 1.x and converted to uv-based packaging (May 2026).0.36-live— parallel pandas-based line of development. Carries CFB PBP bug fixes (kneel-down handling, half-edge cases, turnover detection, WP cases, punt/yardage parsing, etc.) that have not yet been ported into the polarsmainbranch. When porting fixes from0.36-live, translate pandas idioms (np.select,df.loc[...],df.assign(...)) into polars (pl.when().then().otherwise(),with_columns(...)). The function-by-function reconciliation notes live indev/(untracked).
All packaging metadata lives in pyproject.toml (PEP 621 [project] table)
— there is no setup.py. The build path is PEP 517:
python -m build # produces sdist + wheel into dist/setuptools is the build backend (build-system.build-backend = "setuptools.build_meta"). Runtime deps live under [project.dependencies],
extras under [project.optional-dependencies] (tests, docs, models,
all). Package data ships via [tool.setuptools.package-data] (currently
cfb/models/*, nfl/models/*, and py.typed). The [tool.setuptools.packages.find]
block excludes tests*, Sphinx-docs*, docs*, examples*, archive*,
recipe*, dev* from the wheel.
recipe/meta.yaml provides a noarch: python conda-build recipe that
mirrors [project.dependencies]. Local conda build recipe/ works today;
conda-forge feedstock submission uses the PyPI-pinned url: + sha256:
mode documented in recipe/README.md. CI verifies the recipe on every PR
that touches recipe/ or pyproject.toml via
.github/workflows/conda-build.yml.
This project uses uv — see CONTRIBUTING.md
for installation. Common commands from the repo root:
# Sync deps (creates .venv on first run)
uv sync --all-extras --dev
# Run the test suite (gated tests skip without SDV_PY_LIVE_TESTS=1)
uv run pytest
# Run live tests (hits real APIs)
SDV_PY_LIVE_TESTS=1 uv run pytest
# Type-check the strict-listed modules
uv run mypy sportsdataverse/<your_module>.py
# Lint
uv run ruff check sportsdataverse/
# Build wheel + sdist
uv build
# Bump a dependency
uv add some-package # runtime
uv add --dev some-package # dev-onlyuv.lock is committed for reproducible installs across contributors.
sportsdataverse/
cfb/ # College football (heaviest PBP module)
cfb_pbp.py # CFBPlayProcess; __add_player_cols delegates to cfb_play_participants
cfb_play_participants.py # ESPN per-play participants -> {type}_player_name/_id pivot
cfb_loaders.py, cfb_schedule.py, cfb_teams.py, cfb_game_rosters.py, models/
mbb/ # Men's college basketball
mlb/ # MLB (mlbam endpoints + retrosheet/retrosplits)
nba/ # NBA
nfl/ # NFL — nflreadpy-parity surface
nfl_loaders.py # 24 canonical load_nfl_* + 11 deprecated per-type aliases
nfl_pbp.py, nfl_schedule.py, nfl_teams.py, nfl_games.py, nfl_game_rosters.py
cache.py # @cached_loader, memory/filesystem/off, clear_cache()
config.py # NflConfig dataclass + get_config / update_config / reset_config
datasets.py # team_abbr_mapping, team_abbr_mapping_norelocate, player_name_mapping
utils_date.py # get_current_nfl_season(), get_current_nfl_week()
nhl/ # NHL
wbb/ # Women's college basketball
wbb_pbp.py, wbb_game_rosters.py, wbb_schedule.py, wbb_loaders.py, wbb_teams.py
wbb_team_roster.py # single-table espn_wbb_team_roster()
wbb_player_stats.py # multi-table dict[str, pl.DataFrame]
wnba/ # WNBA
wnba_team_roster.py # thin shim over wbb helper, league="wnba"
wnba_player_stats.py # same shape as wbb_player_stats
odds/ # Odds & betting lines
dl_utils.py # download() retry + janitor + (under|kebab|camel)ize helpers
errors.py # NoESPNDataError, SeasonNotFoundError
config.py # Per-sport URL constants pointing at sportsdataverse-data releases
__init__.py
tests/
conftest.py # skip_if_no_live decorator
cfb/, mbb/, mlb/, nba/, nfl/, nhl/, wbb/, wnba/ # one subdir per source pkg
docs/ # Docusaurus site (don't pollute with internal docs)
docs_instructions.md # Reference for the docs build workflow
dev/ # Local-only working notes (gitignored)
recipe/ # Conda-build recipe (meta.yaml + README)
pyproject.toml # PEP 621 metadata + tooling (ruff lint+format, mypy)
pytest.ini # filterwarnings for env-level pkg_resources / nspkg.pth noise
uv.lock # Committed
CONTRIBUTING.md # uv workflow + new-module standards
The cross-league ESPN wrapper surface lives in
sportsdataverse/_common_espn.py. The pattern is one core +
N thin extensions:
_common_espn.py— ~80 core functions parameterized on(sport, league)slugs. Every ESPN URL family is wrapped once (Site v2 / Site v2 alt / Web v3 / Core v2 / Core v3 / CDN)._UNIVERSAL_WRAPPERS— list of(short_name, core_fn)tuples that map to wrapper functions on every league._NCAA_WRAPPERS/_FOOTBALL_WRAPPERS/_MLB_WRAPPERS— opt-in extras gated byinclude_ncaa=/include_football=/include_mlb=flags onmake_league_module()._bind(core_fn, sport, league, full_name, parser=None)— wraps each core function with afunctools.partial(when no parser registered) or a closure (when a parser is registered) that adds__name__/__qualname__/__doc__for IDE introspection and optionally acceptsreturn_parsed=True/return_as_pandas=Truekwargs.make_league_module(sport, league, prefix, namespace, ...)— iterates the wrapper tables and registers each one innamespacewith the canonicalespn_{prefix}_{short}name.
Per-league extension modules (sportsdataverse/{league}/{league}_espn_ext.py)
are 4-line files: import make_league_module, call it with the
appropriate (sport, league, prefix) tuple + extras flags, assign
the return value to __all__. Examples:
# sportsdataverse/nba/nba_espn_ext.py
__all__ = make_league_module("basketball", "nba", "nba", globals())
# sportsdataverse/cfb/cfb_espn_ext.py — both NCAA + football extras
__all__ = make_league_module(
"football", "college-football", "cfb", globals(),
include_ncaa=True, include_football=True,
)Total cross-league surface: 121 short names registered across 8 leagues = 819 wrappers.
Every wrapper returns raw Dict by default. The parser layer turns
those payloads into tidy polars / pandas DataFrames. Six parser
modules, one per data surface:
| Module | Surface | Parsers |
|---|---|---|
_common_espn_parsers.py |
ESPN cross-league (Site v2 + Core v2 + Web v3) | 30+ dedicated + 3 generic + 21-section summary dispatcher |
nhl/nhl_api_web_parsers.py |
api-web.nhle.com/v1/ modern game-feed |
16 dedicated + 2 dispatchers (right_rail, club_stats) |
nhl/nhl_edge_parsers.py |
api-web.nhle.com/v1/edge/* player tracking |
4 family + 3 sub-frame + 1 fallback |
nhl/nhl_stats_rest_parsers.py + nhl_records_parsers.py |
api.nhle.com/stats/rest + records.nhl.com |
1 generic each (shared {data: [...]} shape) |
mlb/mlb_api_parsers.py |
statsapi.mlb.com Stats API |
5 dedicated + 1 generic |
nfl/nfl_api_parsers.py |
api.nfl.com "Shield" data API |
11 dedicated (one per nfl_api endpoint) |
Parser contract (universal across all 6 modules):
- Return
polars.DataFrameby default; pandas viareturn_as_pandas=True. - Empty / malformed payloads return a zero-row frame instead of raising — callers can chain without null-checks.
- Output columns are snake-cased via
sportsdataverse.dl_utils.underscore. - Use
pandas.json_normalizefor nested flattening, then convert to polars at the end. List-valued cells are stringified so polars accepts the frame.
ESPN cross-league wrappers whose short name is registered in
ENDPOINT_PARSERS accept an optional return_parsed=True kwarg
that routes the raw payload through the registered parser:
from sportsdataverse.nba import espn_nba_team_roster
raw = espn_nba_team_roster(team_id=13) # → Dict
df = espn_nba_team_roster(team_id=13, return_parsed=True) # → polars
pdf = espn_nba_team_roster(team_id=13, return_parsed=True,
return_as_pandas=True) # → pandasThe shim is strictly additive — every existing caller continues
to get raw Dict when the kwarg is omitted. NHL / MLB sibling-API
wrappers compose with their parser explicitly:
from sportsdataverse.nhl import nhl_web_pbp, parse_nhl_web_pbp
df = parse_nhl_web_pbp(nhl_web_pbp(2023030417)) # 331 playsENDPOINT_PARSERS invariant: every wrapper short name across
_UNIVERSAL_WRAPPERS + _NCAA_WRAPPERS + _FOOTBALL_WRAPPERS +
_MLB_WRAPPERS is in the registry. Three generic fall-throughs
cover the long tail:
parse_single_entity— Core v2 single-resource payloads (team,venue,franchise,coach, etc.).parse_items— Core v2 paginated{items: [...]}and the Core v2{entries: [...]}variant (athlete_statisticslog).parse_summary— Site v2summarydispatcher (21 sub-frames per game).
Three regression tests in tests/test_espn_universal_parsers.py
lock in the 121/121 coverage invariant + the shim invariant. Any
new wrapper short name added without a matching ENDPOINT_PARSERS
entry fails CI.
parse_summary(payload, section=None) is the dispatcher for the
rich Site v2 summary payload (~700KB–1.8MB per game). With
section=None returns a dict of all 21 sub-frames keyed by section
name; with section="<name>" returns just that one frame.
Section list (current 21): boxscore_player, boxscore_team,
plays, winprobability, leaders, game_info, officials,
header, season_series, against_the_spread, standings,
broadcasts, format, pickcenter, odds, article, injuries,
news, drives, drive_plays, scoring_plays.
Cross-league shape divergences captured by tests:
- NFL + CFB ship
drives.previous[]+scoringPlaysinstead of top-levelplays[].parse_summary_drive_playsunrolls drive plays into a long-form frame withdrive_id+drive_sequencejoin keys for football PBP parity. - NHL doesn't publish per-play
winprobability. pickcenter/odds/against_the_spreadare sparse in past-game captures (live games typically populate them).- NCAA W basketball
officialssometimes ships < 3 rows; CFB national championship shipped 0 officials.
Captured fixtures live under tests/fixtures/{espn,mlb_api,nhl_api_web, nhl_edge,nhl_stats_rest,nhl_records}/. Each directory has a
README.md documenting provenance (URL + capture date). See
docs/docs/parsers/fixtures.md for the full inventory.
Adding a new parser → drop a fixture in the right directory + add
a test in the matching test_*_parsers.py file. The parser tests
are payload-agnostic so re-captured fixtures continue to work as
long as the schema doesn't drift; when it does, the weekly cron
drift detector (.github/workflows/live-tests-cron.yml) catches it
and opens a tracking issue labeled live-tests:drift.
| Test file | Count | Surface |
|---|---|---|
tests/test_espn_universal_parsers.py |
128 | ESPN cross-league + summary dispatcher |
tests/test_nhl_api_web_parsers.py |
37 | NHL api-web modern game-feed |
tests/test_nhl_edge_parsers.py |
32 | NHL EDGE player-tracking |
tests/test_nhl_aux_parsers.py |
21 | NHL Stats REST + Records |
tests/test_mlb_api_parsers.py |
17 | MLB Stats API |
| Offline parser tests total | 235 | |
tests/test_espn_live.py |
41 | Live API integration (gated by SDV_PY_LIVE_TESTS=1) |
Each new ESPN scrape module follows the worked-example shape established by
sportsdataverse/wbb/wbb_team_roster.py (single-table return) and
wbb_player_stats.py (multi-table dict[str, pl.DataFrame] return). The
wnba_team_roster.py / wnba_player_stats.py pair are thin shims over the
shared basketball helper.
- Public function
espn_<league>_<dataset>(primary_id, ..., *, raw=False, return_as_pandas=False, **kwargs). @overloadchain to type-narrow return based onraw/return_as_pandasflags.- Shared private helper
_espn_basketball_<dataset>(league, ...)keeps the wbb/wnba pair DRY — wnba module is a thin wrapper that imports the helper and fixes the league slug. - Returns
pl.DataFramefor single-table endpoints,dict[str, pl.DataFrame]for multi-table endpoints, ordictifraw=True. - Multi-table returns key on canonical category names (
Averages,Totals,Miscfor player stats), with anOtherfallback bucket added only when ESPN ships a non-canonical category name. Empty frames carry the documented schema so callers always see a stable column set. - Snake-case columns via
sportsdataverse.dl_utils.underscore. - Append the new module's path to the
[tool.mypy] files = [...]ratchet inpyproject.tomlonce it types cleanly. That list scopes which modules the gate checks (withfollow_imports = "skip"); do NOT switch to a whole-package[[tool.mypy.overrides]]model — the legacy surface isn't typed yet and would make the gate permanently red.
The NFL submodule is a near drop-in replacement for nflreadpy.
The canonical sdv-py names use the load_nfl_* prefix (cross-sport
disambiguation under the umbrella sportsdataverse package); inside
sportsdataverse.nfl itself we additionally export 25 nflreadpy-style
aliases without the prefix. load_nfl_espn_qbr (0.0.68) loads ESPN Total
QBR (nflreadpy load_espn_qbr parity) and brings the canonical count to 24:
import sportsdataverse.nfl as nfl
pbp = nfl.load_pbp([2024]) # alias -> load_nfl_pbp
schedules = nfl.load_schedules([2024]) # alias -> load_nfl_schedule
ngs = nfl.load_nextgen_stats(stat_type="passing")
adv = nfl.load_pfr_advstats(stat_type="pass", summary_level="season")
nfl.clear_cache()The aliases are NOT re-exported at the top-level sportsdataverse package
on purpose — only inside sportsdataverse.nfl. New nflreadpy-parity
loaders should follow that same scoping rule.
Unified loaders: load_nfl_nextgen_stats(stat_type=) replaces three
per-type variants (load_nfl_ngs_passing / _rushing / _receiving) and
load_nfl_pfr_advstats(stat_type=, summary_level=) replaces eight
per-type/per-summary variants. The legacy per-type wrappers still exist as
thin shims that emit DeprecationWarning and dispatch to the unified
function. Don't add new per-type wrappers; extend the unified function.
load_nfl_ff_rankings: accepts both kind= (preferred) and type=
(nflreadpy's name; kept for parity). type shadows the builtin so the
codebase prefers kind internally.
Caching layer — sportsdataverse/nfl/cache.py +
config.py. Three modes selected via NflConfig.cache_mode:
| Mode | Storage | TTL |
|---|---|---|
memory (default) |
per-process dict | cache_duration seconds |
filesystem |
parquet under cache_dir |
cache_duration seconds |
off |
no caching | n/a |
All 24 canonical loaders + 11 deprecated aliases are wrapped with
@cached_loader. The cache key hashes (qualified_name, args, sorted_kwargs)
and excludes return_as_pandas so a single stored polars frame serves
both polars and pandas callers (the conversion happens on read).
Env-var initialization (precedence: explicit update_config() > env > default):
| Env var | Effect |
|---|---|
SDV_PY_NFL_CACHE |
memory | filesystem | off |
SDV_PY_NFL_CACHE_DIR |
filesystem cache directory |
SDV_PY_NFL_CACHE_DURATION |
TTL in seconds |
SDV_PY_NFL_VERBOSE |
progress chatter on/off |
SDV_PY_NFL_TIMEOUT |
HTTP timeout in seconds |
SDV_PY_NFL_USER_AGENT |
custom UA string |
Programmatic access:
from sportsdataverse.nfl import get_config, update_config, reset_config, clear_cache
update_config(cache_mode="filesystem", cache_duration=3600)
clear_cache() # also wipes both memory + filesystemStatic datasets — sportsdataverse/nfl/datasets.py exports
three module-level dicts: team_abbr_mapping (relocations folded;
OAK -> LV, SD -> LAC, STL -> LA), team_abbr_mapping_norelocate
(historical identity preserved), and player_name_mapping. They're
inline-bundled (not separate JSON files) because the
[tool.setuptools.package-data] block in pyproject.toml only ships
cfb/models/* and nfl/models/*. Refresh procedure is documented in the
datasets.py module docstring.
Date helpers — utils_date.get_current_nfl_season() and
get_current_nfl_week() (also aliased as get_current_season /
get_current_week inside sportsdataverse.nfl).
When in doubt about the upstream API surface, check
gh repo view nflverse/nflreadpy — sdv-py mirrors nflreadpy's signatures
where practical.
sportsdataverse/nfl/ep_wp.py is the single owner of NFL model application
and EPA/WPA derivation. The canonical rule: construction modules
(nfl_pbp.py / native_pbp / load_nfl_pbp) must never re-add EPA/WPA inline —
they emit a frame and ep_wp applies the models. EPA/WPA logic lives in
exactly one place.
- Scorers mirror nflfastR's
calculate_*():calculate_expected_points(single start-of-playep+ 7 class probs),calculate_win_probability(wpnaive +vegas_wpspread),calculate_completion_probability(cp+cpoe, percentage-point scale100*(complete_pass-cp)),calculate_xyac. Outputs areFloat64(cast explicitly — the models emit float32; do NOT let apl.Series(numpy_f32)silently downcast the public columns). - Derivations
calculate_epa(df)/calculate_wpa(df)were lifted verbatim fromNFLPlayProcess.__process_epa/__process_wpa(scoring overlays, half-end-ep, penaltyEP_between, kickoff touchback, turnover/onside, OT two-path, posteam→home flip). Everyshift/lead is.over("game_id")— no cross-game leak when frames are concatenated. enrich_nfl_pbp(df, *, method=...)orchestrates EP→EPA→WP→WPA→CP→CPOE→xYAC in nflfastR order.method="lead_diff"(default, shipped + parity-validated) is a nflverse-native faithful port of nflfastRhelper_add_ep_wp.R: scores oneep, derives the rest natively on nflverse columns, applies the kickoff/PAT feature substitution (touchback yardlineTOUCHBACK_YARDLINE_PRE/POST_2016= 80 pre-2016 / 75 from 2016,down→1,ydstogo→10 — the parity lever), and exposesepas start-of-play EP.method="snapshot"remainsNotImplementedError— it was the intended vehicle for a lead_diff-vs-snapshot cross-era comparison, which was instead validated directly; the comparison confirmed correctness without needing a second live path, so"snapshot"is intentionally left unimplemented.NFLPlayProcess.__process_epa/__process_wpanow delegate their derivation to the sharedcalculate_epa/calculate_wpa— the ESPN construction path and the nflverse lead_diff path share one derivation engine (byte-identical output verified). There is no inline duplicate.fixed_drive/seriescolumns — nflfastRhelper_add_fixed_drives.R+helper_add_series_data.Rare ported into the ESPNNFLPlayProcessconstruction path (including lag-2/3 timeout-interleave and onside-recovery handling); they are additive columns appended duringrun_processing_pipeline().build_nfl_season(game_ids, *, source=...)— season-compile helper: iterates game IDs, calls construct→enrich→appends, joins viadiagonal_relaxed, and caches each game's enriched parquet keyed by(game_id, PIPELINE_VERSION)reusingnfl/cache.py.- Constants are centralized in
model_vars.py:NFLVERSE_FRAME_CONTRACT,_EP_POINT_VALUES,ERA_SEASON_CUTS(cuts 2001/2005/2013/2017),TOUCHBACK_YARDLINE_PRE/POST_2016,SPREAD_TIME_DECAY_EXPONENT(-4.0).receive_2h_kois derived in_add_wp_auxwhen absent (per game: 1st-half posteam == opening defense). - Models
nfl/models/*.ubjare the faithfulnfl_model_artifacts(EP 18-feat / wp_spread 12 / wp_naive 11 / cp 18) fromsportsdataverse-data, not the old CFB-shape placeholders. Refresh by downloading that release and verifyingBooster.feature_names == *_FEATURES. As of 0.0.68 the bundle also shipsfg_model,qbr_model,two_pt_model, the self-derivedxpass_model(offline — no first-use download), andpunt_data.parquet; the fourth-down decision surface lives innfl/nfl_fourth_down.py. - Parity (
lead_diffvs nflverse, model domain):ep0.996,epa0.994,wp0.997,vegas_wp0.998,cpoescale-correct;wpa≈0.89 is an SNR ceiling (the derivation is exact — corr 1.0 when fed nflverse's ownwp; the residual is WP-model per-play noise amplified by first-differencing, not a bug). The play_level EP/WP/CP recipe + this surface are validated against the nflfastR source in the workspace (nflverse-dev/nflfastR/R/helper_add_ep_wp.R).
sportsdataverse/cfb/cfb_play_participants.py replaced 471 lines of regex
inside cfb_pbp.CFBPlayProcess.__add_player_cols with a 130-line
endpoint-delegated extractor. __add_player_cols now delegates to the
participants module and only runs a narrow regex fallback for ESPN
sidecar gaps (sack_player_name2, fg_block_player_name,
punt_block_player_name, interception_player_name).
Three-tier resolution chain, in order:
- ESPN's per-play
participants[]array (the authoritative source). cdn.espn.com/.../playbyplaysidecarplayerHashfor display names (one round trip per game).$refresolution for athletes the sidecar omits (~6 per game on average — split sacks where the second sacker isn't on the leaders list, returners on lateral plays, etc.). Defaultresolve_missing=True, capped at 50 fetches/game (resolve_missing_max=50) so a pathological game can't run away. Setresolve_missing=Falseto disable.
Hybrid scalar + list output: per-play columns are emitted as both
{type}_player_name (scalar — the first / primary participant of that
type) and {type}_player_names (list — all participants of that type
on the play). Multi-entry types like split sacks are no longer silently
collapsed to a single name.
The "0.36-live → main reconciliation" landed in May 2026 and ported
~17 commits' worth of pandas-side CFB pbp bug fixes into the polars
main branch. Coverage includes yardage parsing, kneel-down handling,
half-edge cases, end-of-game WP, penalty-assessed-on-kickoff, plus the
full participants-module extraction described above.
CFBPlayProcess supports rebuilding a game's enriched output from on-disk raw JSON
without re-hitting ESPN — the contract the cfbfastR-cfb-raw scraper's reprocess
pipeline relies on. Three additive pieces:
- Raw allowlist keeps
injuries+gameNotes.espn_cfb_pbp(raw=True)filters the summary to an allowlist (incoming_keys_expectedincfb_pbp.py);injuriesandgameNotesare retained (default[]when absent). When adding a summary key the pipeline should preserve, add it to that list. odds_sourceprovenance.__helper_cfb_pickcentertagsself.odds_sourceas"summary_pickcenter"|"core_odds_api"|"default"|"injected", and the value is written into the returned payload (pbp_txt["odds_source"]) — not just the instance attribute — so dict consumers retain provenance.odds_overrideconstructor arg. The spread/OU/homeFavorite are EPA/WPA inputs, not passthroughs. For 2024+ games the summarypickcenteris empty and the helper otherwise cascades to the livesports.core.api.espn.comodds endpoint (defaulting to(2.5, 55.5, True, False)on failure). PassingCFBPlayProcess(odds_override={...})with keysgameSpread/overUnder/homeFavorite/gameSpreadAvailableshort-circuits resolution to those values (setsodds_source="injected"), so offline rebuilds never touch the network or inherit defaults. The override is validated + type-coerced in__init__(missing key / non-dict →ValueError). DefaultNone= unchanged behavior.
Offline-rebuild pattern: CFBPlayProcess(gameId, path_to_json=raw_dir, odds_override=<persisted>).cfb_pbp_disk() then .run_processing_pipeline().
cfb/models/ bundles rule-era XGBoost artifacts trained per the CFB Modeling
Suite: ep_model, wp_naive, wp_spread, cfb_cp_model, plus the 0.0.68
additions qbr_model, fg_model, fd_model (fourth-down), two_pt_model,
xpass_model, and punt_distribution.parquet. The fourth-down / FG / 2pt
decision surfaces are integrated default-on. The spread_time sign fix
landed alongside these (commit fbe11c4, #129).
sportsdataverse/mlb/mlb_statcast*.py wraps the full ~43-endpoint Baseball
Savant surface (baseballsavant.mlb.com) under the canonical naming
mlb_statcast_<family>_<name> (families = search / leaderboard /
gamefeed / player). Every endpoint returns a tidy frame by default
(return_parsed=False / raw=True for the raw payload). The old released
statcast_* names were renamed (no aliases) — don't reintroduce them.
- Codegen owns the leaderboards + gamefeed + schedule (
mlb_statcast.py, generated fromtools/codegen/endpoints/mlb_statcast.yaml). Savant is heterogeneous (CSV / JSON / HTML), so the YAML setsgetter_module: sportsdataverse.mlb.mlb_statcast_runtime— a smart_getthat returnsdictfor JSON bodies andstrfor CSV/HTML. The shared_codegen_runtime._getis JSON-only and would silently return{}for every CSV leaderboard; use the statcast runtime for any new Savant flat endpoint. - Two leaderboards (
fielding-run-value,statcast-park-factors) return HTML even withcsv=true— their rows live in an embeddedconst data=[...]script blob, parsed byparse_mlb_statcast_html_leaderboard. All other leaderboards are CSV (parse_mlb_statcast_leaderboard). - Hand-written (
mlb_statcast_extra.py): the 25,000-row date-chunked search (mlb_statcast_search+_minors+_wbc, distinct/csvroutes) with a friendly→Savant filter translation (_translate_filters:season,pitch_type,at_bat_result,batters_lookup, … →hfSea,hfPT,hfAB,batters_lookup[]; unknown keys pass through).mlb_statcast_playerparses the page'sserverVals[section](default"statcast") to a frame. - Returns-schemas for every frame function live in
tools/codegen/schemas/native/mlb_statcast/*.yaml(generated) +schemas/autodoc/mlb/mlb_statcast_*.yaml(hand-written); column names match the parser's snake-cased output exactly.
All HTTP goes through sportsdataverse.dl_utils.download(). As of May 2026
it's type-hinted, iterative (no recursion), initializes response = None
defensively, and re-raises the most recent exception when the retry budget
is exhausted (instead of returning an unbound variable). Wrappers do NOT
wrap the call in try/except — they trust download() to either return a
usable requests.Response or raise.
Pinned to polars>=1.0,<2.0. All seven *_pbp.py modules (cfb, nfl, nba,
nhl, mbb, wbb, wnba) were migrated wholesale from the 0.18 surface to 1.x
in May 2026 — roughly 165 call sites. If you find a 0.18-style API in
this codebase, treat it as a bug, not a style preference.
Use the modern API surface:
| Use this | Don't use this (0.18 era) |
|---|---|
df.group_by("col") |
df.groupby("col") |
df.with_row_index("name") |
df.with_row_count("name") |
expr.map_elements(f, return_dtype=...) |
expr.apply(f) |
pl.struct(*cols) |
pl.struct([cols]) |
pl.read_csv(schema_overrides=) |
pl.read_csv(dtypes=) |
Series.scatter() |
Series.set_at_idx() |
pl.len() |
pl.count() |
df.join(..., how="full", coalesce=True) |
df.join(..., how="outer") |
s.cum_sum() |
s.cumsum() |
s.shift(n=k, fill_value=v) |
s.shift_and_fill(periods=k, fill_value=v) |
s.str.strip_chars() |
s.str.strip() |
s.str.len_chars() |
s.str.n_chars() |
-
Boolean masks on polars expressions use
pl.col("col") == True/pl.col("col") == Falseexplicitly (NOTpl.col("col")/~pl.col("col")). Ruff'sE712is suppressed inpyproject.tomlfor this reason. The explicit form is more readable when the column itself is also a polars expression and avoids surprises around null handling. -
Polars/Rust regex has no lookaround support.
(?=...),(?!...),(?<=...),(?<!...)raiseComputeError. To stop a capture at a stopword without lookahead, use the inline case-flag toggle:(?i)prefix(?-i: NAMES). The(?-i:...)group disables case-insensitivity for the captured names so lowercase narrative tails (for,at,return,and, etc.) cannot be folded into a captured proper noun. Example (extract a player name after "sacked by"):pl.col("cleaned_text").str.extract( r"(?i)sacked by(?-i: ([A-Z][\w'\.\-]+(?:\s+[A-Z][\w'\.\-]+)?))", 1 )
New modules MUST be fully typed (params + returns). Append the module path to
the [tool.mypy] files ratchet in pyproject.toml. Legacy modules remain
un-typed and stay out of the gate's files scope until cleaned.
Live-API tests use @skip_if_no_live from tests/conftest.py and run only
when SDV_PY_LIVE_TESTS=1 is set. CI does NOT set the var; live runs are
opt-in by contributor.
Player / athlete / team IDs are join keys, and a join is only as correct as the dtype agreement on both sides. Pin the type early and keep it consistent across the whole pipeline:
- Pick one canonical dtype per id and never silently flip it. ESPN ships
athlete / team IDs as both ints and numeric strings depending on the endpoint
(
participants[]vs theplayerHashsidecar vs a$refpayload). Decide the id's dtype at the boundary and cast there — don't let two code paths feed the same column asInt64in one frame andUtf8in another. - Beware the
id -> Utf8"paper-over" cast. Casting an Int/Float id to string to make a join line up is a latent-bug factory: a float-origin id stringifies as"123.0"(not"123"), and zero-padded source ids lose/gain leading zeros. If you must stringify, cast the raw integer (pl.col("id").cast(pl.Int64).cast(pl.Utf8)), never a float, and assert the result on both sides before joining. - Assert dtype agreement on join keys. Before a
.join(...)on an id, confirmleft.schema[key] == right.schema[key]. The roster-backed{type}_player_idjoin (cfb) and any crosswalk join are the high-risk spots — a dtype mismatch surfaces only as wrong/empty matches at test time. - Match names case-insensitively unless case is load-bearing. Player-name
reconciliation (roster joins, alias tables, narrative extraction) should fold
case rather than require an exact match. For polars/Rust regex use the inline
case toggle
(?i)...(lookaround is unsupported — see "Polars version" above).
-
Statcast parsers must be validated against REAL captures, not synthetic fixtures. Three Savant parsers shipped wrong because their hand-written fixtures didn't match live payloads: the gamefeed
/gfhas no top-leveleventskey (pitches live underteam_home/team_away); the player page'sserverValshas norowskey (it's a multi-table object — use a namedsection); and CSV leaderboards need the content-type-awaremlb_statcast_runtime._get(the JSON-only getter returns{}). When adding a Savant endpoint, capture a real response and assert against it. -
Regenerate generated docs before pushing, or CI fails on drift. If a change touched endpoint YAML, schemas, docstrings, loaders, or wrappers, the generated reference subtree under
docs/docs/is stale until you runuv run python tools/codegen/generate.py. The--checkdrift gate runs in CI and thesdv-codegenpre-commit hook, so an un-regenerated tree turns a green-locally change red in CI. Fold regeneration into the pre-push checklist (the/shipskill does this as step 1). Related: do NOT clean up / delete a branch until the PRstateisMERGED— a premature cleanup stranded work in a past session. -
Don't dump large outputs into a reply — redirect, then hand back a read command. A single response that pastes a multi-MB log, a full data dump, or a whole file can blow the output-token limit and truncate the entire turn (this has killed long sessions outright). Instead: redirect the big output to a file, summarize the salient lines in the reply, and give a copy-pasteable command to read the full output live at that path — e.g.
cat c:/path/to/output.output(ortail -f c:/path/to/output.outputto stream a still-running job). The context-mode tooling already nudges this (write artifacts to files, return the path + a one-line description); this makes it the durable default for ordinary work too. -
cfb_play_participantssidecar gaps are now (mostly) backfilled. ESPN'scdn.espn.com/.../playbyplaysidecar omits ~6 athletes per game on average (split-sack secondary participants, lateral returners, etc.). The module's defaultresolve_missing=Truefetches each missing athlete's$refURL one-by-one (capped at 50/game viaresolve_missing_max) before the per-play pivot. A narrow regex fallback againstcleaned_textis still retained insidecfb_pbp.__add_player_colsforsack_player_name2,fg_block_player_name,punt_block_player_name, andinterception_player_name— those four are documented sidecar blind spots. Don't add new regex extraction; extend the participants module instead. -
Cache invalidation when modifying loaders. The
@cached_loaderdecorator hashes(qualified_name, args, sorted_kwargs)— it does NOT hash the URL or function body. If you change a loader's URL or resource path without renaming the function, callers will see stale data until they callclear_cache(). During development against a cached loader, preferupdate_config(cache_mode="off")orclear_cache()between runs to avoid debugging phantom data. -
Don't add new per-type NFL loaders.
load_nfl_ngs_passing/_rushing/_receivingand the eight per-type/per-summaryload_nfl_pfr_advstats_*wrappers all emitDeprecationWarningand dispatch to the unifiedload_nfl_nextgen_stats(stat_type=)/load_nfl_pfr_advstats(stat_type=, summary_level=)functions. Extend the unified function; do not introduce new per-type wrappers. -
load_nfl_ff_rankings:kind=vstype=. Both work and resolve to the same parameter —kindis preferred internally becausetypeshadows the builtin. nflreadpy usestype=, so we accept both for parity. Pass exactly one. -
Polars/Rust regex has no lookaround support (
(?=...),(?!...),(?<=...),(?<!...)raiseComputeError). Use the inline case-flag toggle(?i)prefix(?-i: NAMES)to stop a capture at a stopword without lookahead. See the example in the "Polars version" table above. -
docs/is the Docusaurus site. Internal working notes (specs, reconciliation maps, scratchpads, etc.) live in the gitignoreddev/directory — NOT underdocs/and NOT at the repo root.dev/is in.gitignoreprecisely because those files are working notes; if a doc graduates to contributor-visible, move it to the repo root and add it to git. -
requirements.txt,requirements-dev.txt, andsetup.pyare all deleted as of May 2026. All packaging metadata lives inpyproject.tomlunder PEP 621[project]. The build path ispython -m build(PEP 517); don't reintroducesetup.py, and don't add arequirements*.txt. -
The 0.36-live branch is intentionally divergent. Don't merge it wholesale into main — it's pandas-flavored and would undo the polars migration. Cherry-pick semantic fixes via translation; the reconciliation notes live in
dev/(untracked). The May 2026 reconciliation already ported the bulk of the CFB pbp fixes (yardage, kneel-downs, half edges, WP, penalty-on-kickoff, etc.). -
pkg_resourcesis removed in setuptools 81+.cfb_pbp.pyandnfl_pbp.pyalready migrated toimportlib.resources.files(). Don't reintroducefrom pkg_resources import resource_filename. If thepkg_resourcesAPI-deprecationUserWarningsurfaces from a transitive dep, the noise is filtered inpytest.ini'sfilterwarnings; investigate before suppressing further. -
psutilis optional indecorators.py. It's imported lazily so the package still imports cleanly whenpsutilisn't installed. Don't promote it to a hard runtime dep without a deliberate decision. -
Polars literal-from-numpy no longer auto-broadcasts in 1.x. Use
pl.lit(np_array).first()to extract a scalar, or pass a Python value directly. -
pyjanitor 0.32.18+silently switched to pandas 3.x. Keep the defensivepyjanitor<0.32.18upper bound inpyproject.tomluntil pandas 3 is the project floor.
- The Docusaurus site lives under
docs/. The per-league reference subtree (docs/docs/<sport>/index.md,<sport>/reference/*.md,_category_.json, andreference/parameters.md) is generated from endpoint metadata bypython tools/codegen/generate.py --docs— never hand-edit those. Conceptual pages OUTSIDE the generated league/reference/dirs (intro.md,quality-of-life.md,architecture/,parsers/) ARE hand-authored and are preserved across regeneration. - Internal working notes (specs, reconciliation maps, scratchpads) live
in
dev/, which is gitignored. Promote a doc to the repo root only if it becomes contributor-visible reference material. CONTRIBUTING.mdis the canonical contributor onboarding file (covers uv, conda, lint/typecheck, dep-bumping flow).README.mdhas Standard pip / Modern uv / Conda / Development install paths plus the runtime notes (Python 3.9-3.14, polars 1.x, NFL cache).recipe/meta.yaml+recipe/README.mdship the conda-build recipe and document the conda-forge feedstock submission flow. The local-source build (conda build recipe/) reads metadata frompyproject.tomlthrough the PEP 517 path; the conda-forge variant pins to a PyPI sdist viaurl:+sha256:.
Every public callable (function / class / method that doesn't start with
underscore) ships a Google-style docstring with Args: / Returns: /
Raises: blocks AND an Example: block in the napoleon literal-block
format. The canonical shape:
Example:
Quick start::
from sportsdataverse.<sport> import <fn>
df = <fn>(<minimal args>)
print(df.shape)
Useful parameter combination::
df_pd = <fn>(..., return_as_pandas=True)
Pipeline next step (one line)::
df.filter(pl.col("...") == ...).head()
See Also:
* `<companion package>`_ -- short rationale
* `<alternative source>`_ -- short rationale
.. _<companion package>: https://...
.. _<alternative source>: https://...
Rules:
- Use the napoleon
Example:heading (singular), blank line, then a literal block introduced by::. Indent the code block 4 spaces. - Do NOT use raw
>>> ...doctest prompts.sphinx.ext.doctestis enabled and would try to verify them — for live-API loaders the values drift, so doctest noise is guaranteed. - Each example should be runnable as-is (copy-paste into a REPL).
- Keep examples short — 2-4 sub-blocks max per function. The pipeline next-step is ONE line, not a notebook.
- Cross-link to companion packages in the
See Also:block. Canonical URLs are listed below; pick the relevant ones for the sport / domain.
Companion-package cross-link URLs:
| Package | URL | Domain |
|---|---|---|
| wehoop | https://wehoop.sportsdataverse.org | Women's basketball (R) |
| hoopR | https://hoopR.sportsdataverse.org | Men's basketball (R) |
| cfbfastR | https://cfbfastR.sportsdataverse.org | College football (R) |
| baseballr | https://baseballr.sportsdataverse.org | Baseball (R) |
| fastRhockey | https://fastRhockey.sportsdataverse.org | Hockey (R) |
| nflfastR | https://www.nflfastr.com | NFL (R) |
| nflverse | https://nflverse.nflverse.com | NFL ecosystem |
| nflreadpy | https://github.com/nflverse/nflreadpy | NFL (Python) |
| nba_api | https://github.com/swar/nba_api | NBA/WNBA (Python) |
| nhl-api-py | https://github.com/coreyjs/nhl-api-py | NHL (Python) |
| recruitR | https://github.com/sportsdataverse/recruitR | CFB recruiting (R) |
Intro/intermediate Jupyter notebooks live under examples/notebooks/,
one per sport plus a top-level cross-sport quickstart. Each demonstrates
the canonical surface for that sport — schedule, PBP, teams, season
stats, plus the package-wide cache + config layer where relevant. New
sport submodules should add a corresponding 0X_<sport>_intro.ipynb so
the introductory walkthrough stays parallel across sports.
The legacy Sphinx pipeline (Sphinx-docs/ + create_docs.sh) is retired.
Reference docs are now generated from the same YAML endpoint metadata that drives
the wrappers, via the codegen CLI:
- Generate:
python tools/codegen/generate.py --docsrewrites the per-league reference subtree underdocs/docs/<sport>/(full-clobbers each league dir + the shareddocs/docs/reference/dir; conceptual pages outside them survive). The no-argpython tools/codegen/generate.pyalso regenerates docs alongside the wrappers/loaders/parsed modules. - Templates:
tools/codegen/templates/_reference_block.jinja(the 8-section per-function block) +reference_page/league_index/loaders_page/parameter_reference/category_jsontemplates.@returntables come fromtools/codegen/schemas/*.yaml(ESPN) andschemas/loader_schemas.yaml(loaders). - Native (flat) API families: non-ESPN live APIs are generated from
tools/codegen/endpoints/<stem>.yamland registered inFLAT_APIS+_FLAT_API_DOC(generate.py): NHL api-web/edge/stats-rest/records, MLB Stats, and NFL.com (nfl_api→api.nfl.com). Each emits asportsdataverse/<league>/<stem>.pymodule (per-endpointparser:→ a parser module) and its own reference grouping on the league index. Authenticated families — NFL.com needs aWEB_DESKTOPbearer token — setauth: true+getter_module:(a module exposing_get) in the YAML, so the generated wrappers gain a reusableheaders=arg and import an auth-aware_get(e.g.nfl/nfl_api_runtime.py) instead of the shared no-auth_codegen_runtime._get. Hand-written cached loaders (NFL is not in_GENERATED_LOADER_LEAGUES) can still get a "Dataset loaders" docs grouping by listing them inreleases.yaml(docs-metadata only; the module is left untouched). - Drift gate:
python tools/codegen/generate.py --checkfails on stale generated docs (orphan-checked only within the generated league/reference/dirs). Same gate runs in CI + thesdv-codegenpre-commit hook. Offline tests live intests/codegen/test_docs.py+test_doc_parity.py. - Docusaurus:
docs/sidebars.tsdrives each league as a clickable category (link → generatedindex) expanding to an autogenerated reference subtree, so new endpoints surface with no sidebar edit. Verify withcd docs && yarn build(broken-link warnings are confined to the frozen0.0.50version + CHANGELOG doctoc fragments). - Deploy & versioning: the site builds on Vercel (auto-deploy on push to
main; no in-repo deploy workflow — a GitHub Pages action would double-publish). The unversioneddocs/docs/tree is the live DEFAULT at the root URL (lastVersion: 'current', labelledmain), so the published docs always track the code. At each release, freeze a per-release archive:cd docs && yarn version:docs <x.y.z>(snapshotsdocs/docs/→versioned_docs/version-<x.y.z>/), then commit.current/mainstays the default — only add a snapshot, never bumplastVersionaway fromcurrent— so the live docs never go stale and each release still gets a frozen record. The legacy pre-codegen docs remain archived at/docs/0.0.50/. - The
--docsoutput is markdown the prose linters skip (docs/docs/**is excluded from doctoc + markdownlint), so generated tables/fences don't fight the hooks. Docstrings still use Google-style sections (Args:/Returns:/Raises:/Example:) — those feed the wrappers' runtime help, not a Sphinx build.