Skip to content

feat(gl): native --gl loader (replaces gl_to_locator.py)#47

Open
stsmall wants to merge 5 commits into
kr-colab:mainfrom
stsmall:gl-native-loader
Open

feat(gl): native --gl loader (replaces gl_to_locator.py)#47
stsmall wants to merge 5 commits into
kr-colab:mainfrom
stsmall:gl-native-loader

Conversation

@stsmall
Copy link
Copy Markdown
Contributor

@stsmall stsmall commented May 12, 2026

Summary

Follow-up to the microsat native loader (#46) per your note that GL should get the same treatment. loc.load_genotypes(gl=..., bam_list=..., gl_mode=...) and locator --gl ... --bam_list ... --gl_mode {dosage,full_gl} now load beagle GL data directly. scripts/gl_to_locator.py is gone; parsing helpers are in locator/_gl.py. Both dosage and full_gl modes from the script are preserved end-to-end; downstream filtering behavior is unchanged (full_gl flows through the same filter_dosage_matrix path the script's TSV used via --matrix).

Imputation lives in the loader (continuous-dosage path), consistent with the microsat PR's resolution and what the script does today.

Branched off main after #45 merged; independent of #46 (no rebase dependency).

Changes

  • locator/_gl.py — module-level parsing helpers (lifted from the removed converter script).
  • locator/loaders.py_load_from_gl + gl=/bam_list=/gl_mode= plumbing in load_genotypes, dispatched via is_dosage_matrixfilter_dosage_matrix.
  • locator/cli.py--gl, --bam_list, --gl_mode {dosage,full_gl} flags.
  • tests/test_gl_helpers.py, tests/test_gl_input.py, tests/test_gl_cli.py — helper unit tests, loader-level tests, and an end-to-end CLI subprocess test (replaces the old test_input_extensions.py).
  • scripts/gl_to_locator.py — deleted.
  • docs/genotype_likelihoods.md — user-facing guide for the native loader.

Filter thresholds (min_maf=0.01, max_missing_frac=0.10, gl_missing_threshold=0.4) mirror the original script defaults and are hard-coded in _load_from_gl. Can be surfaced as CLI flags in a follow-up if desired.

Test plan

  • pixi run pytest tests/test_gl_helpers.py tests/test_gl_input.py tests/test_gl_cli.py -v passes.
  • Full suite green on this branch (247 tests).
  • pixi run ruff check + pixi run ruff format --check clean.
  • Reviewer: spot-check that full_gl mode still produces a (3 * n_sites, n_samples) matrix matching what gl_to_locator.py --gl_mode full_gl + --matrix used to produce.

stsmall and others added 5 commits May 12, 2026 11:56
Lifted byte-identically from scripts/gl_to_locator.py. Next commits wire
them into DataLoaderMixin and the CLI; the script is removed last.

Mirrors the locator._microsat structure landed on microsats-sculpin.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
loc.load_genotypes(gl=..., bam_list=..., gl_mode=...) returns the same
(n_sites, n_samples) float dosage representation that the continuous-dosage
path produces (for dosage mode) or (3*n_sites, n_samples) for full_gl
mode. Both flow through the existing is_dosage_matrix dispatch into
filter_dosage_matrix — no downstream changes.

Missing samples are imputed to per-site mean dosage (dosage mode) or
per-site mean GL triplet (full_gl mode), matching the script's behavior.

tests/test_input_extensions.py is renamed to tests/test_gl_input.py and
rewritten to exercise the loader end-to-end via Locator(...). The CLI
wiring follows in the next commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Threads args into loc.load_genotypes alongside vcf/zarr/matrix. --gl
requires --bam_list (enforced in the loader's dispatch elif). No
intermediate TSV; no preprocessing script.
The native loader (loc.load_genotypes(gl=..., bam_list=..., gl_mode=...)
and locator --gl --bam_list --gl_mode {dosage,full_gl}) fully supersedes
this. Parsing helpers live in locator._gl; both dosage and full_gl modes
are preserved end-to-end.

Also drop the gl_to_locator.py references from _load_from_matrix docstring
and ValueError message, _load_from_gl docstring, cli.py --matrix help text,
filters.py NaN ValueError message, and test_gl_input.py module docstring —
the script is no longer the recommended GL preprocessing path.
User-facing guide for the native flag, --bam_list pairing, --gl_mode
{dosage,full_gl}, and the hard-coded filter thresholds. The parent
CLAUDE.md project note is updated separately (it's outside the
ReLocator repo).
@stsmall stsmall force-pushed the gl-native-loader branch from f47f6a1 to cc79a0c Compare May 12, 2026 19:00
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 94.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.46%. Comparing base (be59c4d) to head (cc79a0c).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
locator/_gl.py 95.71% 3 Missing ⚠️
locator/cli.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
+ Coverage   58.49%   59.46%   +0.97%     
==========================================
  Files          27       28       +1     
  Lines        3518     3617      +99     
==========================================
+ Hits         2058     2151      +93     
- Misses       1460     1466       +6     
Flag Coverage Δ
unittests 59.46% <94.00%> (+0.97%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant