feat(gl): native --gl loader (replaces gl_to_locator.py)#47
Open
stsmall wants to merge 5 commits into
Open
Conversation
Lifted byte-identically from scripts/gl_to_locator.py. Next commits wire them into DataLoaderMixin and the CLI; the script is removed last. Mirrors the locator._microsat structure landed on microsats-sculpin. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
loc.load_genotypes(gl=..., bam_list=..., gl_mode=...) returns the same (n_sites, n_samples) float dosage representation that the continuous-dosage path produces (for dosage mode) or (3*n_sites, n_samples) for full_gl mode. Both flow through the existing is_dosage_matrix dispatch into filter_dosage_matrix — no downstream changes. Missing samples are imputed to per-site mean dosage (dosage mode) or per-site mean GL triplet (full_gl mode), matching the script's behavior. tests/test_input_extensions.py is renamed to tests/test_gl_input.py and rewritten to exercise the loader end-to-end via Locator(...). The CLI wiring follows in the next commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Threads args into loc.load_genotypes alongside vcf/zarr/matrix. --gl requires --bam_list (enforced in the loader's dispatch elif). No intermediate TSV; no preprocessing script.
The native loader (loc.load_genotypes(gl=..., bam_list=..., gl_mode=...)
and locator --gl --bam_list --gl_mode {dosage,full_gl}) fully supersedes
this. Parsing helpers live in locator._gl; both dosage and full_gl modes
are preserved end-to-end.
Also drop the gl_to_locator.py references from _load_from_matrix docstring
and ValueError message, _load_from_gl docstring, cli.py --matrix help text,
filters.py NaN ValueError message, and test_gl_input.py module docstring —
the script is no longer the recommended GL preprocessing path.
User-facing guide for the native flag, --bam_list pairing, --gl_mode
{dosage,full_gl}, and the hard-coded filter thresholds. The parent
CLAUDE.md project note is updated separately (it's outside the
ReLocator repo).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
+ Coverage 58.49% 59.46% +0.97%
==========================================
Files 27 28 +1
Lines 3518 3617 +99
==========================================
+ Hits 2058 2151 +93
- Misses 1460 1466 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to the microsat native loader (#46) per your note that GL should get the same treatment.
loc.load_genotypes(gl=..., bam_list=..., gl_mode=...)andlocator --gl ... --bam_list ... --gl_mode {dosage,full_gl}now load beagle GL data directly.scripts/gl_to_locator.pyis gone; parsing helpers are inlocator/_gl.py. Bothdosageandfull_glmodes from the script are preserved end-to-end; downstream filtering behavior is unchanged (full_gl flows through the samefilter_dosage_matrixpath the script's TSV used via--matrix).Imputation lives in the loader (continuous-dosage path), consistent with the microsat PR's resolution and what the script does today.
Branched off
mainafter #45 merged; independent of #46 (no rebase dependency).Changes
locator/_gl.py— module-level parsing helpers (lifted from the removed converter script).locator/loaders.py—_load_from_gl+gl=/bam_list=/gl_mode=plumbing inload_genotypes, dispatched viais_dosage_matrix→filter_dosage_matrix.locator/cli.py—--gl,--bam_list,--gl_mode {dosage,full_gl}flags.tests/test_gl_helpers.py,tests/test_gl_input.py,tests/test_gl_cli.py— helper unit tests, loader-level tests, and an end-to-end CLI subprocess test (replaces the oldtest_input_extensions.py).scripts/gl_to_locator.py— deleted.docs/genotype_likelihoods.md— user-facing guide for the native loader.Filter thresholds (
min_maf=0.01,max_missing_frac=0.10,gl_missing_threshold=0.4) mirror the original script defaults and are hard-coded in_load_from_gl. Can be surfaced as CLI flags in a follow-up if desired.Test plan
pixi run pytest tests/test_gl_helpers.py tests/test_gl_input.py tests/test_gl_cli.py -vpasses.pixi run ruff check+pixi run ruff format --checkclean.full_glmode still produces a(3 * n_sites, n_samples)matrix matching whatgl_to_locator.py --gl_mode full_gl+--matrixused to produce.