Skip to content

Enum derivations pipeline with auto-generated specs#291

Draft
Sigfried wants to merge 40 commits into
mainfrom
211-enum-derivations-with-unreleased-linkml-stuff
Draft

Enum derivations pipeline with auto-generated specs#291
Sigfried wants to merge 40 commits into
mainfrom
211-enum-derivations-with-unreleased-linkml-stuff

Conversation

@Sigfried
Copy link
Copy Markdown
Collaborator

@Sigfried Sigfried commented Mar 26, 2026

Summary

Proves enum_derivations work end-to-end through the dm-bip pipeline (issue #211), with the proof automated in tests/.

The from_raw toy pipeline gains a single twin column. SEX_CODE (string M/F) is added to pht000001 as a semantic duplicate of the existing integer SEX column. Demography.sex is still populated via value_mappings (unchanged); the new Demography.sex_derived slot is populated via enum_derivations from SEX_CODE. Both resolve to the same target_sex_enum permissible values, and an integration test asserts they match row-for-row. The existing value_mappings pathway stays the comparison baseline; enum_derivations is exercised as an additive capability against the same data — no parallel fixtures.

Changes

  • toy_data/ fixtureSEX_CODE column on pht000001; Demography.sex_derived slot (range target_sex_enum) in the target schema; sex_derived slot derivation + top-level enum_derivations block in from_raw/specs/demography.yaml.
  • toy_data/from_raw/config.mkDM_MAX_ENUM_SIZE := 3, the narrowest bound that lets SEX_CODE (2 distinct values) be inferred as a source enum while keeping SMOKING (5 distinct values) a non-enum string so its value_mappings keep working.
  • src/dm_bip/map_data/compose_specs.py — preserve top-level enum_derivations blocks during spec composition; they were previously dropped silently, making enum derivations invisible to linkml-map.
  • tests/integration/test_from_raw_pipeline.pytest_enum_derivation_twin_matches_value_mapping: asserts sex == sex_derived on every row.
  • Dependencieslinkml>=1.11.0, schema-automator>=0.5.5, linkml-map git-pinned (see Staging).

Staging

linkml-map is pinned to a git ref pending an upstream release that carries the CLI --target-schema support and the delimited-loader forwarding this pipeline needs. The PR stays draft + staged until that release; the pin then bumps to a version constraint and the label drops. Upstream context: linkml/linkml-map#235; this mirrors the same maneuver in linkml/schema-automator#211.

Test plan

  • make test — 146 tests pass (incl. the new twin-column assertion)
  • make lint — clean
  • make pipeline CONFIG=toy_data/from_raw/config.mksex and sex_derived match across all 110 Demography records
  • Re-review / dismiss the stale CHANGES_REQUESTED before un-drafting
  • Bump linkml-map pin to a released version once available

Origin & direction

This started as the enum-derivations work for #211 and changed shape along the way; the original intent is recorded here so the history stays legible.

Originally: the branch auto-generated enum_derivations specs from existing value_mappings specs via a new generate_enum_specs.py, wired through a generate-enum-specs Make target gated on DM_ENUM_DERIVATIONS=true, and proved it with a self-contained toy_data_w_enums/ directory that duplicated the toy data to run two parallel pipelines (config-orig-valmaps.mk vs config-enums.mk), plus a docs/pipeline-steps.md comparison reference. Dependencies were carried via [tool.uv.sources] pointing at local forks set up by a scripts/setup-enum-forks.sh script.

Why it changed: #211's goal is narrowly to prove enum_derivations work end-to-end with the proof in tests/. The spec auto-generator was a separable concern; the parallel duplicated fixture, planning docs, fork-cloning script, and embedded reproduction project buried the actual change and were flagged in review. The redirected approach folds the comparison into the existing from_raw fixture as one twin column — same data, both mechanisms, one assertion — and relies on released linkml/schema-automator with only linkml-map git-pinned. The salvageable substance (enum_derivations Make/compose plumbing, a working end-to-end enum test) is kept; the scaffolding is not.

Sigfried and others added 28 commits March 10, 2026 10:02
Captures our understanding of the task, current state of enum handling
in the pipeline, and open questions for the team before implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain default_range: string, expand enum_derivations key features
with plain-language descriptions, clarify where source enums come from,
and simplify the target schema question.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reframe task as exploratory (test if LinkML-Map handles enum derivations),
explain why pre_cleaned path is the right test case (human-readable values
vs coded integers), and simplify plan into concrete steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Created minimal test in toy_data/enum_test/ with enum-enabled source
schema, target schema with enums, and a spec using enum_derivations.
LinkML-Map correctly maps Male→OMOP:8507, Female→OMOP:8532. Key
finding: every source enum needs a derivation (use mirror_source: true
for passthrough). Updated planning doc with results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents current pipeline and future enum derivations pipeline in
table format with linked files, manual/curation steps, and notes.
Includes instructions at top for completing after context refresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change generate_toy_data.py smoking_status from mixed int/string values
([1, 2, "Former", "Never", "Unknown"]) to all-text values (["Current",
"Former", "Never", "Unknown"]). This fixes linkml-validate failures where
bare numeric TSV values were parsed as integers, not matching string enum
permissible values.

Flesh out docs/pipeline-steps.md: separate In/Out on distinct rows,
add line-specific Makefile links, add real data pointers to RTI
NHLBI-BDC-DMC-HV repo, expand future pipeline table with all data
columns, and document the root cause and fix for the validation error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corey confirmed the mixed int/string smoking_status values (1, 2,
"Former", "Never", "Unknown") are intentional, matching real dbGaP
data patterns. A schema-automator fix for mixed types is in progress.

Update docs to document this as a known issue awaiting upstream fix
rather than a data generation bug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-ran ToyFromRaw pipeline after reverting generate_toy_data.py to
restore output files to their pre-change state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bypassed .gitignore to add output/EnumTest for dev testing
Move target_sex_enum into toy_data/target-schema.yaml (shared) and
delete toy_data/enum_test/target-schema.yaml. Update enum_test config
to point at the shared schema. EnumTest pipeline verified working.

Simplify docs/pipeline-steps.md to focus on toy data only — removed
pre-cleaned and real data columns per current scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pipeline-steps.md: Add copy-pasteable commands for every step, document
  enum test pipeline using raw data path, document --infer-enum-from-integers
  flag, document int/string type mismatch blocker
- issue-211-planning.md: Replace stale "Why pre_cleaned" section (we now use
  raw data), document completed work (enum derivations, --infer-enum-from-integers,
  pipeline wiring), add int/string blocker with question for Corey, update
  remaining questions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "How map_data.py works" section with ASCII flowchart showing
  the full transform pipeline from schema loading through TsvLoader
  to ObjectTransformer.map_object and chunked output
- Expand int/string blocker section with root cause (_parse_numeric
  in TSV loader), code references, why integer PVs can't help, and
  link to linkml-int-enum-repro/ minimal reproduction
- Currently broken: integer-coded enums fail both validation and
  mapping due to _parse_numeric converting all numeric TSV values
  to Python ints before schema-aware code runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explains the bug, expected vs actual output, root cause
(_parse_numeric in TSV loader), and proposed fix (make the
loader schema-aware).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copies the from_raw pipeline setup (raw data, specs, target schema, config)
into a standalone directory. Currently uses value_mappings (identical to
from_raw); enum_derivations changes will be layered on next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses editable installs of local forks of schema-automator, linkml, and
linkml-map to test unreleased features (--infer-enum-from-integers,
int/string enum fixes). Not suitable for merging to main until upstream
releases incorporate these changes.

Changes:
- pyproject.toml/uv.lock: editable deps pointing at local forks
- .gitignore: output/ un-ignored, local clone dirs added
- pipeline.Makefile: DM_INFER_ENUM_FROM_INTEGERS variable
- map_data.py: DataLoader accepts schema_path for type coercion
- toy_data/enum_test: updated config and specs for enum derivations
- new-pipeline-plan.md: plan for generate_enum_specs.py tool

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate issue-211-planning.md and new-pipeline-plan.md into a single
document. Adds local fork commit inventory, enum_derivations YAML syntax
reference, expanded passthrough/unreferenced enum handling, and comments
strategy. Removes resolved questions and narrative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…anning

pipeline-steps.md: Restructure around toy_data_w_enums with nested-list
format comparing original (value_mappings) and enum-focused pipelines.
Add generate-enum-specs as step 2a, inline local fork notes at relevant
steps, remove obsolete BLOCKER notes, add config examples.

issue-211-planning.md: Replace notes-to-claude block and duplicated local
fork section with pointer to pipeline-steps.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d sections

Renumber steps 1-5, name after Makefile targets. Add overview table
comparing value_mappings and enum_derivations pipelines. Each step gets
formatted CLI commands, parameter/config tables, and input/output docs.
Rewrite map_data.py algorithm with SchemaView, blocks, entity discovery,
and transformation operations explained with code snippets. Link
generate_enum_specs algorithm to issue-211-planning.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create config-orig-valmaps.mk (original pipeline) and config-enums.mk
(enum inference + derivation generation). Separate output dirs to avoid
collisions. Rename target-schema.yaml to target-schema-orig-valmaps.yaml.
Revert incorrect PyCharm renames in tests/ and toy_data/ that don't use
toy_data_w_enums paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script reads source schema (with inferred enums) and existing specs
(with value_mappings), generates new specs with enum_derivations and a
target schema with enum definitions. Handles deduplication, disambiguation,
passthrough enums, unreferenced enums, and nested object_derivations.

Pipeline wiring: generate-enum-specs Makefile target runs after
schema-create and before map-data when DM_ENUM_DERIVATIONS is set.
Mapping step uses generated specs and target schema automatically.

Verified: full enum pipeline produces identical output to value_mappings
pipeline (except expected None→null for unmapped enum values).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fill in step 3 enum_derivations column with links to input files and
generated outputs. Add source file links on CLI lines for prepare_input,
generate_enum_specs, and map_data. Fix typos (pipline, tranform).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove Step 0 section, merge config descriptions into the intro with
both make commands up front. Drop row 0 from the overview table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change pyproject.toml [tool.uv.sources] to expect all three forks
(schema-automator, linkml, linkml-map) as sibling directories of dm-bip.
Add scripts/setup-enum-forks.sh to clone them with correct branches and
fetch upstream tags for linkml (needed for version resolution).

Update pipeline-steps.md with setup/cleanup instructions for the forks.
Narrow requires-python to <3.13 to avoid resolution issues with the
linkml fork's dynamic versioning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point to pipeline-steps.md, generate_enum_specs.py, and setup script.
Brief description of what the enum pipeline does and how to run it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@amc-corey-cox amc-corey-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sigfried I hope this doesn't offend but it looks like this PR got pulled off-course by the LLM. The goal of #211 is proving enum_derivations work end-to-end in the pipeline with proof automated in tests/.

generate_enum_specs.py is a useful tool but it's a separate concern — split it into its own PR and we'll get this core part in separately.

There's a lot of material here that doesn't belong in the repo: planning docs, an embedded reproduction project, an AI-generated reference doc, a fork-cloning shell script. These bury the actual work and make the PR hard to review.

The dependency situation ([tool.uv.sources] pointing at local filesystem paths, unpinned deps with TODO comments) and the .gitignore regression are merge blockers — please see #290 for how I did it there.

Generally, you should also strip any descriptive comments the LLM is throwing in - that's just noise.

Comment thread docs/pipeline-steps.md Outdated
Comment thread linkml-int-enum-repro/README.md Outdated
Comment thread scripts/setup-enum-forks.sh Outdated
Comment thread src/dm_bip/map_data/map_data.py Outdated
Comment thread toy_data/enum_test/specs/person-spec.yaml Outdated
Comment thread src/dm_bip/generate_enum_specs.py Outdated
Comment thread .gitignore Outdated
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Comment thread src/dm_bip/generate_enum_specs.py Outdated
…urces

- Remove generate_enum_specs.py (splitting to separate PR)
- Remove issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test dir
- Switch pyproject.toml from local filesystem paths to git URL sources
- Restore output/ to .gitignore, remove local clone entries
- Remove DM_ENUM_DERIVATIONS and generate-enum-specs from pipeline.Makefile
- Restore direct DM_MAP_TARGET_SCHEMA/DM_TRANS_SPEC_DIR usage in map-data target
- Point config-enums.mk at committed specs (with_enum_derivations/) and target-schema-enums.yaml
- Point config-orig-valmaps.mk at with_value_mappings/ subdir
- Strip generated comments from enum derivation spec YAML files
- Rewrite test_from_enum_pipeline.py for enum pipeline with enum-specific assertions
- Update docs/pipeline-steps.md and README.md for new structure

Note: uv sync does not yet work with the git URL sources due to
linkml's uv-dynamic-versioning fallback producing version 0.0.0,
which fails transitive dependency constraints. See PR comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Sigfried
Copy link
Copy Markdown
Collaborator Author

@ccox-work Cleaned up most of the review feedback in 2a6c980:

  • Removed generate_enum_specs.py, issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test/
  • Switched [tool.uv.sources] from local filesystem paths to git URL sources (pinned to commit revs)
  • Restored output/ to .gitignore, removed local clone entries
  • Removed DM_ENUM_DERIVATIONS flag and generate-enum-specs target from pipeline.Makefile; map-data now uses DM_MAP_TARGET_SCHEMA and DM_TRANS_SPEC_DIR directly
  • Config files point at committed specs (with_enum_derivations/ and with_value_mappings/ subdirs)
  • Stripped generated comments from enum spec YAMLs
  • Rewrote test_from_enum_pipeline.py with enum-specific assertions
  • Updated docs and README

Blocker: uv sync fails with git URL sources

The linkml fork's pyproject.toml uses uv-dynamic-versioning with fallback-version = "0.0.0". When uv builds it from a git URL, it can't resolve git tags for versioning and falls back to 0.0.0. This causes a transitive dependency resolution failure:

schema-automator depends on linkml>=1.9.1,<2.0.0
linkml (from git source) resolves to version 0.0.0
→ no solution

Even after that's resolved with override-dependencies = ["linkml>=0"], the same pattern cascades to linkml-runtime (schema-automator also requires linkml-runtime>=1.9.2,<2.0.0).

This didn't happen with the local editable installs because uv could read the git history directly and compute a proper version.

Options I see:

  1. Tag the fork branches (e.g., v1.10.0-sa-loader) so uv-dynamic-versioning produces a valid version from the git URL
  2. Change the fork's fallback-version from "0.0.0" to something like "1.10.0.dev0"
  3. Use a different pinning approach you may know about from your experience with Replace map_data.py with linkml-map CLI (#275) #290

Happy to go whichever direction you prefer.

@amc-corey-cox
Copy link
Copy Markdown
Collaborator

Use PEP 440 direct references in [project.dependencies] instead of [tool.uv.sources]. This sidesteps version resolution entirely:

[project.dependencies]
linkml @ git+https://github.com/Sigfried/linkml.git@<commit-or-branch>
schema-automator @ git+https://github.com/Sigfried/schema-automator.git@<commit-or-branch>
linkml-map @ git+https://github.com/Sigfried/linkml-map.git@<commit-or-branch>

Remove the [tool.uv.sources] section and any override-dependencies. See #290's pyproject.toml for the pattern.

@amc-corey-cox
Copy link
Copy Markdown
Collaborator

This is definitely more complicated for your situation. You may have to make a test branch in schema-automator or linkml-map, or both, with the dependencies for linkml from your branch there in order to push through this. I'm not really sure... but that is what I would try.

Sigfried and others added 4 commits March 27, 2026 11:57
…s, add note to pipeline-steps.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three PRs (schema-automator #188, linkml, linkml-map) are merged
upstream but not yet released. Use PEP 440 direct references to upstream
commit hashes — no more [tool.uv.sources] or override-dependencies.

Also fix test_mapping_uses_enum_derivations to unwrap the dict output
format, and update docs to reflect upstream status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Sigfried Sigfried requested a review from amc-corey-cox March 27, 2026 17:10
@Sigfried
Copy link
Copy Markdown
Collaborator Author

@amc-corey-cox, is there anything you're waiting for me to do on this? I think I addressed your previous comments. At this point main may have changed in ways that require more conflicts to be resolved

@amc-corey-cox amc-corey-cox added the staged Work ready or in progress, waiting on upstream release label May 13, 2026
linkml/linkml#3289 was released in linkml v1.11.0; schema-automator/#188
was released in v0.5.5. Switch both from git URL pins to PyPI version
specifiers.

linkml-map fix is still unreleased (PR linkml/linkml-map#235 open) — its
git pin stays in place until that ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Sigfried
Copy link
Copy Markdown
Collaborator Author

Handoff status (2026-05-14)

Corey, per our chat — here's where this stands in case you decide to pick it up.

What's on this branch right now

The branch is bddf25b (just pushed): a fresh commit that drops the linkml and schema-automator git pins now that the upstream changes are released:

What's salvageable vs obsolete after your f215a1d rewrite

Salvageable (the substantive enum-derivations work):

  • pipeline.Makefile: DM_INFER_ENUM_FROM_INTEGERS Makefile variable + wiring to schemauto generalize-tsvs --infer-enum-from-integers
  • toy_data_w_enums/: full test fixture directory with two parallel spec variants (with_value_mappings/ vs with_enum_derivations/) and matching target schemas
  • tests/integration/test_from_enum_pipeline.py
  • docs/pipeline-steps.md

Obsolete (replaced by your linkml-map CLI flow):

  • All edits to src/dm_bip/map_data/map_data.py (the schema_path/target_class plumbing was needed because we were calling linkml-map's loaders directly; your compose_specs.py + CLI invocation in pipeline.Makefile makes those edits unnecessary — the schema is passed via -s at the CLI level, and once linkml-map#235 lands, the CLI will propagate it to the loaders)
  • streams.py (already deleted on main)

Merge state

The branch is 14 commits behind main and has unresolved conflicts in pyproject.toml, README.md, uv.lock, map_data.py (modify/delete), and a binary .txt.gz. Given the architectural shift, a fresh branch off main with cherry-picked fixtures + the Makefile wiring is probably easier than a conflict-by-conflict merge. I'd estimate that's an afternoon's work.

I'll get back to it next week unless you take it over. Either way, this branch has everything you'd need to reconstruct it.

…with-unreleased-linkml-stuff

# Conflicts:
#	README.md
#	pyproject.toml
#	src/dm_bip/map_data/map_data.py
#	toy_data/data/raw/phs000000.v1.pht000002.v1.p1.c1.ex0_1s.HMB.txt.gz
#	uv.lock
Sigfried's prior pin (53ad099) predated the --target-schema CLI flag.
PR #235 head adds it plus the schema_path/target_class forwarding needed
by the post-#155 directory-input flow that main now uses.
compose_specs.py previously only collected class_derivations blocks
from per-variable spec files. Top-level enum_derivations blocks (used
to declare source-enum → target-enum value correspondences shared
across entity transforms) were silently dropped, which made every
enum_derivation in the source specs invisible to linkml-map.

Merge enum_derivations dicts across spec files and emit them on each
composed entity spec alongside class_derivations.
The toy_data_w_enums/ directory duplicated toy_data/'s binaries and
specs to A/B compare value_mappings and enum_derivations side by side.
That comparison is moving inline into the from_raw pipeline via a twin
column on Demography, so the parallel fixture and its dedicated test
are no longer needed.
Adds a string SEX_CODE column (M/F) to pht000001 as a semantic twin
of the integer SEX column (1/2). The new Demography.sex_derived slot
is populated via enum_derivations from SEX_CODE, while the existing
Demography.sex slot keeps its value_mappings path on SEX. Both slots
resolve to the same target_sex_enum permissible values, and a new
integration test asserts they match row-for-row.

Why a string twin (M/F) instead of a literal integer duplicate:
enum_derivations needs its source column typed as a source enum, and
forcing integers into source enums (via --infer-enum-from-integers)
would also break the existing value_mappings on SEX/RACE/ETHNICITY.
A string column lets schema-create infer just that one column as an
enum once DM_MAX_ENUM_SIZE permits it.

DM_MAX_ENUM_SIZE := 3 in from_raw/config.mk is the narrowest bound
that lets SEX_CODE (2 distinct values) cross while keeping SMOKING
(5 distinct values) below the threshold so its value_mappings keep
working as the comparison baseline.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.74%. Comparing base (a65ec29) to head (bbb4482).

Files with missing lines Patch % Lines
src/dm_bip/map_data/compose_specs.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
- Coverage   79.87%   79.74%   -0.13%     
==========================================
  Files           9        9              
  Lines         626      632       +6     
==========================================
+ Hits          500      504       +4     
- Misses        126      128       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sigfried's branch had deleted output/.gitignore and output/README.md
(the placeholder + doc that keep the pipeline output dir present and
explained); restore them from main — unrelated to the enum work and
previously flagged in review.

Also drop the DM_INFER_ENUM_FROM_INTEGERS Makefile plumbing: nothing
in this PR uses it, no test exercises it, and the underlying
schema-automator flag globally forces every low-cardinality integer
column into a source enum, which breaks value_mappings on the
unconverted slots (the reason the sex_derived twin uses a string
column instead). Re-add with a test if a pure-integer-enum study
ever needs it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

staged Work ready or in progress, waiting on upstream release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants