You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue documents the intended pipeline for automating transformation specification (trans-spec) creation. The goal is a coherent flow from raw study metadata to finished trans-specs, where manual curation effort decreases over time as each stage becomes more automated.
Today, trans-spec authoring is largely manual: a curator examines source data dictionaries, identifies how source variables map to the BDCHM target model, and writes YAML by hand (or with template assistance). This pipeline aims to automate as much of that pathway as possible, reserving human effort for genuinely ambiguous decisions.
Resolution
This issue is resolved by documenting the pipeline architecture in the repo — the content below (refined through implementation) should become that documentation.
Fetch variable digest files (data_dict.xml, var_report.xml) from dbGaP or BDC data study buckets. These contain variable descriptions, coded value sets, units, and other metadata that downstream stages need.
Ideally, data submitters would provide well-structured, computer-readable data dictionaries in a reliable format. If we had that, enrichment would be trivial. In practice, we get inconsistent metadata in various formats. This stage transforms what we receive into something useful — the process should move through a standard data dictionary format even if the input doesn't start there. This may prove impractical for some sources, and that's okay.
Generate a LinkML schema from the source data and/or data dictionaries. Schema-automator handles inference from data files; schemasheets handles data dictionary inputs. Conceptually these should be one operation — use both sources when available to produce the richest possible schema.
The generated schema should be enriched with metadata from the data dictionaries: variable descriptions, units, coded value sets (enums), and study documentation. The richer the source schema, the more downstream automation is possible. For example, if source enums are captured in the schema and the target model has defined enums, the pipeline can derive value mappings automatically instead of requiring manual curation.
4. Align Source Variables to Target Model (#199, #308)
Map source variables to BDCHM target slots. This is where AI-assisted tooling and CDE matching help:
AI variable mapping tool ("AI multi-tool") — currently in testing with one curator. Takes source metadata and proposes variable-to-slot mappings. See Create script to run AI API #199.
CDE matching (cde2vec) — vector-based similarity matching across variable definitions. Madan's existing work in monarch-initiative/cde-harmonization has direct synergy here.
Agent-assisted authoring — Chris Siege has used Claude-based agent swarms in VS Code to help author trans-specs for new datasets (LTRC, SPIROMICS in NHLBI-BDC-DMC-HV). This is a complementary approach for complex or ambiguous mappings where human-directed AI assistance is more appropriate than fully automated alignment.
The deterministic pipeline should handle as much as possible automatically. Agent-based approaches have their place on the hard cases and as cross-checking, but should not be the core pathway.
Take aligned variable mappings and produce YAML transformation specifications using the trans-spec authoring tool (Sabrina's work from RTIInternational/NHLBI-BDC-DMC-HV#325, being brought into dm-bip). The authoring tool uses Jinja2 templates to generate well-formed YAML from curated input — humans shouldn't be writing YAML by hand.
The alignment output (stage 4) should flow directly into this stage as input, reducing the manual curation currently required to bridge them.
The accumulated catalog of how source variables across studies connect to BDCHM target slots, including value/enum mappings. This is both an output of the alignment process and an input to future alignments — once a variable mapping is established for one study, it can inform mappings for similar variables in other studies.
Deterministic first, AI-assisted second — automate what can be automated reliably. Use AI for genuinely ambiguous mappings, not as the default pathway.
Progressive automation — each stage reduces manual effort. We don't need full automation on day one. Start with the highest-value automations (enum derivation from known value sets, reuse of established mappings) and expand.
Pipeline artifacts, not standalone tools — outputs of each stage should be consumable by the next stage without manual intervention.
Tolerance for messy data — real-world study data is inconsistent. The pipeline should handle what it can and surface what it can't for human review, not fail on imperfection.
Overview
This issue documents the intended pipeline for automating transformation specification (trans-spec) creation. The goal is a coherent flow from raw study metadata to finished trans-specs, where manual curation effort decreases over time as each stage becomes more automated.
Today, trans-spec authoring is largely manual: a curator examines source data dictionaries, identifies how source variables map to the BDCHM target model, and writes YAML by hand (or with template assistance). This pipeline aims to automate as much of that pathway as possible, reserving human effort for genuinely ambiguous decisions.
Resolution
This issue is resolved by documenting the pipeline architecture in the repo — the content below (refined through implementation) should become that documentation.
Pipeline Stages
1. Acquire Study Metadata (#204)
Fetch variable digest files (
data_dict.xml,var_report.xml) from dbGaP or BDC data study buckets. These contain variable descriptions, coded value sets, units, and other metadata that downstream stages need.2. Normalize to Standard Format (#103)
Ideally, data submitters would provide well-structured, computer-readable data dictionaries in a reliable format. If we had that, enrichment would be trivial. In practice, we get inconsistent metadata in various formats. This stage transforms what we receive into something useful — the process should move through a standard data dictionary format even if the input doesn't start there. This may prove impractical for some sources, and that's okay.
3. Build and Enrich Source Schema (#80, #307)
Generate a LinkML schema from the source data and/or data dictionaries. Schema-automator handles inference from data files; schemasheets handles data dictionary inputs. Conceptually these should be one operation — use both sources when available to produce the richest possible schema.
The generated schema should be enriched with metadata from the data dictionaries: variable descriptions, units, coded value sets (enums), and study documentation. The richer the source schema, the more downstream automation is possible. For example, if source enums are captured in the schema and the target model has defined enums, the pipeline can derive value mappings automatically instead of requiring manual curation.
4. Align Source Variables to Target Model (#199, #308)
Map source variables to BDCHM target slots. This is where AI-assisted tooling and CDE matching help:
The deterministic pipeline should handle as much as possible automatically. Agent-based approaches have their place on the hard cases and as cross-checking, but should not be the core pathway.
5. Generate Trans-Specs (#175)
Take aligned variable mappings and produce YAML transformation specifications using the trans-spec authoring tool (Sabrina's work from RTIInternational/NHLBI-BDC-DMC-HV#325, being brought into dm-bip). The authoring tool uses Jinja2 templates to generate well-formed YAML from curated input — humans shouldn't be writing YAML by hand.
The alignment output (stage 4) should flow directly into this stage as input, reducing the manual curation currently required to bridge them.
6. Variable Library (#309)
The accumulated catalog of how source variables across studies connect to BDCHM target slots, including value/enum mappings. This is both an output of the alignment process and an input to future alignments — once a variable mapping is established for one study, it can inform mappings for similar variables in other studies.
Current State
Design Principles