Trans-spec authoring pipeline overview

## Overview

This issue documents the intended pipeline for automating transformation specification (trans-spec) creation. The goal is a coherent flow from raw study metadata to finished trans-specs, where manual curation effort decreases over time as each stage becomes more automated.

Today, trans-spec authoring is largely manual: a curator examines source data dictionaries, identifies how source variables map to the BDCHM target model, and writes YAML by hand (or with template assistance). This pipeline aims to automate as much of that pathway as possible, reserving human effort for genuinely ambiguous decisions.

### Resolution

This issue is resolved by documenting the pipeline architecture in the repo — the content below (refined through implementation) should become that documentation.

## Pipeline Stages

### 1. Acquire Study Metadata (#204)

Fetch variable digest files (`data_dict.xml`, `var_report.xml`) from dbGaP or BDC data study buckets. These contain variable descriptions, coded value sets, units, and other metadata that downstream stages need.

### 2. Normalize to Standard Format (#103)

Ideally, data submitters would provide well-structured, computer-readable data dictionaries in a reliable format. If we had that, enrichment would be trivial. In practice, we get inconsistent metadata in various formats. This stage transforms what we receive into something useful — the process should move *through* a standard data dictionary format even if the input doesn't start there. This may prove impractical for some sources, and that's okay.

### 3. Build and Enrich Source Schema (#80, #307)

Generate a LinkML schema from the source data and/or data dictionaries. Schema-automator handles inference from data files; schemasheets handles data dictionary inputs. Conceptually these should be one operation — use both sources when available to produce the richest possible schema.

The generated schema should be enriched with metadata from the data dictionaries: variable descriptions, units, coded value sets (enums), and study documentation. The richer the source schema, the more downstream automation is possible. For example, if source enums are captured in the schema and the target model has defined enums, the pipeline can derive value mappings automatically instead of requiring manual curation.

### 4. Align Source Variables to Target Model (#199, #308)

Map source variables to BDCHM target slots. This is where AI-assisted tooling and CDE matching help:

- **AI variable mapping tool** ("AI multi-tool") — currently in testing with one curator. Takes source metadata and proposes variable-to-slot mappings. See #199.
- **CDE matching (cde2vec)** — vector-based similarity matching across variable definitions. Madan's existing work in monarch-initiative/cde-harmonization has direct synergy here.
- **Agent-assisted authoring** — Chris Siege has used Claude-based agent swarms in VS Code to help author trans-specs for new datasets (LTRC, SPIROMICS in NHLBI-BDC-DMC-HV). This is a complementary approach for complex or ambiguous mappings where human-directed AI assistance is more appropriate than fully automated alignment.

The deterministic pipeline should handle as much as possible automatically. Agent-based approaches have their place on the hard cases and as cross-checking, but should not be the core pathway.

### 5. Generate Trans-Specs (#175)

Take aligned variable mappings and produce YAML transformation specifications using the trans-spec authoring tool (Sabrina's work from RTIInternational/NHLBI-BDC-DMC-HV#325, being brought into dm-bip). The authoring tool uses Jinja2 templates to generate well-formed YAML from curated input — humans shouldn't be writing YAML by hand.

The alignment output (stage 4) should flow directly into this stage as input, reducing the manual curation currently required to bridge them.

### 6. Variable Library (#309)

The accumulated catalog of how source variables across studies connect to BDCHM target slots, including value/enum mappings. This is both an output of the alignment process and an input to future alignments — once a variable mapping is established for one study, it can inform mappings for similar variables in other studies.

## Current State

| Stage | Status |
|-------|--------|
| Acquire metadata | #204 open, not started |
| Normalize format | #103 open, aspirational |
| Build/enrich schema | #80 open (needs rescoping), enrichment not started |
| Variable alignment | AI multi-tool in curator testing, CDE tools exist separately |
| Generate trans-specs | #175 ready to bring in, agent approaches in use |
| Variable library | Not started |

## Design Principles

- **Deterministic first, AI-assisted second** — automate what can be automated reliably. Use AI for genuinely ambiguous mappings, not as the default pathway.
- **Progressive automation** — each stage reduces manual effort. We don't need full automation on day one. Start with the highest-value automations (enum derivation from known value sets, reuse of established mappings) and expand.
- **Pipeline artifacts, not standalone tools** — outputs of each stage should be consumable by the next stage without manual intervention.
- **Tolerance for messy data** — real-world study data is inconsistent. The pipeline should handle what it can and surface what it can't for human review, not fail on imperfection.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trans-spec authoring pipeline overview #306

Overview

Resolution

Pipeline Stages

1. Acquire Study Metadata (#204)

2. Normalize to Standard Format (#103)

3. Build and Enrich Source Schema (#80, #307)

4. Align Source Variables to Target Model (#199, #308)

5. Generate Trans-Specs (#175)

6. Variable Library (#309)

Current State

Design Principles

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stage	Status
Acquire metadata	#204 open, not started
Normalize format	#103 open, aspirational
Build/enrich schema	#80 open (needs rescoping), enrichment not started
Variable alignment	AI multi-tool in curator testing, CDE tools exist separately
Generate trans-specs	#175 ready to bring in, agent approaches in use
Variable library	Not started

Trans-spec authoring pipeline overview #306

Description

Overview

Resolution

Pipeline Stages

1. Acquire Study Metadata (#204)

2. Normalize to Standard Format (#103)

3. Build and Enrich Source Schema (#80, #307)

4. Align Source Variables to Target Model (#199, #308)

5. Generate Trans-Specs (#175)

6. Variable Library (#309)

Current State

Design Principles

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions