The Human Tumor Atlas Network (HTAN) is transitioning from its Phase 1 data model — a flat Schematic CSV with 59 classes, free-text fields, and cancer-type-specific tiers — to a Phase 2 model built on LinkML with ontology-coded attributes, age-in-days temporal fields, and a hierarchical class structure.
This repository provides:
- Field-level SSSOM mappings between the two models (309 matched field pairs)
- Value-level SSSOM mappings for enum translation (356 clinical value matches)
- A config-driven migration engine that transforms HTAN1 tabular data to HTAN2-compatible format
- OLS-verified ontology lookup tables for UBERON tissues and NCIt diagnoses
The migration has been validated against HTAN2 v1.3.0 JSON Schemas at 100% compliance across 10,077 rows of real HTAN1 clinical data from BigQuery. See mappings/htan1_to_htan2/MIGRATION_REPORT.md for detailed results.
pip install -e ".[dev]"# Pull data from BigQuery (requires gcloud auth)
uv run htan query bq sql "SELECT * FROM \`isb-cgc-bq.HTAN.clinical_tier1_demographics_current\` LIMIT 3000" \
--format csv 2>/dev/null | python3 -c "import csv,sys; [print('\t'.join(r)) for r in csv.reader(sys.stdin)]" \
> /tmp/htan1_demographics.tsv
# Run migration
uv run python scripts/migrate.py \
--input /tmp/htan1_demographics.tsv \
--config configs/htan1_to_htan2/clinical.transform.yaml \
--source-class Demographics \
--output output/htan1_to_htan2/ \
--normalize-columns
# Validate against HTAN2 JSON Schema
uv run python scripts/validate_transformed.py \
--input-dir output/htan1_to_htan2/ \
--schema-dir /path/to/htan2_json_schemas/ \
--ignore-patterns/map-models --source ncihtan/data-models@v25.2.1 --target ncihtan/htan2-data-model@v1.3.0
uv run python scripts/validate_mappings.py mappings/htan1_to_htan2/mappings/ SSSOM TSV files (field-level + value-level)
htan1_to_htan2/
clinical_fields.sssom.tsv 68 field mappings
clinical_values.sssom.tsv 356 value mappings (deterministic + semantic)
biospecimen_fields.sssom.tsv 35 field mappings
assay_fields.sssom.tsv 206 field mappings
MAPPING.md Field mapping methodology
MIGRATION_REPORT.md Migration results and validation
configs/ Transform configs (YAML)
htan1_to_htan2/
clinical.transform.yaml Conversions, structural transforms, defaults
biospecimen.transform.yaml
assay.transform.yaml
lookups/ Ontology lookup tables (JSON)
uberon_labels_to_codes.json 22,603 entries (HTAN2 enums + 73 OLS-verified)
ncit_diagnosis_to_codes.json 20,195 entries (HTAN2 enums + 59 OLS-verified)
scripts/ Python utilities
migrate.py Migration engine (5-tier pipeline)
value_match.py Deterministic value matching
semantic_value_match.py LLM-assisted value matching
ols_lookup.py EBI OLS4 API for ontology resolution
build_lookup_tables.py Extract lookups from HTAN2 enum YAMLs
validate_transformed.py JSON Schema / LinkML validation
normalize_model.py Model format normalization
deterministic_match.py Field matching (caDSR + name)
generate_sssom_tsv.py SSSOM TSV generation
validate_mappings.py SSSOM file validation
tests/ pytest test suite (36 tests)
.claude/skills/ Claude Code skill definitions
map-models/ /map-models — generate SSSOM mappings
migrate-data/ /migrate-data — orchestrate migration
The migration engine (scripts/migrate.py) applies five tiers of transformation, driven by YAML config files:
| Tier | Transform | Example |
|---|---|---|
| 1 | Field renaming | Ethnicity → ETHNIC_GROUP (from field SSSOM) |
| 2 | Value remapping | not hispanic or latino → Not Hispanic or Latino (from value SSSOM) |
| 3 | Conversions | Colon NOS → UBERON:0001155 (text-to-ontology via OLS-verified lookup) |
| 4 | Structural | Gender split → GENDER_IDENTITY + SEX; Vital Status relocated Demographics → VitalStatus |
| 5 | Defaults | Empty required fields → Not Reported / Unknown / -1 (per HTAN2 sentinel conventions) |
Additional post-processing: value corrections for enum mismatches, integer sentinel conversion for text values in numeric fields, ICD-O NOS suffix stripping.
Mappings are generated by a multi-stage pipeline combining deterministic matching with LLM-assisted semantic matching:
- Model Extraction — Fetch from GitHub, normalize to common JSON format
- Deterministic Matching — caDSR ID match (confidence 1.0) + normalized name match (0.9)
- Semantic Matching — Parallel Haiku agents evaluate descriptions, enum overlap, naming patterns
- Quality Review — Sonnet agent checks duplicates, cross-class moves, 1-to-many splits
- SSSOM Generation — TSV output with YAML metadata headers
| Domain | Matched | Unmatched Source | New in Phase 2 | Avg Confidence |
|---|---|---|---|---|
| Clinical | 68 | 351 | 9 | 0.88 |
| Biospecimen | 35 | 31 | 2 | 0.88 |
| Assay | 206 | 411 | 77 | 0.87 |
| Total | 309 | 793 | 88 | 0.87 |
Value-level: 356 clinical value matches (285 deterministic + 71 LLM-assisted), avg confidence 0.94.
Text-to-ontology conversions use lookup tables built from HTAN2 enum YAML files, verified against the EBI OLS4 API:
- UBERON (tissues): 22,603 entries — 73 HTAN1 tissue terms OLS-verified, 100% resolution rate
- NCIt (diagnoses): 20,195 entries — 59 ICD-O morphology terms OLS-verified, 100% resolution rate
The scripts/ols_lookup.py utility provides search, resolve, crosswalk, and verify commands against the OLS4 API (free, no auth required).