htan-model-mappings

The Human Tumor Atlas Network (HTAN) is transitioning from its Phase 1 data model — a flat Schematic CSV with 59 classes, free-text fields, and cancer-type-specific tiers — to a Phase 2 model built on LinkML with ontology-coded attributes, age-in-days temporal fields, and a hierarchical class structure.

This repository provides:

Field-level SSSOM mappings between the two models (309 matched field pairs)
Value-level SSSOM mappings for enum translation (356 clinical value matches)
A config-driven migration engine that transforms HTAN1 tabular data to HTAN2-compatible format
OLS-verified ontology lookup tables for UBERON tissues and NCIt diagnoses

The migration has been validated against HTAN2 v1.3.0 JSON Schemas at 100% compliance across 10,077 rows of real HTAN1 clinical data from BigQuery. See mappings/htan1_to_htan2/MIGRATION_REPORT.md for detailed results.

Quick Start

pip install -e ".[dev]"

Migrate HTAN1 Data

# Pull data from BigQuery (requires gcloud auth)
uv run htan query bq sql "SELECT * FROM \`isb-cgc-bq.HTAN.clinical_tier1_demographics_current\` LIMIT 3000" \
  --format csv 2>/dev/null | python3 -c "import csv,sys; [print('\t'.join(r)) for r in csv.reader(sys.stdin)]" \
  > /tmp/htan1_demographics.tsv

# Run migration
uv run python scripts/migrate.py \
  --input /tmp/htan1_demographics.tsv \
  --config configs/htan1_to_htan2/clinical.transform.yaml \
  --source-class Demographics \
  --output output/htan1_to_htan2/ \
  --normalize-columns

# Validate against HTAN2 JSON Schema
uv run python scripts/validate_transformed.py \
  --input-dir output/htan1_to_htan2/ \
  --schema-dir /path/to/htan2_json_schemas/ \
  --ignore-patterns

Generate Mappings (via Claude Code skill)

/map-models --source ncihtan/data-models@v25.2.1 --target ncihtan/htan2-data-model@v1.3.0

Validate Existing Mappings

uv run python scripts/validate_mappings.py mappings/htan1_to_htan2/

Structure

mappings/                          SSSOM TSV files (field-level + value-level)
  htan1_to_htan2/
    clinical_fields.sssom.tsv      68 field mappings
    clinical_values.sssom.tsv      356 value mappings (deterministic + semantic)
    biospecimen_fields.sssom.tsv   35 field mappings
    assay_fields.sssom.tsv         206 field mappings
    MAPPING.md                     Field mapping methodology
    MIGRATION_REPORT.md            Migration results and validation

configs/                           Transform configs (YAML)
  htan1_to_htan2/
    clinical.transform.yaml        Conversions, structural transforms, defaults
    biospecimen.transform.yaml
    assay.transform.yaml

lookups/                           Ontology lookup tables (JSON)
  uberon_labels_to_codes.json      22,603 entries (HTAN2 enums + 73 OLS-verified)
  ncit_diagnosis_to_codes.json     20,195 entries (HTAN2 enums + 59 OLS-verified)

scripts/                           Python utilities
  migrate.py                       Migration engine (5-tier pipeline)
  value_match.py                   Deterministic value matching
  semantic_value_match.py          LLM-assisted value matching
  ols_lookup.py                    EBI OLS4 API for ontology resolution
  build_lookup_tables.py           Extract lookups from HTAN2 enum YAMLs
  validate_transformed.py          JSON Schema / LinkML validation
  normalize_model.py               Model format normalization
  deterministic_match.py           Field matching (caDSR + name)
  generate_sssom_tsv.py            SSSOM TSV generation
  validate_mappings.py             SSSOM file validation

tests/                             pytest test suite (36 tests)

.claude/skills/                    Claude Code skill definitions
  map-models/                      /map-models — generate SSSOM mappings
  migrate-data/                    /migrate-data — orchestrate migration

Migration Pipeline

The migration engine (scripts/migrate.py) applies five tiers of transformation, driven by YAML config files:

Tier	Transform	Example
1	Field renaming	`Ethnicity` → `ETHNIC_GROUP` (from field SSSOM)
2	Value remapping	`not hispanic or latino` → `Not Hispanic or Latino` (from value SSSOM)
3	Conversions	`Colon NOS` → `UBERON:0001155` (text-to-ontology via OLS-verified lookup)
4	Structural	`Gender` split → `GENDER_IDENTITY` + `SEX`; `Vital Status` relocated Demographics → VitalStatus
5	Defaults	Empty required fields → `Not Reported` / `Unknown` / `-1` (per HTAN2 sentinel conventions)

Additional post-processing: value corrections for enum mismatches, integer sentinel conversion for text values in numeric fields, ICD-O NOS suffix stripping.

Mapping Methods

Mappings are generated by a multi-stage pipeline combining deterministic matching with LLM-assisted semantic matching:

Model Extraction — Fetch from GitHub, normalize to common JSON format
Deterministic Matching — caDSR ID match (confidence 1.0) + normalized name match (0.9)
Semantic Matching — Parallel Haiku agents evaluate descriptions, enum overlap, naming patterns
Quality Review — Sonnet agent checks duplicates, cross-class moves, 1-to-many splits
SSSOM Generation — TSV output with YAML metadata headers

Current Results

Domain	Matched	Unmatched Source	New in Phase 2	Avg Confidence
Clinical	68	351	9	0.88
Biospecimen	35	31	2	0.88
Assay	206	411	77	0.87
Total	309	793	88	0.87

Value-level: 356 clinical value matches (285 deterministic + 71 LLM-assisted), avg confidence 0.94.

Ontology Resolution

Text-to-ontology conversions use lookup tables built from HTAN2 enum YAML files, verified against the EBI OLS4 API:

UBERON (tissues): 22,603 entries — 73 HTAN1 tissue terms OLS-verified, 100% resolution rate
NCIt (diagnoses): 20,195 entries — 59 ICD-O morphology terms OLS-verified, 100% resolution rate

The scripts/ols_lookup.py utility provides search, resolve, crosswalk, and verify commands against the OLS4 API (free, no auth required).

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude		.claude
configs/htan1_to_htan2		configs/htan1_to_htan2
lookups		lookups
mappings/htan1_to_htan2		mappings/htan1_to_htan2
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

htan-model-mappings

Quick Start

Migrate HTAN1 Data

Generate Mappings (via Claude Code skill)

Validate Existing Mappings

Structure

Migration Pipeline

Mapping Methods

Current Results

Ontology Resolution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

htan-model-mappings

Quick Start

Migrate HTAN1 Data

Generate Mappings (via Claude Code skill)

Validate Existing Mappings

Structure

Migration Pipeline

Mapping Methods

Current Results

Ontology Resolution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages