Skip to content

ncihtan/htan-model-mappings

Repository files navigation

htan-model-mappings

The Human Tumor Atlas Network (HTAN) is transitioning from its Phase 1 data model — a flat Schematic CSV with 59 classes, free-text fields, and cancer-type-specific tiers — to a Phase 2 model built on LinkML with ontology-coded attributes, age-in-days temporal fields, and a hierarchical class structure.

This repository provides:

  1. Field-level SSSOM mappings between the two models (309 matched field pairs)
  2. Value-level SSSOM mappings for enum translation (356 clinical value matches)
  3. A config-driven migration engine that transforms HTAN1 tabular data to HTAN2-compatible format
  4. OLS-verified ontology lookup tables for UBERON tissues and NCIt diagnoses

The migration has been validated against HTAN2 v1.3.0 JSON Schemas at 100% compliance across 10,077 rows of real HTAN1 clinical data from BigQuery. See mappings/htan1_to_htan2/MIGRATION_REPORT.md for detailed results.

Quick Start

pip install -e ".[dev]"

Migrate HTAN1 Data

# Pull data from BigQuery (requires gcloud auth)
uv run htan query bq sql "SELECT * FROM \`isb-cgc-bq.HTAN.clinical_tier1_demographics_current\` LIMIT 3000" \
  --format csv 2>/dev/null | python3 -c "import csv,sys; [print('\t'.join(r)) for r in csv.reader(sys.stdin)]" \
  > /tmp/htan1_demographics.tsv

# Run migration
uv run python scripts/migrate.py \
  --input /tmp/htan1_demographics.tsv \
  --config configs/htan1_to_htan2/clinical.transform.yaml \
  --source-class Demographics \
  --output output/htan1_to_htan2/ \
  --normalize-columns

# Validate against HTAN2 JSON Schema
uv run python scripts/validate_transformed.py \
  --input-dir output/htan1_to_htan2/ \
  --schema-dir /path/to/htan2_json_schemas/ \
  --ignore-patterns

Generate Mappings (via Claude Code skill)

/map-models --source ncihtan/data-models@v25.2.1 --target ncihtan/htan2-data-model@v1.3.0

Validate Existing Mappings

uv run python scripts/validate_mappings.py mappings/htan1_to_htan2/

Structure

mappings/                          SSSOM TSV files (field-level + value-level)
  htan1_to_htan2/
    clinical_fields.sssom.tsv      68 field mappings
    clinical_values.sssom.tsv      356 value mappings (deterministic + semantic)
    biospecimen_fields.sssom.tsv   35 field mappings
    assay_fields.sssom.tsv         206 field mappings
    MAPPING.md                     Field mapping methodology
    MIGRATION_REPORT.md            Migration results and validation

configs/                           Transform configs (YAML)
  htan1_to_htan2/
    clinical.transform.yaml        Conversions, structural transforms, defaults
    biospecimen.transform.yaml
    assay.transform.yaml

lookups/                           Ontology lookup tables (JSON)
  uberon_labels_to_codes.json      22,603 entries (HTAN2 enums + 73 OLS-verified)
  ncit_diagnosis_to_codes.json     20,195 entries (HTAN2 enums + 59 OLS-verified)

scripts/                           Python utilities
  migrate.py                       Migration engine (5-tier pipeline)
  value_match.py                   Deterministic value matching
  semantic_value_match.py          LLM-assisted value matching
  ols_lookup.py                    EBI OLS4 API for ontology resolution
  build_lookup_tables.py           Extract lookups from HTAN2 enum YAMLs
  validate_transformed.py          JSON Schema / LinkML validation
  normalize_model.py               Model format normalization
  deterministic_match.py           Field matching (caDSR + name)
  generate_sssom_tsv.py            SSSOM TSV generation
  validate_mappings.py             SSSOM file validation

tests/                             pytest test suite (36 tests)

.claude/skills/                    Claude Code skill definitions
  map-models/                      /map-models — generate SSSOM mappings
  migrate-data/                    /migrate-data — orchestrate migration

Migration Pipeline

The migration engine (scripts/migrate.py) applies five tiers of transformation, driven by YAML config files:

Tier Transform Example
1 Field renaming EthnicityETHNIC_GROUP (from field SSSOM)
2 Value remapping not hispanic or latinoNot Hispanic or Latino (from value SSSOM)
3 Conversions Colon NOSUBERON:0001155 (text-to-ontology via OLS-verified lookup)
4 Structural Gender split → GENDER_IDENTITY + SEX; Vital Status relocated Demographics → VitalStatus
5 Defaults Empty required fields → Not Reported / Unknown / -1 (per HTAN2 sentinel conventions)

Additional post-processing: value corrections for enum mismatches, integer sentinel conversion for text values in numeric fields, ICD-O NOS suffix stripping.

Mapping Methods

Mappings are generated by a multi-stage pipeline combining deterministic matching with LLM-assisted semantic matching:

  1. Model Extraction — Fetch from GitHub, normalize to common JSON format
  2. Deterministic Matching — caDSR ID match (confidence 1.0) + normalized name match (0.9)
  3. Semantic Matching — Parallel Haiku agents evaluate descriptions, enum overlap, naming patterns
  4. Quality Review — Sonnet agent checks duplicates, cross-class moves, 1-to-many splits
  5. SSSOM Generation — TSV output with YAML metadata headers

Current Results

Domain Matched Unmatched Source New in Phase 2 Avg Confidence
Clinical 68 351 9 0.88
Biospecimen 35 31 2 0.88
Assay 206 411 77 0.87
Total 309 793 88 0.87

Value-level: 356 clinical value matches (285 deterministic + 71 LLM-assisted), avg confidence 0.94.

Ontology Resolution

Text-to-ontology conversions use lookup tables built from HTAN2 enum YAML files, verified against the EBI OLS4 API:

  • UBERON (tissues): 22,603 entries — 73 HTAN1 tissue terms OLS-verified, 100% resolution rate
  • NCIt (diagnoses): 20,195 entries — 59 ICD-O morphology terms OLS-verified, 100% resolution rate

The scripts/ols_lookup.py utility provides search, resolve, crosswalk, and verify commands against the OLS4 API (free, no auth required).

About

SSSOM mappings between HTAN Phase 1 and Phase 2 data models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages