Skip to content

Add 118 microschemas from CDE HDBSCAN clustering#15

Open
justaddcoffee wants to merge 1 commit intomainfrom
microschema/cluster-batch
Open

Add 118 microschemas from CDE HDBSCAN clustering#15
justaddcoffee wants to merge 1 commit intomainfrom
microschema/cluster-batch

Conversation

@justaddcoffee
Copy link
Copy Markdown
Collaborator

Summary

  • Generates LinkML microschema YAML files for all 118 CDE clusters from Madan et al. (arxiv:2506.02160) HDBSCAN clustering of 6,390 NIH/NLM CDEs
  • Each cluster maps to a separate YAML file, organized by best-fit record type subdirectory:
    • clinical_measurement/: 103 files (questionnaires, lab tests, physiological measurements)
    • condition_status/: 10 files (diagnoses, symptom presence, medical history)
    • drug_status/: 4 files (medication exposure, substance use)
    • procedure_status/: 1 file (EEG, neuroimaging, neurosurgical procedures)
  • All files follow the rules pattern with preconditions (measurement_type/condition_type codes) and postconditions (units, data_type, value ranges, instruments)

Agent swarm strategy

These 118 files were generated by a swarm of 12 Claude subagents running in parallel, each assigned a batch of clusters. The orchestration worked as follows:

1. Data preparation

  • Downloaded the HDBSCAN clustering spreadsheet from monarch-initiative/cde-clustering
  • Parsed the 118 clusters (6,390 CDEs) into individual JSON files at /tmp/clusters/cluster_NNN.json
  • Each cluster JSON contains the cluster name and an array of CDEs with their designation, definition, permissible values, and steward org:
    {
      "name": "Pulmonary Function Assessment Cluster",
      "cdes": [
        {
          "tinyId": "kNUZxW9ZPu",
          "designation": "Forced expiratory volume 1 second value",
          "definition": "Value of %FEV1 from within one month of the assessment",
          "permissible_values": "No permissible values",
          "stewardOrg": "NHLBI"
        }
      ]
    }

2. Reference prompt

A shared reference document (/tmp/microschema_reference.md) was written for all agents, specifying the LinkML pattern, record type classification rules, directory layout, and file naming convention:

# Microschema Reference for Codex Agents

## Task
Given a cluster of CDEs (Common Data Elements), create a LinkML microschema YAML file.

## Record Types (pick best fit)
1. ClinicalMeasurementRecord - labs, vital signs, anthropometrics, physiological measurements, scores/indices
2. ConditionStatusRecord - diagnoses, diseases, symptoms, medical conditions
3. DrugStatusRecord - medications, drug exposures, therapies
4. ProcedureStatusRecord - procedures, surgeries, clinical interventions

## Key Rules
1. Look at the CDEs in the cluster and identify distinct measurement CONCEPTS (not individual CDEs)
2. Each distinct concept becomes one class in the YAML
3. Use LOINC codes from CDE names when recognizable (e.g. "FEV1" -> LOINC:20150-9)
4. If you can't identify specific codes, use descriptive comments noting which CDEs align
5. Group related CDEs under the same class (e.g. pre-BD and post-BD variants of FEV1)
6. Use medically reasonable value ranges
7. For questionnaire/survey clusters that don't fit any record type well, use ClinicalMeasurementRecord as fallback
8. Keep it simple - don't overthink it. This is a first draft.

3. Agent dispatch

12 general-purpose subagents were launched in parallel via the Claude Code Task tool, each assigned a non-overlapping batch of cluster numbers:

Agent Clusters Count
1 0-5 6
2 6-11 6
3 12-17 6
4 18-23 6
5 24-29 6
6 30-41 12
7 42-53 12
8 54-65 12
9 66-77 12
10 78-89 12
11 90-101 12
12 102-117 16

Each agent received the same prompt template:

You need to generate LinkML microschema YAML files for CDE clusters {N}-{M}.
Read the reference at /tmp/microschema_reference.md for the pattern.
For each cluster, read /tmp/clusters/cluster_NNN.json (zero-padded),
then write a YAML file to the appropriate directory under
src/linkml_microschema_profile/schema/. Use the Write tool to create
each file. Classify each cluster to best-fit record type
(ClinicalMeasurementRecord for most). Create directories with
Bash mkdir -p if needed. File naming: cluster_NNN_snake_case_name.yaml.
Do NOT use codex - just generate the files directly yourself.

4. Results

  • All 12 agents completed successfully (~2-4 min each, running concurrently)
  • 118/118 YAML files generated with zero YAML syntax errors
  • No file conflicts between agents (non-overlapping cluster assignments)
  • Total: ~13,000 lines of LinkML microschema YAML

Note on failed first attempt

The initial strategy tried to dispatch 20 OpenAI Codex agents via MCP, but this failed because Claude Code subagents cannot nest Task tool calls (required by the codex MCP tool configuration). The pivot to direct generation by Claude subagents was faster and more reliable.

Test plan

  • All 118 YAML files pass yaml.safe_load() validation
  • Review sample files for correct LinkML structure and ontology codes
  • Validate against LinkML schema with gen-python or linkml-validate

Replaces #13 (moved from fork to upstream repo).

🤖 Generated with Claude Code

Generate LinkML microschemas for all 118 CDE clusters from
Madan et al. (arxiv:2506.02160) HDBSCAN clustering of NIH/NLM CDEs.
Each cluster maps to a separate YAML file organized by record type:
- clinical_measurement/: 103 files
- condition_status/: 10 files
- drug_status/: 4 files
- procedure_status/: 1 file

Each file follows the rules pattern with preconditions (measurement_type
or condition_type codes) and postconditions (units, data_type, value
ranges, instruments).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://linkml.github.io/linkml-microschema-profile/pr-preview/pr-15/

Built to branch gh-pages at 2026-03-06 18:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant