Add 118 microschemas from CDE HDBSCAN clustering by justaddcoffee · Pull Request #15 · linkml/linkml-microschema-profile

justaddcoffee · 2026-03-06T18:50:28Z

Summary

Generates LinkML microschema YAML files for all 118 CDE clusters from Madan et al. (arxiv:2506.02160) HDBSCAN clustering of 6,390 NIH/NLM CDEs
Each cluster maps to a separate YAML file, organized by best-fit record type subdirectory:
- clinical_measurement/: 103 files (questionnaires, lab tests, physiological measurements)
- condition_status/: 10 files (diagnoses, symptom presence, medical history)
- drug_status/: 4 files (medication exposure, substance use)
- procedure_status/: 1 file (EEG, neuroimaging, neurosurgical procedures)
All files follow the rules pattern with preconditions (measurement_type/condition_type codes) and postconditions (units, data_type, value ranges, instruments)

Agent swarm strategy

These 118 files were generated by a swarm of 12 Claude subagents running in parallel, each assigned a batch of clusters. The orchestration worked as follows:

1. Data preparation

Downloaded the HDBSCAN clustering spreadsheet from monarch-initiative/cde-clustering
Parsed the 118 clusters (6,390 CDEs) into individual JSON files at /tmp/clusters/cluster_NNN.json

Each cluster JSON contains the cluster name and an array of CDEs with their designation, definition, permissible values, and steward org:

{
  "name": "Pulmonary Function Assessment Cluster",
  "cdes": [
    {
      "tinyId": "kNUZxW9ZPu",
      "designation": "Forced expiratory volume 1 second value",
      "definition": "Value of %FEV1 from within one month of the assessment",
      "permissible_values": "No permissible values",
      "stewardOrg": "NHLBI"
    }
  ]
}

2. Reference prompt

A shared reference document (/tmp/microschema_reference.md) was written for all agents, specifying the LinkML pattern, record type classification rules, directory layout, and file naming convention:

# Microschema Reference for Codex Agents

## Task
Given a cluster of CDEs (Common Data Elements), create a LinkML microschema YAML file.

## Record Types (pick best fit)
1. ClinicalMeasurementRecord - labs, vital signs, anthropometrics, physiological measurements, scores/indices
2. ConditionStatusRecord - diagnoses, diseases, symptoms, medical conditions
3. DrugStatusRecord - medications, drug exposures, therapies
4. ProcedureStatusRecord - procedures, surgeries, clinical interventions

## Key Rules
1. Look at the CDEs in the cluster and identify distinct measurement CONCEPTS (not individual CDEs)
2. Each distinct concept becomes one class in the YAML
3. Use LOINC codes from CDE names when recognizable (e.g. "FEV1" -> LOINC:20150-9)
4. If you can't identify specific codes, use descriptive comments noting which CDEs align
5. Group related CDEs under the same class (e.g. pre-BD and post-BD variants of FEV1)
6. Use medically reasonable value ranges
7. For questionnaire/survey clusters that don't fit any record type well, use ClinicalMeasurementRecord as fallback
8. Keep it simple - don't overthink it. This is a first draft.

3. Agent dispatch

12 general-purpose subagents were launched in parallel via the Claude Code Task tool, each assigned a non-overlapping batch of cluster numbers:

Agent	Clusters	Count
1	0-5	6
2	6-11	6
3	12-17	6
4	18-23	6
5	24-29	6
6	30-41	12
7	42-53	12
8	54-65	12
9	66-77	12
10	78-89	12
11	90-101	12
12	102-117	16

Each agent received the same prompt template:

You need to generate LinkML microschema YAML files for CDE clusters {N}-{M}.
Read the reference at /tmp/microschema_reference.md for the pattern.
For each cluster, read /tmp/clusters/cluster_NNN.json (zero-padded),
then write a YAML file to the appropriate directory under
src/linkml_microschema_profile/schema/. Use the Write tool to create
each file. Classify each cluster to best-fit record type
(ClinicalMeasurementRecord for most). Create directories with
Bash mkdir -p if needed. File naming: cluster_NNN_snake_case_name.yaml.
Do NOT use codex - just generate the files directly yourself.

4. Results

All 12 agents completed successfully (~2-4 min each, running concurrently)
118/118 YAML files generated with zero YAML syntax errors
No file conflicts between agents (non-overlapping cluster assignments)
Total: ~13,000 lines of LinkML microschema YAML

Note on failed first attempt

The initial strategy tried to dispatch 20 OpenAI Codex agents via MCP, but this failed because Claude Code subagents cannot nest Task tool calls (required by the codex MCP tool configuration). The pivot to direct generation by Claude subagents was faster and more reliable.

Test plan

All 118 YAML files pass yaml.safe_load() validation
Review sample files for correct LinkML structure and ontology codes
Validate against LinkML schema with gen-python or linkml-validate

Replaces #13 (moved from fork to upstream repo).

🤖 Generated with Claude Code

Generate LinkML microschemas for all 118 CDE clusters from Madan et al. (arxiv:2506.02160) HDBSCAN clustering of NIH/NLM CDEs. Each cluster maps to a separate YAML file organized by record type: - clinical_measurement/: 103 files - condition_status/: 10 files - drug_status/: 4 files - procedure_status/: 1 file Each file follows the rules pattern with preconditions (measurement_type or condition_type codes) and postconditions (units, data_type, value ranges, instruments). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-06T18:50:58Z

PR Preview Action v1.8.1
🚀 View preview at https://linkml.github.io/linkml-microschema-profile/pr-preview/pr-15/
Built to branch `gh-pages` at 2026-03-06 18:50 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 118 microschemas from CDE HDBSCAN clustering#15

Add 118 microschemas from CDE HDBSCAN clustering#15
justaddcoffee wants to merge 1 commit intomainfrom
microschema/cluster-batch

justaddcoffee commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Built to branch `gh-pages` at 2026-03-06 18:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justaddcoffee commented Mar 6, 2026

Summary

Agent swarm strategy

1. Data preparation

2. Reference prompt

3. Agent dispatch

4. Results

Note on failed first attempt

Test plan

Uh oh!

github-actions bot commented Mar 6, 2026

Built to branch gh-pages at 2026-03-06 18:50 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Built to branch `gh-pages` at 2026-03-06 18:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.