Skip to content

Bulk-add deep research files#480

Open
cmungall wants to merge 1 commit intomainfrom
bulk-add-research-files
Open

Bulk-add deep research files#480
cmungall wants to merge 1 commit intomainfrom
bulk-add-research-files

Conversation

@cmungall
Copy link
Member

@cmungall cmungall commented Mar 4, 2026

Summary

  • Bulk-add ~560 deep research output files from Falcon, Cyberian Codex, Perplexity, and OpenAI
  • These are source material used during disorder curation, including .citations.md companion files
  • Also adds research/* to codespell skip list (medical terms trigger false positives)

Test plan

  • Verify research files are present and readable
  • No disorder YAML or schema files are modified

🤖 Generated with Claude Code

Deep research outputs (from Falcon, Cyberian Codex, Perplexity, OpenAI)
used as source material for disorder curation, including citation files.
Also adds research/* to codespell skip list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 4, 2026 06:15
@claude
Copy link
Contributor

claude bot commented Mar 4, 2026

Claude finished @cmungall's task in 4m 45s —— View job


PR Review: Bulk-add deep research files

  • Gather context and understand the PR
  • Review pyproject.toml change
  • Sample and review research file content (quality, accuracy, format)
  • Check citations/references files
  • Assess completeness and conformance with guidelines
  • Post final review

Summary

This PR adds 574 research source files (+564 companion citations files) across 301 unique disorders, contributed by four AI research providers (Falcon: 263, Cyberian Codex: 256, Perplexity: 35, OpenAI: 17, plus a few others). These files serve as auditable source material used during KB curation — not as dismech YAML entries themselves. The scope and intent are appropriate.

The test plan items are satisfied: no disorder YAML or schema files are modified, and research files are all readable.


pyproject.toml change

The single code change — adding research/* to the codespell skip list — is correct and necessary. Medical terminology in these files (gene names, disease terms, clinical abbreviations) will reliably trigger false positives.


Content Quality

Falcon files (~263 files, ~400+ lines each): High quality. Detailed mechanistic pathophysiology narratives covering molecular pathways, cell types, anatomical sites, gene/protein players, and clinical phenotypes. Citations are a well-structured mix of PMIDs and DOIs. Well-suited as curation source material. Example: Carbamoyl_Phosphate_Synthetase_I_Deficiency-deep-research-falcon.md covers disease stages, neurotoxicity mechanisms, metabolic cascade, and treatment with good citation density.

Cyberian Codex files (~256 files, ~40–55 lines each): These are short secondary synthesis files that summarize key pathophysiology nodes and enumerate DOI citation inventories from prior primary sources. They are not standalone research reports — that is by design (template_file: codex_supplement_local, duration_seconds: 0.0). Useful as an index of curated citations.

Perplexity / OpenAI files: Longer, well-cited reports (e.g., Alcohol-Related_Disorders-deep-research-openai.md has 124 citations). OpenAI o3-deep-research output is comprehensive.


Issues Found

1. Filename typo — Alzheimer's Disease (minor)

Four files use Alhzeimer_Disease instead of Alzheimers_Disease:

  • Alhzeimer_Disease-deep-research-cyberian-codex.md
  • Alhzeimer_Disease-deep-research-falcon.md
    (+ 2 citations companions)

The KB file is Alzheimers_Disease.yaml. This naming inconsistency won't break anything but may hinder discovery when cross-referencing by filename. Consider renaming at your convenience, or noting in research/README if one exists.

2. Truncated DOIs in 14 citation files (minor)

Fourteen .citations.md files contain malformed DOIs with unescaped parentheses that suggest truncation, e.g.:

DOI:10.1016/s2213-2600(23

These appear in files like Infectious_Disease-deep-research-cyberian-codex.md.citations.md. Since these are source-material metadata files (not the KB), they won't fail any validator, but they are non-resolvable as DOIs. Affected files include: 22q11.2_Deletion_Syndrome, Addisons_Disease, Amyotrophic_Lateral_Sclerosis, Chronic_Pancreatitis, Diamond-Blackfan_Anemia, Ebola_Virus_Disease_EVD, Ehlers-Danlos_Syndrome, Hemophilia_B, Infectious_Disease, Migraine, Parkinsons_Disease, Polycystic_Echinococcosis, Shigellosis, Ulcerative_Colitis.

3. Corrupted template variables in comorbidity files (low severity)

Two comorbidity perplexity files have clearly malformed frontmatter — the disease names are truncated mid-word, suggesting a pipeline bug with tab-delimited template variable parsing:

com_Atopic_Dermatitis__Type_2_Diabetes_Mellitus-deep-research-perplexity.md:

disease_a_label: A        # should be: Atopic Dermatitis
disease_b_label: opic Derma   # should be: Type 2 Diabetes Mellitus
disease_a_slug: i
disease_b_slug: is

The actual research content appears to be about the correct disease pair (the Perplexity API was presumably given the correct question despite the malformed metadata). Worth flagging as a pipeline bug to fix in the research generation tooling.

4. Non-standard filenames (cosmetic)

A few files don't follow the <Disease>-deep-research-<provider>.md convention:

  • Fanconi_Anemia_Clinical_Care_Guidelines_5thEdition_web.extracted.md — PDF extract, no standard frontmatter
  • Fanconi_Anemia__Phenotypes-openai.md — double underscore, topic-specific variant
  • Fanconi_Anemia__Skeletal-deep-research-claudeweb.mdclaudeweb is an undocumented provider name

These are minor and don't affect the KB, but if the research/ directory grows it may be worth documenting naming conventions.


CI Failure

The "Build and test" run fails with:

FAILED tests/test_data.py::test_evidence_items_have_references[Glutaryl-CoA_Dehydrogenase_Deficiency.yaml]
reference should start with PMID:, DOI:, clinicaltrials:, or file:
got url:https://www.ncbi.nlm.nih.gov/books/NBK546575/?report=printable

This failure is pre-existing and unrelated to this PRGlutaryl-CoA_Dehydrogenase_Deficiency.yaml is an existing KB file not touched by this PR. It should be fixed in a separate PR.


Verdict

Approved with suggestions. The PR successfully adds well-structured source material that supports reproducibility of curation decisions. No schema, YAML, or test files are modified. The issues noted (filename typo, truncated DOIs, corrupted comorbidity frontmatter) are cosmetic or minor pipeline artifacts in source files that don't affect KB quality. The pre-existing CI failure should be tracked separately.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Bulk-adds deep research output artifacts (Markdown plus companion citation lists) used as source material during disorder curation, and updates codespell configuration to ignore the research corpus.

Changes:

  • Added a large set of research/*.md deep-research outputs with YAML front matter (provider/model/timestamps) and synthesized content sections.
  • Added companion research/*.md.citations.md files enumerating DOI/PMID citation inventories per query.
  • Updated pyproject.toml codespell skip list to ignore research/* files (to avoid false positives on medical terms).

Reviewed changes

Copilot reviewed 297 out of 561 changed files in this pull request and generated 13 comments.

File Description
research/*.md Adds deep-research synthesis outputs used as curation source material.
research/*.md.citations.md Adds citation inventories (DOI/PMID lists) corresponding to each research query.
pyproject.toml Skips codespell checking for the research corpus to reduce false positives.

Comment on lines +10 to +12
disease_name: HER2-Positive Breast Cancer
category:
citation_count: 5
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category: is present but empty (null) in both the front matter and the rendered “Category” line. If downstream tooling expects a non-null string category, this can cause inconsistent indexing/filtering. Consider either populating a category value (e.g., “Oncology”) or omitting the key/line entirely when unknown so consumers can distinguish “missing” vs “explicitly null”.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +24
- Name: HER2-Positive Breast Cancer
- Category:
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category: is present but empty (null) in both the front matter and the rendered “Category” line. If downstream tooling expects a non-null string category, this can cause inconsistent indexing/filtering. Consider either populating a category value (e.g., “Oncology”) or omitting the key/line entirely when unknown so consumers can distinguish “missing” vs “explicitly null”.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,43 @@
---
provider: cyberian-codex
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file metadata indicates provider: cyberian-codex, but source_providers (and the echoed “Existing deep-research providers”) lists only falcon. If source_providers is meant to describe the sources used to generate this artifact, it should include cyberian-codex (and any others) to avoid confusing provenance. If instead it represents pre-existing providers for the disorder, consider renaming the field or adjusting the template output to clarify intent.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +14
source_providers:
- falcon
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file metadata indicates provider: cyberian-codex, but source_providers (and the echoed “Existing deep-research providers”) lists only falcon. If source_providers is meant to describe the sources used to generate this artifact, it should include cyberian-codex (and any others) to avoid confusing provenance. If instead it represents pre-existing providers for the disorder, consider renaming the field or adjusting the template output to clarify intent.

Copilot uses AI. Check for mistakes.
### Disorder
- Name: Kummell Disease
- Category: Complex
- Existing deep-research providers: falcon
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file metadata indicates provider: cyberian-codex, but source_providers (and the echoed “Existing deep-research providers”) lists only falcon. If source_providers is meant to describe the sources used to generate this artifact, it should include cyberian-codex (and any others) to avoid confusing provenance. If instead it represents pre-existing providers for the disorder, consider renaming the field or adjusting the template output to clarify intent.

Copilot uses AI. Check for mistakes.
1. DOI:10.1002/14651858.cd014544.pub2
2. DOI:10.1007/s13353-025-00952-w
3. DOI:10.1007/s40140-024-00635-y
4. DOI:10.1016/s0140-6736(24
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DOI appears truncated (10.1016/s0140-6736(24), which is not parsable as a DOI and can break any citation validation/linking tooling. Please regenerate or repair truncated DOI entries so each is a complete DOI (or replace with a PMID/URL if DOI is unavailable).

Suggested change
4. DOI:10.1016/s0140-6736(24
4. URL:https://pubmed.ncbi.nlm.nih.gov/?term=hemophilia+B

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +11
4. DOI:10.1055/s-0037-1615884.pdf
5. DOI:10.1055/s-0038-1642670.pdf
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These entries include a .pdf suffix, which makes the value no longer a DOI identifier. If consumers assume DOI: lines are valid DOI strings, these will fail validation and linking. Recommend storing the DOI alone (e.g., 10.1055/s-0037-1615884) and, if needed, adding a separate URL field for the PDF link.

Suggested change
4. DOI:10.1055/s-0037-1615884.pdf
5. DOI:10.1055/s-0038-1642670.pdf
4. DOI:10.1055/s-0037-1615884
5. DOI:10.1055/s-0038-1642670

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +12
3. DOI:10.1272/jnms.jnms.2023\_90-104
4. DOI:10.1272/jnms.jnms.2023_90-104
5. DOI:10.1371/journal.pone.0301416
6. DOI:10.3390/gastroent14030024
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same citation appears twice, differing only by an escaped underscore (\_ vs _). This will inflate citation counts and can create duplicate references in downstream mapping. Recommend normalizing escaping (prefer plain _ in the raw data) and de-duplicating identical citations.

Suggested change
3. DOI:10.1272/jnms.jnms.2023\_90-104
4. DOI:10.1272/jnms.jnms.2023_90-104
5. DOI:10.1371/journal.pone.0301416
6. DOI:10.3390/gastroent14030024
3. DOI:10.1272/jnms.jnms.2023_90-104
4. DOI:10.1371/journal.pone.0301416
5. DOI:10.3390/gastroent14030024

Copilot uses AI. Check for mistakes.
duration_seconds: 0.0
template_file: codex_supplement_local
template_variables:
disease_name: Alhzeimer_Disease
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Alhzeimer_Disease" appears to be a misspelling of "Alzheimer_Disease". Since this string is likely used for lookup/filenames/indexing, correcting it (and ideally the filename too) will prevent hard-to-find inconsistencies.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +23
### Disorder
- Name: Alhzeimer_Disease
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Alhzeimer_Disease" appears to be a misspelling of "Alzheimer_Disease". Since this string is likely used for lookup/filenames/indexing, correcting it (and ideally the filename too) will prevent hard-to-find inconsistencies.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants