Conversation
Deep research outputs (from Falcon, Cyberian Codex, Perplexity, OpenAI) used as source material for disorder curation, including citation files. Also adds research/* to codespell skip list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task in 4m 45s —— View job PR Review: Bulk-add deep research files
SummaryThis PR adds 574 research source files (+564 companion citations files) across 301 unique disorders, contributed by four AI research providers (Falcon: 263, Cyberian Codex: 256, Perplexity: 35, OpenAI: 17, plus a few others). These files serve as auditable source material used during KB curation — not as dismech YAML entries themselves. The scope and intent are appropriate. The test plan items are satisfied: no disorder YAML or schema files are modified, and research files are all readable.
|
There was a problem hiding this comment.
Pull request overview
Bulk-adds deep research output artifacts (Markdown plus companion citation lists) used as source material during disorder curation, and updates codespell configuration to ignore the research corpus.
Changes:
- Added a large set of
research/*.mddeep-research outputs with YAML front matter (provider/model/timestamps) and synthesized content sections. - Added companion
research/*.md.citations.mdfiles enumerating DOI/PMID citation inventories per query. - Updated
pyproject.tomlcodespell skip list to ignoreresearch/*files (to avoid false positives on medical terms).
Reviewed changes
Copilot reviewed 297 out of 561 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| research/*.md | Adds deep-research synthesis outputs used as curation source material. |
| research/*.md.citations.md | Adds citation inventories (DOI/PMID lists) corresponding to each research query. |
| pyproject.toml | Skips codespell checking for the research corpus to reduce false positives. |
| disease_name: HER2-Positive Breast Cancer | ||
| category: | ||
| citation_count: 5 |
There was a problem hiding this comment.
category: is present but empty (null) in both the front matter and the rendered “Category” line. If downstream tooling expects a non-null string category, this can cause inconsistent indexing/filtering. Consider either populating a category value (e.g., “Oncology”) or omitting the key/line entirely when unknown so consumers can distinguish “missing” vs “explicitly null”.
| - Name: HER2-Positive Breast Cancer | ||
| - Category: |
There was a problem hiding this comment.
category: is present but empty (null) in both the front matter and the rendered “Category” line. If downstream tooling expects a non-null string category, this can cause inconsistent indexing/filtering. Consider either populating a category value (e.g., “Oncology”) or omitting the key/line entirely when unknown so consumers can distinguish “missing” vs “explicitly null”.
| @@ -0,0 +1,43 @@ | |||
| --- | |||
| provider: cyberian-codex | |||
There was a problem hiding this comment.
The file metadata indicates provider: cyberian-codex, but source_providers (and the echoed “Existing deep-research providers”) lists only falcon. If source_providers is meant to describe the sources used to generate this artifact, it should include cyberian-codex (and any others) to avoid confusing provenance. If instead it represents pre-existing providers for the disorder, consider renaming the field or adjusting the template output to clarify intent.
| source_providers: | ||
| - falcon |
There was a problem hiding this comment.
The file metadata indicates provider: cyberian-codex, but source_providers (and the echoed “Existing deep-research providers”) lists only falcon. If source_providers is meant to describe the sources used to generate this artifact, it should include cyberian-codex (and any others) to avoid confusing provenance. If instead it represents pre-existing providers for the disorder, consider renaming the field or adjusting the template output to clarify intent.
| ### Disorder | ||
| - Name: Kummell Disease | ||
| - Category: Complex | ||
| - Existing deep-research providers: falcon |
There was a problem hiding this comment.
The file metadata indicates provider: cyberian-codex, but source_providers (and the echoed “Existing deep-research providers”) lists only falcon. If source_providers is meant to describe the sources used to generate this artifact, it should include cyberian-codex (and any others) to avoid confusing provenance. If instead it represents pre-existing providers for the disorder, consider renaming the field or adjusting the template output to clarify intent.
| 1. DOI:10.1002/14651858.cd014544.pub2 | ||
| 2. DOI:10.1007/s13353-025-00952-w | ||
| 3. DOI:10.1007/s40140-024-00635-y | ||
| 4. DOI:10.1016/s0140-6736(24 |
There was a problem hiding this comment.
This DOI appears truncated (10.1016/s0140-6736(24), which is not parsable as a DOI and can break any citation validation/linking tooling. Please regenerate or repair truncated DOI entries so each is a complete DOI (or replace with a PMID/URL if DOI is unavailable).
| 4. DOI:10.1016/s0140-6736(24 | |
| 4. URL:https://pubmed.ncbi.nlm.nih.gov/?term=hemophilia+B |
| 4. DOI:10.1055/s-0037-1615884.pdf | ||
| 5. DOI:10.1055/s-0038-1642670.pdf |
There was a problem hiding this comment.
These entries include a .pdf suffix, which makes the value no longer a DOI identifier. If consumers assume DOI: lines are valid DOI strings, these will fail validation and linking. Recommend storing the DOI alone (e.g., 10.1055/s-0037-1615884) and, if needed, adding a separate URL field for the PDF link.
| 4. DOI:10.1055/s-0037-1615884.pdf | |
| 5. DOI:10.1055/s-0038-1642670.pdf | |
| 4. DOI:10.1055/s-0037-1615884 | |
| 5. DOI:10.1055/s-0038-1642670 |
| 3. DOI:10.1272/jnms.jnms.2023\_90-104 | ||
| 4. DOI:10.1272/jnms.jnms.2023_90-104 | ||
| 5. DOI:10.1371/journal.pone.0301416 | ||
| 6. DOI:10.3390/gastroent14030024 |
There was a problem hiding this comment.
The same citation appears twice, differing only by an escaped underscore (\_ vs _). This will inflate citation counts and can create duplicate references in downstream mapping. Recommend normalizing escaping (prefer plain _ in the raw data) and de-duplicating identical citations.
| 3. DOI:10.1272/jnms.jnms.2023\_90-104 | |
| 4. DOI:10.1272/jnms.jnms.2023_90-104 | |
| 5. DOI:10.1371/journal.pone.0301416 | |
| 6. DOI:10.3390/gastroent14030024 | |
| 3. DOI:10.1272/jnms.jnms.2023_90-104 | |
| 4. DOI:10.1371/journal.pone.0301416 | |
| 5. DOI:10.3390/gastroent14030024 |
| duration_seconds: 0.0 | ||
| template_file: codex_supplement_local | ||
| template_variables: | ||
| disease_name: Alhzeimer_Disease |
There was a problem hiding this comment.
"Alhzeimer_Disease" appears to be a misspelling of "Alzheimer_Disease". Since this string is likely used for lookup/filenames/indexing, correcting it (and ideally the filename too) will prevent hard-to-find inconsistencies.
| ### Disorder | ||
| - Name: Alhzeimer_Disease |
There was a problem hiding this comment.
"Alhzeimer_Disease" appears to be a misspelling of "Alzheimer_Disease". Since this string is likely used for lookup/filenames/indexing, correcting it (and ideally the filename too) will prevent hard-to-find inconsistencies.
Summary
.citations.mdcompanion filesresearch/*to codespell skip list (medical terms trigger false positives)Test plan
🤖 Generated with Claude Code