Skip to content

Commit 402f4fb

Browse files
realmarcinclaude
andcommitted
Add unmapped ingredients aggregation and tracking system
- Created LinkML schema for unmapped ingredients (9 classes, 5 enums) - Implemented aggregation script to identify and track unmapped ingredients - Added statistics reporting tool for prioritization analysis - Generated comprehensive documentation and executive summary - Updated README with system overview and usage commands System identifies 136 unmapped ingredients across 522 media (4.9% of total), totaling 3,084 instances requiring ontology term mapping. Supports automated detection of numeric placeholders, generic terms, and chemical name extraction from notes fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d5d1774 commit 402f4fb

File tree

6 files changed

+1317
-14
lines changed

6 files changed

+1317
-14
lines changed

README.md

Lines changed: 145 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,18 @@
22

33
**Comprehensive Microbial Culture Media Knowledge Graph**
44

5-
A production-ready knowledge base containing **10,595 culture media recipes** from 10 major international repositories, with LinkML schema validation, ontology grounding, and browser-based exploration.
5+
A production-ready knowledge base containing **10,657 culture media recipes** from 10 major international repositories, with LinkML schema validation, ontology grounding, and browser-based exploration.
66

77
[![License: CC0-1.0](https://img.shields.io/badge/License-CC0_1.0-lightgrey.svg)](http://creativecommons.org/publicdomain/zero/1.0/)
88
[![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
99

1010
## 📊 Current Coverage
1111

12-
**Total Recipes**: **10,595** culture media formulations
12+
**Total Recipes**: **10,657** culture media formulations
1313

1414
| Category | Recipes | Sources |
1515
|----------|---------|---------|
16-
| **Bacterial** | 10,072 | MediaDive, TOGO, BacDive, ATCC, NBRC, KOMODO, MediaDB |
16+
| **Bacterial** | 10,134 | MediaDive, TOGO, BacDive, ATCC, NBRC, KOMODO, MediaDB |
1717
| **Algae** | 242 | UTEX, CCAP, SAG |
1818
| **Fungal** | 119 | MediaDive, TOGO |
1919
| **Specialized** | 99 | KOMODO |
@@ -66,6 +66,14 @@ Complex media contain undefined components (e.g., yeast extract, peptone), while
6666
- Organisms: NCBITaxon (NCBI Taxonomy)
6767
- Media databases: DSMZ, TOGO, ATCC prefixes
6868

69+
**Unmapped Ingredients Tracking System** (2026-03-05):
70+
- 🎯 **136 unmapped ingredients** identified across 522 media (4.9% of total)
71+
- 📊 **3,084 total instances** requiring ontology term mapping
72+
- 🔍 Automated detection of numeric placeholders ('1', '2', '3'), generic terms, and empty values
73+
- 🧪 Chemical name extraction from notes fields for mapping assistance
74+
- 📈 Priority-based mapping recommendations (critical: 51+ occurrences)
75+
- See [UNMAPPED_INGREDIENTS_SUMMARY.md](UNMAPPED_INGREDIENTS_SUMMARY.md) and [docs/unmapped_ingredients_guide.md](docs/unmapped_ingredients_guide.md)
76+
6977
**Advanced Normalization & SSSOM Enrichment** (2026-02):
7078
- ✨ Integrated MicroMediaParam's production-grade 16-step chemical normalization pipeline
7179
- 📚 100+ curated biological products (yeast extract, peptone, serum, DNA, agar, etc.)
@@ -77,9 +85,15 @@ Complex media contain undefined components (e.g., yeast extract, peptone), while
7785
- 📈 **68.4% increase** in coverage (27.1% → 45.6%)
7886
- See [PROJECT_STATUS_SUMMARY.md](PROJECT_STATUS_SUMMARY.md) and [GAS_MAPPING_SUMMARY.md](GAS_MAPPING_SUMMARY.md) for details
7987

88+
**Enum Normalization** (2026-02-20):
89+
- 🔧 Normalized **10,657 YAML files** for schema compliance
90+
- ✅ Fixed capitalization: `medium_type` (COMPLEX, DEFINED), `physical_state` (LIQUID, SOLID_AGAR)
91+
- 📁 Recategorized all "imported" files to proper organism types (bacterial, fungal, archaea, algae, specialized)
92+
- 🎯 **100% schema compliance** across all enum fields
93+
8094
## ✨ Features
8195

82-
**10,595 recipes** - Production-ready dataset from 10 authoritative sources
96+
**10,657 recipes** - Production-ready dataset from 10 authoritative sources
8397
**Four-tier architecture** - Clean separation: raw → raw_yaml → normalized_yaml → merge_yaml
8498
**Recipe deduplication** - Merge recipes with same ingredient sets (~344 unique base formulations)
8599
**LinkML schema validation** - Comprehensive data quality enforcement
@@ -88,6 +102,9 @@ Complex media contain undefined components (e.g., yeast extract, peptone), while
88102
**Automated pipelines** - Fetchers, converters, and importers for all sources
89103
**Browser interface** - Faceted search and filtering
90104
**Knowledge graph export** - Biolink-compliant KGX format
105+
**Literature verification** - 6-tier cascading PDF retrieval for cross-reference validation
106+
**ATCC cross-references** - Automated equivalency detection with DSMZ media
107+
**Unmapped ingredients tracking** - Automated detection and prioritization of ingredients needing ontology mapping
91108
**Comprehensive documentation** - 30+ guides in `docs/`
92109

93110
## 🚀 Quick Start
@@ -206,7 +223,8 @@ just import-nbrc
206223
CultureMech/
207224
├── src/culturemech/ # Python package
208225
│ ├── schema/ # LinkML schema definitions
209-
│ │ └── culturemech.yaml # Main schema (1800+ lines)
226+
│ │ ├── culturemech.yaml # Main schema (1800+ lines)
227+
│ │ └── unmapped_ingredients_schema.yaml # Unmapped ingredients schema
210228
│ ├── fetch/ # Data fetchers (10 sources)
211229
│ │ ├── utex_fetcher.py # UTEX algae media
212230
│ │ ├── ccap_fetcher.py # CCAP algae media
@@ -223,6 +241,13 @@ CultureMech/
223241
│ │ └── kgx_export.py # Knowledge graph export
224242
│ └── render.py # HTML page generator
225243
244+
├── scripts/ # Utility scripts
245+
│ ├── aggregate_unmapped_ingredients.py # Aggregate unmapped ingredients
246+
│ └── unmapped_ingredients_stats.py # Generate statistics reports
247+
248+
├── output/ # Generated outputs
249+
│ └── unmapped_ingredients.yaml # Aggregated unmapped ingredients (502KB)
250+
226251
├── data/ # Three-tier data architecture
227252
│ ├── raw/ # Layer 1: Source files (git ignored)
228253
│ │ ├── utex/ # UTEX raw data
@@ -273,8 +298,11 @@ Comprehensive documentation is available in the [`docs/`](docs/) directory:
273298
- **[CCAP/SAG Deployment](docs/CCAP_SAG_PRODUCTION_DEPLOYMENT.md)** - Metadata import details
274299
- **[Data Sources Summary](docs/DATA_SOURCES_SUMMARY.md)** - All source repositories
275300

276-
### Data Quality
301+
### Data Quality & Enrichment
277302
- **[Enrichment Guide](docs/ENRICHMENT_GUIDE.md)** - Data quality improvement workflow
303+
- **[Implementation Summary](IMPLEMENTATION_SUMMARY.md)** - Literature verification & enum normalization
304+
- **[Unmapped Ingredients Guide](docs/unmapped_ingredients_guide.md)** - System for tracking ingredients needing ontology mapping
305+
- **[Unmapped Ingredients Summary](UNMAPPED_INGREDIENTS_SUMMARY.md)** - Executive summary with statistics and priorities
278306

279307
## 🧬 Recipe Format
280308

@@ -283,8 +311,8 @@ Recipes are stored as YAML files following the LinkML schema:
283311
```yaml
284312
name: BG-11 Medium
285313
category: algae
286-
medium_type: complex
287-
physical_state: liquid
314+
medium_type: COMPLEX
315+
physical_state: LIQUID
288316

289317
description: Standard cyanobacteria medium from UTEX Culture Collection
290318

@@ -403,6 +431,65 @@ Every recipe includes:
403431
- Cross-references to original sources
404432
- PDF URLs for detailed protocols (CCAP/SAG)
405433

434+
## 🔬 Literature Verification
435+
436+
**NEW** (2026-02-20): CultureMech now includes a comprehensive literature verification system for validating cross-references through scientific papers.
437+
438+
### 6-Tier Cascading PDF Retrieval
439+
440+
The system attempts to retrieve PDFs from multiple sources in order:
441+
442+
1. **Direct Publisher Access** - ASM, PLOS, Frontiers, MDPI, Nature, Science, Elsevier
443+
2. **PubMed Central (PMC)** - NCBI idconv API
444+
3. **Unpaywall API** - Open access aggregator
445+
4. **Semantic Scholar** - Open PDF endpoint
446+
5. **Sci-Hub Fallback** - Optional, disabled by default (requires explicit opt-in)
447+
6. **Web Search** - arXiv, bioRxiv, Europe PMC
448+
449+
### Key Features
450+
451+
-**Legal sources first** - Always tries publisher, PMC, Unpaywall, and Semantic Scholar before fallback
452+
-**Sci-Hub opt-in only** - Disabled by default, requires `--enable-scihub-fallback` flag
453+
-**Full provenance** - Tracks which tier successfully retrieved each PDF
454+
-**Evidence extraction** - 8 regex patterns for detecting media equivalencies
455+
-**Batch processing** - Verify multiple candidates efficiently
456+
-**Caching layer** - Metadata and PDFs cached locally to avoid repeated requests
457+
458+
### Usage Examples
459+
460+
```bash
461+
# Generate ATCC-DSMZ cross-reference candidates (name-based matching only)
462+
python -m culturemech.enrich.atcc_crossref_builder generate
463+
464+
# Verify candidates using legal sources only (no Sci-Hub)
465+
python -m culturemech.enrich.atcc_crossref_builder generate \
466+
--verify-literature
467+
468+
# Verify with Sci-Hub fallback enabled (explicit opt-in)
469+
python -m culturemech.enrich.atcc_crossref_builder generate \
470+
--verify-literature \
471+
--enable-scihub-fallback
472+
473+
# Configure via environment variables
474+
export ENABLE_SCIHUB_FALLBACK=true
475+
export LITERATURE_EMAIL="your@email.com"
476+
export FALLBACK_PDF_MIRRORS="https://sci-hub.se,https://sci-hub.st"
477+
python -m culturemech.enrich.atcc_crossref_builder generate --verify-literature
478+
```
479+
480+
### Institutional Compliance
481+
482+
⚠️ **Important**: The Sci-Hub fallback tier is disabled by default and requires explicit opt-in. Use may violate publisher agreements or local laws. Users are responsible for compliance with institutional policies.
483+
484+
**Safety features:**
485+
- Default: `use_fallback_pdf=False`
486+
- Legal sources exhausted first
487+
- Clear warnings when Sci-Hub is enabled
488+
- Full provenance tracking
489+
- No auto-distribution of PDFs
490+
491+
See [IMPLEMENTATION_SUMMARY.md](IMPLEMENTATION_SUMMARY.md) for complete documentation.
492+
406493
## 🌐 Browser Interface
407494

408495
The faceted search browser (`app/index.html`) provides:
@@ -434,6 +521,42 @@ just gen-browser-data # Generate browser search data
434521
just test # Run test suite
435522
```
436523

524+
### Cross-Reference & Enrichment
525+
526+
```bash
527+
# Generate ATCC-DSMZ cross-reference candidates
528+
python -m culturemech.enrich.atcc_crossref_builder generate \
529+
--output data/curation/atcc_candidates.json
530+
531+
# Verify candidates via literature search (legal sources only)
532+
python -m culturemech.enrich.atcc_crossref_builder generate \
533+
--verify-literature
534+
535+
# Verify with Sci-Hub fallback (opt-in, requires explicit flag)
536+
python -m culturemech.enrich.atcc_crossref_builder generate \
537+
--verify-literature \
538+
--enable-scihub-fallback
539+
540+
# Normalize enum values (medium_type, physical_state, category)
541+
python -m culturemech.enrich.normalize_enums --dry-run # Preview changes
542+
python -m culturemech.enrich.normalize_enums # Apply changes
543+
544+
# Aggregate unmapped ingredients for mapping prioritization
545+
python scripts/aggregate_unmapped_ingredients.py --verbose --min-occurrences 2
546+
547+
# View unmapped ingredients statistics
548+
python scripts/unmapped_ingredients_stats.py --top 20
549+
550+
# View full aggregated data
551+
less output/unmapped_ingredients.yaml
552+
553+
# Read the comprehensive guide
554+
cat docs/unmapped_ingredients_guide.md
555+
556+
# Read the executive summary
557+
cat UNMAPPED_INGREDIENTS_SUMMARY.md
558+
```
559+
437560
### Adding New Recipes
438561

439562
1. Create YAML file in appropriate category:
@@ -493,17 +616,18 @@ pytest tests/test_kgx_export.py
493616
$ just count-recipes
494617
Recipe count by category:
495618

496-
algae: 242
619+
algae: 242
497620
archaea: 63
498-
bacterial: 10072
499-
fungal: 119
500-
specialized: 99
621+
bacterial: 10,134
622+
fungal: 119
623+
specialized: 99
501624

502-
Total recipes: 10595
625+
Total recipes: 10,657
503626
```
504627

505628
**Data Quality**:
506629
- ✅ 100% schema-validated
630+
- ✅ 100% enum compliance (10,657 files normalized)
507631
- ✅ Full source attribution
508632
- ✅ Comprehensive provenance tracking
509633
- ✅ LinkML compliance
@@ -514,6 +638,13 @@ Total recipes: 10595
514638
- ✅ 3 algae collections (UTEX, CCAP, SAG)
515639
- ✅ Automated fetch → convert → import workflow
516640

641+
**Enrichment Features**:
642+
- ✅ Literature verification with 6-tier PDF retrieval
643+
- ✅ ATCC-DSMZ cross-reference detection
644+
- ✅ Automated enum normalization
645+
- ✅ Evidence extraction from scientific papers
646+
- ✅ Unmapped ingredients aggregation and tracking (136 ingredients, 3,084 instances)
647+
517648
## 🤝 Contributing
518649

519650
We welcome contributions! Ways to contribute:
@@ -602,4 +733,4 @@ If you use CultureMech in your research, please cite:
602733

603734
**Built with ❤️ for microbiology research**
604735

605-
**10,595 recipes****10 sources****Production ready****Public domain**
736+
**10,657 recipes****10 sources****Production ready****Public domain**

0 commit comments

Comments
 (0)