CultureMech integrates microbial culture media data from multiple international biological resource centers, databases, and ontologies. This document provides a high-level overview of all data sources.
Current Status (as of 2025-01):
- Unique Media: ~6,000+ recipes
- Total Records: ~80,000+ (including duplicates and organism associations)
- Sources Integrated: 3 complete, 5 in progress, 10 planned
- Coverage: Bacteria, Fungi, Archaea, Algae
Layer 1: Raw Data → Layer 2: Processed → Layer 3: Knowledge Base
data/raw/{source}/ data/processed/ kb/media/{category}/
- Original formats - Enriched - CultureMech YAML
- Immutable archives - Cross-referenced - LinkML validated
- Full provenance - Standardized - Ontology-annotated
All sources are tracked in data/MEDIA_SOURCES.tsv with:
- Source metadata (name, URL, API endpoint)
- Record counts and data formats
- Download status and priority
- Integration notes
- Records: 3,327 media recipes
- Coverage: Bacteria, Fungi, Archaea
- Format: JSON via REST API
- Status: COMPLETE
- Location:
data/raw/mediadive/ - Primary Value: Largest well-structured bacterial media collection
- Integration: Full import with CHEBI mappings
- Records: 2,917 media recipes
- Coverage: Bacteria, Fungi, Archaea (Japanese BRCs)
- Format: JSON via REST API + SPARQL
- Status: COMPLETE
- Location:
data/raw/togo/ - Primary Value: Aggregates JCM, NBRC, and other Japanese sources
- Integration: Full import with ~900 overlaps with MediaDive
- Records: ~5,000 chemical mappings
- Coverage: All domains
- Format: TSV files
- Status: COMPLETE
- Location:
data/raw/microbe-media-param/ - Primary Value: CHEBI mappings for ingredient annotation
- Integration: Used by all importers for chemical entity resolution
- Records: ~300 media (manually curated)
- Coverage: Bacteria, Fungi, Archaea, Algae
- Format: HTML/PDF (no bulk API)
- Status: PARTIAL
- Location:
data/raw/atcc/ - Primary Value: U.S. culture collection reference
- Challenge: No bulk download - manual curation required
- Records: 66,570 cultivation datasets
- Coverage: Bacteria, Archaea
- Format: JSON via REST API + Python client
- Status: NOT_STARTED
- Planned Location:
data/raw/bacdive/ - Primary Value: Largest cultivation dataset with organism-media associations
- Expected New Media: ~500 unique recipes (most reference existing DSMZ media)
- Expected Enrichments: 66,000+ organism-media links to existing recipes
- Records: 3,335 media variants
- Coverage: Bacteria (primarily E. coli and model organisms)
- Format: SQL database
- Status: NOT_STARTED
- Planned Location:
data/raw/komodo/ - Primary Value: Standardized molar concentrations for all compounds
- Expected New Media: ~300 unique variants
- Expected Enrichments: Backfill concentrations to ~3,000 existing MediaDive recipes
- Records: 65 defined media
- Coverage: Bacteria (model organisms)
- Format: MySQL dump + TSV
- Status: NOT_STARTED
- Planned Location:
data/raw/mediadb/ - Primary Value: Chemically defined media for computational biology
- Expected New Media: ~50-60 recipes (high overlap with existing sources)
- Records: 400+ media recipes
- Coverage: Bacteria, Fungi, Archaea
- Format: HTML (web scraping)
- Status: NOT_STARTED
- Planned Location:
data/raw/nbrc/ - Primary Value: Japanese BRC media not in TogoMedium
- Expected New Media: ~200 recipes (overlap with TOGO expected)
- Note: Requires ethical web scraping with rate limiting
- Records: ~500 media
- Access: SPARQL endpoint + manual curation
- Note: Partially included in TogoMedium
- Records: 68 algae media
- Access: Web scraping (structured HTML)
- Value: Fills algae coverage gap
- Records: ~100 algae media
- Access: PDF parsing
- Value: UK algae collection
- Records: ~30 algae media
- Access: Web scraping
- Value: German algae collection
- NCTC (UK bacteria)
- NCIMB (UK industrial microbes)
- BioCyc/MetaCyc (subscription required)
- Additional culture collections worldwide
- Classes: 14,550 culture condition terms
- Format: OWL ontology
- Use: Semantic annotation of media properties
- Purpose: Culture media classification framework
- Format: OWL ontology
- Status: Planned integration
| Domain | Current | After Priority 1 | After Priority 2 | Target |
|---|---|---|---|---|
| Bacteria | ~3,053 | ~5,500 | ~6,000 | ~6,500 |
| Fungi | ~114 | ~400 | ~450 | ~500 |
| Archaea | ~63 | ~200 | ~250 | ~300 |
| Algae | 0 | ~60 | ~300 | ~500 |
| Specialized | ~97 | ~200 | ~400 | ~500 |
| Total Unique | ~3,327 | ~6,400 | ~7,400 | ~8,300 |
All integrated sources must meet:
- Schema Validation: 100% LinkML compliance
- Provenance: Full source tracking and dates
- Chemical Mapping: CHEBI IDs where possible (target >80%)
- Cross-referencing: Deduplication against existing recipes
- Documentation: Complete README with fetch instructions
Phase 1 (Weeks 1-2): Infrastructure + BacDive
- Create tracking table ✅
- Documentation ✅
- BacDive fetcher + importer
Phase 2 (Week 2): KOMODO + MediaDB
- SQL parsers
- Concentration enrichment pipeline
Phase 3 (Week 3): NBRC + validation
- Ethical web scraping
- Cross-reference deduplication
Phase 4 (Week 4): Algae collections
- UTEX, CCAP, SAG integration
- Final documentation updates
- MediaDive: CC BY 4.0
- TogoMedium: CC BY 4.0
- BacDive: CC BY 4.0
- ATCC: Fair use (limited data)
- Web-scraped sources: Cite and attribute properly
- Respect robots.txt for all web scraping
- Implement rate limiting (1-2s delays)
- Cache to minimize server load
- Provide proper attribution
- Contact administrators for large-scale scraping
- MediaDive: Updated regularly via API
- TogoMedium: Quarterly updates recommended
- BacDive: API access allows incremental updates
- Manual sources: Update as needed
- MEDIA_SOURCES.tsv: Master tracking table
- MEDIA_INTEGRATION_GUIDE.md: How to add new sources
- Individual README files:
data/raw/{source}/README.md
For questions about data sources or to suggest new integrations, open an issue on GitHub.
Last Updated: 2025-01-24 Maintained By: CultureMech Development Team