Add comprehensive CLI, GFF utilities, enrichment analysis, and API enhancements #69

cmungall · 2025-10-27T17:01:27Z

This major release transforms nmdc_api_utilities into a full-featured toolkit for
microbiome data analysis, adding 17,000+ lines across 52 files with:

New nmdc CLI with 10+ commands built on Typer and Rich
Commands: biosample, study, data-object, link, dump-study, enrich, validate, mint
GFF subcommands: query, stats, find-bgc, export
Beautiful terminal output with progress indicators and formatted tables
Environment selection (prod/dev/backup) with NMDC_ENV support
Geographic bounding box filtering (--bbox) for biosample queries
GFF utilities (gff_utils.py, 638 lines): DuckDB-powered parsing of functional
annotations with sub-second queries on 160k+ features, BGC detection, EC/PFAM/COG/KO
searches, and region-based queries
Enrichment analysis (enrichment.py, 839 lines): Statistical comparison of
functional annotations across sample groups with Fisher's exact test, Chi-square,
FDR correction (Benjamini-Hochberg/Bonferroni), and OAKlib integration for
ontology term labels
Preprocessors (preprocessors.py, 732 lines): Extensible framework for adding
headers to NMDC annotation files (KO, EC, COG, PFAM, CATH, SMART, TIGRFAM, SUPERFAMILY)
Object linking: LinkedInstancesSearch class for graph traversal, linking methods
on BiosampleSearch/StudySearch/DataObjectSearch for finding related entities
Study dumper (study_dumper.py, 455 lines): Download complete studies with metadata,
functional annotations, and data objects with preprocessing and type filtering
Functional biosample search: Filter biosamples by EC numbers, PFAM domains,
COG categories, and KEGG orthologs
Link caching (link_cache.py, 474 lines): LRU cache for API responses to reduce
redundant calls during graph traversal
Export utilities: TSV/CSV export with flattening of nested structures
Migration to uv for modern dependency management (uv.lock, 5724 lines)
Optional dependency groups: [viz], [gff], [enrich], [dev]
Claude Code plugin marketplace integration (.claude-plugin/marketplace.json)
NMDC skills for GFF analysis and enrichment analysis (nmdc-skills/)
Comprehensive documentation: CLAUDE.md, CONTRIBUTING.md, filters.rst, justfile
29 new tests with pytest fixtures (test_cli.py, test_linked_instances.py, etc.)
14 doctests for inline documentation
Enhanced README with CLI examples and Python API usage
Extensive CHANGES.md documenting all 500+ lines of changes
New docs/filters.rst explaining MongoDB query patterns
justfile for common development tasks

All changes are backward compatible. Tests run against prod/dev/backup NMDC APIs.

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

…hancements This major release transforms nmdc_api_utilities into a full-featured toolkit for microbiome data analysis, adding 17,000+ lines across 52 files with: - New `nmdc` CLI with 10+ commands built on Typer and Rich - Commands: biosample, study, data-object, link, dump-study, enrich, validate, mint - GFF subcommands: query, stats, find-bgc, export - Beautiful terminal output with progress indicators and formatted tables - Environment selection (prod/dev/backup) with NMDC_ENV support - Geographic bounding box filtering (--bbox) for biosample queries - **GFF utilities** (gff_utils.py, 638 lines): DuckDB-powered parsing of functional annotations with sub-second queries on 160k+ features, BGC detection, EC/PFAM/COG/KO searches, and region-based queries - **Enrichment analysis** (enrichment.py, 839 lines): Statistical comparison of functional annotations across sample groups with Fisher's exact test, Chi-square, FDR correction (Benjamini-Hochberg/Bonferroni), and OAKlib integration for ontology term labels - **Preprocessors** (preprocessors.py, 732 lines): Extensible framework for adding headers to NMDC annotation files (KO, EC, COG, PFAM, CATH, SMART, TIGRFAM, SUPERFAMILY) - **Object linking**: LinkedInstancesSearch class for graph traversal, linking methods on BiosampleSearch/StudySearch/DataObjectSearch for finding related entities - **Study dumper** (study_dumper.py, 455 lines): Download complete studies with metadata, functional annotations, and data objects with preprocessing and type filtering - **Functional biosample search**: Filter biosamples by EC numbers, PFAM domains, COG categories, and KEGG orthologs - **Link caching** (link_cache.py, 474 lines): LRU cache for API responses to reduce redundant calls during graph traversal - **Export utilities**: TSV/CSV export with flattening of nested structures - Migration to `uv` for modern dependency management (uv.lock, 5724 lines) - Optional dependency groups: [viz], [gff], [enrich], [dev] - Claude Code plugin marketplace integration (.claude-plugin/marketplace.json) - NMDC skills for GFF analysis and enrichment analysis (nmdc-skills/) - Comprehensive documentation: CLAUDE.md, CONTRIBUTING.md, filters.rst, justfile - 29 new tests with pytest fixtures (test_cli.py, test_linked_instances.py, etc.) - 14 doctests for inline documentation - Enhanced README with CLI examples and Python API usage - Extensive CHANGES.md documenting all 500+ lines of changes - New docs/filters.rst explaining MongoDB query patterns - justfile for common development tasks All changes are backward compatible. Tests run against prod/dev/backup NMDC APIs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

aclum · 2025-10-27T17:52:59Z

nmdc_api_utilities/nmdc_search.py

+    >>> search_dev.base_url
+    'https://api-dev.microbiomedata.org'
+    >>> # Backup environment
+    >>> search_backup = NMDCSearch(env="backup")


not sure we want to reference this going forward @shreddd

This was critical for dev use while things were in churn. There should be some kind of mechanism for switching server. It doesn't need to be prominent in docs like this is though.

aclum · 2025-10-27T17:55:37Z

nmdc_api_utilities/test/test_link_cache.py

+
+def test_get_links_by_relationship_type(temp_cache):
+    """Test filtering links by relationship type."""
+    temp_cache.add_link("nmdc:bsm-123", "nmdc:sty-456", "part_of")


is there some renaming of the relationship slots under the hood? I'd expect this to be has_associated_studies, not part_of?

Similar story for other relationship slots/tests

We should think about adding wrapper functions to the new linked_instances endpoint in place of this.

This is just for testing the cache, the value isn't important here, but yes there should be some kind of enum restricting these

Fix awkward field names with trailing double underscores (e.g., "tot_nitro__") that occurred when measurement units contained special characters like "%". The issue arose from direct regex sanitization of unit strings, which converted single-character symbols like "%" to "_", resulting in field names like "field_name__" (field name + "_" + sanitized unit). Changes: - Add UNIT_NORMALIZATION mapping (aggregation.py:52-76) Maps common unit symbols to readable names: - % → percent - Cel/°C/℃ → celsius - °F/℉ → fahrenheit - Standard SI units (m, kg, L, etc.) → full names - Chemical units (M, mM, ppm, ppb, etc.) - Add normalize_unit_for_key() function (aggregation.py:79-116) Normalizes unit strings for use as dictionary key suffixes - Checks UNIT_NORMALIZATION mapping first - Falls back to regex sanitization for unknown units - Strips trailing underscores to prevent awkward names - Includes comprehensive doctests - Update aggregate_field() to use unit normalization (aggregation.py:229-292) - Replace direct regex substitution with normalize_unit_for_key() - Update docstring with examples of normalized field names - Add example showing % → percent conversion - Add comprehensive test coverage (test_aggregation.py) - Add TestNormalizeUnitForKey class with 5 test cases: * test_normalize_percent: % → percent * test_normalize_celsius_variants: Cel/°C/℃ → celsius * test_normalize_common_units: m/kg/L/ppm conversions * test_normalize_complex_unit: mg/kg → mg_kg * test_normalize_strips_trailing_underscore - Update existing tests for normalized names: * temp_Cel → temp_celsius * biogeochemical_rollup → property_rollup - All 36 unit tests pass - All 8 doctests pass Result: Field names are now clean and readable - Before: "tot_nitro__" (awkward trailing __) - After: "tot_nitro_percent" (clear and descriptive) This improves the usability of the study rollup API by providing intuitive, human-readable field names in aggregated biosample data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

aclum reviewed Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive CLI, GFF utilities, enrichment analysis, and API enhancements #69

Add comprehensive CLI, GFF utilities, enrichment analysis, and API enhancements #69

Uh oh!

cmungall commented Oct 27, 2025

Uh oh!

aclum Oct 27, 2025

Uh oh!

cmungall Oct 27, 2025

Uh oh!

aclum Oct 27, 2025

Uh oh!

aclum Oct 27, 2025

Uh oh!

kheal Oct 27, 2025

Uh oh!

cmungall Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add comprehensive CLI, GFF utilities, enrichment analysis, and API enhancements #69

Are you sure you want to change the base?

Add comprehensive CLI, GFF utilities, enrichment analysis, and API enhancements #69

Uh oh!

Conversation

cmungall commented Oct 27, 2025

Uh oh!

aclum Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

cmungall Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

aclum Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

aclum Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

kheal Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

cmungall Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants