Skip to content

Conversation

@cmungall
Copy link

This major release transforms nmdc_api_utilities into a full-featured toolkit for
microbiome data analysis, adding 17,000+ lines across 52 files with:

  • New nmdc CLI with 10+ commands built on Typer and Rich

  • Commands: biosample, study, data-object, link, dump-study, enrich, validate, mint

  • GFF subcommands: query, stats, find-bgc, export

  • Beautiful terminal output with progress indicators and formatted tables

  • Environment selection (prod/dev/backup) with NMDC_ENV support

  • Geographic bounding box filtering (--bbox) for biosample queries

  • GFF utilities (gff_utils.py, 638 lines): DuckDB-powered parsing of functional
    annotations with sub-second queries on 160k+ features, BGC detection, EC/PFAM/COG/KO
    searches, and region-based queries

  • Enrichment analysis (enrichment.py, 839 lines): Statistical comparison of
    functional annotations across sample groups with Fisher's exact test, Chi-square,
    FDR correction (Benjamini-Hochberg/Bonferroni), and OAKlib integration for
    ontology term labels

  • Preprocessors (preprocessors.py, 732 lines): Extensible framework for adding
    headers to NMDC annotation files (KO, EC, COG, PFAM, CATH, SMART, TIGRFAM, SUPERFAMILY)

  • Object linking: LinkedInstancesSearch class for graph traversal, linking methods
    on BiosampleSearch/StudySearch/DataObjectSearch for finding related entities

  • Study dumper (study_dumper.py, 455 lines): Download complete studies with metadata,
    functional annotations, and data objects with preprocessing and type filtering

  • Functional biosample search: Filter biosamples by EC numbers, PFAM domains,
    COG categories, and KEGG orthologs

  • Link caching (link_cache.py, 474 lines): LRU cache for API responses to reduce
    redundant calls during graph traversal

  • Export utilities: TSV/CSV export with flattening of nested structures

  • Migration to uv for modern dependency management (uv.lock, 5724 lines)

  • Optional dependency groups: [viz], [gff], [enrich], [dev]

  • Claude Code plugin marketplace integration (.claude-plugin/marketplace.json)

  • NMDC skills for GFF analysis and enrichment analysis (nmdc-skills/)

  • Comprehensive documentation: CLAUDE.md, CONTRIBUTING.md, filters.rst, justfile

  • 29 new tests with pytest fixtures (test_cli.py, test_linked_instances.py, etc.)

  • 14 doctests for inline documentation

  • Enhanced README with CLI examples and Python API usage

  • Extensive CHANGES.md documenting all 500+ lines of changes

  • New docs/filters.rst explaining MongoDB query patterns

  • justfile for common development tasks

All changes are backward compatible. Tests run against prod/dev/backup NMDC APIs.

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

…hancements

This major release transforms nmdc_api_utilities into a full-featured toolkit for
microbiome data analysis, adding 17,000+ lines across 52 files with:

- New `nmdc` CLI with 10+ commands built on Typer and Rich
- Commands: biosample, study, data-object, link, dump-study, enrich, validate, mint
- GFF subcommands: query, stats, find-bgc, export
- Beautiful terminal output with progress indicators and formatted tables
- Environment selection (prod/dev/backup) with NMDC_ENV support
- Geographic bounding box filtering (--bbox) for biosample queries

- **GFF utilities** (gff_utils.py, 638 lines): DuckDB-powered parsing of functional
  annotations with sub-second queries on 160k+ features, BGC detection, EC/PFAM/COG/KO
  searches, and region-based queries
- **Enrichment analysis** (enrichment.py, 839 lines): Statistical comparison of
  functional annotations across sample groups with Fisher's exact test, Chi-square,
  FDR correction (Benjamini-Hochberg/Bonferroni), and OAKlib integration for
  ontology term labels
- **Preprocessors** (preprocessors.py, 732 lines): Extensible framework for adding
  headers to NMDC annotation files (KO, EC, COG, PFAM, CATH, SMART, TIGRFAM, SUPERFAMILY)

- **Object linking**: LinkedInstancesSearch class for graph traversal, linking methods
  on BiosampleSearch/StudySearch/DataObjectSearch for finding related entities
- **Study dumper** (study_dumper.py, 455 lines): Download complete studies with metadata,
  functional annotations, and data objects with preprocessing and type filtering
- **Functional biosample search**: Filter biosamples by EC numbers, PFAM domains,
  COG categories, and KEGG orthologs
- **Link caching** (link_cache.py, 474 lines): LRU cache for API responses to reduce
  redundant calls during graph traversal
- **Export utilities**: TSV/CSV export with flattening of nested structures

- Migration to `uv` for modern dependency management (uv.lock, 5724 lines)
- Optional dependency groups: [viz], [gff], [enrich], [dev]
- Claude Code plugin marketplace integration (.claude-plugin/marketplace.json)
- NMDC skills for GFF analysis and enrichment analysis (nmdc-skills/)
- Comprehensive documentation: CLAUDE.md, CONTRIBUTING.md, filters.rst, justfile
- 29 new tests with pytest fixtures (test_cli.py, test_linked_instances.py, etc.)
- 14 doctests for inline documentation

- Enhanced README with CLI examples and Python API usage
- Extensive CHANGES.md documenting all 500+ lines of changes
- New docs/filters.rst explaining MongoDB query patterns
- justfile for common development tasks

All changes are backward compatible. Tests run against prod/dev/backup NMDC APIs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
>>> search_dev.base_url
'https://api-dev.microbiomedata.org'
>>> # Backup environment
>>> search_backup = NMDCSearch(env="backup")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we want to reference this going forward @shreddd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was critical for dev use while things were in churn. There should be some kind of mechanism for switching server. It doesn't need to be prominent in docs like this is though.


def test_get_links_by_relationship_type(temp_cache):
"""Test filtering links by relationship type."""
temp_cache.add_link("nmdc:bsm-123", "nmdc:sty-456", "part_of")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some renaming of the relationship slots under the hood? I'd expect this to be has_associated_studies, not part_of?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar story for other relationship slots/tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about adding wrapper functions to the new linked_instances endpoint in place of this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing the cache, the value isn't important here, but yes there should be some kind of enum restricting these

Fix awkward field names with trailing double underscores (e.g., "tot_nitro__")
that occurred when measurement units contained special characters like "%".

The issue arose from direct regex sanitization of unit strings, which converted
single-character symbols like "%" to "_", resulting in field names like
"field_name__" (field name + "_" + sanitized unit).

Changes:

- Add UNIT_NORMALIZATION mapping (aggregation.py:52-76)
  Maps common unit symbols to readable names:
  - % → percent
  - Cel/°C/℃ → celsius
  - °F/℉ → fahrenheit
  - Standard SI units (m, kg, L, etc.) → full names
  - Chemical units (M, mM, ppm, ppb, etc.)

- Add normalize_unit_for_key() function (aggregation.py:79-116)
  Normalizes unit strings for use as dictionary key suffixes
  - Checks UNIT_NORMALIZATION mapping first
  - Falls back to regex sanitization for unknown units
  - Strips trailing underscores to prevent awkward names
  - Includes comprehensive doctests

- Update aggregate_field() to use unit normalization (aggregation.py:229-292)
  - Replace direct regex substitution with normalize_unit_for_key()
  - Update docstring with examples of normalized field names
  - Add example showing % → percent conversion

- Add comprehensive test coverage (test_aggregation.py)
  - Add TestNormalizeUnitForKey class with 5 test cases:
    * test_normalize_percent: % → percent
    * test_normalize_celsius_variants: Cel/°C/℃ → celsius
    * test_normalize_common_units: m/kg/L/ppm conversions
    * test_normalize_complex_unit: mg/kg → mg_kg
    * test_normalize_strips_trailing_underscore
  - Update existing tests for normalized names:
    * temp_Cel → temp_celsius
    * biogeochemical_rollup → property_rollup
  - All 36 unit tests pass
  - All 8 doctests pass

Result: Field names are now clean and readable
- Before: "tot_nitro__" (awkward trailing __)
- After:  "tot_nitro_percent" (clear and descriptive)

This improves the usability of the study rollup API by providing intuitive,
human-readable field names in aggregated biosample data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants