-
Notifications
You must be signed in to change notification settings - Fork 0
Add comprehensive CLI, GFF utilities, enrichment analysis, and API enhancements #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…hancements This major release transforms nmdc_api_utilities into a full-featured toolkit for microbiome data analysis, adding 17,000+ lines across 52 files with: - New `nmdc` CLI with 10+ commands built on Typer and Rich - Commands: biosample, study, data-object, link, dump-study, enrich, validate, mint - GFF subcommands: query, stats, find-bgc, export - Beautiful terminal output with progress indicators and formatted tables - Environment selection (prod/dev/backup) with NMDC_ENV support - Geographic bounding box filtering (--bbox) for biosample queries - **GFF utilities** (gff_utils.py, 638 lines): DuckDB-powered parsing of functional annotations with sub-second queries on 160k+ features, BGC detection, EC/PFAM/COG/KO searches, and region-based queries - **Enrichment analysis** (enrichment.py, 839 lines): Statistical comparison of functional annotations across sample groups with Fisher's exact test, Chi-square, FDR correction (Benjamini-Hochberg/Bonferroni), and OAKlib integration for ontology term labels - **Preprocessors** (preprocessors.py, 732 lines): Extensible framework for adding headers to NMDC annotation files (KO, EC, COG, PFAM, CATH, SMART, TIGRFAM, SUPERFAMILY) - **Object linking**: LinkedInstancesSearch class for graph traversal, linking methods on BiosampleSearch/StudySearch/DataObjectSearch for finding related entities - **Study dumper** (study_dumper.py, 455 lines): Download complete studies with metadata, functional annotations, and data objects with preprocessing and type filtering - **Functional biosample search**: Filter biosamples by EC numbers, PFAM domains, COG categories, and KEGG orthologs - **Link caching** (link_cache.py, 474 lines): LRU cache for API responses to reduce redundant calls during graph traversal - **Export utilities**: TSV/CSV export with flattening of nested structures - Migration to `uv` for modern dependency management (uv.lock, 5724 lines) - Optional dependency groups: [viz], [gff], [enrich], [dev] - Claude Code plugin marketplace integration (.claude-plugin/marketplace.json) - NMDC skills for GFF analysis and enrichment analysis (nmdc-skills/) - Comprehensive documentation: CLAUDE.md, CONTRIBUTING.md, filters.rst, justfile - 29 new tests with pytest fixtures (test_cli.py, test_linked_instances.py, etc.) - 14 doctests for inline documentation - Enhanced README with CLI examples and Python API usage - Extensive CHANGES.md documenting all 500+ lines of changes - New docs/filters.rst explaining MongoDB query patterns - justfile for common development tasks All changes are backward compatible. Tests run against prod/dev/backup NMDC APIs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
| >>> search_dev.base_url | ||
| 'https://api-dev.microbiomedata.org' | ||
| >>> # Backup environment | ||
| >>> search_backup = NMDCSearch(env="backup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we want to reference this going forward @shreddd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was critical for dev use while things were in churn. There should be some kind of mechanism for switching server. It doesn't need to be prominent in docs like this is though.
|
|
||
| def test_get_links_by_relationship_type(temp_cache): | ||
| """Test filtering links by relationship type.""" | ||
| temp_cache.add_link("nmdc:bsm-123", "nmdc:sty-456", "part_of") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some renaming of the relationship slots under the hood? I'd expect this to be has_associated_studies, not part_of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar story for other relationship slots/tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about adding wrapper functions to the new linked_instances endpoint in place of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for testing the cache, the value isn't important here, but yes there should be some kind of enum restricting these
Fix awkward field names with trailing double underscores (e.g., "tot_nitro__")
that occurred when measurement units contained special characters like "%".
The issue arose from direct regex sanitization of unit strings, which converted
single-character symbols like "%" to "_", resulting in field names like
"field_name__" (field name + "_" + sanitized unit).
Changes:
- Add UNIT_NORMALIZATION mapping (aggregation.py:52-76)
Maps common unit symbols to readable names:
- % → percent
- Cel/°C/℃ → celsius
- °F/℉ → fahrenheit
- Standard SI units (m, kg, L, etc.) → full names
- Chemical units (M, mM, ppm, ppb, etc.)
- Add normalize_unit_for_key() function (aggregation.py:79-116)
Normalizes unit strings for use as dictionary key suffixes
- Checks UNIT_NORMALIZATION mapping first
- Falls back to regex sanitization for unknown units
- Strips trailing underscores to prevent awkward names
- Includes comprehensive doctests
- Update aggregate_field() to use unit normalization (aggregation.py:229-292)
- Replace direct regex substitution with normalize_unit_for_key()
- Update docstring with examples of normalized field names
- Add example showing % → percent conversion
- Add comprehensive test coverage (test_aggregation.py)
- Add TestNormalizeUnitForKey class with 5 test cases:
* test_normalize_percent: % → percent
* test_normalize_celsius_variants: Cel/°C/℃ → celsius
* test_normalize_common_units: m/kg/L/ppm conversions
* test_normalize_complex_unit: mg/kg → mg_kg
* test_normalize_strips_trailing_underscore
- Update existing tests for normalized names:
* temp_Cel → temp_celsius
* biogeochemical_rollup → property_rollup
- All 36 unit tests pass
- All 8 doctests pass
Result: Field names are now clean and readable
- Before: "tot_nitro__" (awkward trailing __)
- After: "tot_nitro_percent" (clear and descriptive)
This improves the usability of the study rollup API by providing intuitive,
human-readable field names in aggregated biosample data.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
This major release transforms nmdc_api_utilities into a full-featured toolkit for
microbiome data analysis, adding 17,000+ lines across 52 files with:
New
nmdcCLI with 10+ commands built on Typer and RichCommands: biosample, study, data-object, link, dump-study, enrich, validate, mint
GFF subcommands: query, stats, find-bgc, export
Beautiful terminal output with progress indicators and formatted tables
Environment selection (prod/dev/backup) with NMDC_ENV support
Geographic bounding box filtering (--bbox) for biosample queries
GFF utilities (gff_utils.py, 638 lines): DuckDB-powered parsing of functional
annotations with sub-second queries on 160k+ features, BGC detection, EC/PFAM/COG/KO
searches, and region-based queries
Enrichment analysis (enrichment.py, 839 lines): Statistical comparison of
functional annotations across sample groups with Fisher's exact test, Chi-square,
FDR correction (Benjamini-Hochberg/Bonferroni), and OAKlib integration for
ontology term labels
Preprocessors (preprocessors.py, 732 lines): Extensible framework for adding
headers to NMDC annotation files (KO, EC, COG, PFAM, CATH, SMART, TIGRFAM, SUPERFAMILY)
Object linking: LinkedInstancesSearch class for graph traversal, linking methods
on BiosampleSearch/StudySearch/DataObjectSearch for finding related entities
Study dumper (study_dumper.py, 455 lines): Download complete studies with metadata,
functional annotations, and data objects with preprocessing and type filtering
Functional biosample search: Filter biosamples by EC numbers, PFAM domains,
COG categories, and KEGG orthologs
Link caching (link_cache.py, 474 lines): LRU cache for API responses to reduce
redundant calls during graph traversal
Export utilities: TSV/CSV export with flattening of nested structures
Migration to
uvfor modern dependency management (uv.lock, 5724 lines)Optional dependency groups: [viz], [gff], [enrich], [dev]
Claude Code plugin marketplace integration (.claude-plugin/marketplace.json)
NMDC skills for GFF analysis and enrichment analysis (nmdc-skills/)
Comprehensive documentation: CLAUDE.md, CONTRIBUTING.md, filters.rst, justfile
29 new tests with pytest fixtures (test_cli.py, test_linked_instances.py, etc.)
14 doctests for inline documentation
Enhanced README with CLI examples and Python API usage
Extensive CHANGES.md documenting all 500+ lines of changes
New docs/filters.rst explaining MongoDB query patterns
justfile for common development tasks
All changes are backward compatible. Tests run against prod/dev/backup NMDC APIs.
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]