-
Notifications
You must be signed in to change notification settings - Fork 105
feat: Add NCBI Datasets API Integration (56 Tools) #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
benjibromberg
wants to merge
8
commits into
mims-harvard:main
Choose a base branch
from
benjibromberg:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Integrates high-coverage NCBI Datasets API tools with auto-generated tool classes, wrappers, and JSON configs, supporting gene, genome, taxonomy, and virus queries. Introduces OpenAPI-driven discovery and code generation scripts, enabling maintenance automation and parameter synchronization. Ensures all tool schemas and parameters remain up to date with the evolving NCBI Datasets OpenAPI spec, minimizing manual drift. Provides an extensive, parametrized test suite for functionality, error handling, rate limits, and OpenAPI compliance, supporting robust, future-proof integration. Lays groundwork for continuous tool API maintenance and easy coverage extension as NCBI adds endpoints.
Introduces support for retrieving taxonomy dataset reports using NCBI taxon identifiers, including function, tool class, JSON schema, and integration into the tool universe. Enhances automation and code generation logic to handle flexible path parameters for endpoints that accept both single values and arrays. Improves coverage of NCBI Datasets API tools, enabling users to access richer taxonomic metadata across various taxonomic ranks.
Updates the parameter-building logic to use string concatenation that properly separates conditional parameter blocks with newlines. Prevents formatting issues in generated query code, ensuring parameters are correctly added when present.
Adds support for additional flexible path parameters such as locus tags, assembly names, bioprojects, biosample IDs, proteins, tax IDs, and WGS accessions, enabling single values or lists for these inputs. Improves parameter description logic by extracting the first word from descriptions or falling back to parameter names, enhancing auto-generated documentation clarity. Updates response construction to include path parameters for better context. These changes improve tool flexibility and generated API documentation quality.
Introduces new auto-generated tools for NCBI Datasets API endpoints that provide dataset reports by gene ID, accession, taxon, locus tag, and for viruses and genomes by various identifiers. Updates initialization, lazy loading, and exports to support these tools and registers their schemas and Python client functions. Enables broader and more granular access to NCBI Datasets metadata, allowing easier integration and improved flexibility for downstream consumers.
Adds comprehensive integration with the NCBI Datasets API, introducing 56 new tools for accessing gene data, genome assemblies, taxonomy information, virus genomes, organelle data, and biosample records. This update includes auto-generated tool classes, detailed documentation, and a maintenance guide, enhancing the API's usability and flexibility for researchers. Additionally, known test failures are documented to improve testing transparency.
Combined NCBI Datasets tools with upstream's new tools (OLS, ClinVar, literature search tools). Updated type annotations, imports, lazy proxies, and __all__ list to include both sets of tools.
Contributor
Author
|
Tried to do my best here, but let me know if I missed anything that I can fix! |
Member
|
Looks good to me, thank you! I will test these tools on my side and merge them ASAP! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive integration with the NCBI Datasets API v2, providing
56 tools for accessing gene data, genome assemblies, taxonomy information,
virus genomes, organelle data, and biosample records. The integration uses an
OpenAPI-driven approach where the OpenAPI specification serves as the single
source of truth for all parameters, endpoints, and validation.
Features
56 Tool Classes: Complete coverage of NCBI Datasets API endpoints
100% OpenAPI Parameter Coverage: All parameters from the OpenAPI
specification are implemented in each tool
Automated Generation System: Configuration files and test definitions
are auto-generated from the OpenAPI specification, ensuring easy updates
when NCBI releases new API versions
Comprehensive Test Suite: 447 tests total (408 passing, 91.3% pass rate)
KNOWN_TEST_FAILURES.md)for NCBI tests) due to rate limiting (0.25s delay between tests) to
respect NCBI API limits
Complete Documentation:
docs/tools/ncbi_datasets_tools.rst(774 lines)src/tooluniverse/data/specs/ncbi/README.mdexamples/ncbi_datasets_tool_example.pyTechnical Implementation
OpenAPI-Driven Architecture
The integration follows a specification-driven approach:
OpenAPI Specification:
src/tooluniverse/data/specs/ncbi/openapi3.docs.yamlAuto-Generation Scripts:
scripts/discover_and_generate.py: Discovers endpoints and generatestool classes
scripts/update_ncbi_json_from_openapi.py: Updates JSON configurationsfrom spec
Tool Classes: All 56 tools in
src/tooluniverse/ncbi_datasets_tool.pyBaseToolFunction Wrappers: 56 wrapper functions in
src/tooluniverse/tools/Test Results
Test Runtime Impact: This PR adds 447 tests to the test suite, which
extends the overall test runtime by approximately 4 minutes (~228 seconds).
Each test includes a 0.25s delay to respect NCBI API rate limits (5-10
requests/second), ensuring reliable test execution without hitting API
throttling.
Known Failures: Documented in
src/tooluniverse/data/specs/ncbi/KNOWN_TEST_FAILURES.md. These are upstreamNCBI API issues affecting:
Tests are kept active to detect when NCBI fixes these issues.
Upstream Compatibility
Merge Tested: Successfully merged with
upstream/mainsrc/tooluniverse/__init__.py(resolved)Files Changed
Core Implementation
src/tooluniverse/ncbi_datasets_tool.py: 56 tool classessrc/tooluniverse/data/ncbi_datasets_tools.json: Tool configurationssrc/tooluniverse/tools/ncbi_datasets_*.py: 56 wrapper functionssrc/tooluniverse/__init__.py: Updated imports and exports (4 locations)Specifications and Maintenance
src/tooluniverse/data/specs/ncbi/: Complete directoryopenapi3.docs.yaml: Official OpenAPI specificationREADME.md: Maintenance guide for contributorsKNOWN_TEST_FAILURES.md: Documentation of known API issuesscripts/discover_and_generate.py: Auto-generation scriptscripts/update_ncbi_json_from_openapi.py: JSON config updaterTests
tests/tools/test_ncbi_datasets_tool.py: Comprehensive test suiteDocumentation
docs/tools/ncbi_datasets_tools.rst: Complete user documentation (774 lines)examples/ncbi_datasets_tool_example.py: 13 working examplesAPI Key Support
Tools support optional API key authentication via
NCBI_API_KEYenvironmentvariable for enhanced rate limits (10 rps vs 5 rps default). See
docs/tools/ncbi_datasets_tools.rstfor setup instructions.Usage Example
Maintenance
Future updates to the NCBI Datasets API can be easily integrated by:
openapi3.docs.yamlwith new specificationpython src/tooluniverse/data/specs/ncbi/scripts/discover_and_generate.pySee
src/tooluniverse/data/specs/ncbi/README.mdfor detailed maintenanceinstructions.
Related Issues
This PR adds a new API integration following the OpenAPI-driven approach
documented in the maintenance guide. The integration is complete and ready
for review.
Checklist
__init__.pyupdated in all 4 required locations