Skip to content

Conversation

@benjibromberg
Copy link
Contributor

Summary

This PR adds comprehensive integration with the NCBI Datasets API v2, providing
56 tools for accessing gene data, genome assemblies, taxonomy information,
virus genomes, organelle data, and biosample records. The integration uses an
OpenAPI-driven approach where the OpenAPI specification serves as the single
source of truth for all parameters, endpoints, and validation.

Features

  • 56 Tool Classes: Complete coverage of NCBI Datasets API endpoints

    • 18 Gene tools (by ID, symbol, accession, taxon, locus tag)
    • 15 Genome tools (assembly reports, annotations, sequences)
    • 8 Taxonomy tools (metadata, lineage, related IDs)
    • 9 Virus tools (genome summaries, annotations, SARS-CoV-2 data)
    • 2 Organelle tools
    • 2 Biosample tools
    • 3 Download summary tools (preview before download)
    • 1 Utility tool (version information)
  • 100% OpenAPI Parameter Coverage: All parameters from the OpenAPI
    specification are implemented in each tool

  • Automated Generation System: Configuration files and test definitions
    are auto-generated from the OpenAPI specification, ensuring easy updates
    when NCBI releases new API versions

  • Comprehensive Test Suite: 447 tests total (408 passing, 91.3% pass rate)

    • 39 known failures are upstream NCBI API issues (documented in
      KNOWN_TEST_FAILURES.md)
    • All test data dynamically generated from OpenAPI specification
    • No hardcoded test values
    • Note: This PR significantly extends test suite runtime (~4 minutes
      for NCBI tests) due to rate limiting (0.25s delay between tests) to
      respect NCBI API limits
  • Complete Documentation:

    • User guide: docs/tools/ncbi_datasets_tools.rst (774 lines)
    • Maintenance guide: src/tooluniverse/data/specs/ncbi/README.md
    • 13 working examples in examples/ncbi_datasets_tool_example.py

Technical Implementation

OpenAPI-Driven Architecture

The integration follows a specification-driven approach:

  1. OpenAPI Specification: src/tooluniverse/data/specs/ncbi/openapi3.docs.yaml

    • Official NCBI Datasets API v2 specification
    • Single source of truth for all endpoints and parameters
  2. Auto-Generation Scripts:

    • scripts/discover_and_generate.py: Discovers endpoints and generates
      tool classes
    • scripts/update_ncbi_json_from_openapi.py: Updates JSON configurations
      from spec
  3. Tool Classes: All 56 tools in src/tooluniverse/ncbi_datasets_tool.py

    • Inherit from BaseTool
    • Support flexible parameters (single value or array)
    • Include comprehensive error handling
    • Support API key authentication for enhanced rate limits
  4. Function Wrappers: 56 wrapper functions in src/tooluniverse/tools/

    • Minimal docstrings linking to official NCBI documentation
    • Full type hints and validation

Test Results

447 tests total
- 408 passing (91.3%)
- 39 known failures (upstream NCBI API issues)

Test Runtime Impact: This PR adds 447 tests to the test suite, which
extends the overall test runtime by approximately 4 minutes (~228 seconds).
Each test includes a 0.25s delay to respect NCBI API rate limits (5-10
requests/second), ensuring reliable test execution without hitting API
throttling.

Known Failures: Documented in
src/tooluniverse/data/specs/ncbi/KNOWN_TEST_FAILURES.md. These are upstream
NCBI API issues affecting:

  • SARS-CoV-2 protein/genome table endpoints (500 errors)
  • Download summary endpoints (500 errors)

Tests are kept active to detect when NCBI fixes these issues.

Upstream Compatibility

Merge Tested: Successfully merged with upstream/main

  • Single conflict in src/tooluniverse/__init__.py (resolved)
  • All 56 NCBI tools work correctly with upstream's latest tools
  • ToolUniverse loads successfully with 723 total tools

Files Changed

Core Implementation

  • src/tooluniverse/ncbi_datasets_tool.py: 56 tool classes
  • src/tooluniverse/data/ncbi_datasets_tools.json: Tool configurations
  • src/tooluniverse/tools/ncbi_datasets_*.py: 56 wrapper functions
  • src/tooluniverse/__init__.py: Updated imports and exports (4 locations)

Specifications and Maintenance

  • src/tooluniverse/data/specs/ncbi/: Complete directory
    • openapi3.docs.yaml: Official OpenAPI specification
    • README.md: Maintenance guide for contributors
    • KNOWN_TEST_FAILURES.md: Documentation of known API issues
    • scripts/discover_and_generate.py: Auto-generation script
    • scripts/update_ncbi_json_from_openapi.py: JSON config updater

Tests

  • tests/tools/test_ncbi_datasets_tool.py: Comprehensive test suite
    • 447 tests (408 passing)
    • All test data from OpenAPI specification
    • Rate limiting to respect NCBI API limits

Documentation

  • docs/tools/ncbi_datasets_tools.rst: Complete user documentation (774 lines)
  • examples/ncbi_datasets_tool_example.py: 13 working examples

API Key Support

Tools support optional API key authentication via NCBI_API_KEY environment
variable for enhanced rate limits (10 rps vs 5 rps default). See
docs/tools/ncbi_datasets_tools.rst for setup instructions.

Usage Example

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

# Get gene metadata by ID
result = tu.run({
    "name": "ncbi_datasets_gene_by_id",
    "arguments": {"gene_ids": 59067}
})

# Get taxonomy metadata
result = tu.run({
    "name": "ncbi_datasets_taxonomy_metadata",
    "arguments": {"taxons": "9606"}  # Human
})

Maintenance

Future updates to the NCBI Datasets API can be easily integrated by:

  1. Updating openapi3.docs.yaml with new specification
  2. Running python src/tooluniverse/data/specs/ncbi/scripts/discover_and_generate.py
  3. Running tests to validate changes

See src/tooluniverse/data/specs/ncbi/README.md for detailed maintenance
instructions.

Related Issues

This PR adds a new API integration following the OpenAPI-driven approach
documented in the maintenance guide. The integration is complete and ready
for review.

Checklist

  • All 56 tools implemented and tested
  • 100% OpenAPI parameter coverage
  • Comprehensive test suite (447 tests)
  • User documentation complete
  • Examples provided and tested
  • Maintenance guide included
  • Upstream compatibility verified
  • Code follows ToolUniverse standards
  • __init__.py updated in all 4 required locations

Integrates high-coverage NCBI Datasets API tools with auto-generated tool classes, wrappers, and JSON configs, supporting gene, genome, taxonomy, and virus queries.

Introduces OpenAPI-driven discovery and code generation scripts, enabling maintenance automation and parameter synchronization. Ensures all tool schemas and parameters remain up to date with the evolving NCBI Datasets OpenAPI spec, minimizing manual drift.

Provides an extensive, parametrized test suite for functionality, error handling, rate limits, and OpenAPI compliance, supporting robust, future-proof integration. Lays groundwork for continuous tool API maintenance and easy coverage extension as NCBI adds endpoints.
Introduces support for retrieving taxonomy dataset reports using NCBI taxon identifiers, including function, tool class, JSON schema, and integration into the tool universe. Enhances automation and code generation logic to handle flexible path parameters for endpoints that accept both single values and arrays.

Improves coverage of NCBI Datasets API tools, enabling users to access richer taxonomic metadata across various taxonomic ranks.
Updates the parameter-building logic to use string concatenation that properly separates conditional parameter blocks with newlines. Prevents formatting issues in generated query code, ensuring parameters are correctly added when present.
Adds support for additional flexible path parameters such as locus tags, assembly names, bioprojects, biosample IDs, proteins, tax IDs, and WGS accessions, enabling single values or lists for these inputs.

Improves parameter description logic by extracting the first word from descriptions or falling back to parameter names, enhancing auto-generated documentation clarity.

Updates response construction to include path parameters for better context.

These changes improve tool flexibility and generated API documentation quality.
Introduces new auto-generated tools for NCBI Datasets API endpoints that provide dataset reports by gene ID, accession, taxon, locus tag, and for viruses and genomes by various identifiers. Updates initialization, lazy loading, and exports to support these tools and registers their schemas and Python client functions.

Enables broader and more granular access to NCBI Datasets metadata, allowing easier integration and improved flexibility for downstream consumers.
Adds comprehensive integration with the NCBI Datasets API, introducing 56 new tools for accessing gene data, genome assemblies, taxonomy information, virus genomes, organelle data, and biosample records. This update includes auto-generated tool classes, detailed documentation, and a maintenance guide, enhancing the API's usability and flexibility for researchers. Additionally, known test failures are documented to improve testing transparency.
Combined NCBI Datasets tools with upstream's new tools (OLS, ClinVar, literature search tools). Updated type annotations, imports, lazy proxies, and __all__ list to include both sets of tools.
@benjibromberg
Copy link
Contributor Author

Tried to do my best here, but let me know if I missed anything that I can fix!

@gasvn
Copy link
Member

gasvn commented Nov 18, 2025

Looks good to me, thank you! I will test these tools on my side and merge them ASAP!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants