Skip to content

Conversation

@teetangh
Copy link
Contributor

@teetangh teetangh commented Jan 7, 2026

Summary

This PR implements 4 enhancements to the catalog system as documented in CATALOG.md:

  • Incremental Updates: Only refresh collections with 10%+ document count change or age > 1 hour
  • Event-Driven Sampling: Replace 2-min polling with immediate triggering via ThreadToAsyncBridge
  • Parallel Schema Inference: Run up to 5 concurrent INFER queries using asyncio.gather + Semaphore
  • Job Queue: Priority queue with HIGH/NORMAL/LOW priorities and automatic retries (max 3)

Additionally adds 4 new MCP tools for catalog interaction.

Changes

New Files

File Purpose
src/catalog/events/bridge.py ThreadToAsyncBridge for cross-thread signaling
src/catalog/jobs/executor.py ParallelInferenceExecutor with semaphore
src/catalog/jobs/queue.py InferenceJob, InferenceJobQueue
src/tools/catalog.py 4 new MCP tools
docs/CATALOG_ENHANCEMENTS.md Documentation

Modified Files

File Changes
src/catalog/store/store.py Added CollectionMetadata for incremental tracking
src/catalog/worker.py Incremental updates + parallel execution
src/catalog/enrichment/catalog_enrichment.py Event-driven instead of polling
src/tools/__init__.py Register new catalog tools

New MCP Tools

  1. get_catalog_status - Returns catalog system status
  2. get_collection_schema_from_catalog - Get cached schema without running INFER
  3. refresh_collection_schema - Queue high-priority refresh for a collection
  4. get_enriched_database_context - Get LLM-enriched database context

Test plan

  • Verify incremental updates skip unchanged collections
  • Verify event-driven enrichment triggers immediately on schema change
  • Verify parallel execution respects concurrency limit
  • Verify job queue handles priorities correctly
  • Test new MCP tools return expected data
  • Verify backward compatibility with existing catalog state files

- Add incremental updates (only refresh changed collections)
- Replace polling with event-driven enrichment via ThreadToAsyncBridge
- Add parallel schema inference with asyncio.Semaphore(5)
- Add job queue with priority and retry support
- Add 4 new MCP catalog tools
- Add documentation
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @teetangh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the catalog system by introducing several key enhancements aimed at improving efficiency, responsiveness, and manageability. The changes include optimizing schema refreshes through incremental updates and parallel processing, transitioning to an event-driven model for enrichment, and implementing a robust job queue for inference tasks. Additionally, new administrative tools are provided to interact with and monitor the enhanced catalog system.

Highlights

  • Incremental Updates: The catalog system now only refreshes collections if their document count changes by 10% or more, or if their last refresh was over an hour ago, significantly reducing unnecessary processing.
  • Event-Driven Sampling: Replaced the previous 2-minute polling mechanism for enrichment with an immediate, event-driven trigger using a ThreadToAsyncBridge, improving responsiveness and efficiency.
  • Parallel Schema Inference: Schema inference (INFER queries) can now run concurrently for up to 5 collections using asyncio.gather and asyncio.Semaphore, drastically speeding up catalog refreshes for clusters with many collections.
  • Job Queue for Inference: Introduced a priority-based job queue with HIGH/NORMAL/LOW priorities and automatic retry logic (up to 3 retries), decoupling scheduling from execution and enabling on-demand refreshes.
  • New MCP Tools: Added four new MCP tools: get_catalog_status, get_collection_schema_from_catalog, refresh_collection_schema, and get_enriched_database_context for enhanced catalog interaction.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the catalog system by introducing incremental updates for schema refresh, an event-driven mechanism for enrichment sampling, parallel schema inference using asyncio.Semaphore, and a priority-based job queue for managing inference tasks. It also adds new MCP tools for interacting with these catalog features. However, the review highlights several issues: the refresh_collection_schema tool's use of asyncio.run() is problematic for cross-thread communication with the job queue, a SQL injection vulnerability exists in _get_index_definitions due to f-string query construction, store.set_collection_metadata() causes inefficient disk I/O when called repeatedly within a loop, the InferenceJobQueue.enqueue method's behavior regarding duplicate jobs contradicts its docstring, and the _get_document_count function is duplicated across modules, requiring refactoring for maintainability.

Comment on lines +203 to +211
# Use asyncio to run the enqueue
try:
loop = asyncio.get_running_loop()
# If we're in an async context, create a task
asyncio.create_task(queue.enqueue(job))
queued = True
except RuntimeError:
# Not in async context - run in new loop
queued = asyncio.run(queue.enqueue(job))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using asyncio.run() here to call the async queue.enqueue(job) method is problematic. asyncio.run() creates a new event loop, runs the coroutine, and closes it. The InferenceJobQueue and its internal asyncio.Lock are tied to the event loop they are created in. This approach can lead to race conditions, deadlocks, or RuntimeError: Event loop is closed if the queue is accessed from different contexts (e.g., the worker thread's loop).

A tool, which is synchronous, should not directly interact with async components in this way. A better approach would be to use a thread-safe mechanism to pass the job to the worker's event loop, for example by having a thread-safe enqueue method on the queue that uses loop.call_soon_threadsafe internally.

Comment on lines +247 to +254
query = (
f"SELECT meta().id, i.name, i.index_key, i.metadata.definition "
f"FROM system:indexes as i "
f"WHERE i.bucket_id = '{bucket_name}' "
f"AND i.scope_id = '{scope_name}' "
f"AND i.keyspace_id = '{collection_name}'"
)
result = await self._cluster.query(query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This query is constructed using an f-string, which makes it vulnerable to SQL injection. The bucket_name, scope_name, and collection_name can originate from user input via the new refresh_collection_schema tool. Please use parameterized queries to prevent this vulnerability.

            query = (
                "SELECT meta().id, i.name, i.index_key, i.metadata.definition "
                "FROM system:indexes as i "
                "WHERE i.bucket_id = $bucket_name "
                "AND i.scope_id = $scope_name "
                "AND i.keyspace_id = $collection_name"
            )
            result = await self._cluster.query(
                query, bucket_name=bucket_name, scope_name=scope_name, collection_name=collection_name
            )

last_infer_time=datetime.utcnow().isoformat(),
document_count=result.document_count,
)
store.set_collection_metadata(new_metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling store.set_collection_metadata() inside this loop is inefficient. This method saves the entire catalog state to disk on every call. For a large number of collections needing refresh, this will cause significant and unnecessary disk I/O. It would be better to collect all new metadata and then update the store once outside the loop.

Comment on lines +99 to +101
Prevents duplicate jobs for the same collection path.
If a job with the same path exists and the new job has higher priority,
the new job will be added (old one will be skipped when dequeued).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for enqueue states that an existing job can be replaced by a new one with higher priority. However, the implementation simply skips adding the job if one is already pending, regardless of priority. This discrepancy should be resolved. Given that updating items in a priority queue is complex, I recommend updating the docstring to reflect the actual behavior of skipping duplicates.

Comment on lines +113 to +124
async def _get_document_count(bucket: AsyncBucket, scope_name: str, collection_name: str) -> int:
"""Get the document count for a collection."""
try:
scope = bucket.scope(name=scope_name)
count_query = f"SELECT RAW COUNT(*) FROM `{collection_name}`"
count_result = scope.query(count_query)
async for row in count_result:
return row
return 0
except Exception as e:
logger.warning(f"Error getting document count for {scope_name}.{collection_name}: {e}")
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function _get_document_count is a duplicate of the one in src/catalog/jobs/executor.py. To improve maintainability and avoid bugs from inconsistent changes, this logic should be centralized in one place, likely in the executor module, and reused here.

@teetangh teetangh closed this Jan 12, 2026
@nithishr nithishr deleted the catalog-enhancements branch January 16, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants