Catalog system enhancements #88

teetangh · 2026-01-07T10:49:44Z

Summary

This PR implements 4 enhancements to the catalog system as documented in CATALOG.md:

Incremental Updates: Only refresh collections with 10%+ document count change or age > 1 hour
Event-Driven Sampling: Replace 2-min polling with immediate triggering via ThreadToAsyncBridge
Parallel Schema Inference: Run up to 5 concurrent INFER queries using asyncio.gather + Semaphore
Job Queue: Priority queue with HIGH/NORMAL/LOW priorities and automatic retries (max 3)

Additionally adds 4 new MCP tools for catalog interaction.

Changes

New Files

File	Purpose
`src/catalog/events/bridge.py`	ThreadToAsyncBridge for cross-thread signaling
`src/catalog/jobs/executor.py`	ParallelInferenceExecutor with semaphore
`src/catalog/jobs/queue.py`	InferenceJob, InferenceJobQueue
`src/tools/catalog.py`	4 new MCP tools
`docs/CATALOG_ENHANCEMENTS.md`	Documentation

Modified Files

File	Changes
`src/catalog/store/store.py`	Added CollectionMetadata for incremental tracking
`src/catalog/worker.py`	Incremental updates + parallel execution
`src/catalog/enrichment/catalog_enrichment.py`	Event-driven instead of polling
`src/tools/__init__.py`	Register new catalog tools

New MCP Tools

get_catalog_status - Returns catalog system status
get_collection_schema_from_catalog - Get cached schema without running INFER
refresh_collection_schema - Queue high-priority refresh for a collection
get_enriched_database_context - Get LLM-enriched database context

Test plan

Verify incremental updates skip unchanged collections
Verify event-driven enrichment triggers immediately on schema change
Verify parallel execution respects concurrency limit
Verify job queue handles priorities correctly
Test new MCP tools return expected data
Verify backward compatibility with existing catalog state files

- Add incremental updates (only refresh changed collections) - Replace polling with event-driven enrichment via ThreadToAsyncBridge - Add parallel schema inference with asyncio.Semaphore(5) - Add job queue with priority and retry support - Add 4 new MCP catalog tools - Add documentation

gemini-code-assist · 2026-01-07T10:50:07Z

Summary of Changes

Hello @teetangh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the catalog system by introducing several key enhancements aimed at improving efficiency, responsiveness, and manageability. The changes include optimizing schema refreshes through incremental updates and parallel processing, transitioning to an event-driven model for enrichment, and implementing a robust job queue for inference tasks. Additionally, new administrative tools are provided to interact with and monitor the enhanced catalog system.

Highlights

Incremental Updates: The catalog system now only refreshes collections if their document count changes by 10% or more, or if their last refresh was over an hour ago, significantly reducing unnecessary processing.
Event-Driven Sampling: Replaced the previous 2-minute polling mechanism for enrichment with an immediate, event-driven trigger using a ThreadToAsyncBridge, improving responsiveness and efficiency.
Parallel Schema Inference: Schema inference (INFER queries) can now run concurrently for up to 5 collections using asyncio.gather and asyncio.Semaphore, drastically speeding up catalog refreshes for clusters with many collections.
Job Queue for Inference: Introduced a priority-based job queue with HIGH/NORMAL/LOW priorities and automatic retry logic (up to 3 retries), decoupling scheduling from execution and enabling on-demand refreshes.
New MCP Tools: Added four new MCP tools: get_catalog_status, get_collection_schema_from_catalog, refresh_collection_schema, and get_enriched_database_context for enhanced catalog interaction.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly enhances the catalog system by introducing incremental updates for schema refresh, an event-driven mechanism for enrichment sampling, parallel schema inference using asyncio.Semaphore, and a priority-based job queue for managing inference tasks. It also adds new MCP tools for interacting with these catalog features. However, the review highlights several issues: the refresh_collection_schema tool's use of asyncio.run() is problematic for cross-thread communication with the job queue, a SQL injection vulnerability exists in _get_index_definitions due to f-string query construction, store.set_collection_metadata() causes inefficient disk I/O when called repeatedly within a loop, the InferenceJobQueue.enqueue method's behavior regarding duplicate jobs contradicts its docstring, and the _get_document_count function is duplicated across modules, requiring refactoring for maintainability.

gemini-code-assist · 2026-01-07T10:52:55Z

src/tools/catalog.py

+        # Use asyncio to run the enqueue
+        try:
+            loop = asyncio.get_running_loop()
+            # If we're in an async context, create a task
+            asyncio.create_task(queue.enqueue(job))
+            queued = True
+        except RuntimeError:
+            # Not in async context - run in new loop
+            queued = asyncio.run(queue.enqueue(job))


Using asyncio.run() here to call the async queue.enqueue(job) method is problematic. asyncio.run() creates a new event loop, runs the coroutine, and closes it. The InferenceJobQueue and its internal asyncio.Lock are tied to the event loop they are created in. This approach can lead to race conditions, deadlocks, or RuntimeError: Event loop is closed if the queue is accessed from different contexts (e.g., the worker thread's loop).

A tool, which is synchronous, should not directly interact with async components in this way. A better approach would be to use a thread-safe mechanism to pass the job to the worker's event loop, for example by having a thread-safe enqueue method on the queue that uses loop.call_soon_threadsafe internally.

gemini-code-assist · 2026-01-07T10:52:55Z

src/catalog/jobs/executor.py

+            query = (
+                f"SELECT meta().id, i.name, i.index_key, i.metadata.definition "
+                f"FROM system:indexes as i "
+                f"WHERE i.bucket_id = '{bucket_name}' "
+                f"AND i.scope_id = '{scope_name}' "
+                f"AND i.keyspace_id = '{collection_name}'"
+            )
+            result = await self._cluster.query(query)


This query is constructed using an f-string, which makes it vulnerable to SQL injection. The bucket_name, scope_name, and collection_name can originate from user input via the new refresh_collection_schema tool. Please use parameterized queries to prevent this vulnerability.

query = ( "SELECT meta().id, i.name, i.index_key, i.metadata.definition " "FROM system:indexes as i " "WHERE i.bucket_id = $bucket_name " "AND i.scope_id = $scope_name " "AND i.keyspace_id = $collection_name" ) result = await self._cluster.query( query, bucket_name=bucket_name, scope_name=scope_name, collection_name=collection_name )

gemini-code-assist · 2026-01-07T10:52:56Z

src/catalog/worker.py

+                    last_infer_time=datetime.utcnow().isoformat(),
+                    document_count=result.document_count,
+                )
+                store.set_collection_metadata(new_metadata)


Calling store.set_collection_metadata() inside this loop is inefficient. This method saves the entire catalog state to disk on every call. For a large number of collections needing refresh, this will cause significant and unnecessary disk I/O. It would be better to collect all new metadata and then update the store once outside the loop.

gemini-code-assist · 2026-01-07T10:52:56Z

src/catalog/jobs/queue.py

+        Prevents duplicate jobs for the same collection path.
+        If a job with the same path exists and the new job has higher priority,
+        the new job will be added (old one will be skipped when dequeued).


The docstring for enqueue states that an existing job can be replaced by a new one with higher priority. However, the implementation simply skips adding the job if one is already pending, regardless of priority. This discrepancy should be resolved. Given that updating items in a priority queue is complex, I recommend updating the docstring to reflect the actual behavior of skipping duplicates.

gemini-code-assist · 2026-01-07T10:52:56Z

src/catalog/worker.py

+async def _get_document_count(bucket: AsyncBucket, scope_name: str, collection_name: str) -> int:
+    """Get the document count for a collection."""
+    try:
+        scope = bucket.scope(name=scope_name)
+        count_query = f"SELECT RAW COUNT(*) FROM `{collection_name}`"
+        count_result = scope.query(count_query)
+        async for row in count_result:
+            return row
+        return 0
+    except Exception as e:
+        logger.warning(f"Error getting document count for {scope_name}.{collection_name}: {e}")
+        return 0


This function _get_document_count is a duplicate of the one in src/catalog/jobs/executor.py. To improve maintainability and avoid bugs from inconsistent changes, this logic should be centralized in one place, likely in the executor module, and reused here.

…log system enhancements cleanup.

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

Remove CATALOG_ENHANCEMENTS.md documentation file as part of the cata…

3d7c082

…log system enhancements cleanup.

teetangh closed this Jan 12, 2026

nithishr deleted the catalog-enhancements branch January 16, 2026 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Catalog system enhancements #88

Catalog system enhancements #88

Uh oh!

teetangh commented Jan 7, 2026

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Catalog system enhancements #88

Catalog system enhancements #88

Uh oh!

Conversation

teetangh commented Jan 7, 2026

Summary

Changes

New Files

Modified Files

New MCP Tools

Test plan

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants