feat: on-demand keyed data sources for Breeze Buddy#839
Conversation
Adds a pluggable connector layer on top of the inline Google Sheets data source so the agent loads one keyed slice (e.g. a single sheet tab) on demand mid-call, instead of injecting every source up front. - connector registry (type -> connector) with lazy bootstrap so the builtin connector is always registered, even in a cold process - GoogleSheetConnector: keyed fetch / list_keys / cache_key (the only module that knows "tab"/"spreadsheet") - DataSourceRef gains `type` + `mode` (eager|on_demand) + connector_config() - load_data_source global builtin: cache-first (prefetched), fail-open keyed load; global fn so it never transitions the node - loader skips on_demand sources in eager injection (keeps context small) - prefetch warms every tab for on_demand sources (first load is a hit) - 24 tests (registry, connector, model, handler, prefetch, cold-process wiring) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WalkthroughAdds Google Sheets as an inline data source for Breeze Buddy templates. Introduces ChangesGoogle Sheets Data Source Integration
Sequence Diagram(s)sequenceDiagram
participant Client as Admin Client
participant API as FastAPI Router
participant SheetsHandler as data_sources/handlers.py
participant SheetsService as sheets.py
participant GCP as Google Sheets API
rect rgba(100, 149, 237, 0.5)
note over Client,GCP: Template Configuration Discovery
Client->>API: GET /data-sources/sheets/tabs?spreadsheet_url=...
API->>SheetsHandler: list_tabs_handler(url, user)
SheetsHandler->>SheetsService: extract_spreadsheet_id(url)
SheetsService-->>SheetsHandler: spreadsheet_id (or HTTP 400)
SheetsHandler->>SheetsService: list_tabs(spreadsheet_id)
SheetsService->>GCP: GET spreadsheets/{id} via thread pool
GCP-->>SheetsService: sheet metadata
SheetsService-->>SheetsHandler: [tab names]
SheetsHandler-->>Client: TabsResponse
end
rect rgba(144, 238, 144, 0.5)
note over Client,GCP: Runtime Voice Call Dispatch
Client->>API: POST /calls (template with data_sources)
API->>Worker: dispatch call
Worker->>Worker: asyncio.create_task(prefetch_data_sources(template))
Worker->>FlowConfigLoader: load_template()
FlowConfigLoader->>SheetsService: fetch_formatted() for EAGER sources
SheetsService->>GCP: fetch via thread pool
GCP-->>SheetsService: row data
SheetsService-->>FlowConfigLoader: formatted string
FlowConfigLoader->>Redis: setex(cache_key, TTL, content)
FlowConfigLoader-->>Worker: template_vars injected with datasource_<name>
end
rect rgba(255, 165, 0, 0.5)
note over Client,GCP: LLM On-Demand Data Load
Worker->>LLM: conversation with template context
LLM->>load_data_source: call(source="prices", key="Sheet1")
load_data_source->>Redis: get(cache_key)
alt cache hit
Redis-->>load_data_source: cached content
else cache miss
load_data_source->>GoogleSheetConnector: fetch(config, key)
GoogleSheetConnector->>SheetsService: fetch_formatted()
SheetsService->>GCP: GET values/{range}
GCP-->>SheetsService: values
SheetsService-->>load_data_source: formatted string
load_data_source->>Redis: setex(cache_key, 60, content)
end
load_data_source-->>LLM: {status, source, key, content}
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
🧹 Nitpick comments (4)
app/api/routers/breeze_buddy/data_sources/__init__.py (1)
31-34: ⚡ Quick winAdd explicit return type hints to route handlers.
get_sheet_tabs,get_sheet_columns, andpreview_sheetare missing return annotations. Add-> TabsResponse,-> ColumnsResponse, and-> PreviewResponsefor guideline compliance and clearer contracts.Proposed fix
async def get_sheet_tabs( spreadsheet_url: str = Query(..., description="Full Google Sheets URL"), current_user: UserInfo = Depends(get_current_user_with_rbac), -): +) -> TabsResponse: @@ async def get_sheet_columns( @@ current_user: UserInfo = Depends(get_current_user_with_rbac), -): +) -> ColumnsResponse: @@ async def preview_sheet( @@ current_user: UserInfo = Depends(get_current_user_with_rbac), -): +) -> PreviewResponse:As per coding guidelines, “Add type hints on all function signatures.”
Also applies to: 41-47, 54-62
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/routers/breeze_buddy/data_sources/__init__.py` around lines 31 - 34, Add explicit return type annotations to the three async route handler functions to comply with coding guidelines. For the function get_sheet_tabs, add the return type hint TabsResponse. For get_sheet_columns (located around lines 41-47), add the return type hint ColumnsResponse. For preview_sheet (located around lines 54-62), add the return type hint PreviewResponse. Each return type annotation should be placed after the closing parenthesis of the function parameters and before the colon that precedes the function body.Source: Coding guidelines
app/ai/voice/agents/breeze_buddy/template/loader.py (1)
341-342: ⚡ Quick winSilent exception swallowing hinders debugging.
The
try-except-passblocks suppress all Redis errors without any logging. While fail-open semantics are correct here, adding a debug-level log would help operators diagnose intermittent cache issues.Proposed fix
try: redis = await get_redis_service() cached = await redis.get(cache_key) if cached: return cached - except Exception: - pass + except Exception as exc: + logger.debug("Cache read failed for '%s': %s", ref.name, exc)try: redis = await get_redis_service() await redis.setex(cache_key, content, ttl_seconds=60) - except Exception: - pass + except Exception as exc: + logger.debug("Cache write failed for '%s': %s", ref.name, exc)Also applies to: 367-368
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/ai/voice/agents/breeze_buddy/template/loader.py` around lines 341 - 342, The except Exception blocks with only pass statements suppress all Redis errors without any logging, making it difficult to debug intermittent cache issues. Replace the pass statements in both exception handlers (at lines 341-342 and 367-368) with a debug-level log statement that captures the actual exception details. This allows operators to diagnose cache-related problems while maintaining the fail-open semantics.app/database/queries/breeze_buddy/template.py (1)
183-187: ⚡ Quick winMetadata list query is over-fetching
data_sources.Line 186 pulls
data_sources, but the accessor maps list results intoTemplateMetadatawithout this field. This adds unnecessary JSONB I/O on a hot listing path.Suggested patch
- name, is_active, supported_channels, data_sources, created_at, updated_at + name, is_active, supported_channels, created_at, updated_at🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/database/queries/breeze_buddy/template.py` around lines 183 - 187, The SELECT statement in the template metadata list query is unnecessarily fetching the data_sources field, which is a JSONB column that adds I/O overhead. Remove data_sources from the SELECT clause on line 186 since the TemplateMetadata object that receives these query results does not have a corresponding field for it. This will improve performance on this frequently-used listing endpoint by eliminating unnecessary JSONB data retrieval.pyproject.toml (1)
26-26: Remove the unusedgoogle-api-python-clientdependency.No imports of
googleapiclientexist in the codebase. Since the Sheets implementation usesgoogle-auth+ direct REST calls instead, this dependency adds unnecessary supply-chain and startup overhead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` at line 26, Remove the unused google-api-python-client dependency from the pyproject.toml file. Since the codebase does not import googleapiclient and the Sheets implementation uses google-auth with direct REST calls instead, delete the line containing the google-api-python-client>=2.86.0 dependency declaration to reduce supply-chain and startup overhead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/ai/voice/agents/breeze_buddy/handlers/internal/load_data_source.py`:
- Around line 79-91: The Redis method calls in the load_data_source function are
missing the required namespace parameter according to coding guidelines. Update
the redis.get() call that retrieves the cached value using cache_key to include
the namespace parameter as specified in the guidelines. Additionally, update the
redis.setex() call around line 113 to also include the namespace parameter in
both locations where Redis methods are invoked.
In `@app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py`:
- Around line 79-82: The redis.setex() method calls in the file are missing the
required namespace parameter according to coding guidelines. Locate all
redis.setex() calls (at line 82 in the try block and at lines 111-114 in the
exception handler) and add the namespace parameter to each call to comply with
the coding standards for Redis operations.
- Around line 37-45: The _cache_key() function duplicates cache key generation
logic and causes inconsistency between _prefetch_one() and _prefetch_on_demand()
methods, where _prefetch_one uses _cache_key(ref) and _prefetch_on_demand uses
connector.cache_key(config, key). Remove the _cache_key() function entirely and
refactor _prefetch_one() to use connector.cache_key() with the same parameters
passed to _prefetch_on_demand() to ensure both methods generate identical cache
keys for the same data source.
In `@app/ai/voice/agents/breeze_buddy/template/field_reference.json`:
- Line 21: The description field for the data_sources object in
field_reference.json incorrectly documents how template variables are injected
at runtime. Update the description to accurately reflect the actual runtime
behavior: eager data sources are injected as {datasource_<name>} template
variables (not {name}), and on-demand sources are not injected wholesale into
the LLM context. Correct the documentation to match the actual template variable
naming convention and clarify the distinction between eager and on-demand source
injection behavior.
In `@app/ai/voice/agents/breeze_buddy/template/loader.py`:
- Around line 336-342: The Redis calls within the try block and the subsequent
write operations are missing the required namespace parameter in their method
invocations. Add the namespace parameter to both the redis.get() call (which
retrieves the cached value using cache_key) and the redis.setex() calls that
write to the cache, ensuring all Redis operations follow the coding guidelines
by specifying a namespace to prevent key collisions across services.
- Around line 329-334: The cache key computation in the loader.py file (the code
block computing cache_key with spreadsheet_id, sheet, cols_digest, and
format.value) duplicates the implementation from GoogleSheetConnector.cache_key
in sheets_connector.py. To fix this, refactor the code to call the connector's
cache_key method instead of recomputing the key locally, ensuring both
eager-loaded and on-demand handlers use the same key generation logic and
preventing cache misses or stale data when implementations diverge.
In `@app/api/routers/breeze_buddy/data_sources/handlers.py`:
- Around line 90-95: The issue is that when the preview returns zero rows, the
columns field in the PreviewResponse becomes an empty list even if the caller
explicitly requested specific columns. To fix this, capture the requested
columns from the function parameters or request object, and modify the logic
that determines col_names so that when rows is empty, it uses the requested
columns instead of defaulting to an empty list. This ensures the API returns
deterministic column information regardless of whether data rows are present.
- Around line 60-64: The ColumnsResponse at the get_column_headers call returns
an empty string for sheet_name when it is None, but the columns are actually
fetched from a specific default tab. Instead of returning an empty string on
line 63, determine the actual resolved tab name that was used by
get_column_headers (the default first tab when sheet_name is not provided) and
return that resolved name in the ColumnsResponse construction. This ensures the
response accurately reflects which tab the columns were retrieved from and
eliminates ambiguity for clients consuming this endpoint.
In `@app/api/routers/breeze_buddy/templates/handlers.py`:
- Around line 543-547: In the data dictionary being constructed around the
data_sources assignment, replace the conditional that sets data_sources to None
when omitted with logic that only includes the data_sources key if
template_data.data_sources is actually provided. Instead of using else None,
conditionally add the data_sources field to the dictionary only when the field
has a value, so omitted fields do not overwrite existing database values with
NULL.
In `@app/database/accessor/breeze_buddy/template.py`:
- Line 134: The current code uses a truthiness check when converting
data_sources to JSON, which causes empty lists to be treated as falsy and
converted to None (SQL NULL). Replace the truthiness check with an explicit `is
not None` comparison in both occurrences (line 134 and line 428) to preserve
empty lists as empty JSON arrays instead of coercing them to NULL. This ensures
the distinction between an empty list and a null value is maintained in the
database.
In `@app/services/data_sources.py`:
- Around line 49-60: The _bootstrapped flag is being set to True before the
connector imports execute, so if an import fails, the flag remains True and
subsequent calls to _bootstrap_builtin_connectors() will return early without
retrying the failed import. Move the _bootstrapped = True assignment to after
all the import statements (both the google.sheets_connector import and any other
connectors mentioned in the "Also applies to" comment) have completed
successfully, so that if any import fails, the flag remains False and the
bootstrap can be retried on the next call.
In `@app/services/google/sheets_connector.py`:
- Around line 30-31: The issue is that `config.get("spreadsheet_url", "")` can
return None or non-string values even though a default empty string is provided,
and passing these directly to `extract_spreadsheet_id()` causes a TypeError
instead of gracefully handling the missing/invalid ID. Fix this at all three
locations (around lines 30, 48, and 55) by normalizing the spreadsheet_url value
to ensure it is always a string type before passing it to
`extract_spreadsheet_id()`. Convert the config value to a string or empty string
to preserve the fail-open behavior where the `if not spreadsheet_id:` checks
properly trigger the DATA_SOURCE_UNAVAILABLE flow without exceptions.
- Around line 32-35: The logger.warning call in the GoogleSheetConnector class
is using printf-style formatting with %s, but Loguru only supports curly-brace
style formatting with {}. Replace the %s placeholder in the format string with
{} to ensure the spreadsheet_url value from config.get("spreadsheet_url") is
properly interpolated and logged.
In `@app/services/google/sheets.py`:
- Around line 247-269: Rename the `format` parameter to avoid shadowing Python's
builtin (e.g., change it to `format_type` or `output_format`). This requires
updating the parameter name in the function definition in sheets.py (around line
247), and then updating all call sites that pass the `format=` keyword argument:
in sheets_connector.py, the three calls in tests/test_google_sheets.py, the mock
function signature in tests/test_data_source_connectors.py, the call in
app/ai/voice/agents/breeze_buddy/template/loader.py, and the call in
app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py. Ensure all
five call sites use the new parameter name consistently to avoid TypeError for
unexpected keyword arguments.
In `@tests/test_data_source_connectors.py`:
- Around line 102-108: The _fake_fetch_formatted function uses `format` as a
parameter name, which shadows Python's builtin function and triggers Ruff A002.
Refactor the function signature to accept **kwargs instead of named parameters
(spreadsheet_id, sheet_name, columns, format), then update the captured.update()
call to read values from kwargs using keys like kwargs["format"],
kwargs["spreadsheet_id"], etc. Apply the same fix to the other occurrence
mentioned at lines 131-133.
- Around line 66-72: The subprocess.run call in the test lacks a timeout
parameter which can cause the test suite to hang indefinitely, and it also
triggers the Ruff S603 security lint rule. Add an explicit timeout parameter to
the subprocess.run call to prevent hangs, and add a noqa comment (# noqa: S603)
to the same line to annotate that this subprocess use is intentional and
suppress the lint gate failure.
---
Nitpick comments:
In `@app/ai/voice/agents/breeze_buddy/template/loader.py`:
- Around line 341-342: The except Exception blocks with only pass statements
suppress all Redis errors without any logging, making it difficult to debug
intermittent cache issues. Replace the pass statements in both exception
handlers (at lines 341-342 and 367-368) with a debug-level log statement that
captures the actual exception details. This allows operators to diagnose
cache-related problems while maintaining the fail-open semantics.
In `@app/api/routers/breeze_buddy/data_sources/__init__.py`:
- Around line 31-34: Add explicit return type annotations to the three async
route handler functions to comply with coding guidelines. For the function
get_sheet_tabs, add the return type hint TabsResponse. For get_sheet_columns
(located around lines 41-47), add the return type hint ColumnsResponse. For
preview_sheet (located around lines 54-62), add the return type hint
PreviewResponse. Each return type annotation should be placed after the closing
parenthesis of the function parameters and before the colon that precedes the
function body.
In `@app/database/queries/breeze_buddy/template.py`:
- Around line 183-187: The SELECT statement in the template metadata list query
is unnecessarily fetching the data_sources field, which is a JSONB column that
adds I/O overhead. Remove data_sources from the SELECT clause on line 186 since
the TemplateMetadata object that receives these query results does not have a
corresponding field for it. This will improve performance on this
frequently-used listing endpoint by eliminating unnecessary JSONB data
retrieval.
In `@pyproject.toml`:
- Line 26: Remove the unused google-api-python-client dependency from the
pyproject.toml file. Since the codebase does not import googleapiclient and the
Sheets implementation uses google-auth with direct REST calls instead, delete
the line containing the google-api-python-client>=2.86.0 dependency declaration
to reduce supply-chain and startup overhead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2b580490-0c23-4446-90a2-c0feb4c10073
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (23)
app/ai/voice/agents/breeze_buddy/dispatch/worker.pyapp/ai/voice/agents/breeze_buddy/handlers/internal/builtin_dispatcher.pyapp/ai/voice/agents/breeze_buddy/handlers/internal/load_data_source.pyapp/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.pyapp/ai/voice/agents/breeze_buddy/template/field_reference.jsonapp/ai/voice/agents/breeze_buddy/template/loader.pyapp/ai/voice/agents/breeze_buddy/template/types.pyapp/api/routers/breeze_buddy/__init__.pyapp/api/routers/breeze_buddy/data_sources/__init__.pyapp/api/routers/breeze_buddy/data_sources/handlers.pyapp/api/routers/breeze_buddy/templates/handlers.pyapp/database/accessor/breeze_buddy/template.pyapp/database/decoder/breeze_buddy/template.pyapp/database/migrations/034_add_data_sources_to_template.sqlapp/database/queries/breeze_buddy/template.pyapp/schemas/breeze_buddy/data_source.pyapp/services/data_sources.pyapp/services/google/__init__.pyapp/services/google/sheets.pyapp/services/google/sheets_connector.pypyproject.tomltests/test_data_source_connectors.pytests/test_google_sheets.py
| try: | ||
| redis = await get_redis_service() | ||
| cached = await redis.get(cache_key) | ||
| if cached: | ||
| logger.info("[load_data_source] cache hit '%s' key=%s", source, key) | ||
| return { | ||
| "status": "success", | ||
| "source": source, | ||
| "key": key, | ||
| "content": cached, | ||
| } | ||
| except Exception as exc: # never let cache issues break the call | ||
| logger.warning("[load_data_source] cache read failed: %s", exc) |
There was a problem hiding this comment.
Missing namespace parameter in Redis calls.
Same issue as in loader.py: the Redis get (line 81) and setex (line 113) calls lack the namespace parameter required by coding guidelines.
Proposed fix
# Fast path — prefetched slice already in Redis.
try:
redis = await get_redis_service()
- cached = await redis.get(cache_key)
+ cached = await redis.get(cache_key, namespace="datasource")
if cached: # Warm the cache for subsequent loads of the same slice.
try:
redis = await get_redis_service()
- await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL)
+ await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL, namespace="datasource")
except Exception as exc:Also applies to: 110-115
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 90-90: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/handlers/internal/load_data_source.py`
around lines 79 - 91, The Redis method calls in the load_data_source function
are missing the required namespace parameter according to coding guidelines.
Update the redis.get() call that retrieves the cached value using cache_key to
include the namespace parameter as specified in the guidelines. Additionally,
update the redis.setex() call around line 113 to also include the namespace
parameter in both locations where Redis methods are invoked.
Source: Coding guidelines
| def _cache_key(ref: DataSourceRef) -> str: | ||
| """Derive a stable cache key from the sheet config.""" | ||
| spreadsheet_id = extract_spreadsheet_id(ref.spreadsheet_url) | ||
| if not spreadsheet_id: | ||
| spreadsheet_id = "invalid_url" | ||
| sheet = ref.sheet_name or "_first_" | ||
| cols = json.dumps(sorted(ref.columns or [])) | ||
| cols_digest = hashlib.md5(cols.encode()).hexdigest()[:8] | ||
| return f"datasource:{spreadsheet_id}:{sheet}:{cols_digest}:{ref.format.value}" |
There was a problem hiding this comment.
Cache key logic duplicated; _prefetch_one vs _prefetch_on_demand use different key sources.
_cache_key() is the third copy of this logic (also in loader.py and sheets_connector.py). More critically, _prefetch_one() uses _cache_key(ref) while _prefetch_on_demand() uses connector.cache_key(config, key). If these produce different formats, the eager-prefetched content won't be found by the on-demand loader.
Consider unifying all cache key generation through connector.cache_key() to guarantee consistency.
Proposed refactor for _prefetch_one
async def _prefetch_one(ref: DataSourceRef) -> None:
if not ref.is_active:
return
spreadsheet_id = extract_spreadsheet_id(ref.spreadsheet_url)
if not spreadsheet_id:
logger.warning("Prefetch: invalid spreadsheet_url for '%s'", ref.name)
return
+ connector = get_connector(ref.type)
+ if connector is None:
+ logger.warning("Prefetch: no connector for type '%s'", ref.type)
+ return
+
try:
content = await asyncio.wait_for(
fetch_formatted(
spreadsheet_id=spreadsheet_id,
sheet_name=ref.sheet_name,
columns=ref.columns,
format=ref.format.value,
),
timeout=_FETCH_TIMEOUT,
)
# ... error handling ...
if not content or content == DATA_SOURCE_UNAVAILABLE:
return
- cache_key = _cache_key(ref)
+ config = ref.connector_config()
+ cache_key = connector.cache_key(config, ref.sheet_name)
try:
redis = await get_redis_service()🧰 Tools
🪛 ast-grep (0.43.0)
[info] 42-42: use jsonify instead of json.dumps for JSON output
Context: json.dumps(sorted(ref.columns or []))
Note: Security best practice.
(use-jsonify)
[warning] 43-43: Do not use insecure functions
Context: hashlib.md5(cols.encode())
Note: [CWE-327] [CWE-328].
(insecure-hash-functions)
🪛 Ruff (0.15.17)
[error] 44-44: Probable use of insecure hash functions in hashlib: md5
(S324)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py` around
lines 37 - 45, The _cache_key() function duplicates cache key generation logic
and causes inconsistency between _prefetch_one() and _prefetch_on_demand()
methods, where _prefetch_one uses _cache_key(ref) and _prefetch_on_demand uses
connector.cache_key(config, key). Remove the _cache_key() function entirely and
refactor _prefetch_one() to use connector.cache_key() with the same parameters
passed to _prefetch_on_demand() to ensure both methods generate identical cache
keys for the same data source.
| cache_key = _cache_key(ref) | ||
| try: | ||
| redis = await get_redis_service() | ||
| await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL) |
There was a problem hiding this comment.
Missing namespace parameter in Redis setex calls.
Lines 82 and 113-114 use redis.setex() without the namespace parameter required by coding guidelines.
Proposed fix
cache_key = _cache_key(ref)
try:
redis = await get_redis_service()
- await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL)
+ await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL, namespace="datasource") try:
redis = await get_redis_service()
await redis.setex(
- connector.cache_key(config, key), content, ttl_seconds=_CACHE_TTL
+ connector.cache_key(config, key), content, ttl_seconds=_CACHE_TTL, namespace="datasource"
)Also applies to: 111-115
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py` around
lines 79 - 82, The redis.setex() method calls in the file are missing the
required namespace parameter according to coding guidelines. Locate all
redis.setex() calls (at line 82 in the try block and at lines 111-114 in the
exception handler) and add the namespace parameter to each call to comply with
the coding standards for Redis operations.
Source: Coding guidelines
| "is_active": { "description": "When false, the template is soft-disabled: dispatch will not enqueue new leads for it. Set to false (not delete) when retiring a template.", "example": true }, | ||
| "supported_channels": { "description": "Channels this template may be served on. Default ['voice']. Add 'chat' to enable the Buddy Assist web-chat mode. Must have at least one entry.", "example": ["voice"] } | ||
| "supported_channels": { "description": "Channels this template may be served on. Default ['voice']. Add 'chat' to enable the Buddy Assist web-chat mode. Must have at least one entry.", "example": ["voice"] }, | ||
| "data_sources": { "description": "Google Sheets data sources attached to this template. Each ref has name, spreadsheet_url, sheet_name, columns filter, output format (markdown_table/csv/json), and is_active flag. Fetched at call dispatch and injected as {name} template vars into LLM context.", "example": [{ "name": "customer_list", "spreadsheet_url": "https://docs.google.com/spreadsheets/d/abc123", "sheet_name": "Sheet1", "columns": ["Name", "Status"], "format": "markdown_table", "is_active": true }] } |
There was a problem hiding this comment.
data_sources docs are out of sync with runtime contract.
Line 21 says values are injected as {name} at dispatch, but runtime uses {datasource_<name>} for eager sources and does not inject on-demand sources wholesale.
Suggested patch
- "data_sources": { "description": "Google Sheets data sources attached to this template. Each ref has name, spreadsheet_url, sheet_name, columns filter, output format (markdown_table/csv/json), and is_active flag. Fetched at call dispatch and injected as {name} template vars into LLM context.", "example": [{ "name": "customer_list", "spreadsheet_url": "https://docs.google.com/spreadsheets/d/abc123", "sheet_name": "Sheet1", "columns": ["Name", "Status"], "format": "markdown_table", "is_active": true }] }
+ "data_sources": { "description": "Inline data sources attached to this template. Each ref includes connector `type` and loading `mode` (`eager`/`on_demand`). Eager sources are fetched up front and injected as {datasource_<name>} template vars; on_demand sources are loaded slice-by-slice via load_data_source(source=..., key=...).", "example": [{ "name": "customer_list", "type": "google_sheet", "mode": "eager", "spreadsheet_url": "https://docs.google.com/spreadsheets/d/abc123", "sheet_name": "Sheet1", "columns": ["Name", "Status"], "format": "markdown_table", "is_active": true }] }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "data_sources": { "description": "Google Sheets data sources attached to this template. Each ref has name, spreadsheet_url, sheet_name, columns filter, output format (markdown_table/csv/json), and is_active flag. Fetched at call dispatch and injected as {name} template vars into LLM context.", "example": [{ "name": "customer_list", "spreadsheet_url": "https://docs.google.com/spreadsheets/d/abc123", "sheet_name": "Sheet1", "columns": ["Name", "Status"], "format": "markdown_table", "is_active": true }] } | |
| "data_sources": { "description": "Inline data sources attached to this template. Each ref includes connector `type` and loading `mode` (`eager`/`on_demand`). Eager sources are fetched up front and injected as {datasource_<name>} template vars; on_demand sources are loaded slice-by-slice via load_data_source(source=..., key=...).", "example": [{ "name": "customer_list", "type": "google_sheet", "mode": "eager", "spreadsheet_url": "https://docs.google.com/spreadsheets/d/abc123", "sheet_name": "Sheet1", "columns": ["Name", "Status"], "format": "markdown_table", "is_active": true }] } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/template/field_reference.json` at line 21,
The description field for the data_sources object in field_reference.json
incorrectly documents how template variables are injected at runtime. Update the
description to accurately reflect the actual runtime behavior: eager data
sources are injected as {datasource_<name>} template variables (not {name}), and
on-demand sources are not injected wholesale into the LLM context. Correct the
documentation to match the actual template variable naming convention and
clarify the distinction between eager and on-demand source injection behavior.
| sheet = ref.sheet_name or "_first_" | ||
| cols = json.dumps(sorted(ref.columns or [])) | ||
| cols_digest = hashlib.md5(cols.encode()).hexdigest()[:8] | ||
| cache_key = ( | ||
| f"datasource:{spreadsheet_id}:{sheet}:{cols_digest}:{ref.format.value}" | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Cache key computation is duplicated from GoogleSheetConnector.cache_key.
The cache key logic here mirrors the implementation in app/services/google/sheets_connector.py:53-61. If the key format diverges, eager-loaded content won't be found by the on-demand handler (or vice versa), causing redundant fetches or stale data.
Consider delegating to the connector's cache_key method or extracting a shared helper to ensure consistency.
Proposed refactor using connector
+from app.services.data_sources import get_connector
async def _fetch_one_data_source(self, ref: "DataSourceRef") -> str:
if not ref.is_active:
return DATA_SOURCE_UNAVAILABLE
spreadsheet_id = extract_spreadsheet_id(ref.spreadsheet_url)
if not spreadsheet_id:
logger.warning(
"Data source '%s' has malformed spreadsheet_url: %s",
ref.name,
ref.spreadsheet_url,
)
return DATA_SOURCE_UNAVAILABLE
- sheet = ref.sheet_name or "_first_"
- cols = json.dumps(sorted(ref.columns or []))
- cols_digest = hashlib.md5(cols.encode()).hexdigest()[:8]
- cache_key = (
- f"datasource:{spreadsheet_id}:{sheet}:{cols_digest}:{ref.format.value}"
- )
+ connector = get_connector(ref.type)
+ if connector is None:
+ logger.warning("Data source '%s' has unknown type '%s'", ref.name, ref.type)
+ return DATA_SOURCE_UNAVAILABLE
+ config = ref.connector_config()
+ cache_key = connector.cache_key(config, ref.sheet_name)🧰 Tools
🪛 ast-grep (0.43.0)
[warning] 330-330: Do not use insecure functions
Context: hashlib.md5(cols.encode())
Note: [CWE-327] [CWE-328].
(insecure-hash-functions)
[info] 329-329: use jsonify instead of json.dumps for JSON output
Context: json.dumps(sorted(ref.columns or []))
Note: Security best practice.
(use-jsonify)
🪛 Ruff (0.15.17)
[error] 331-331: Probable use of insecure hash functions in hashlib: md5
(S324)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/template/loader.py` around lines 329 - 334,
The cache key computation in the loader.py file (the code block computing
cache_key with spreadsheet_id, sheet, cols_digest, and format.value) duplicates
the implementation from GoogleSheetConnector.cache_key in sheets_connector.py.
To fix this, refactor the code to call the connector's cache_key method instead
of recomputing the key locally, ensuring both eager-loaded and on-demand
handlers use the same key generation logic and preventing cache misses or stale
data when implementations diverge.
| spreadsheet_id = extract_spreadsheet_id(config.get("spreadsheet_url", "")) | ||
| if not spreadsheet_id: |
There was a problem hiding this comment.
Normalize spreadsheet_url type before ID extraction to preserve fail-open behavior.
Lines 30, 48, and 55 pass config.get("spreadsheet_url", "") directly. If the key exists with None/non-string, extract_spreadsheet_id() can raise TypeError, bypassing your intended DATA_SOURCE_UNAVAILABLE flow.
Suggested diff
class GoogleSheetConnector:
@@
async def fetch(self, config: Dict[str, Any], key: Optional[str]) -> str:
- spreadsheet_id = extract_spreadsheet_id(config.get("spreadsheet_url", ""))
+ spreadsheet_url = config.get("spreadsheet_url")
+ spreadsheet_id = extract_spreadsheet_id(
+ spreadsheet_url if isinstance(spreadsheet_url, str) else ""
+ )
@@
async def list_keys(self, config: Dict[str, Any]) -> List[str]:
- spreadsheet_id = extract_spreadsheet_id(config.get("spreadsheet_url", ""))
+ spreadsheet_url = config.get("spreadsheet_url")
+ spreadsheet_id = extract_spreadsheet_id(
+ spreadsheet_url if isinstance(spreadsheet_url, str) else ""
+ )
@@
def cache_key(self, config: Dict[str, Any], key: Optional[str]) -> str:
+ spreadsheet_url = config.get("spreadsheet_url")
spreadsheet_id = (
- extract_spreadsheet_id(config.get("spreadsheet_url", "")) or "invalid_url"
+ extract_spreadsheet_id(
+ spreadsheet_url if isinstance(spreadsheet_url, str) else ""
+ )
+ or "invalid_url"
)Also applies to: 48-49, 55-56
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/services/google/sheets_connector.py` around lines 30 - 31, The issue is
that `config.get("spreadsheet_url", "")` can return None or non-string values
even though a default empty string is provided, and passing these directly to
`extract_spreadsheet_id()` causes a TypeError instead of gracefully handling the
missing/invalid ID. Fix this at all three locations (around lines 30, 48, and
55) by normalizing the spreadsheet_url value to ensure it is always a string
type before passing it to `extract_spreadsheet_id()`. Convert the config value
to a string or empty string to preserve the fail-open behavior where the `if not
spreadsheet_id:` checks properly trigger the DATA_SOURCE_UNAVAILABLE flow
without exceptions.
| logger.warning( | ||
| "GoogleSheetConnector: malformed spreadsheet_url: %s", | ||
| config.get("spreadsheet_url"), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect logger implementation and placeholder usage patterns.
fd -i 'logger.py' app/core --exec sed -n '1,220p' {}
rg -n --iglob '*.py' 'logger\.(debug|info|warning|error|exception|critical)\(".*%s' appRepository: juspay/clairvoyance
Length of output: 2367
🏁 Script executed:
cat -n app/services/google/sheets_connector.py | head -50Repository: juspay/clairvoyance
Length of output: 2312
🏁 Script executed:
rg -n "import.*logger\|from.*logger" app/services/google/sheets_connector.pyRepository: juspay/clairvoyance
Length of output: 45
🏁 Script executed:
rg -n "logger.*=" app/core --type py | head -20Repository: juspay/clairvoyance
Length of output: 969
🌐 Web query:
Loguru logging library printf style formatting %s interpolation Python
💡 Result:
The Loguru library does not support printf-style formatting (e.g., %s interpolation) and does not plan to add support for it [1][2][3]. Loguru is designed to prioritize modern Python practices and exclusively uses curly-brace {} style formatting, which is implemented via str.format [4][5][6]. To migrate from printf-style logging to Loguru, you must replace your log calls as follows: Standard printf-style: logger.debug("Some variable: %s", var) Loguru brace-style: logger.debug("Some variable: {}", var) This approach offers several advantages, including better performance through lazy evaluation: if the logging level is configured to ignore the message, the formatting operation is never performed [1][7]. If you have existing codebases or external libraries that rely heavily on printf-style formatting, the recommended approach is to use Loguru's InterceptHandler to redirect logs from the standard logging library to Loguru, which preserves the standard library's formatting capabilities for those specific logs [2][6]. Alternatively, you could implement a small wrapper around Loguru if strict project-wide consistency is required [6].
Citations:
- 1: https://github.com/Delgan/loguru/blob/master/docs/resources/migration.rst
- 2: Support printf-style formatting? Delgan/loguru#1151
- 3: Feature request: Format log messages when using % and a comma Delgan/loguru#1367
- 4: https://loguru.readthedocs.io/en/stable/api/logger.html
- 5: https://github.com/Delgan/loguru/
- 6: Handle %s formatted strings? Delgan/loguru#991
- 7: Are f-strings supported? Delgan/loguru#289
Use Loguru-style formatting here so the URL is actually logged.
Line 33 uses %s; Loguru does not support printf-style formatting and exclusively uses curly-brace {} style. With %s, the malformed URL value will not be interpolated in logs.
Suggested diff
logger.warning(
- "GoogleSheetConnector: malformed spreadsheet_url: %s",
+ "GoogleSheetConnector: malformed spreadsheet_url: {}",
config.get("spreadsheet_url"),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.warning( | |
| "GoogleSheetConnector: malformed spreadsheet_url: %s", | |
| config.get("spreadsheet_url"), | |
| ) | |
| logger.warning( | |
| "GoogleSheetConnector: malformed spreadsheet_url: {}", | |
| config.get("spreadsheet_url"), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/services/google/sheets_connector.py` around lines 32 - 35, The
logger.warning call in the GoogleSheetConnector class is using printf-style
formatting with %s, but Loguru only supports curly-brace style formatting with
{}. Replace the %s placeholder in the format string with {} to ensure the
spreadsheet_url value from config.get("spreadsheet_url") is properly
interpolated and logged.
Source: Coding guidelines
| format: str = "markdown_table", | ||
| max_rows: int = 500, | ||
| ) -> str: | ||
| """ | ||
| Fetch sheet data and return as a formatted string for LLM injection. | ||
|
|
||
| Returns DATA_SOURCE_UNAVAILABLE on any error or empty sheet. | ||
| """ | ||
| rows = await fetch_sheet_data(spreadsheet_id, sheet_name, columns, max_rows) | ||
| if not rows: | ||
| logger.warning( | ||
| f"No data fetched from spreadsheet={spreadsheet_id}, sheet={sheet_name}" | ||
| ) | ||
| return DATA_SOURCE_UNAVAILABLE | ||
|
|
||
| headers = list(rows[0].keys()) if rows else [] | ||
|
|
||
| if format == "csv": | ||
| return _rows_to_csv(headers, rows) | ||
| elif format == "json": | ||
| return _rows_to_json(rows) | ||
| else: | ||
| return _rows_to_markdown_table(headers, rows) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for function parameters named `format` in Python code.
rg -n --iglob '*.py' 'def\s+\w+\([^)]*\bformat\s*:'Repository: juspay/clairvoyance
Length of output: 45
🏁 Script executed:
# Find files related to sheets
fd -e py sheetsRepository: juspay/clairvoyance
Length of output: 161
🏁 Script executed:
# Check if the files exist and examine the relevant code
if [ -f "app/services/google/sheets.py" ]; then
echo "=== File: app/services/google/sheets.py ==="
wc -l "app/services/google/sheets.py"
fi
if [ -f "app/services/google/sheets_connector.py" ]; then
echo "=== File: app/services/google/sheets_connector.py ==="
wc -l "app/services/google/sheets_connector.py"
fiRepository: juspay/clairvoyance
Length of output: 238
🏁 Script executed:
# Read lines 240-275 from sheets.py to see the function definition
if [ -f "app/services/google/sheets.py" ]; then
sed -n '240,275p' "app/services/google/sheets.py"
fiRepository: juspay/clairvoyance
Length of output: 972
🏁 Script executed:
# Read lines 40-50 from sheets_connector.py to find the caller
if [ -f "app/services/google/sheets_connector.py" ]; then
sed -n '40,50p' "app/services/google/sheets_connector.py"
fiRepository: juspay/clairvoyance
Length of output: 496
🏁 Script executed:
# Check for all calls to fetch_formatted function
rg -n 'fetch_formatted' --iglob '*.py'Repository: juspay/clairvoyance
Length of output: 2855
🏁 Script executed:
# Verify there are no other parameters named 'format' in sheets.py
grep -n 'format:' app/services/google/sheets.pyRepository: juspay/clairvoyance
Length of output: 103
🏁 Script executed:
# Check the other callers in breeze_buddy modules
sed -n '340,355p' app/ai/voice/agents/breeze_buddy/template/loader.pyRepository: juspay/clairvoyance
Length of output: 585
🏁 Script executed:
# Check the caller in data_source_prefetch.py
sed -n '55,70p' app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.pyRepository: juspay/clairvoyance
Length of output: 539
🏁 Script executed:
# Verify exact line numbers in test files
grep -n 'format=' tests/test_google_sheets.py | head -10Repository: juspay/clairvoyance
Length of output: 299
🏁 Script executed:
# Double check exact lines in breeze_buddy files
grep -n 'format=' app/ai/voice/agents/breeze_buddy/template/loader.pyRepository: juspay/clairvoyance
Length of output: 112
🏁 Script executed:
# Check test_data_source_connectors.py for the mock function parameters
sed -n '100,115p' tests/test_data_source_connectors.pyRepository: juspay/clairvoyance
Length of output: 509
🏁 Script executed:
# Verify all the call sites comprehensively - count all format= usages
rg -n 'format=' --iglob '*.py' -B 1 -A 1Repository: juspay/clairvoyance
Length of output: 4449
Rename format parameter to avoid builtin shadowing, but update all 5 callers (not just sheets_connector.py).
The parameter shadowing issue is valid (Ruff A002), but the review is incomplete. Beyond the function definition in sheets.py (line 247) and sheets_connector.py (line 44), the format= keyword argument is used in:
- tests/test_google_sheets.py (lines 432, 449, 464)
- tests/test_data_source_connectors.py (line 102, mock function signature)
- app/ai/voice/agents/breeze_buddy/template/loader.py (line 350)
- app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py (line 63)
All 5 call sites must be updated when renaming the parameter, or the code will break with a TypeError for unexpected keyword arguments.
🧰 Tools
🪛 Ruff (0.15.17)
[error] 247-247: Function argument format is shadowing a Python builtin
(A002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/services/google/sheets.py` around lines 247 - 269, Rename the `format`
parameter to avoid shadowing Python's builtin (e.g., change it to `format_type`
or `output_format`). This requires updating the parameter name in the function
definition in sheets.py (around line 247), and then updating all call sites that
pass the `format=` keyword argument: in sheets_connector.py, the three calls in
tests/test_google_sheets.py, the mock function signature in
tests/test_data_source_connectors.py, the call in
app/ai/voice/agents/breeze_buddy/template/loader.py, and the call in
app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py. Ensure all
five call sites use the new parameter name consistently to avoid TypeError for
unexpected keyword arguments.
Source: Linters/SAST tools
| proc = subprocess.run( | ||
| [sys.executable, "-c", code], | ||
| cwd=_REPO_ROOT, | ||
| env={**os.environ, "PYTHONPATH": _REPO_ROOT}, | ||
| capture_output=True, | ||
| text=True, | ||
| ) |
There was a problem hiding this comment.
Harden this subprocess smoke test against hangs and lint gate failures.
The command is test-controlled, but this call can block the suite without a timeout and currently trips Ruff S603. Add an explicit timeout and annotate the intentional subprocess use.
Suggested patch
- proc = subprocess.run(
+ proc = subprocess.run( # noqa: S603 - test uses constant, non-user-controlled command
[sys.executable, "-c", code],
cwd=_REPO_ROOT,
env={**os.environ, "PYTHONPATH": _REPO_ROOT},
capture_output=True,
text=True,
+ timeout=15,
)🧰 Tools
🪛 Ruff (0.15.17)
[error] 66-66: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_data_source_connectors.py` around lines 66 - 72, The
subprocess.run call in the test lacks a timeout parameter which can cause the
test suite to hang indefinitely, and it also triggers the Ruff S603 security
lint rule. Add an explicit timeout parameter to the subprocess.run call to
prevent hangs, and add a noqa comment (# noqa: S603) to the same line to
annotate that this subprocess use is intentional and suppress the lint gate
failure.
Source: Linters/SAST tools
| async def _fake_fetch_formatted(*, spreadsheet_id, sheet_name, columns, format): | ||
| captured.update( | ||
| spreadsheet_id=spreadsheet_id, | ||
| sheet_name=sheet_name, | ||
| columns=columns, | ||
| format=format, | ||
| ) |
There was a problem hiding this comment.
Avoid builtin shadowing in test doubles for fetch_formatted.
Using format as a parameter name triggers Ruff A002. Keep the same call contract via **kwargs and read kwargs["format"].
Suggested patch
- async def _fake_fetch_formatted(*, spreadsheet_id, sheet_name, columns, format):
+ async def _fake_fetch_formatted(*, spreadsheet_id, sheet_name, columns, **kwargs):
+ fmt = kwargs["format"]
captured.update(
spreadsheet_id=spreadsheet_id,
sheet_name=sheet_name,
columns=columns,
- format=format,
+ format=fmt,
)
return "TAB-CONTENT"
@@
- async def _fake_fetch_formatted(*, spreadsheet_id, sheet_name, columns, format):
+ async def _fake_fetch_formatted(*, spreadsheet_id, sheet_name, columns, **kwargs):
captured["sheet_name"] = sheet_name
return "X"Also applies to: 131-133
🧰 Tools
🪛 Ruff (0.15.17)
[error] 102-102: Function argument format is shadowing a Python builtin
(A002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_data_source_connectors.py` around lines 102 - 108, The
_fake_fetch_formatted function uses `format` as a parameter name, which shadows
Python's builtin function and triggers Ruff A002. Refactor the function
signature to accept **kwargs instead of named parameters (spreadsheet_id,
sheet_name, columns, format), then update the captured.update() call to read
values from kwargs using keys like kwargs["format"], kwargs["spreadsheet_id"],
etc. Apply the same fix to the other occurrence mentioned at lines 131-133.
Source: Linters/SAST tools
Adds a pluggable connector layer on top of the inline Google Sheets data source so the agent loads one keyed slice (e.g. a single sheet tab) on demand mid-call, instead of injecting every source up front.
type+mode(eager|on_demand) + connector_config()Summary by CodeRabbit