feat(slack bot): add federated search#5275
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Weves
left a comment
There was a problem hiding this comment.
@jessicasingh7 overall looks great! Left a few initial comments, let me know if you have any questions about them.
backend/alembic/versions/5ae8240accb3_add_research_agent_database_tables_and_.py
Show resolved
Hide resolved
backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py
Outdated
Show resolved
Hide resolved
backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py
Show resolved
Hide resolved
backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py
Outdated
Show resolved
Hide resolved
08ea4f5 to
63bd0f2
Compare
There was a problem hiding this comment.
Greptile Summary
This PR implements federated Slack search functionality for the Slack bot, enabling it to search through Slack messages directly and include those results in chat responses. The implementation adds support for user OAuth tokens (xoxp- prefixed) in addition to existing bot tokens, as user tokens provide broader search permissions required for accessing Slack's search API.
The changes span multiple layers of the application:
Database & Models: A new nullable user_token field is added to the SlackBot model with proper encryption, along with corresponding API models (SlackBotCreationRequest, SlackBotTokens) and validation functions that ensure user tokens follow the correct format and are valid with Slack's API.
Frontend Integration: The admin UI is updated to include an optional "Slack User Token" field in bot configuration forms, maintaining backward compatibility since not all bots require federated search capabilities.
Search Pipeline Integration: The core search system is enhanced with SlackContext awareness - a new model that captures channel type (public/private/DM), channel ID, and user ID. This context flows through the entire search pipeline from the Slack listener through chat processing to the search tools, enabling context-aware federated searches.
Federated Search Implementation: A new slack_search.py module implements the actual Slack search functionality using Slack's search.messages API. It includes sophisticated channel filtering logic (e.g., restricting private channel access in bot contexts), thread contextualization for better results, and proper conversion of Slack search results to the system's InferenceChunk format.
OAuth Flow Enhancement: The Slack OAuth connector is updated to handle both bot and user tokens correctly, extracting user tokens from the authed_user section of Slack's OAuth response rather than incorrectly using the same token for both purposes.
The architecture prioritizes Slack bot context over regular user-based federated retrieval, allowing bots to function autonomously while falling back to user-based search when appropriate. Channel type detection uses Slack's API rather than potentially unreliable event data, ensuring accurate permission handling.
Confidence score: 3/5
- This PR introduces complex federated search functionality with multiple integration points that could have subtle permission or security issues
- Score reflects the architectural complexity and potential for runtime issues with lambda closures and token validation edge cases
- Pay close attention to backend/onyx/federated_connectors/federated_retrieval.py and the Slack search implementation files for variable capture bugs and permission logic
Context used:
Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)
32 files reviewed, 12 comments
| Args: | ||
| user_token: The user OAuth token to validate. | ||
| Returns: | ||
| None is valid and will return successfully. |
There was a problem hiding this comment.
style: Consider updating docstring - says 'None is valid and will return successfully' but should say 'Returns None if valid'
| None is valid and will return successfully. | |
| Returns None if valid. |
| try: | ||
| channel_info = web_client.conversations_info(channel=channel_id) | ||
| if channel_info.get("ok") and channel_info.get("channel"): | ||
| channel: dict[str, Any] = channel_info.get("channel", {}) |
There was a problem hiding this comment.
style: The variable assignment channel: dict[str, Any] = channel_info.get("channel", {}) is redundant since you're already checking channel_info.get("channel") on line 94
| if "not_allowed_token_type" in str(slack_error): | ||
| logger.error( | ||
| f"TOKEN TYPE ERROR: access_token starts with: {access_token[:10]}..." | ||
| ) |
There was a problem hiding this comment.
style: Logging the first 10 characters of access_token could potentially expose sensitive information in logs. Consider logging just the token type prefix (xoxp- vs xoxb-) instead.
| retrieval_function=lambda query: connector.search( | ||
| query, | ||
| {}, # Empty entities for Slack context | ||
| access_token=access_token, | ||
| limit=MAX_FEDERATED_CHUNKS, | ||
| slack_event_context=slack_context, | ||
| bot_token=tenant_slack_bot.bot_token, | ||
| ), |
There was a problem hiding this comment.
logic: Lambda captures variables by reference. access_token and tenant_slack_bot.bot_token could change between lambda creation and execution, causing runtime errors. Use default parameters to capture by value: lambda query, token=access_token, bot_token=tenant_slack_bot.bot_token: ...
| retrieval_function=lambda query: connector.search( | |
| query, | |
| {}, # Empty entities for Slack context | |
| access_token=access_token, | |
| limit=MAX_FEDERATED_CHUNKS, | |
| slack_event_context=slack_context, | |
| bot_token=tenant_slack_bot.bot_token, | |
| ), | |
| retrieval_function=lambda query, token=access_token, bot_token=tenant_slack_bot.bot_token: connector.search( | |
| query, | |
| {}, # Empty entities for Slack context | |
| access_token=token, | |
| limit=MAX_FEDERATED_CHUNKS, | |
| slack_event_context=slack_context, | |
| bot_token=bot_token, | |
| ), |
| # Slack-specific parameters | ||
| slack_event_context: SlackContext | None = None, | ||
| bot_token: str | None = None, |
There was a problem hiding this comment.
style: The Slack-specific parameters break interface abstraction by adding connector-specific parameters to a generic interface. Consider using a more extensible approach like a generic context parameter or connector-specific method overloads.
|
|
||
| # Ensure that the message is a new message of expected type | ||
| event_type = event.get("type") | ||
| event.get("channel_type") |
There was a problem hiding this comment.
syntax: This line calls event.get("channel_type") but doesn't assign the result to anything. Either remove this line or assign it to a variable if it's needed.
|
|
||
| return OAuthResult( | ||
| access_token=access_token, | ||
| access_token=user_token, # Bot token for bot operations |
There was a problem hiding this comment.
syntax: Comment is misleading - this is actually returning the user token, not bot token as the comment suggests
| access_token=user_token, # Bot token for bot operations | |
| access_token=user_token, # User token for user operations |
|
|
||
| started_patches = [p.start() for p in patches] | ||
|
|
||
| self._setup_slack_api_mocks(started_patches[0], started_patches[0]) |
There was a problem hiding this comment.
logic: Bug: passing same mock (started_patches[0]) twice to _setup_slack_api_mocks. Second parameter should be different mock or removed.
There was a problem hiding this comment.
cubic analysis
24 issues found across 33 files
Prompt for AI agents (all 24 issues)
Understand the root cause of the following 24 issues and fix them.
<file name="backend/onyx/connectors/slack/onyx_retry_handler.py">
<violation number="1" location="backend/onyx/connectors/slack/onyx_retry_handler.py:78">
Retry-After parsing uses only first character when header value is str, causing too-short backoff and repeated 429s.</violation>
</file>
<file name="backend/onyx/chat/process_message.py">
<violation number="1" location="backend/onyx/chat/process_message.py:606">
Passing slack_context causes SearchTool to log full SlackContext at info level (user_id/channel_id), risking PII exposure; prefer redaction or debug-level logging.</violation>
</file>
<file name="backend/onyx/federated_connectors/interfaces.py">
<violation number="1" location="backend/onyx/federated_connectors/interfaces.py:91">
Leaking Slack-specific type into the generic FederatedConnector.search signature reduces modularity; use a generic type (e.g., Any or a protocol) or a generic context dict instead.</violation>
</file>
<file name="backend/onyx/federated_connectors/slack/federated_connector.py">
<violation number="1" location="backend/onyx/federated_connectors/slack/federated_connector.py:187">
OAuthResult is assigning the user token to access_token, which mislabels tokens and will break downstream auth flows. Populate user_token with the user token and keep access_token for the bot token returned at the top level of the OAuth response.</violation>
<violation number="2" location="backend/onyx/federated_connectors/slack/federated_connector.py:241">
Avoid logging full entities at INFO; log minimal or redacted metadata to reduce potential leakage in logs.</violation>
</file>
<file name="backend/onyx/context/search/federated/slack_search.py">
<violation number="1" location="backend/onyx/context/search/federated/slack_search.py:60">
Calling conversations_info per match can be slow and hit rate limits; cache channel type by channel_id or prefetch to reduce API calls.</violation>
<violation number="2" location="backend/onyx/context/search/federated/slack_search.py:71">
On any exception during channel type lookup, all messages are filtered out, often resulting in zero results.</violation>
<violation number="3" location="backend/onyx/context/search/federated/slack_search.py:147">
Do not log entire API response objects; restrict logs to safe metadata (status, error).
(Based on your team's feedback about logging only safe metadata rather than raw responses that may include sensitive info.) [FEEDBACK_USED]</violation>
<violation number="4" location="backend/onyx/context/search/federated/slack_search.py:151">
Do not log any part of OAuth tokens; remove token prefix from logs.
(Based on your team's feedback about avoiding logging raw exception details/tokens and preferring safe metadata.) [FEEDBACK_USED]</violation>
</file>
<file name="backend/onyx/onyxbot/slack/utils.py">
<violation number="1" location="backend/onyx/onyxbot/slack/utils.py:87">
chat_postMessage result accessed with message_ts instead of ts, causing KeyError and retry-induced duplicate posts</violation>
<violation number="2" location="backend/onyx/onyxbot/slack/utils.py:101">
Private channel detection omits is_group; legacy private channels may be misclassified as UNKNOWN. Include is_group in the private-channel check.</violation>
<violation number="3" location="backend/onyx/onyxbot/slack/utils.py:113">
Avoid catching broad Exception; catch SlackApiError to handle Slack API errors specifically.</violation>
<violation number="4" location="backend/onyx/onyxbot/slack/utils.py:115">
Avoid logging the raw exception string; it may leak sensitive data. Log safe metadata or omit the exception value.
(Based on your team's feedback about avoiding logging raw exception strings that may include URLs with temporary auth tokens.) [FEEDBACK_USED]</violation>
</file>
<file name="backend/onyx/federated_connectors/models.py">
<violation number="1" location="backend/onyx/federated_connectors/models.py:43">
Mask sensitive access_token in model repr to prevent leaking tokens when objects are logged; add repr=False to Field.</violation>
<violation number="2" location="backend/onyx/federated_connectors/models.py:45">
Add repr=False to user_token Field to avoid exposing the user OAuth token in logs/prints.</violation>
</file>
<file name="backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py">
<violation number="1" location="backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py:248">
Add a test case where SlackBot has no user_token to validate the bot_token-only fallback and ensure federated search is correctly limited/disabled per requirements.</violation>
<violation number="2" location="backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py:290">
Avoid hardcoding embedding dimensionality in tests; derive from configuration or stub consumers to accept arbitrary length.</violation>
<violation number="3" location="backend/tests/external_dependency_unit/slack_bot/test_slack_bot_federated_search.py:529">
Docstring/expectations for DM behavior may be misleading; either update to reflect current implementation or add tests to cover optional inclusion of private channels when feasible. (According to linked Linear issue DAN-2328 optional DM behavior.)</violation>
</file>
<file name="backend/onyx/onyxbot/slack/listener.py">
<violation number="1" location="backend/onyx/onyxbot/slack/listener.py:870">
Avoid per-message Slack API calls for channel type; prefer using event-provided channel_type or caching to reduce latency and rate-limit risk.</violation>
</file>
<file name="backend/onyx/server/manage/validate_tokens.py">
<violation number="1" location="backend/onyx/server/manage/validate_tokens.py:71">
Add a timeout to the Slack API call to prevent hanging requests in case of network issues or Slack outages.</violation>
</file>
<file name="backend/onyx/utils/gpu_utils.py">
<violation number="1" location="backend/onyx/utils/gpu_utils.py:22">
Brittle disabled check: hard-codes port 9000; misses other ports, causing unnecessary requests when MODEL_SERVER_HOST is 'disabled'.</violation>
</file>
<file name="backend/onyx/configs/constants.py">
<violation number="1" location="backend/onyx/configs/constants.py:10">
Overly strict user token prefix "xoxp-" causes validate_user_token to reject valid Slack rotated user tokens (e.g., "xoxe.xoxp-..."), blocking Slack bot setup per DAN-2328.</violation>
</file>
<file name=".github/workflows/pr-external-dependency-unit-tests.yml">
<violation number="1" location=".github/workflows/pr-external-dependency-unit-tests.yml:10">
Duplicate env key OPENAI_API_KEY is defined twice in the same env mapping; remove one to avoid confusion and potential YAML duplicate-key issues.</violation>
<violation number="2" location=".github/workflows/pr-external-dependency-unit-tests.yml:94">
Readiness check masks all failures by continuing on any non-zero status; only continue on actual timeout (exit code 124), fail otherwise.</violation>
</file>
Linked issue analysis
Linked issue: DAN-2328: Add support for slack search into the slack bot
| Status | Acceptance criteria | Notes |
|---|---|---|
| ❌ | If 'All Public Knowledge' is selected, then federated slack (if connected) should be used. | No mapping from 'All Public Knowledge' selection to Slack retrieval |
| ❌ | If 'Specific Document Sets' are selected, then federated slack (if assigned to any of the document sets) should be used. | No document-set → Slack-channel selection used in retrieval |
| ❌ | If 'Search Assistant' is selected, then federated slack (if available to that assistant) should be used. | No assistant-type check enabling Slack federated retrieval found |
| ❌ | If 'Non-Search' assistant is selected, then no federated search should be used. | No logic disables federated retrieval for non-search assistants |
| ✅ | Federated search should search all Public channels, but not include DMs / private channels. | Channel filtering & public-channel detection implemented |
| ❌ | If a document set is specified that specifies certain channels, then search those channels (okay to ignore private channels). | No scoping of Slack search by document-set channel list implemented |
| ✅ | Optional: If DMing the bot, then respond based off of all public channels + private channels + DMs that the user is a part of (including ephemeral responses). | DM context handled via include_dm true and skiplists bypassed |
| ❌ | Optional: Allow user to add the Slack bot to certain private channels to make them searchable. | No UI/API to add bot to private channels found |
| ✅ | Optional: If asking a question in a private channel, respond based off all public channels + that ONE specific private channel. | Allowed_private_channel implemented to permit only that private channel |
| ❌ | Admins/curators should be able to create document sets and attach specific Slack channels to them (instead of only 'All of Slack'). | No document-set UI/API to attach specific Slack channels implemented |
| ❌ | UI: In document set creation page, allow configuring channels (replace gear with 'All Channels' & popup to type channel names). | No document-set UI changes to add channel configuration popup |
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.
| ttl_ms: int | None = None | ||
|
|
||
| retry_after_value: list[str] | None = None | ||
| retry_after_value: list[str] | str | None = None |
There was a problem hiding this comment.
Retry-After parsing uses only first character when header value is str, causing too-short backoff and repeated 429s.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/slack/onyx_retry_handler.py at line 78:
<comment>Retry-After parsing uses only first character when header value is str, causing too-short backoff and repeated 429s.</comment>
<file context>
@@ -75,7 +75,7 @@ def prepare_for_next_attempt(
ttl_ms: int | None = None
- retry_after_value: list[str] | None = None
+ retry_after_value: list[str] | str | None = None
retry_after_header_name: Optional[str] = None
duration_s: float = 1.0 # seconds
</file context>
✅ Addressed in 1a2badb
| additional_headers=custom_tool_additional_headers, | ||
| ), | ||
| allowed_tool_ids=new_msg_req.allowed_tool_ids, | ||
| slack_context=new_msg_req.slack_context, # Pass Slack context from request |
There was a problem hiding this comment.
Passing slack_context causes SearchTool to log full SlackContext at info level (user_id/channel_id), risking PII exposure; prefer redaction or debug-level logging.
Prompt for AI agents
Address the following comment on backend/onyx/chat/process_message.py at line 606:
<comment>Passing slack_context causes SearchTool to log full SlackContext at info level (user_id/channel_id), risking PII exposure; prefer redaction or debug-level logging.</comment>
<file context>
@@ -603,6 +603,7 @@ def stream_chat_message_objects(
additional_headers=custom_tool_additional_headers,
),
allowed_tool_ids=new_msg_req.allowed_tool_ids,
+ slack_context=new_msg_req.slack_context, # Pass Slack context from request
)
</file context>
| access_token: str, | ||
| limit: int | None = None, | ||
| # Slack-specific parameters | ||
| slack_event_context: SlackContext | None = None, |
There was a problem hiding this comment.
Leaking Slack-specific type into the generic FederatedConnector.search signature reduces modularity; use a generic type (e.g., Any or a protocol) or a generic context dict instead.
Prompt for AI agents
Address the following comment on backend/onyx/federated_connectors/interfaces.py at line 91:
<comment>Leaking Slack-specific type into the generic FederatedConnector.search signature reduces modularity; use a generic type (e.g., Any or a protocol) or a generic context dict instead.</comment>
<file context>
@@ -86,6 +87,9 @@ def search(
access_token: str,
limit: int | None = None,
+ # Slack-specific parameters
+ slack_event_context: SlackContext | None = None,
+ bot_token: str | None = None,
) -> list[InferenceChunk]:
</file context>
| Returns: | ||
| Search results in SlackSearchResponse format | ||
| """ | ||
| logger.info(f"Slack federated search called with entities: {entities}") |
There was a problem hiding this comment.
Avoid logging full entities at INFO; log minimal or redacted metadata to reduce potential leakage in logs.
Prompt for AI agents
Address the following comment on backend/onyx/federated_connectors/slack/federated_connector.py at line 241:
<comment>Avoid logging full entities at INFO; log minimal or redacted metadata to reduce potential leakage in logs.</comment>
<file context>
@@ -229,9 +232,20 @@ def search(
Returns:
Search results in SlackSearchResponse format
"""
+ logger.info(f"Slack federated search called with entities: {entities}")
+
with get_session_with_current_tenant() as db_session:
</file context>
| logger.info(f"Slack federated search called with entities: {entities}") | |
| logger.info("Slack federated search called", extra={"has_entities": bool(entities)}) |
| f"Skipping message from private channel {channel_id} " | ||
| f"(not the allowed private channel: {allowed_private_channel})" | ||
| ) | ||
| return True |
There was a problem hiding this comment.
On any exception during channel type lookup, all messages are filtered out, often resulting in zero results.
Prompt for AI agents
Address the following comment on backend/onyx/context/search/federated/slack_search.py at line 71:
<comment>On any exception during channel type lookup, all messages are filtered out, often resulting in zero results.</comment>
<file context>
@@ -39,6 +41,42 @@
+ f"Skipping message from private channel {channel_id} "
+ f"(not the allowed private channel: {allowed_private_channel})"
+ )
+ return True
+ except Exception as e:
+ logger.warning(
</file context>
|
|
||
| env: | ||
| # OpenAI | ||
| OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} |
There was a problem hiding this comment.
Duplicate env key OPENAI_API_KEY is defined twice in the same env mapping; remove one to avoid confusion and potential YAML duplicate-key issues.
Prompt for AI agents
Address the following comment on .github/workflows/pr-external-dependency-unit-tests.yml at line 10:
<comment>Duplicate env key OPENAI_API_KEY is defined twice in the same env mapping; remove one to avoid confusion and potential YAML duplicate-key issues.</comment>
<file context>
@@ -6,6 +6,9 @@ on:
env:
+ # OpenAI
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+
# AWS
</file context>
✅ Addressed in 1a2badb
| # Use bot token if available (has full permissions), otherwise fall back to user token | ||
| token_to_use = bot_token or access_token | ||
| channel_client = WebClient(token=token_to_use) | ||
| channel_info = channel_client.conversations_info(channel=channel_id) |
There was a problem hiding this comment.
Calling conversations_info per match can be slow and hit rate limits; cache channel type by channel_id or prefetch to reduce API calls.
Prompt for AI agents
Address the following comment on backend/onyx/context/search/federated/slack_search.py at line 60:
<comment>Calling conversations_info per match can be slow and hit rate limits; cache channel type by channel_id or prefetch to reduce API calls.</comment>
<file context>
@@ -39,6 +41,42 @@
+ # Use bot token if available (has full permissions), otherwise fall back to user token
+ token_to_use = bot_token or access_token
+ channel_client = WebClient(token=token_to_use)
+ channel_info = channel_client.conversations_info(channel=channel_id)
+
+ if isinstance(channel_info.data, dict) and not _is_public_channel(
</file context>
| logger.error(f"Slack API error in search_messages: {slack_error}") | ||
| logger.error( | ||
| f"Slack API error details: status={slack_error.response.status_code}, " | ||
| f"error={slack_error.response.get('error')}, response={slack_error.response}" |
There was a problem hiding this comment.
Prompt for AI agents
~~~ Address the following comment on backend/onyx/context/search/federated/slack_search.py at line 147: Do not log entire API response objects; restrict logs to safe metadata (status, error). (Based on your team's feedback about logging only safe metadata rather than raw responses that may include sensitive info.) @@ -79,19 +139,36 @@ def query_slack( + logger.error(f"Slack API error in search_messages: {slack_error}") + logger.error( + f"Slack API error details: status={slack_error.response.status_code}, " + f"error={slack_error.response.get('error')}, response={slack_error.response}" + ) + if "not_allowed_token_type" in str(slack_error): ~~~| ) | ||
| if "not_allowed_token_type" in str(slack_error): | ||
| logger.error( | ||
| f"TOKEN TYPE ERROR: access_token starts with: {access_token[:10]}..." |
There was a problem hiding this comment.
Prompt for AI agents
~~~ Address the following comment on backend/onyx/context/search/federated/slack_search.py at line 151: Do not log any part of OAuth tokens; remove token prefix from logs. (Based on your team's feedback about avoiding logging raw exception details/tokens and preferring safe metadata.) @@ -79,19 +139,36 @@ def query_slack( + ) + if "not_allowed_token_type" in str(slack_error): + logger.error( + f"TOKEN TYPE ERROR: access_token starts with: {access_token[:10]}..." + ) return [] ~~~|
|
||
| return OAuthResult( | ||
| access_token=access_token, | ||
| access_token=user_token, # Bot token for bot operations |
There was a problem hiding this comment.
OAuthResult is assigning the user token to access_token, which mislabels tokens and will break downstream auth flows. Populate user_token with the user token and keep access_token for the bot token returned at the top level of the OAuth response.
Prompt for AI agents
Address the following comment on backend/onyx/federated_connectors/slack/federated_connector.py at line 187:
<comment>OAuthResult is assigning the user token to access_token, which mislabels tokens and will break downstream auth flows. Populate user_token with the user token and keep access_token for the bot token returned at the top level of the OAuth response.</comment>
<file context>
@@ -183,7 +184,7 @@ def callback(self, callback_data: dict[str, Any], redirect_uri: str) -> OAuthRes
return OAuthResult(
- access_token=access_token,
+ access_token=user_token, # Bot token for bot operations
token_type=token_type,
scope=scope,
</file context>
|
Slack bot- fed search |
Co-authored-by: Jessica Singh <jessicasingh@Mac.attlocal.net> Co-authored-by: Jessica Singh <jessicasingh@mac.lan>
Co-authored-by: Jessica Singh <jessicasingh@Mac.attlocal.net> Co-authored-by: Jessica Singh <jessicasingh@mac.lan>
Description
DAN-2328
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Adds federated Slack search to the Slack bot using a user OAuth token, enabling public-channel search results in chat. This delivers the core of DAN-2328 for public channels; channel scoping and private/DM coverage will follow.
New Features
Migration