Skip to content

Drive smart chip indexing #4459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 7, 2025
Merged

Drive smart chip indexing #4459

merged 7 commits into from
Apr 7, 2025

Conversation

evan-danswer
Copy link
Contributor

@evan-danswer evan-danswer commented Apr 5, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-1684/handle-date-pills-and-other-similar-things-in-google-docs

Index smart chips in Google Docs. We know of three ways to get content from Google Docs. Here are some pros/cons
a) Docs advanced API v1
Pros: Allows structured retrieval, i.e. can extract headings
Cons: missing smart chips (dates, timers, calendar events, etc). DOES extract people and docs links.
b) Drive file retrieval
Pros: gets ALL smart chips, all text content.
Cons: no structure
c) Apps Scripting
Pros: structured retrieval, DOES get dates
Cons: misses all other smart chips, requires users to enable a bunch of extra scopes

This PR addresses some prior issues with the Advanced retrieval (missing the first section, not getting tables). It also detects when a doc contains smart chips, and if so uses (b) to get the full file content, then best-effort combines with section information from (a) to get full-content docs with reasonable section information.

We switched away from (c) upon realizing that (b) had more information than previously thought, but the code from that approach is available in the drive-pill-indexing branch.

How Has This Been Tested?

Tested in UI, should add an integration test at some point

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-danswer evan-danswer requested a review from a team as a code owner April 5, 2025 00:04
Copy link

vercel bot commented Apr 5, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 7, 2025 9:02pm

@evan-danswer evan-danswer changed the title Drive pill indexing2 Drive smart chip indexing Apr 5, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

The PR enhances the Google Drive connector's extraction and indexing of smart chips by integrating multiple retrieval methods and refining section alignment.

  • /backend/onyx/connectors/google_drive/appsscript.json: New configuration enabling the advanced Docs service in a V8 environment.
  • /backend/onyx/chat/process_message.py: Streamlined deduplication logic using a clearer conditional expression.
  • /backend/onyx/connectors/google_drive/section_extraction.py: Improved tab handling and structured processing of headings and tables.
  • /backend/onyx/connectors/google_drive/doc_conversion.py: Introduced best-effort alignment between basic and advanced extraction with warnings for misalignments.
  • /backend/onyx/connectors/google_drive/smart_chip_retrieval.gs: New Apps Script for smart chip extraction with potential Map usage concerns.

6 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile

@Weves Weves added this pull request to the merge queue Apr 7, 2025
Merged via the queue into main with commit 9c73099 Apr 7, 2025
11 checks passed
@evan-danswer evan-danswer deleted the drive-pill-indexing2 branch April 7, 2025 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants