-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Id not set in checkpoint2 #4468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Improved folder indexing and checkpoint management in the Google Drive connector to better handle cases where folder IDs aren’t set due to access issues.
- backend/onyx/connectors/google_drive/file_retrieval.py: Moved
update_traversed_ids_func
inside the try block to update folder IDs only when valid files are found. - backend/onyx/connectors/google_drive/connector.py: Captures the last processed folder on resume and unconditionally updates the checkpoint, ensuring more reliable progress tracking.
- Minor renaming clarifies stage completion handling.
2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
consolidated_context_docs.append(original_doc) | ||
counter += 1 | ||
for original_doc in orig_question_retrieval_documents: | ||
if original_doc in structured_subquestion_docs.cited_documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing fixes
@@ -98,9 +98,6 @@ def _is_external_doc_permissions_sync_due(cc_pair: ConnectorCredentialPair) -> b | |||
if cc_pair.status != ConnectorCredentialPairStatus.ACTIVE: | |||
return False | |||
|
|||
if cc_pair.status == ConnectorCredentialPairStatus.DELETING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was never True, see lines above
yield RetrievedDriveFile( | ||
drive_file=file, | ||
user_email=user_email, | ||
parent_id=parent_id, | ||
completion_stage=DriveRetrievalStage.FOLDER_FILES, | ||
) | ||
if found_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the move? If it's important, would prefer to add a comment as to why. If not / purely stylistic, ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously we were marking folders as traversed if at least one file from the folder was retrieved without an error; now it will only be marked as done if ALL files from it are retrieved. With the new system for tracking folder completion (sorting and continuing from the last SEEN folder rather than last retrieved), this shouldn't cause us to get stuck and should let us handle pathological cases like a bunch of different users having individual access to files in a "shared folder" that isn't actually fully shared due to permission revoking.
last_processed_folder = folder_id | ||
|
||
skipping_seen_folders = last_processed_folder is not None | ||
for folder_id in sorted(filtered_folder_ids): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we merge the if
statement and this for
loop into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably doable, but imo it's better/easier to read to have separate logic that handles resuming from a checkpoint
Description
https://linear.app/danswer/issue/DAN-1762/more-drive-id-not-set-fixes
More drive connector improvements to cover cases where "folder id not set in checkpoint" might occur. One such common case is that a user doesn't have access to the first several folders being indexed, so despite that user's completion stage being set to "folders", they don't yield any documents from the first few folders, leading to no information being set in the checkpoint.
Also added a mypy change and associated fixes to prevent broken equality checks for the future
How Has This Been Tested?
tested in UI
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.