Skip to content

Id not set in checkpoint2 #4468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 8, 2025
Merged

Id not set in checkpoint2 #4468

merged 5 commits into from
Apr 8, 2025

Conversation

evan-danswer
Copy link
Contributor

@evan-danswer evan-danswer commented Apr 7, 2025

Description

https://linear.app/danswer/issue/DAN-1762/more-drive-id-not-set-fixes

More drive connector improvements to cover cases where "folder id not set in checkpoint" might occur. One such common case is that a user doesn't have access to the first several folders being indexed, so despite that user's completion stage being set to "folders", they don't yield any documents from the first few folders, leading to no information being set in the checkpoint.

Also added a mypy change and associated fixes to prevent broken equality checks for the future

How Has This Been Tested?

tested in UI

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-danswer evan-danswer requested a review from a team as a code owner April 7, 2025 16:59
Copy link

vercel bot commented Apr 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 7, 2025 10:00pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Improved folder indexing and checkpoint management in the Google Drive connector to better handle cases where folder IDs aren’t set due to access issues.

  • backend/onyx/connectors/google_drive/file_retrieval.py: Moved update_traversed_ids_func inside the try block to update folder IDs only when valid files are found.
  • backend/onyx/connectors/google_drive/connector.py: Captures the last processed folder on resume and unconditionally updates the checkpoint, ensuring more reliable progress tracking.
  • Minor renaming clarifies stage completion handling.

2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

consolidated_context_docs.append(original_doc)
counter += 1
for original_doc in orig_question_retrieval_documents:
if original_doc in structured_subquestion_docs.cited_documents:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing fixes

@@ -98,9 +98,6 @@ def _is_external_doc_permissions_sync_due(cc_pair: ConnectorCredentialPair) -> b
if cc_pair.status != ConnectorCredentialPairStatus.ACTIVE:
return False

if cc_pair.status == ConnectorCredentialPairStatus.DELETING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was never True, see lines above

yield RetrievedDriveFile(
drive_file=file,
user_email=user_email,
parent_id=parent_id,
completion_stage=DriveRetrievalStage.FOLDER_FILES,
)
if found_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the move? If it's important, would prefer to add a comment as to why. If not / purely stylistic, ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we were marking folders as traversed if at least one file from the folder was retrieved without an error; now it will only be marked as done if ALL files from it are retrieved. With the new system for tracking folder completion (sorting and continuing from the last SEEN folder rather than last retrieved), this shouldn't cause us to get stuck and should let us handle pathological cases like a bunch of different users having individual access to files in a "shared folder" that isn't actually fully shared due to permission revoking.

last_processed_folder = folder_id

skipping_seen_folders = last_processed_folder is not None
for folder_id in sorted(filtered_folder_ids):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we merge the if statement and this for loop into one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably doable, but imo it's better/easier to read to have separate logic that handles resuming from a checkpoint

@Weves Weves merged commit 17562f9 into main Apr 8, 2025
10 of 11 checks passed
@Weves Weves deleted the id-not-set-in-checkpoint2 branch April 8, 2025 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants