Skip to content

fix: apply additional_context to Documents in the document path of extract()#458

Open
SuperMarioYL wants to merge 1 commit intogoogle:mainfrom
SuperMarioYL:fix/additional-context-documents-path
Open

fix: apply additional_context to Documents in the document path of extract()#458
SuperMarioYL wants to merge 1 commit intogoogle:mainfrom
SuperMarioYL:fix/additional-context-documents-path

Conversation

@SuperMarioYL
Copy link
Copy Markdown

Description

When calling extract() with an iterable of Document objects, the top-level
additional_context parameter was silently ignored.

Root cause: extract() has two code paths:

  • String path (working): wraps the string in Document(additional_context=additional_context), then calls annotate_text().
  • Document path (broken): called annotate_documents() directly, never forwarding additional_context.

Fix

When a global additional_context is provided alongside a document iterable,
new Document instances are created for any document that has no per-document
context set. Per-document context always takes precedence over the global
value, and original caller objects are not mutated.

Key implementation notes:

  • The iterable is only materialized when additional_context is not None;
    when it is None, the raw iterable is forwarded unchanged to preserve
    streaming behaviour for large inputs.
  • Copied Documents preserve _document_id and _tokenized_text from the
    originals to avoid regenerating IDs or discarding cached tokenization.

Fixes #445

How Has This Been Tested?

Added 7 new unit tests to tests/extract_precedence_test.py covering:

  • Global context applied to documents with no per-document context
  • Per-document context takes precedence over the global value
  • None additional_context leaves documents and iterable unchanged
  • Explicit document IDs are preserved on copied documents
  • Generator/lazy-iterable inputs work correctly
  • Original caller Documents are not mutated
  • Empty-string additional_context is treated as a non-None value

All 525 existing tests pass (excluding pre-existing env-dependent plugin test).

Checklist

  • My code follows the style guidelines of this project
  • I have self-reviewed my own code
  • I have made corresponding changes to the documentation (n/a)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • New and existing unit tests passed locally with my changes (pytest tests/)
  • I have run ./autoformat.sh on changed files

@github-actions github-actions Bot added the size/M Pull request with 150-600 lines changed label Apr 21, 2026
@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 21, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@github-actions
Copy link
Copy Markdown

⚠️ Branch Update Required

Your branch is 1 commits behind main. Please update your branch to ensure CI checks run with the latest code:

git fetch origin main
git merge origin/main
git push

Note: Enable "Allow edits by maintainers" to allow automatic updates.

…tract()

When calling extract() with an iterable of Document objects, the top-level
additional_context parameter was silently ignored. The document code path
called annotate_documents() directly, while the string path correctly embedded
the context in a Document wrapper.

The fix intercepts the document iterable when a global additional_context is
provided. For each Document without its own additional_context, a new Document
is created carrying the global value. Documents that already have per-document
context are passed through unchanged, so per-document context always takes
precedence.

Implementation notes:
- The iterable is only materialized into a list when additional_context is not
  None; the raw iterable is forwarded as-is when no global context is given,
  preserving the previous behavior for large or streaming inputs.
- New Document copies preserve _document_id and _tokenized_text from the
  original to avoid regenerating IDs prematurely or discarding cached
  tokenization.
- Original caller Document objects are not mutated.

Fixes google#445
@SuperMarioYL SuperMarioYL force-pushed the fix/additional-context-documents-path branch from aa088da to 6317471 Compare April 22, 2026 21:13
@aksg87
Copy link
Copy Markdown
Collaborator

aksg87 commented Apr 25, 2026

Hi @SuperMarioYL, please complete the CLA at https://cla.developers.google.com/ so we can review this PR. Without cla/google green we can't proceed. If we don't see it complete in the next couple of days, we'll close this out and handle the issue separately. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Pull request with 150-600 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

additional_context parameter not work

2 participants