fix: apply additional_context to Documents in the document path of extract()#458
fix: apply additional_context to Documents in the document path of extract()#458SuperMarioYL wants to merge 1 commit intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Your branch is 1 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
…tract() When calling extract() with an iterable of Document objects, the top-level additional_context parameter was silently ignored. The document code path called annotate_documents() directly, while the string path correctly embedded the context in a Document wrapper. The fix intercepts the document iterable when a global additional_context is provided. For each Document without its own additional_context, a new Document is created carrying the global value. Documents that already have per-document context are passed through unchanged, so per-document context always takes precedence. Implementation notes: - The iterable is only materialized into a list when additional_context is not None; the raw iterable is forwarded as-is when no global context is given, preserving the previous behavior for large or streaming inputs. - New Document copies preserve _document_id and _tokenized_text from the original to avoid regenerating IDs prematurely or discarding cached tokenization. - Original caller Document objects are not mutated. Fixes google#445
aa088da to
6317471
Compare
|
Hi @SuperMarioYL, please complete the CLA at https://cla.developers.google.com/ so we can review this PR. Without |
Description
When calling
extract()with an iterable ofDocumentobjects, the top-leveladditional_contextparameter was silently ignored.Root cause:
extract()has two code paths:Document(additional_context=additional_context), then callsannotate_text().annotate_documents()directly, never forwardingadditional_context.Fix
When a global
additional_contextis provided alongside a document iterable,new
Documentinstances are created for any document that has no per-documentcontext set. Per-document context always takes precedence over the global
value, and original caller objects are not mutated.
Key implementation notes:
additional_context is not None;when it is
None, the raw iterable is forwarded unchanged to preservestreaming behaviour for large inputs.
_document_idand_tokenized_textfrom theoriginals to avoid regenerating IDs or discarding cached tokenization.
Fixes #445
How Has This Been Tested?
Added 7 new unit tests to
tests/extract_precedence_test.pycovering:Noneadditional_context leaves documents and iterable unchangedAll 525 existing tests pass (excluding pre-existing env-dependent plugin test).
Checklist
pytest tests/)./autoformat.shon changed files