Skip to content

Commit 15253a5

Browse files
quedryannikolaidiscodeflash-ai[bot]luke-kucingclaude
authored
feat: track text source (#4112)
The purpose of this PR is to use the newly created `is_extracted` parameter in `TextRegion` (and the corresponding vector version `is_extracted_array` in `TextRegions`), flagging elements that were extracted directly from PDFs as such. This also involved: - New tests - A version update to bring in the new `unstructured-inference` - An ingest fixtures update - An optimization from Codeflash that's not directly related One important thing to review is that all avenues by which an element is extracted and ends up in the output of a partition are covered... fast, hi_res, etc. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: luke-kucing <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: qued <[email protected]>
1 parent b01d35b commit 15253a5

File tree

13 files changed

+318
-18
lines changed

13 files changed

+318
-18
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,17 @@
1+
## 0.18.19-dev0
2+
3+
### Enhancement
4+
- Flag extracted elements as such in the metadata for downstream use
5+
6+
### Features
7+
8+
### Fixes
9+
110
## 0.18.18
211

312
### Fixes
413
- **Prevent path traversal in email MSG attachment filenames** Fixed a security vulnerability (GHSA-gm8q-m8mv-jj5m) where malicious attachment filenames containing path traversal sequences could write files outside the intended directory. The fix normalizes both Unix and Windows path separators before sanitizing filenames, preventing cross-platform path traversal attacks in `partition_msg` functions
14+
515
## 0.18.17
616

717
### Enhancement

requirements/extra-pdf-image.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,5 @@ google-cloud-vision
1212
effdet
1313
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
1414
# when unstructured library is.
15-
unstructured-inference>=1.0.5
15+
unstructured-inference>=1.1.1
1616
unstructured.pytesseract>=0.3.12

requirements/extra-pdf-image.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ typing-extensions==4.15.0
283283
# torch
284284
tzdata==2025.2
285285
# via pandas
286-
unstructured-inference==1.0.5
286+
unstructured-inference==1.1.1
287287
# via -r ./extra-pdf-image.in
288288
unstructured-pytesseract==0.3.15
289289
# via -r ./extra-pdf-image.in

requirements/extra-xlsx.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ cryptography==46.0.3
1515
et-xmlfile==2.0.0
1616
# via openpyxl
1717
msoffcrypto-tool==5.4.2
18-
# via -r ./extra-xlsx.in
18+
# via -r extra-xlsx.in
1919
networkx==3.4.2
20-
# via -r ./extra-xlsx.in
20+
# via -r extra-xlsx.in
2121
numpy==2.2.6
2222
# via
2323
# -c /Users/luke/git/unstructured/requirements/base.txt
@@ -27,9 +27,9 @@ olefile==0.47
2727
# -c /Users/luke/git/unstructured/requirements/base.txt
2828
# msoffcrypto-tool
2929
openpyxl==3.1.5
30-
# via -r ./extra-xlsx.in
30+
# via -r extra-xlsx.in
3131
pandas==2.3.3
32-
# via -r ./extra-xlsx.in
32+
# via -r extra-xlsx.in
3333
pycparser==2.23
3434
# via
3535
# -c /Users/luke/git/unstructured/requirements/base.txt
@@ -51,4 +51,4 @@ typing-extensions==4.15.0
5151
tzdata==2025.2
5252
# via pandas
5353
xlrd==2.0.2
54-
# via -r ./extra-xlsx.in
54+
# via -r extra-xlsx.in

test_unstructured/partition/common/test_common.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import numpy as np
55
import pytest
66
from PIL import Image
7+
from unstructured_inference.constants import IsExtracted
78
from unstructured_inference.inference import layout
89
from unstructured_inference.inference.elements import TextRegion
910
from unstructured_inference.inference.layoutelement import LayoutElement
@@ -445,3 +446,23 @@ def test_ocr_data_to_elements():
445446
points=layout_el.bbox.coordinates,
446447
system=coordinate_system,
447448
)
449+
450+
451+
def test_normalize_layout_element_layout_element_text_source_metadata():
452+
layout_element = LayoutElement.from_coords(
453+
type="NarrativeText",
454+
x1=1,
455+
y1=2,
456+
x2=3,
457+
y2=4,
458+
text="Some lovely text",
459+
is_extracted=IsExtracted.TRUE,
460+
)
461+
coordinate_system = PixelSpace(width=10, height=20)
462+
element = common.normalize_layout_element(
463+
layout_element,
464+
coordinate_system=coordinate_system,
465+
)
466+
assert hasattr(element, "metadata")
467+
assert hasattr(element.metadata, "is_extracted")
468+
assert element.metadata.is_extracted == "true"
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
from PIL import Image
2+
from unstructured_inference.constants import IsExtracted
3+
from unstructured_inference.inference.elements import Rectangle
4+
from unstructured_inference.inference.layout import DocumentLayout, PageLayout
5+
from unstructured_inference.inference.layoutelement import LayoutElement, LayoutElements
6+
7+
from unstructured.partition.pdf_image.pdfminer_processing import (
8+
merge_inferred_with_extracted_layout,
9+
)
10+
11+
12+
def test_text_source_preserved_during_merge():
13+
"""Test that text_source property is preserved when elements are merged."""
14+
15+
# Create two simple LayoutElements with different text_source values
16+
inferred_element = LayoutElement(
17+
bbox=Rectangle(0, 0, 100, 50), text=None, is_extracted=IsExtracted.FALSE
18+
)
19+
20+
extracted_element = LayoutElement(
21+
bbox=Rectangle(0, 0, 100, 50), text="Extracted text", is_extracted=IsExtracted.TRUE
22+
)
23+
24+
# Create LayoutElements arrays
25+
inferred_layout_elements = LayoutElements.from_list([inferred_element])
26+
extracted_layout_elements = LayoutElements.from_list([extracted_element])
27+
28+
# Create a PageLayout for the inferred layout
29+
image = Image.new("RGB", (200, 200))
30+
inferred_page = PageLayout(number=1, image=image)
31+
inferred_page.elements_array = inferred_layout_elements
32+
33+
# Create DocumentLayout from the PageLayout
34+
inferred_document_layout = DocumentLayout(pages=[inferred_page])
35+
36+
# Merge them
37+
merged_layout = merge_inferred_with_extracted_layout(
38+
inferred_document_layout=inferred_document_layout,
39+
extracted_layout=[extracted_layout_elements],
40+
hi_res_model_name="test_model",
41+
)
42+
43+
# Verify text_source is preserved
44+
# Check the merged page's elements_array
45+
merged_page = merged_layout.pages[0]
46+
assert "Extracted text" in merged_page.elements_array.texts
47+
assert hasattr(merged_page.elements_array, "is_extracted_array")
48+
assert IsExtracted.TRUE in merged_page.elements_array.is_extracted_array

test_unstructured/partition/pdf_image/test_pdfminer_processing.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import pytest
55
from pdfminer.layout import LAParams
66
from PIL import Image
7+
from unstructured_inference.constants import IsExtracted
78
from unstructured_inference.constants import Source as InferenceSource
89
from unstructured_inference.inference.elements import (
910
EmbeddedTextRegion,
@@ -249,6 +250,11 @@ def test_process_file_with_pdfminer():
249250
assert links[0][0]["url"] == "https://layout-parser.github.io"
250251

251252

253+
def test_process_file_with_pdfminer_is_extracted_array():
254+
layout, _ = process_file_with_pdfminer(example_doc_path("pdf/layout-parser-paper-fast.pdf"))
255+
assert all(is_extracted is IsExtracted.TRUE for is_extracted in layout[0].is_extracted_array)
256+
257+
252258
@patch("unstructured.partition.pdf_image.pdfminer_utils.LAParams", return_value=LAParams())
253259
def test_laprams_are_passed_from_partition_to_pdfminer(pdfminer_mock):
254260
partition(

test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
"element_id": "1e41f20785644cdea2f017cfb67bb359",
55
"text": "Core Skills for Biomedical Data Scientists",
66
"metadata": {
7+
"is_extracted": "true",
78
"filetype": "application/pdf",
89
"languages": [
910
"eng"
@@ -26,6 +27,7 @@
2627
"element_id": "c915a2a57c901810a698491ca2393669",
2728
"text": "Maryam Zaringhalam, PhD, AAAS Science & Technology Policy Fellow",
2829
"metadata": {
30+
"is_extracted": "true",
2931
"filetype": "application/pdf",
3032
"languages": [
3133
"eng"
@@ -48,6 +50,7 @@
4850
"element_id": "b24c3f8d268b2f834a00966d8faef975",
4951
"text": "Lisa Federer, MLIS, Data Science Training Coordinator",
5052
"metadata": {
53+
"is_extracted": "true",
5154
"filetype": "application/pdf",
5255
"languages": [
5356
"eng"
@@ -70,6 +73,7 @@
7073
"element_id": "fcff333f886b39cee0a7084a9ff9204d",
7174
"text": "Michael F. Huerta, PhD, Associate Director of NLM for Program Development and NLM Coordinator of Data Science and Open Science Initiatives",
7275
"metadata": {
76+
"is_extracted": "true",
7377
"filetype": "application/pdf",
7478
"languages": [
7579
"eng"
@@ -92,6 +96,7 @@
9296
"element_id": "1b86fad341db35208d75a543bcf819ae",
9397
"text": "Executive Summary",
9498
"metadata": {
99+
"is_extracted": "true",
95100
"filetype": "application/pdf",
96101
"languages": [
97102
"eng"
@@ -114,6 +119,7 @@
114119
"element_id": "fee71d4f7ef7a5f253a44f6df648d12a",
115120
"text": "This report provides recommendations for a minimal set of core skills for biomedical data scientists based on analysis that draws on opinions of data scientists, curricula for existing biomedical data science programs, and requirements for biomedical data science jobs. Suggested high-level core skills include:",
116121
"metadata": {
122+
"is_extracted": "true",
117123
"filetype": "application/pdf",
118124
"languages": [
119125
"eng"
@@ -136,6 +142,7 @@
136142
"element_id": "caa3c2eba90fedb7c8923ae8cd8de961",
137143
"text": "1. General biomedical subject matter knowledge: biomedical data scientists should have a general working knowledge of the principles of biology, bioinformatics, and basic clinical science;",
138144
"metadata": {
145+
"is_extracted": "true",
139146
"filetype": "application/pdf",
140147
"languages": [
141148
"eng"
@@ -158,6 +165,7 @@
158165
"element_id": "a4622e6575ee04b0c4d74c0c6b3b2452",
159166
"text": "2. Programming language expertise: biomedical data scientists should be fluent in at least one programming language (typically R and/or Python);",
160167
"metadata": {
168+
"is_extracted": "true",
161169
"filetype": "application/pdf",
162170
"languages": [
163171
"eng"
@@ -180,6 +188,7 @@
180188
"element_id": "206899164b194bb9c379531b35eae01b",
181189
"text": "3. Predictive analytics, modeling, and machine learning: while a range of statistical methods may be useful, predictive analytics, modeling, and machine learning emerged as especially important skills in biomedical data science;",
182190
"metadata": {
191+
"is_extracted": "true",
183192
"filetype": "application/pdf",
184193
"languages": [
185194
"eng"
@@ -202,6 +211,7 @@
202211
"element_id": "36eb8f3c3778fbb71dc056571e71175d",
203212
"text": "4. Team science and scientific communication: “soft” skills, like the ability to work well on teams and communicate effectively in both verbal and written venues, may be as important as the more technical skills typically associated with data science.",
204213
"metadata": {
214+
"is_extracted": "true",
205215
"filetype": "application/pdf",
206216
"languages": [
207217
"eng"
@@ -224,6 +234,7 @@
224234
"element_id": "afe37b1ec10a6d08294ff0fb6df79996",
225235
"text": "5. Responsible data stewardship: a successful data scientist must be able to implement best practices for data management and stewardship, as well as conduct research in an ethical manner that maintains data security and privacy.",
226236
"metadata": {
237+
"is_extracted": "true",
227238
"filetype": "application/pdf",
228239
"languages": [
229240
"eng"
@@ -246,6 +257,7 @@
246257
"element_id": "b29f66200f2cc9ff2b49f3d07fd8022b",
247258
"text": "The report further details specific skills and expertise relevant to biomedical data scientists.",
248259
"metadata": {
260+
"is_extracted": "true",
249261
"filetype": "application/pdf",
250262
"languages": [
251263
"eng"
@@ -268,6 +280,7 @@
268280
"element_id": "bab05a183c34df666bfc920f04d17637",
269281
"text": "Motivation",
270282
"metadata": {
283+
"is_extracted": "true",
271284
"filetype": "application/pdf",
272285
"languages": [
273286
"eng"
@@ -290,6 +303,7 @@
290303
"element_id": "f250e86931949c66fe99d742fd9be29c",
291304
"text": "Training a biomedical data science (BDS) workforce is a central theme in NLM’s Strategic Plan for the coming decade. That commitment is echoed in the NIH-wide Big Data to Knowledge (BD2K) initiative, which invested $61 million between FY2014 and FY2017 in training programs for the development and use of biomedical big data science methods and tools. In line with",
292305
"metadata": {
306+
"is_extracted": "true",
293307
"filetype": "application/pdf",
294308
"languages": [
295309
"eng"
@@ -312,6 +326,7 @@
312326
"element_id": "9aa82368657b60536f152fd413aec316",
313327
"text": "Core Skills for Biomedical Data Scientists",
314328
"metadata": {
329+
"is_extracted": "true",
315330
"filetype": "application/pdf",
316331
"languages": [
317332
"eng"
@@ -334,6 +349,7 @@
334349
"element_id": "4f2dbe3656a9ebc60c7e3426ad3cb3e3",
335350
"text": "_____________________________________________________________________________________________",
336351
"metadata": {
352+
"is_extracted": "true",
337353
"filetype": "application/pdf",
338354
"languages": [
339355
"eng"
@@ -356,6 +372,7 @@
356372
"element_id": "cd359ae8c49885ead47318021438eead",
357373
"text": "this commitment, a recent report to the NLM Director recommended working across NIH to identify and develop core skills required of a biomedical data scientist to consistency across the cohort of NIH-trained data scientists. This report provides a set of recommended core skills based on analysis of current BD2K-funded training programs, biomedical data science job ads, and practicing members of the current data science workforce.",
358374
"metadata": {
375+
"is_extracted": "true",
359376
"filetype": "application/pdf",
360377
"languages": [
361378
"eng"
@@ -378,6 +395,7 @@
378395
"element_id": "bf8321a34edb7103ec4209f3e4a8a8da",
379396
"text": "Methodology",
380397
"metadata": {
398+
"is_extracted": "true",
381399
"filetype": "application/pdf",
382400
"languages": [
383401
"eng"
@@ -400,6 +418,7 @@
400418
"element_id": "1e1d3d1a5c1397fc588393568d829bc8",
401419
"text": "The Workforce Excellence team took a three-pronged approach to identifying core skills required of a biomedical data scientist (BDS), drawing from:",
402420
"metadata": {
421+
"is_extracted": "true",
403422
"filetype": "application/pdf",
404423
"languages": [
405424
"eng"
@@ -422,6 +441,7 @@
422441
"element_id": "45d7ff56632d66a2ab2d4dd2716d4d2e",
423442
"text": "a) Responses to a 2017 Kaggle1 survey2 of over 16,000 self-identified data scientists working across many industries. Analysis of the Kaggle survey responses from the current data science workforce provided insights into the current generation of data scientists, including how they were trained and what programming and analysis skills they use.",
424443
"metadata": {
444+
"is_extracted": "true",
425445
"filetype": "application/pdf",
426446
"languages": [
427447
"eng"
@@ -444,6 +464,7 @@
444464
"element_id": "bf452aac5123fcedda30dd6ed179f41c",
445465
"text": "b) Data science skills taught in BD2K-funded training programs. A qualitative content analysis was applied to the descriptions of required courses offered under the 12 BD2K-funded training programs. Each course was coded using qualitative data analysis software, with each skill that was present in the description counted once. The coding schema of data science-related skills was inductively developed and was organized into four major categories: (1) statistics and math skills; (2) computer science; (3) subject knowledge; (4) general skills, like communication and teamwork. The coding schema is detailed in Appendix A.",
446466
"metadata": {
467+
"is_extracted": "true",
447468
"filetype": "application/pdf",
448469
"languages": [
449470
"eng"
@@ -466,6 +487,7 @@
466487
"element_id": "ca176cbef532792b1f11830ff7520587",
467488
"text": "c) Desired skills identified from data science-related job ads. 59 job ads from government (8.5%), academia (42.4%), industry (33.9%), and the nonprofit sector (15.3%) were sampled from websites like Glassdoor, Linkedin, and Ziprecruiter. The content analysis methodology and coding schema utilized in analyzing the training programs were applied to the job descriptions. Because many job ads mentioned the same skill more than once, each occurrence of the skill was coded, therefore weighting important skills that were mentioned multiple times in a single ad.",
468489
"metadata": {
490+
"is_extracted": "true",
469491
"filetype": "application/pdf",
470492
"languages": [
471493
"eng"
@@ -488,6 +510,7 @@
488510
"element_id": "11b170fedd889c3b895bbd28acd811ca",
489511
"text": "Analysis of the above data provided insights into the current state of biomedical data science training, as well as a view into data science-related skills likely to be needed to prepare the BDS workforce to succeed in the future. Together, these analyses informed recommendations for core skills necessary for a competitive biomedical data scientist.",
490512
"metadata": {
513+
"is_extracted": "true",
491514
"filetype": "application/pdf",
492515
"languages": [
493516
"eng"
@@ -510,6 +533,7 @@
510533
"element_id": "2665aadf75bca259f1f5b4c91a53a301",
511534
"text": "1 Kaggle is an online community for data scientists, serving as a platform for collaboration, competition, and learning: http://kaggle.com",
512535
"metadata": {
536+
"is_extracted": "true",
513537
"filetype": "application/pdf",
514538
"languages": [
515539
"eng"
@@ -532,6 +556,7 @@
532556
"element_id": "8bbfe1c3e6bca9a33226d20d69b2297a",
533557
"text": "2 In August 2017, Kaggle conducted an industry-wide survey to gain a clearer picture of the state of data science and machine learning. A standard set of questions were asked of all respondents, with more specific questions related to work for employed data scientists and questions related to learning for data scientists in training. Methodology and results: https://www.kaggle.com/kaggle/kaggle-survey-2017",
534558
"metadata": {
559+
"is_extracted": "true",
535560
"filetype": "application/pdf",
536561
"languages": [
537562
"eng"
@@ -554,6 +579,7 @@
554579
"element_id": "dd4a661e1a3c898a5cf6328ba56b924d",
555580
"text": "2",
556581
"metadata": {
582+
"is_extracted": "true",
557583
"filetype": "application/pdf",
558584
"languages": [
559585
"eng"

0 commit comments

Comments
 (0)