Description
Describe the bug
I used a financial statement to extract the elements in it. But it does not identify the tables properly. Sometimes last few rows are missing, sometimes few columns are missing altogether. Also the first few rows that represent the table header is also missing in the chunk, which makes the RAG system unbale to find the right answer.
To Reproduce
Here is the python code used
elements = partition_pdf(
filename=filename,
# Unstructured Helpers
chunking_strategy="by_title",
strategy="hi_res",
infer_table_structure=True,
max_partition = 3000,
max_characters= 3000,
#overlap_all = True
)
I used this file : https://investors.intuit.com/_assets/_a21b3f6dd5cf08cb659458f26330acce/intuit/news/2024-08-22_Intuit_Reports_Strong_Fourth_Quarter_and_Full_1202.pdf
Look at some of the complex tables in this and compare these outputs
str(elements[13].metadata.text_as_html)
'
July 31, 2024 | July 31, 2023 | July 31, 2024 | July 31, 2023 | |||||
st revenue: | ||||||||
Service | $ | 2,670 | $ | 2,340 | $ | 13,861 | $ | 12,317 |
Product and other | 514 | 372 | 2,424 | 2,051 | ||||
Total net revenue | 3,184 | —~=*«‘T12 | 16,285 | «14,368. | ||||
sts and expenses: | ||||||||
Cost of revenue: | ||||||||
Cost of service revenue | 733 | 656 | 3,250 | 2,908 | ||||
Cost of product and other revenue | 14 | 16 | 69 | 72 | ||||
Amortization of acquired technology | 36 | 41 | 146 | 163 | ||||
Selling and marketing | 1,104 | 840 | 4,312 | 3,762 | ||||
Research and development | 725 | 680 | 2,754 | 2,539 | ||||
General and administrative | 377 | 341 | 1,418 | 1,300 | ||||
Amortization of other acquired | ||||||||
intangible assets | 123 | 121 | 483 | 483 | ||||
Restructuring | 223 | — | 223 | — | ||||
Total costs and expenses [A] | 3,335 | 2,695 | 12,655 | 11,227 | ||||
Operating income (loss) | (151) | 17 | 3,630 | 3,141 | ||||
erest expense | (60) | (68) | (242) | (248) | ||||
erest and other income, net | 71 | 46 | 162 | 96 | ||||
some (loss) before income taxes | (140) | (5) | 3,550 | 2,989 | ||||
come tax provision (benefit) [B] | (120) | (94) | 587 | 605 | ||||
st income (loss) | $ | (20) | $ | 89 | $ | 2,963 | $ | 2,384 |
isic net income (loss) per share | $ | (0.07) | $ | 0.32 | $ | 10.58 | $ | 8.49 |
lares used in basic per share Iculations | 280 | 280 | 280 | 281 |
str(elements[13].text)
'Three Months Ended Twelve Months Ended July 31, July 31, July 31, July 31, (In millions) 2024 2023 2024 2023 Cost of revenue $ 102 $ 83 $ 402 $ 374 Selling and marketing 137 119 506 Research and development 161 148 639 General and administrative 94 98 368 Restructuring 25 — 25 Total share-based compensation expense $ 519 $ 448 $ 1,940 $'
- table content (mainly in the text form) misses some of the rows or cells and do not match with the actual table
- The top row 'Three months ended' is completely missed in the html table element
- initial few of characters of each row in the table are missed
- text content is not matching with the html content at all, text version misses 1 column completely
Expected behavior
- There should be no difference between text and html versions of the chunks with table elements. It not clear which is more reliable for embedding
- Header of the table is missed in the table element. also the description the table falls into the previous element, so the table looses the context
- In some cases table is also split between chunks, so the partial table looses the context of the parent table
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
PyTorch version: 2.5.1+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Quadro T1000 with Max-Q Design
Nvidia driver version: 538.18
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=2310
DeviceID=CPU0
Family=198
L2CacheSize=1536
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2712
Name=Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
ProcessorType=3
Revision=
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] onnxruntime==1.19.2
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[conda] Could not collect
Additional context
Add any other context about the problem here.