bug/pdf-extraction-bug

**Describe the bug**
While partition_pdf or partition(text.. ) this method is working for docx, txt however for some pdfs it is not parsing well especially academic papers. 

**To Reproduce**
`elements = self._partition(**self._build_partition_kwargs(loaded))`

**Expected behavior**
Should return a simple text inside the elements list with some metadata. 

**Screenshots**

<img width="1294" height="854" alt="Image" src="https://github.com/user-attachments/assets/014a7452-55f3-4ee3-9c3e-c4b0d3965a92" />

```
**Environment Info**
name = "unstructured"
version = "0.17.2"
description = "A library that prepares raw documents for downstream ML tasks."
optional = false
python-versions = ">=3.9.0"
groups = ["main"]
files = [
    {file = "unstructured-0.17.2-py3-none-any.whl", hash = ..
    {file = "unstructured-0.17.2.tar.gz", hash = ..
```

**Additional context**
I will upload the pdf example that is not working. 

[2025.findings-naacl.114.pdf](https://github.com/user-attachments/files/22782439/2025.findings-naacl.114.pdf)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug/pdf-extraction-bug #4104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug/pdf-extraction-bug #4104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions