Skip to content

Some PDFs can be run, but others cannot. What could be the reasons? #172

Open
@pengapple-ps

Description

@pengapple-ps

🐛 Describe the bug

No document text for /workplace/olmocr/tests/test/some_ocr1.pdf,but small_page_size.pdf and slideshow_mostly_good_some_pages_should_get_filtered.pdf can run

Versions

aiohappyeyeballs 2.6.1
aiohttp 3.11.16
aiosignal 1.3.2
annotated-types 0.7.0
anthropic 0.49.0
anyio 4.9.0
asttokens 3.0.0
attrs 25.3.0
beaker-py 1.34.1
bleach 6.2.0
boto3 1.37.33
botocore 1.37.33
cached_path 1.7.1
cachetools 5.5.2
certifi 2025.1.31
cffi 1.17.1
chardet 5.2.0
charset-normalizer 3.4.1
click 8.1.8
cloudpickle 3.1.1
compressed-tensors 0.8.0
cryptography 44.0.2
cuda-bindings 12.8.0
cuda-python 12.8.0
datasets 3.5.0
decorator 5.2.1
decord 0.6.0
Deprecated 1.2.18
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
docker 7.1.0
einops 0.8.1
executing 2.2.0
fastapi 0.115.12
filelock 3.18.0
flashinfer 0.1.6+cu124torch2.4
flashinfer-python 0.2.3+cu124torch2.5
frozenlist 1.5.0
fsspec 2024.12.0
ftfy 6.3.1
gguf 0.10.0
google-api-core 2.24.2
google-auth 2.38.0
google-cloud-core 2.4.3
google-cloud-storage 2.19.0
google-crc32c 1.7.1
google-resumable-media 2.7.2
googleapis-common-protos 1.69.2
h11 0.14.0
hf_transfer 0.1.9
httpcore 1.0.8
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.30.2
idna 3.10
img2pdf 0.6.0
importlib_metadata 8.6.1
interegular 0.3.3
ipython 9.1.0
ipython_pygments_lexers 1.1.1
jedi 0.19.2
Jinja2 3.1.6
jiter 0.9.0
jmespath 1.0.1
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lark 1.2.2
lingua-language-detector 2.1.0
litellm 1.66.0
llguidance 0.7.14
llvmlite 0.44.0
lm-format-enforcer 0.10.11
lxml 5.3.2
markdown-it-py 3.0.0
markdown2 2.5.3
MarkupSafe 3.0.2
matplotlib-inline 0.1.7
mdurl 0.1.2
mistral_common 1.5.4
modelscope 1.25.0
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.19.0
multidict 6.4.3
multiprocess 0.70.16
nanobind 2.6.1
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.4
numba 0.61.2
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
olmocr 0.1.60 /workplace/olmocr
openai 1.73.0
opencv-python-headless 4.11.0.86
orjson 3.10.16
outlines 0.0.46
packaging 24.2
pandas 2.2.3
parso 0.8.4
partial-json-parser 0.2.1.1.post5
pdf2image 1.17.0
pdfminer.six 20250327
pdfplumber 0.11.6
pexpect 4.9.0
pikepdf 9.7.0
pillow 11.2.1
pip 25.0
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.50
propcache 0.3.1
proto-plus 1.26.1
protobuf 6.30.2
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 19.0.1
pyasn1 0.6.1
pyasn1_modules 0.4.2
pycountry 24.6.1
pycparser 2.22
pydantic 2.11.3
pydantic_core 2.33.1
Pygments 2.19.1
pynvml 12.0.0
pypdf 5.4.0
PyPDF2 3.0.1
pypdfium2 4.30.1
pytesseract 0.3.13
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
pyzmq 26.4.0
ray 2.44.1
referencing 0.36.2
regex 2024.11.6
reportlab 4.3.1
requests 2.32.3
rich 13.9.4
rpds-py 0.24.0
rsa 4.9
s3transfer 0.11.4
safetensors 0.5.3
sentencepiece 0.2.0
setproctitle 1.3.5
setuptools 75.8.0
sgl-kernel 0.0.8
sglang 0.4.5
six 1.17.0
smart-open 7.1.0
sniffio 1.3.1
soundfile 0.13.1
stack-data 0.6.3
starlette 0.46.2
sympy 1.13.1
tiktoken 0.9.0
tokenizers 0.21.1
torch 2.5.1
torchao 0.10.0
torchvision 0.20.1
tqdm 4.67.1
traitlets 5.14.3
transformers 4.51.1
triton 3.1.0
typing_extensions 4.13.2
typing-inspection 0.4.0
tzdata 2025.2
urllib3 2.4.0
uvicorn 0.34.1
uvloop 0.21.0
vllm 0.6.4.post1
watchfiles 1.0.5
wcwidth 0.2.13
webencodings 0.5.1
websockets 15.0.1
wheel 0.45.1
wrapt 1.17.2
xformers 0.0.28.post3
xgrammar 0.1.17
xxhash 3.5.0
yarl 1.19.0
zipp 3.21.0
zstandard 0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions