Replace pdftotext with Docling #523

jborsky · 2025-10-22T07:14:50Z

This pull request implements basic initial replacement of pdftotext for Docling.

I am unsure of a few things about whether it's the ideal solution:

I removed the use of the convert_garbage attribute. I just kept it in the internal states for backward compatibility but removed it from serialization and schemas. I don't know if this attribute should be kept.
I created a new PdfConverter abstract class to allow different future PDF converter implementations and a DoclingConverter class that implements this simple interface. It lives right now in the pdf utils. Maybe it would make sense to create a new file for these definitions?
The DoclingConverter class has a pipeline setup for the conversion. There are a bunch of options to configure. Should I keep them defined there or move relevant options to the app configuration? If the latter, it may be better to define the whole pipeline there, because the pipeline itself can be changed for a different Docling one or you can create your custom one or just add/replace/remove models or stages.
I delegated parallelism to Docling and set max_workers=1 in the process_parallel call
to avoid excessive threads, while keeping the nice progress bar.

Replace pdftotext + OCR logic with Docling. Add JSON serialization of DoclingDocument alongside text output.

Update CCCertificate to handle the new return value of convert_pdf_file, which now returns a single boolean instead of a tuple. Add json_hash computation and _json_dir parameters for local path setup. Remove convert_garbage and add _json_path and json_hash in DocumentState. Update CCDataset to create JSON directories. Update log messages and progress bar descriptions.

Update FIPSCertificate to handle the new return value of convert_pdf_file, which now returns a single boolean instead of a tuple. Add json_hash computation and _json_dir parameters for local path setup. Added _json_path and json_hash in InternalState. Update FIPSDataset to include JSON directory. Update log messages and progress bar descriptions.

Update ProtectionProfile to handle the new return value of convert_pdf_file, which now returns a single boolean instead of a tuple. Add json_hash computation and _json_dir parameters for local path setup. Update ProtectionProfileDataset to include JSON directories. Update log messages and progress bar descriptions.

…verter Create new PdfConverter abstract class for allowing different PDF converter implementations in the future. Move conversion logic to DoclingConverter, with pipeline setup in init instead of in each call of convert

Update sample functions to accept an abstract PDF converter. Create a single DoclingConverter instance in dataset conversion functions. Delegate threading to Docling by setting max_workers=1 in process_parallel call to avoid excessive threads.

…mpatibility

Add checks for json_path existence. Add json_hash and remove convert_garbage in toy datasets and fictional cert. Replace template text files with new one produced by Docling.

Add json_path checks. Add json_hash and remove convert_garbage in toy dataset and fictional cert. Update template text file with one produced by Docling.

adamjanovsky · 2025-10-22T12:41:45Z

@jborsky what are you after right now? High-level design review? In-depth implementation review?

Do you consider this complete or should we wait till you mark it "ready" and request review?

J08nY · 2025-10-24T14:42:19Z

I had a quick look at this.

I removed the use of the convert_garbage attribute. I just kept it in the internal states for backward compatibility but removed it from serialization and schemas. I don't know if this attribute should be kept.

This makes sense.

I created a new PdfConverter abstract class to allow different future PDF converter implementations and a DoclingConverter class that implements this simple interface. It lives right now in the pdf utils. Maybe it would make sense to create a new file for these definitions?

Makes sense, could you actually keep the pdf2text conversion in a separate subclass of this? It will not create the JSON, but that is fine.

The DoclingConverter class has a pipeline setup for the conversion. There are a bunch of options to configure. Should I keep them defined there or move relevant options to the app configuration? If the latter, it may be better to define the whole pipeline there, because the pipeline itself can be changed for a different Docling one or you can create your custom one or just add/replace/remove models or stages.

Lets keep it simple now. We can extend this later on.

I delegated parallelism to Docling and set max_workers=1 in the process_parallel call
to avoid excessive threads, while keeping the nice progress bar.

I guess this makes sense but currently I think it still creates an additional process and submits the tasks to it which is unnecessary. If you look at the logic of process_parallel you can see that. This means we are still serializing the cert and the Docling converter instance and passing it along to a new process to actually convert. I remember that Docling has quite some costly initialization so I think we should avoid doing this needlessly.

adamjanovsky · 2025-10-24T16:52:37Z

Thanks. I agree with what @J08nY says. On top of that:

Given that the source branch lives in your fork, we don't get any pipelines here. Do the tests pass?
Eventually, we'd like to expose major settings to this class:

sec-certs/src/sec_certs/configuration.py

Line 12 in 85adccc

class Configuration(BaseSettings):
- No need to address this now
Do we have any estimates if this can run on consumer-grade laptops? Or how long this will take on our server on all documents?
More in person. Given that this is your first involvement in the project, it looks excellent! 👍

…essing

jborsky · 2025-10-25T17:16:30Z

I guess this makes sense but currently I think it still creates an additional process and submits the tasks to it which is unnecessary. If you look at the logic of process_parallel you can see that. This means we are still serializing the cert and the Docling converter instance and passing it along to a new process to actually convert. I remember that Docling has quite some costly initialization so I think we should avoid doing this needlessly.

Yes, you're right. It was actually using a second thread but if the multiprocessing was used instead of threading the Docling converter and cert would be serialized and passed to the new process. I've replaced the call with a simple loop, so now conversion happens in the main thread.

If we later decide to use multiple processes for conversion, I'll adjust the implementation so that each process creates its own converter instance. For now I think the current parallelism in Docling is sufficient.

Given that the source branch lives in your fork, we don't get any pipelines here. Do the tests pass?

Yes, most of the tests pass, at least on my machine 😄. The ones that don't are already failing in main. These are the tests with HTTP errors and test_build_dataset[default_dataset1-CveNvdDatasetBuilder].
And just a note that right now the assertions in the tests, which compare the converted output text to the template, depend on the OCR engine used. Changing the OCR can cause the assertion to fail.

Do I have other option besides having the branch on my fork? I don't think I can create branch here, or am I wrong?

Eventually, we'd like to expose major settings to this class:

sec-certs/src/sec_certs/configuration.py

Line 12 in 85adccc

class Configuration(BaseSettings):

Yep, that was my thought. I've at least for now added an option in this class to choose the PDF converter.

Do we have any estimates if this can run on consumer-grade laptops? Or how long this will take on our server on all documents?

It depends on what OCR is chosen and whether GPU acceleration is used. For instance, EasyOCR runs fairly slow on CPU.

Using the current pipeline configuration with EasyOCR on a random sample of 10 certificate artifacts from our dataset:

On my Apple M1 chip the conversion took on average 1.3 s per page.
On the Aura server with GPU acceleration, running the process with a niceness of 15, the conversion took on average 0.5s per page.

It would take something around 13 days on my MacBook and 5 days on Aura to convert all PDFs.

Maybe tweaking a little bit the configuration on Aura could further improve the performance.

J08nY · 2025-10-25T19:21:04Z

Do I have other option besides having the branch on my fork? I don't think I can create branch here, or am I wrong?

I will give you contributor rights tk the repo sonyou can have the branch there.

Do we have any estimates if this can run on consumer-grade laptops? Or how long this will take on our server on all documents?

It depends on what OCR is chosen and whether GPU acceleration is used. For instance, EasyOCR runs fairly slow on CPU.

Using the current pipeline configuration with EasyOCR on a random sample of 10 certificate artifacts from our dataset:

On my Apple M1 chip the conversion took on average 1.3 s per page.

On the Aura server with GPU acceleration, running the process with a niceness of 15, the conversion took on average 0.5s per page.

It would take something around 13 days on my MacBook and 5 days on Aura to convert all PDFs.

Maybe tweaking a little bit the configuration on Aura could further improve the performance.

That is really quite slow, hmm. That changes my view on the transition a bit. Let's make this actually not be the default (keep pdf2text as the base option) and make the resulting enhanced heuristics and data be dependent on whether we have the docling JSON or not. Wdyt @adamjanovsky ? This is also how other similarly dependency/runtime options are done.

jborsky · 2025-10-25T20:03:38Z

That is really quite slow, hmm. That changes my view on the transition a bit. Let's make this actually not be the default (keep pdf2text as the base option) and make the resulting enhanced heuristics and data be dependent on whether we have the docling JSON or not. Wdyt @adamjanovsky ? This is also how other similarly dependency/runtime options are done.

Yeah, it is. I think we’ve discussed that running the processing on the server is feasible even if it takes up a week. But running it locally on a laptop is a different story.

We could also adjust the default pipeline—disable OCR and switch table recognition to fast mode instead of accurate mode. That gives me about 0.5s per page on the M1, but I’m not sure if that’s still considered slow.

adamjanovsky · 2025-10-30T06:43:12Z

I think it's time to migrate to incremental processing, at least on the server. That is, keep the OCR by default but introduce the ability to cache the results from previous iterations.

Not saying that this must by developed by @jborsky.

J08nY · 2025-10-30T07:11:01Z

I think this incremental processing is best handled in the server code with relatively little support in the library needed. Basically it should be able to detect that the files that are meant to be somewhere are there and if a setting is set it should not recompute them. Another thing to handle is the download of fresh stuff and comparison of hashes of the files with the existing ones such that we retain the ability to spot certificate changes.

jborsky added 20 commits October 18, 2025 21:24

feat: replace pdftotext with Docling

a0e4a87

Replace pdftotext + OCR logic with Docling. Add JSON serialization of DoclingDocument alongside text output.

fix: correct wrong txt variable/function names

ebb99ea

chore: ruff fix and format

8766667

fix(test): remove convert_garbage variable asserts

306f12e

feat: replace pdftotext with docling in deps

20e1bc4

chore: remove dot from progress bar message

34e3dfc

fix: correct attribute name for conversion status

234c9e8

feat: add EasyOCR to deps

964e4e2

docs: add explanation comments about docling options

b15ad8d

fix: correct argument name in save to markdown function

8793263

feat: remove convert_garabage and add json_hash to serialization schemas

d81ff1c

feat: include back convert_garbage to document states for backward co…

84f60f0

…mpatibility

test(cc): fix existing tests

7fbf109

Add checks for json_path existence. Add json_hash and remove convert_garbage in toy datasets and fictional cert. Replace template text files with new one produced by Docling.

test(fips): fix existing tests

80c139e

Add json_path checks. Add json_hash and remove convert_garbage in toy dataset and fictional cert. Update template text file with one produced by Docling.

refactor: make PDF uppercase

ae06b51

jborsky added 7 commits October 25, 2025 13:35

feat: add pdftotext converter

e009a29

fix: compute json hash only if the file exists

a6b29ff

refactor(cc): make general conversion method and remove parallel proc…

8f685d9

…essing

fix: set json_hash to None if the file doesn't exist

3738def

refactor(fips): remove parallel processing when converting PDFs

908972b

refactor(PP): make general conversion method and remove parallel proc…

3ec2f82

…essing

feat: add option to choose PDF converter in configuration

981be2c

jborsky added 3 commits October 25, 2025 15:17

refactor: pass PDFConverter instance through dataset methods

e8b0568

chore: make log messages punctuation consistent

9baf1ac

fix: save json output only when json_path is provided

fc4f008

jborsky added 2 commits October 25, 2025 20:14

chore: remove .DS_store files

ae3ddf3

test(cc): replace templates with EasyOCR versions

bce5dfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace pdftotext with Docling #523

Replace pdftotext with Docling #523

Uh oh!

jborsky commented Oct 22, 2025

Uh oh!

adamjanovsky commented Oct 22, 2025

Uh oh!

J08nY commented Oct 24, 2025

Uh oh!

adamjanovsky commented Oct 24, 2025 •

edited

Loading

Uh oh!

jborsky commented Oct 25, 2025 •

edited

Loading

Uh oh!

J08nY commented Oct 25, 2025

Uh oh!

jborsky commented Oct 25, 2025 •

edited

Loading

Uh oh!

adamjanovsky commented Oct 30, 2025

Uh oh!

J08nY commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Replace pdftotext with Docling #523

Are you sure you want to change the base?

Replace pdftotext with Docling #523

Uh oh!

Conversation

jborsky commented Oct 22, 2025

Uh oh!

adamjanovsky commented Oct 22, 2025

Uh oh!

J08nY commented Oct 24, 2025

Uh oh!

adamjanovsky commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jborsky commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

J08nY commented Oct 25, 2025

Uh oh!

jborsky commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamjanovsky commented Oct 30, 2025

Uh oh!

J08nY commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamjanovsky commented Oct 24, 2025 •

edited

Loading

jborsky commented Oct 25, 2025 •

edited

Loading

jborsky commented Oct 25, 2025 •

edited

Loading