Skip to content

Conversation

@jborsky
Copy link
Collaborator

@jborsky jborsky commented Oct 22, 2025

This pull request implements basic initial replacement of pdftotext for Docling.

I am unsure of a few things about whether it's the ideal solution:

  • I removed the use of the convert_garbage attribute. I just kept it in the internal states for backward compatibility but removed it from serialization and schemas. I don't know if this attribute should be kept.
  • I created a new PdfConverter abstract class to allow different future PDF converter implementations and a DoclingConverter class that implements this simple interface. It lives right now in the pdf utils. Maybe it would make sense to create a new file for these definitions?
  • The DoclingConverter class has a pipeline setup for the conversion. There are a bunch of options to configure. Should I keep them defined there or move relevant options to the app configuration? If the latter, it may be better to define the whole pipeline there, because the pipeline itself can be changed for a different Docling one or you can create your custom one or just add/replace/remove models or stages.
  • I delegated parallelism to Docling and set max_workers=1 in the process_parallel call
    to avoid excessive threads, while keeping the nice progress bar.

Replace pdftotext + OCR logic with Docling.
Add JSON serialization of DoclingDocument alongside text output.
Update CCCertificate to handle the new return value of
convert_pdf_file, which now returns a single boolean instead of a tuple.
Add json_hash computation and _json_dir parameters for local path
setup.

Remove convert_garbage and add _json_path and json_hash in
DocumentState.

Update CCDataset to create JSON directories. Update log messages and
progress bar descriptions.
Update FIPSCertificate to handle the new return value of
convert_pdf_file, which now returns a single boolean instead of a tuple.
Add json_hash computation and _json_dir parameters for local path
setup. Added _json_path and json_hash in InternalState.

Update FIPSDataset to include JSON directory. Update log messages and
progress bar descriptions.
Update ProtectionProfile to handle the new return value of
convert_pdf_file, which now returns a single boolean instead of a tuple.
Add json_hash computation and _json_dir parameters for local path
setup.

Update ProtectionProfileDataset to include JSON directories. Update
log messages and progress bar descriptions.
…verter

Create new PdfConverter abstract class for allowing different PDF
converter implementations in the future.

Move conversion logic to DoclingConverter, with pipeline setup in init
instead of in each call of convert
Update sample functions to accept an abstract PDF converter.

Create a single DoclingConverter instance in dataset conversion functions.

Delegate threading to Docling by setting max_workers=1 in process_parallel call
to avoid excessive threads.
Add checks for json_path existence.

Add json_hash and remove convert_garbage in toy datasets and fictional cert.

Replace template text files with new one produced by Docling.
Add json_path checks.

Add json_hash and remove convert_garbage in toy dataset and fictional cert.

Update template text file with one produced by Docling.
@adamjanovsky
Copy link
Collaborator

@jborsky what are you after right now? High-level design review? In-depth implementation review?

Do you consider this complete or should we wait till you mark it "ready" and request review?

@J08nY
Copy link
Member

J08nY commented Oct 24, 2025

I had a quick look at this.

  • I removed the use of the convert_garbage attribute. I just kept it in the internal states for backward compatibility but removed it from serialization and schemas. I don't know if this attribute should be kept.

This makes sense.

  • I created a new PdfConverter abstract class to allow different future PDF converter implementations and a DoclingConverter class that implements this simple interface. It lives right now in the pdf utils. Maybe it would make sense to create a new file for these definitions?

Makes sense, could you actually keep the pdf2text conversion in a separate subclass of this? It will not create the JSON, but that is fine.

  • The DoclingConverter class has a pipeline setup for the conversion. There are a bunch of options to configure. Should I keep them defined there or move relevant options to the app configuration? If the latter, it may be better to define the whole pipeline there, because the pipeline itself can be changed for a different Docling one or you can create your custom one or just add/replace/remove models or stages.

Lets keep it simple now. We can extend this later on.

  • I delegated parallelism to Docling and set max_workers=1 in the process_parallel call
    to avoid excessive threads, while keeping the nice progress bar.

I guess this makes sense but currently I think it still creates an additional process and submits the tasks to it which is unnecessary. If you look at the logic of process_parallel you can see that. This means we are still serializing the cert and the Docling converter instance and passing it along to a new process to actually convert. I remember that Docling has quite some costly initialization so I think we should avoid doing this needlessly.

@adamjanovsky
Copy link
Collaborator

adamjanovsky commented Oct 24, 2025

Thanks. I agree with what @J08nY says. On top of that:

  • Given that the source branch lives in your fork, we don't get any pipelines here. Do the tests pass?
  • Eventually, we'd like to expose major settings to this class:
    class Configuration(BaseSettings):
    • No need to address this now
  • Do we have any estimates if this can run on consumer-grade laptops? Or how long this will take on our server on all documents?
  • More in person. Given that this is your first involvement in the project, it looks excellent! 👍

@jborsky
Copy link
Collaborator Author

jborsky commented Oct 25, 2025

I guess this makes sense but currently I think it still creates an additional process and submits the tasks to it which is unnecessary. If you look at the logic of process_parallel you can see that. This means we are still serializing the cert and the Docling converter instance and passing it along to a new process to actually convert. I remember that Docling has quite some costly initialization so I think we should avoid doing this needlessly.

Yes, you're right. It was actually using a second thread but if the multiprocessing was used instead of threading the Docling converter and cert would be serialized and passed to the new process. I've replaced the call with a simple loop, so now conversion happens in the main thread.

If we later decide to use multiple processes for conversion, I'll adjust the implementation so that each process creates its own converter instance. For now I think the current parallelism in Docling is sufficient.

Given that the source branch lives in your fork, we don't get any pipelines here. Do the tests pass?

Yes, most of the tests pass, at least on my machine 😄. The ones that don't are already failing in main. These are the tests with HTTP errors and test_build_dataset[default_dataset1-CveNvdDatasetBuilder].
And just a note that right now the assertions in the tests, which compare the converted output text to the template, depend on the OCR engine used. Changing the OCR can cause the assertion to fail.

Do I have other option besides having the branch on my fork? I don't think I can create branch here, or am I wrong?

Eventually, we'd like to expose major settings to this class:

class Configuration(BaseSettings):

Yep, that was my thought. I've at least for now added an option in this class to choose the PDF converter.

Do we have any estimates if this can run on consumer-grade laptops? Or how long this will take on our server on all documents?

It depends on what OCR is chosen and whether GPU acceleration is used. For instance, EasyOCR runs fairly slow on CPU.

Using the current pipeline configuration with EasyOCR on a random sample of 10 certificate artifacts from our dataset:

  • On my Apple M1 chip the conversion took on average 1.3 s per page.
  • On the Aura server with GPU acceleration, running the process with a niceness of 15, the conversion took on average 0.5s per page.

It would take something around 13 days on my MacBook and 5 days on Aura to convert all PDFs.

Maybe tweaking a little bit the configuration on Aura could further improve the performance.

@J08nY
Copy link
Member

J08nY commented Oct 25, 2025

Do I have other option besides having the branch on my fork? I don't think I can create branch here, or am I wrong?

I will give you contributor rights tk the repo sonyou can have the branch there.

Do we have any estimates if this can run on consumer-grade laptops? Or how long this will take on our server on all documents?

It depends on what OCR is chosen and whether GPU acceleration is used. For instance, EasyOCR runs fairly slow on CPU.

Using the current pipeline configuration with EasyOCR on a random sample of 10 certificate artifacts from our dataset:

  • On my Apple M1 chip the conversion took on average 1.3 s per page.
  • On the Aura server with GPU acceleration, running the process with a niceness of 15, the conversion took on average 0.5s per page.

It would take something around 13 days on my MacBook and 5 days on Aura to convert all PDFs.

Maybe tweaking a little bit the configuration on Aura could further improve the performance.

That is really quite slow, hmm. That changes my view on the transition a bit. Let's make this actually not be the default (keep pdf2text as the base option) and make the resulting enhanced heuristics and data be dependent on whether we have the docling JSON or not. Wdyt @adamjanovsky ? This is also how other similarly dependency/runtime options are done.

@jborsky
Copy link
Collaborator Author

jborsky commented Oct 25, 2025

That is really quite slow, hmm. That changes my view on the transition a bit. Let's make this actually not be the default (keep pdf2text as the base option) and make the resulting enhanced heuristics and data be dependent on whether we have the docling JSON or not. Wdyt @adamjanovsky ? This is also how other similarly dependency/runtime options are done.

Yeah, it is. I think we’ve discussed that running the processing on the server is feasible even if it takes up a week. But running it locally on a laptop is a different story.

We could also adjust the default pipeline—disable OCR and switch table recognition to fast mode instead of accurate mode. That gives me about 0.5s per page on the M1, but I’m not sure if that’s still considered slow.

@adamjanovsky
Copy link
Collaborator

I think it's time to migrate to incremental processing, at least on the server. That is, keep the OCR by default but introduce the ability to cache the results from previous iterations.

Not saying that this must by developed by @jborsky.

@J08nY
Copy link
Member

J08nY commented Oct 30, 2025

I think this incremental processing is best handled in the server code with relatively little support in the library needed. Basically it should be able to detect that the files that are meant to be somewhere are there and if a setting is set it should not recompute them. Another thing to handle is the download of fresh stuff and comparison of hashes of the files with the existing ones such that we retain the ability to spot certificate changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants