-
Notifications
You must be signed in to change notification settings - Fork 11
Replace pdftotext with Docling #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace pdftotext + OCR logic with Docling. Add JSON serialization of DoclingDocument alongside text output.
Update CCCertificate to handle the new return value of convert_pdf_file, which now returns a single boolean instead of a tuple. Add json_hash computation and _json_dir parameters for local path setup. Remove convert_garbage and add _json_path and json_hash in DocumentState. Update CCDataset to create JSON directories. Update log messages and progress bar descriptions.
Update FIPSCertificate to handle the new return value of convert_pdf_file, which now returns a single boolean instead of a tuple. Add json_hash computation and _json_dir parameters for local path setup. Added _json_path and json_hash in InternalState. Update FIPSDataset to include JSON directory. Update log messages and progress bar descriptions.
Update ProtectionProfile to handle the new return value of convert_pdf_file, which now returns a single boolean instead of a tuple. Add json_hash computation and _json_dir parameters for local path setup. Update ProtectionProfileDataset to include JSON directories. Update log messages and progress bar descriptions.
…verter Create new PdfConverter abstract class for allowing different PDF converter implementations in the future. Move conversion logic to DoclingConverter, with pipeline setup in init instead of in each call of convert
Update sample functions to accept an abstract PDF converter. Create a single DoclingConverter instance in dataset conversion functions. Delegate threading to Docling by setting max_workers=1 in process_parallel call to avoid excessive threads.
Add checks for json_path existence. Add json_hash and remove convert_garbage in toy datasets and fictional cert. Replace template text files with new one produced by Docling.
Add json_path checks. Add json_hash and remove convert_garbage in toy dataset and fictional cert. Update template text file with one produced by Docling.
|
@jborsky what are you after right now? High-level design review? In-depth implementation review? Do you consider this complete or should we wait till you mark it "ready" and request review? |
|
I had a quick look at this.
This makes sense.
Makes sense, could you actually keep the pdf2text conversion in a separate subclass of this? It will not create the JSON, but that is fine.
Lets keep it simple now. We can extend this later on.
I guess this makes sense but currently I think it still creates an additional process and submits the tasks to it which is unnecessary. If you look at the logic of process_parallel you can see that. This means we are still serializing the cert and the Docling converter instance and passing it along to a new process to actually convert. I remember that Docling has quite some costly initialization so I think we should avoid doing this needlessly. |
|
Thanks. I agree with what @J08nY says. On top of that:
|
Yes, you're right. It was actually using a second thread but if the multiprocessing was used instead of threading the Docling converter and cert would be serialized and passed to the new process. I've replaced the call with a simple loop, so now conversion happens in the main thread. If we later decide to use multiple processes for conversion, I'll adjust the implementation so that each process creates its own converter instance. For now I think the current parallelism in Docling is sufficient.
Yes, most of the tests pass, at least on my machine 😄. The ones that don't are already failing in main. These are the tests with HTTP errors and test_build_dataset[default_dataset1-CveNvdDatasetBuilder]. Do I have other option besides having the branch on my fork? I don't think I can create branch here, or am I wrong?
Yep, that was my thought. I've at least for now added an option in this class to choose the PDF converter.
It depends on what OCR is chosen and whether GPU acceleration is used. For instance, EasyOCR runs fairly slow on CPU. Using the current pipeline configuration with EasyOCR on a random sample of 10 certificate artifacts from our dataset:
It would take something around 13 days on my MacBook and 5 days on Aura to convert all PDFs. Maybe tweaking a little bit the configuration on Aura could further improve the performance. |
I will give you contributor rights tk the repo sonyou can have the branch there.
That is really quite slow, hmm. That changes my view on the transition a bit. Let's make this actually not be the default (keep pdf2text as the base option) and make the resulting enhanced heuristics and data be dependent on whether we have the docling JSON or not. Wdyt @adamjanovsky ? This is also how other similarly dependency/runtime options are done. |
Yeah, it is. I think we’ve discussed that running the processing on the server is feasible even if it takes up a week. But running it locally on a laptop is a different story. We could also adjust the default pipeline—disable OCR and switch table recognition to fast mode instead of accurate mode. That gives me about 0.5s per page on the M1, but I’m not sure if that’s still considered slow. |
|
I think it's time to migrate to incremental processing, at least on the server. That is, keep the OCR by default but introduce the ability to cache the results from previous iterations. Not saying that this must by developed by @jborsky. |
|
I think this incremental processing is best handled in the server code with relatively little support in the library needed. Basically it should be able to detect that the files that are meant to be somewhere are there and if a setting is set it should not recompute them. Another thing to handle is the download of fresh stuff and comparison of hashes of the files with the existing ones such that we retain the ability to spot certificate changes. |
This pull request implements basic initial replacement of pdftotext for Docling.
I am unsure of a few things about whether it's the ideal solution:
to avoid excessive threads, while keeping the nice progress bar.