Split and delete forms#6277
Conversation
| log.error("Error closing document", e); | ||
| } | ||
| Set<Integer> keep = new HashSet<>(keepIndices); | ||
| try (PDDocument doc = pdfDocumentFactory.load(sourceFile)) { |
There was a problem hiding this comment.
Loading the source PDF (pdfDocumentFactory.load(...)) inside a per-output loop causes repeated expensive PDF parses; reuse a single loaded representation or a lightweight clone instead.
Details
✨ AI Reasoning
The new approach persists the uploaded PDF to disk and then, for each output range, calls the PDF factory to load the source file and then mutates (removePage) the loaded document. Loading a PDDocument from disk (parsing PDF) is an I/O and CPU-heavy operation. When there are many output parts, repeated pdfDocumentFactory.load(...) in a loop scales linearly with the number of parts and duplicates work that could be done once (or batched) by operating on a shared cached representation or by using a lighter-weight copy/cloning strategy. This harms throughput for large splits and high-concurrency requests. The change was introduced in this diff where writeRangeToZip loads the source file per range.
🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
🚀 V2 Auto-Deployment Complete!Your V2 PR with embedded architecture has been deployed! 🔗 Direct Test URL (non-SSL) http://54.175.155.236:6277 🔐 Secure HTTPS URL: https://6277.ssl.stirlingpdf.cloud This deployment will be automatically cleaned up when the PR is closed. 🔄 Auto-deployed for approved V2 contributors. |
| private Set<COSDictionary> collectLiveWidgetDictionaries(PDDocument document) { | ||
| Set<COSDictionary> live = new HashSet<>(); | ||
| int pageCount = document.getNumberOfPages(); | ||
| for (int i = 0; i < pageCount; i++) { |
There was a problem hiding this comment.
Potential place for threaded implementation. If you got a chonky doc that is. May not be worth it
There was a problem hiding this comment.
I don't think this is worth doing, used very big docs to test and had no issue nad I don't think the overhead for this is all that significant
| return live; | ||
| } | ||
|
|
||
| private List<PDField> pruneFieldList(List<PDField> fields, Set<COSDictionary> liveWidgets) { |
There was a problem hiding this comment.
Fancy recursion! What happens if the number of fields get really big?
There was a problem hiding this comment.
It's on tree depth not field count. This matches how pdf box does it if this fails due to tree depth then we have biger problems
| group.setPartialName("group"); | ||
|
|
||
| PDTextField kept = new PDTextField(acroForm); | ||
| kept.setPartialName("kept"); | ||
| PDAnnotationWidget keptWidget = new PDAnnotationWidget(); | ||
| keptWidget.setRectangle(new PDRectangle(50, 50, 100, 20)); | ||
| keptWidget.setPage(pageA); | ||
| kept.setWidgets(List.of(keptWidget)); | ||
| pageA.getAnnotations().add(keptWidget); | ||
|
|
||
| PDTextField dropped = new PDTextField(acroForm); | ||
| dropped.setPartialName("dropped"); |
There was a problem hiding this comment.
Are these names shown to the user? Does it matter if they are english only?
There was a problem hiding this comment.
These are just for tests. Shouldn't have added this file though it can just be moved to formutilstest
| continue; | ||
| } | ||
| if (hasForm) { | ||
| writeRangeViaReload( |
There was a problem hiding this comment.
You have writeRangeViaReload and writeRangeViaSharedSource but you also have writeSplitViaReload/sharedSource in another file. Do they share implemetation that can reduce dupes
There was a problem hiding this comment.
Great call out, will sort
Delete orphaned forms when removing pages and maintain forms correctly when splitting