Skip to content

Split and delete forms#6277

Open
reecebrowne wants to merge 4 commits intomainfrom
bug/splitting-forms
Open

Split and delete forms#6277
reecebrowne wants to merge 4 commits intomainfrom
bug/splitting-forms

Conversation

@reecebrowne
Copy link
Copy Markdown
Contributor

Delete orphaned forms when removing pages and maintain forms correctly when splitting

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines ignoring generated files. Bugfix Pull requests that fix bugs labels Apr 30, 2026
@stirlingbot stirlingbot Bot added Java Pull requests that update Java code Back End Issues related to back-end development API API-related issues or pull requests Test Testing-related issues or pull requests and removed Bugfix Pull requests that fix bugs labels Apr 30, 2026
log.error("Error closing document", e);
}
Set<Integer> keep = new HashSet<>(keepIndices);
try (PDDocument doc = pdfDocumentFactory.load(sourceFile)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the source PDF (pdfDocumentFactory.load(...)) inside a per-output loop causes repeated expensive PDF parses; reuse a single loaded representation or a lightweight clone instead.

Details

✨ AI Reasoning
​The new approach persists the uploaded PDF to disk and then, for each output range, calls the PDF factory to load the source file and then mutates (removePage) the loaded document. Loading a PDDocument from disk (parsing PDF) is an I/O and CPU-heavy operation. When there are many output parts, repeated pdfDocumentFactory.load(...) in a loop scales linearly with the number of parts and duplicates work that could be done once (or batched) by operating on a shared cached representation or by using a lighter-weight copy/cloning strategy. This harms throughput for large splits and high-concurrency requests. The change was introduced in this diff where writeRangeToZip loads the source file per range.

🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

@stirlingbot
Copy link
Copy Markdown
Contributor

stirlingbot Bot commented Apr 30, 2026

🚀 V2 Auto-Deployment Complete!

Your V2 PR with embedded architecture has been deployed!

🔗 Direct Test URL (non-SSL) http://54.175.155.236:6277

🔐 Secure HTTPS URL: https://6277.ssl.stirlingpdf.cloud

This deployment will be automatically cleaned up when the PR is closed.

🔄 Auto-deployed for approved V2 contributors.

private Set<COSDictionary> collectLiveWidgetDictionaries(PDDocument document) {
Set<COSDictionary> live = new HashSet<>();
int pageCount = document.getNumberOfPages();
for (int i = 0; i < pageCount; i++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential place for threaded implementation. If you got a chonky doc that is. May not be worth it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is worth doing, used very big docs to test and had no issue nad I don't think the overhead for this is all that significant

return live;
}

private List<PDField> pruneFieldList(List<PDField> fields, Set<COSDictionary> liveWidgets) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy recursion! What happens if the number of fields get really big?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on tree depth not field count. This matches how pdf box does it if this fails due to tree depth then we have biger problems

Comment on lines +202 to +213
group.setPartialName("group");

PDTextField kept = new PDTextField(acroForm);
kept.setPartialName("kept");
PDAnnotationWidget keptWidget = new PDAnnotationWidget();
keptWidget.setRectangle(new PDRectangle(50, 50, 100, 20));
keptWidget.setPage(pageA);
kept.setWidgets(List.of(keptWidget));
pageA.getAnnotations().add(keptWidget);

PDTextField dropped = new PDTextField(acroForm);
dropped.setPartialName("dropped");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these names shown to the user? Does it matter if they are english only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just for tests. Shouldn't have added this file though it can just be moved to formutilstest

continue;
}
if (hasForm) {
writeRangeViaReload(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have writeRangeViaReload and writeRangeViaSharedSource but you also have writeSplitViaReload/sharedSource in another file. Do they share implemetation that can reduce dupes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call out, will sort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API API-related issues or pull requests Back End Issues related to back-end development Java Pull requests that update Java code size:XL This PR changes 500-999 lines ignoring generated files. Test Testing-related issues or pull requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants