Skip to content

Add API to supply pre-computed figures/tables and areas to ignore #1375

Draft
lfoppiano wants to merge 2 commits intomasterfrom
feature/add-ignore-areas-api
Draft

Add API to supply pre-computed figures/tables and areas to ignore #1375
lfoppiano wants to merge 2 commits intomasterfrom
feature/add-ignore-areas-api

Conversation

@lfoppiano
Copy link
Member

TBC

@lfoppiano lfoppiano marked this pull request as draft March 3, 2026 08:04
@lfoppiano lfoppiano force-pushed the feature/add-ignore-areas-api branch from e280e1e to 275a28e Compare March 3, 2026 08:04
Introduce a new typedAreas API parameter that allows users to specify
regions in PDF documents for specialized processing:

- FIGURE: regions processed with FigureParser model
- TABLE: regions processed with TableParser model
- IGNORE: regions completely excluded from processing

Key changes:
- New AreaType enum and IgnoreArea class for typed area handling
- Document.filterLayoutTokensByTypedAreas() for token categorization
- GrobidAnalysisConfig builder methods for typed areas
- REST API endpoints updated with typedAreas parameter
- FullTextParser.processTypedAreas() for specialized parsing
- HeaderParser integration for typed area filtering
- Comprehensive unit tests (23 tests passing)
- Complete API documentation with examples

This replaces the legacy ignoreAreas parameter with full backward
compatibility via deprecated constructors and methods.
@lfoppiano lfoppiano force-pushed the feature/add-ignore-areas-api branch from 275a28e to 63f0189 Compare March 3, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant