Skip to content

Extracts Conflict of interests / Author contributions statements#1319

Merged
lfoppiano merged 96 commits intomasterfrom
feature/coi-ac
Feb 21, 2026
Merged

Extracts Conflict of interests / Author contributions statements#1319
lfoppiano merged 96 commits intomasterfrom
feature/coi-ac

Conversation

@lfoppiano
Copy link
Member

@lfoppiano lfoppiano commented Jul 30, 2025

Adds the identification of "conflict of interests / declaration of interests" and author contribution / credit statements.

The evaluation requires an updated version of the dataset which can be found here: https://huggingface.co/datasets/sciencialab/grobid-evaluation where the JATS files have been modified to identify the new statements.
We did our best, given the mess of the JATS Jungle.

We also use a dataset for which the author contribution and the data availability statements were already extracted with a different tool and selected documents that had missing or truncated statements, here some results:

  Iteration 0 (grobid 0.8.1) Iteration 1 (grobid dev) Iteration 2 (+50 docs) Iteration 3 (+113 docs) Iteration 4 (+97 docs)
Docs missing availability statements 2420 1240 805 317 236
Docs missing contribution statements 3737 1205 767 523 483
image

@coveralls
Copy link

coveralls commented Jul 30, 2025

Coverage Status

coverage: 40.323% (-0.07%) from 40.394%
when pulling 80ec370 on feature/coi-ac
into 01fe109 on master.

@lfoppiano lfoppiano changed the title Add Conflict of interests / Author contributions Extracts Conflict of interests / Author contributions statements Aug 10, 2025
# Conflicts:
#	grobid-home/models/affiliation-address-BidLSTM_CRF/config.json
#	grobid-home/models/affiliation-address-BidLSTM_CRF/model_weights.hdf5
#	grobid-home/models/affiliation-address-BidLSTM_CRF/preprocessor.json
#	grobid-home/models/header-BidLSTM_CRF/config.json
#	grobid-home/models/header-BidLSTM_CRF/model_weights.hdf5
#	grobid-home/models/header-BidLSTM_CRF/preprocessor.json
#	grobid-home/models/header-BidLSTM_ChainCRF/config.json
#	grobid-home/models/header-BidLSTM_ChainCRF/model_weights.hdf5
#	grobid-home/models/header-BidLSTM_ChainCRF/preprocessor.json
#	grobid-home/models/name-header-BidLSTM_CRF/config.json
#	grobid-home/models/name-header-BidLSTM_CRF/model_weights.hdf5
#	grobid-home/models/name-header-BidLSTM_CRF/preprocessor.json
Signed-off-by: Luca Foppiano <luca@foppiano.org>
Signed-off-by: Luca Foppiano <luca@foppiano.org>
@lfoppiano
Copy link
Member Author

@kermitt2 it seems that the default "recipe" for training the header-BidLSTM_ChainCRF_FEATURES does not yield the same results as the current header model in the master repository. Do you remember by any chance which parameters did you use? 🙏

FYI I've solved the issue and obtained similar, or slighly better, results.

@lfoppiano lfoppiano requested a review from Copilot February 21, 2026 14:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for extracting conflict of interest statements and author contribution statements from scientific articles. The implementation includes updated model configurations, new tagging labels, and modifications to parsing and processing logic to handle these new statement types.

Changes:

  • Added new tagging labels for conflict of interest (<conflict>) and author contribution (<contribution>) statements
  • Updated model configurations to accommodate expanded character vocabularies and new tag types
  • Modified parsing engines to extract and process the new statement types from both header and body sections

Reviewed changes

Copilot reviewed 27 out of 2468 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
grobid-home/models/*/preprocessor.json Expanded character vocabularies to support additional Unicode characters encountered in documents
grobid-home/models/*/config.json Updated character vocabulary sizes and batch sizes to align with preprocessor changes
TaggingLabels.java Added label constants for conflict of interest and author contribution statements
SegmentationLabels.java Registered new labels for segmentation model
Segmentation.java Added handling for conflict and contribution statement extraction in training mode
HeaderParser.java Implemented extraction logic for conflict and contribution statements from headers
FullTextParser.java Added processing for conflict and contribution statements from non-header sections
BiblioItem.java Added fields and accessors for storing conflict and contribution statements
BasicStructureBuilder.java Removed unused import
doc/training/*.md Updated documentation to describe new statement types and annotation guidelines
doc/benchmarks/*.md Updated benchmark results reflecting model performance changes
Comments suppressed due to low confidence (1)

grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java:1496

  • Extra blank line added at line 1443. While not functionally problematic, this inconsistent spacing should be removed to maintain code style consistency.
    }
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lfoppiano lfoppiano merged commit b91c01d into master Feb 21, 2026
6 of 7 checks passed
@lfoppiano lfoppiano deleted the feature/coi-ac branch February 21, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

section incorrectly classified as keywords using CRF models Extracts author contribution and conflict of interest statements

3 participants