Extracts Conflict of interests / Author contributions statements#1319
Extracts Conflict of interests / Author contributions statements#1319
Conversation
# Conflicts: # grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
# Conflicts: # grobid-home/models/affiliation-address-BidLSTM_CRF/config.json # grobid-home/models/affiliation-address-BidLSTM_CRF/model_weights.hdf5 # grobid-home/models/affiliation-address-BidLSTM_CRF/preprocessor.json # grobid-home/models/header-BidLSTM_CRF/config.json # grobid-home/models/header-BidLSTM_CRF/model_weights.hdf5 # grobid-home/models/header-BidLSTM_CRF/preprocessor.json # grobid-home/models/header-BidLSTM_ChainCRF/config.json # grobid-home/models/header-BidLSTM_ChainCRF/model_weights.hdf5 # grobid-home/models/header-BidLSTM_ChainCRF/preprocessor.json # grobid-home/models/name-header-BidLSTM_CRF/config.json # grobid-home/models/name-header-BidLSTM_CRF/model_weights.hdf5 # grobid-home/models/name-header-BidLSTM_CRF/preprocessor.json
Signed-off-by: Luca Foppiano <luca@foppiano.org>
Signed-off-by: Luca Foppiano <luca@foppiano.org>
Signed-off-by: Luca Foppiano <luca@foppiano.org>
FYI I've solved the issue and obtained similar, or slighly better, results. |
…LSTM_CRF_FEATURES architecture
…LSTM_ChainCRF_FEATURES architecture second pass
Signed-off-by: Luca Foppiano <luca@foppiano.org>
There was a problem hiding this comment.
Pull request overview
This PR adds support for extracting conflict of interest statements and author contribution statements from scientific articles. The implementation includes updated model configurations, new tagging labels, and modifications to parsing and processing logic to handle these new statement types.
Changes:
- Added new tagging labels for conflict of interest (
<conflict>) and author contribution (<contribution>) statements - Updated model configurations to accommodate expanded character vocabularies and new tag types
- Modified parsing engines to extract and process the new statement types from both header and body sections
Reviewed changes
Copilot reviewed 27 out of 2468 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| grobid-home/models/*/preprocessor.json | Expanded character vocabularies to support additional Unicode characters encountered in documents |
| grobid-home/models/*/config.json | Updated character vocabulary sizes and batch sizes to align with preprocessor changes |
| TaggingLabels.java | Added label constants for conflict of interest and author contribution statements |
| SegmentationLabels.java | Registered new labels for segmentation model |
| Segmentation.java | Added handling for conflict and contribution statement extraction in training mode |
| HeaderParser.java | Implemented extraction logic for conflict and contribution statements from headers |
| FullTextParser.java | Added processing for conflict and contribution statements from non-header sections |
| BiblioItem.java | Added fields and accessors for storing conflict and contribution statements |
| BasicStructureBuilder.java | Removed unused import |
| doc/training/*.md | Updated documentation to describe new statement types and annotation guidelines |
| doc/benchmarks/*.md | Updated benchmark results reflecting model performance changes |
Comments suppressed due to low confidence (1)
grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java:1496
- Extra blank line added at line 1443. While not functionally problematic, this inconsistent spacing should be removed to maintain code style consistency.
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds the identification of "conflict of interests / declaration of interests" and author contribution / credit statements.
The evaluation requires an updated version of the dataset which can be found here: https://huggingface.co/datasets/sciencialab/grobid-evaluation where the JATS files have been modified to identify the new statements.
We did our best, given the mess of the JATS Jungle.
We also use a dataset for which the author contribution and the data availability statements were already extracted with a different tool and selected documents that had missing or truncated statements, here some results: