This repository contains the 9th place solution for the Make Data Count: Finding Data References competition on Kaggle.
The goal of this competition is to identify and classify data references in scientific papers. The task involves finding two types of data references:
- Accession numbers (e.g., database identifiers like CAB12345, HPA67890)
- DOI links (e.g., https://doi.org/10.5061/dryad.abc123)
Each reference must be classified as either:
- Primary: The dataset is a main contribution of the paper
- Secondary: The dataset is referenced or used but not the main focus
This solution uses separate pipelines for different types of data references:
- Text Extraction: Extracts text from PDF and XML files
- Citation Mapping: Builds citation mappings using Europe PMC (EUPMC) v4
- Pattern Matching: Uses regex patterns to find accession numbers (CAB, HPA, EPI, CVCL patterns)
- Chunk Processing: Extracts relevant text chunks around found accessions
- Embedding Similarity: Uses BGE embeddings to score relevance for primary/secondary classification
- Range Processing: Handles accession ranges and filters ENA-specific ranges
- Citation Mapping: Builds citation mappings using DataCite v3
- LLM Extraction: Uses Qwen2.5-7B-Instruct to extract DOIs from paper text
- Validation: Validates extracted DOIs against known article datasets
- Two-Stage Classification:
- Stage 1: Classify as Data vs Literature vs Code (using Qwen2.5-32B-Instruct)
- Stage 2: Classify Data DOIs as Primary vs Secondary
- Special Handling: Automatic primary classification for Dryad DOIs
Combines outputs from both pipelines and generates the final submission file.
src/
├── data/
│ ├── pdf_to_text.py # PDF text extraction
│ └── xml_to_text.py # XML text extraction
├── models/
│ ├── doi_extraction.py # LLM-based DOI extraction
│ ├── doi_classifier_p1.py # Data vs Literature classification
│ ├── doi_classifier_p2.py # Primary vs Secondary classification
│ ├── paper_metadata_extractor.py # Paper metadata extraction
│ └── embed_sim.py # BGE embedding similarity scoring
├── utils/
│ ├── acc_helpers.py # Accession processing utilities
│ ├── doi_helpers.py # DOI processing utilities
│ ├── config.py # Configuration management
│ ├── get_citations.py # Citation mapping builder
│ ├── logging_util.py # Logging utilities
│ └── metric.py # Evaluation metrics
├── run_acc.py # Main accession pipeline
├── run_doi.py # Main DOI pipeline
└── run_combine.py # Final combination script
- Python 3.8+
- CUDA-compatible GPU (for LLM inference)
- 32GB+ RAM recommended
- See
requirements.txt
for specific package versions
vllm==0.8.5.post1
- Fast LLM inferencesentence-transformers
- BGE embeddingspymupdf
- PDF processinglogits-processor-zoo
- LLM output processingpandas
,numpy
- Data processing
- Clone the repository:
git clone https://github.com/bogoconic1/9th-place-kaggle-mdc-finding-data-references.git
cd 9th-place-kaggle-mdc-finding-data-references
- Install dependencies:
pip install -r requirements.txt
- Download required models:
- Qwen/Qwen2.5-7B-Instruct-AWQ (for DOI extraction)
- Qwen/Qwen2.5-32B-Instruct-AWQ (for DOI classification)
- Qwen/Qwen3-Embedding-0.6B (for similarity scoring)
Edit conf/main.yaml
to set paths for:
- Data directories (PDF/XML files)
- Model paths
- Environment settings (LOCAL/KAGGLE)
Example configuration:
runtime:
env: "LOCAL"
data:
pdf_dir:
LOCAL: "/path/to/train/PDF"
xml_dir:
LOCAL: "/path/to/train/XML"
models:
doi_extract_model:
LOCAL: "Qwen/Qwen2.5-7B-Instruct-AWQ"
doi_classify_model:
LOCAL: "Qwen/Qwen2.5-32B-Instruct-AWQ"
embedding_model:
LOCAL: "Qwen/Qwen3-Embedding-0.6B"
Run the complete pipeline:
python src/run_acc.py # Process accession numbers
python src/run_doi.py # Process DOI references
python src/run_combine.py # Generate final submission
This project is released under the MIT License.