Skip to content

bogoconic1/9th-place-kaggle-mdc-finding-data-references

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

9th Place Solution - Kaggle Make Data Count: Finding Data References Competition

This repository contains the 9th place solution for the Make Data Count: Finding Data References competition on Kaggle.

alt text

Competition Overview

The goal of this competition is to identify and classify data references in scientific papers. The task involves finding two types of data references:

Each reference must be classified as either:

  • Primary: The dataset is a main contribution of the paper
  • Secondary: The dataset is referenced or used but not the main focus

Solution Architecture

Two-Pipeline Approach

This solution uses separate pipelines for different types of data references:

1. Accession Pipeline (run_acc.py)

  • Text Extraction: Extracts text from PDF and XML files
  • Citation Mapping: Builds citation mappings using Europe PMC (EUPMC) v4
  • Pattern Matching: Uses regex patterns to find accession numbers (CAB, HPA, EPI, CVCL patterns)
  • Chunk Processing: Extracts relevant text chunks around found accessions
  • Embedding Similarity: Uses BGE embeddings to score relevance for primary/secondary classification
  • Range Processing: Handles accession ranges and filters ENA-specific ranges

2. DOI Pipeline (run_doi.py)

  • Citation Mapping: Builds citation mappings using DataCite v3
  • LLM Extraction: Uses Qwen2.5-7B-Instruct to extract DOIs from paper text
  • Validation: Validates extracted DOIs against known article datasets
  • Two-Stage Classification:
    • Stage 1: Classify as Data vs Literature vs Code (using Qwen2.5-32B-Instruct)
    • Stage 2: Classify Data DOIs as Primary vs Secondary
  • Special Handling: Automatic primary classification for Dryad DOIs

Final Combination (run_combine.py)

Combines outputs from both pipelines and generates the final submission file.

Project Structure

src/
├── data/
│   ├── pdf_to_text.py          # PDF text extraction
│   └── xml_to_text.py          # XML text extraction
├── models/
│   ├── doi_extraction.py       # LLM-based DOI extraction
│   ├── doi_classifier_p1.py    # Data vs Literature classification
│   ├── doi_classifier_p2.py    # Primary vs Secondary classification
│   ├── paper_metadata_extractor.py  # Paper metadata extraction
│   └── embed_sim.py            # BGE embedding similarity scoring
├── utils/
│   ├── acc_helpers.py          # Accession processing utilities
│   ├── doi_helpers.py          # DOI processing utilities
│   ├── config.py               # Configuration management
│   ├── get_citations.py        # Citation mapping builder
│   ├── logging_util.py         # Logging utilities
│   └── metric.py               # Evaluation metrics
├── run_acc.py                  # Main accession pipeline
├── run_doi.py                  # Main DOI pipeline
└── run_combine.py              # Final combination script

Requirements

  • Python 3.8+
  • CUDA-compatible GPU (for LLM inference)
  • 32GB+ RAM recommended
  • See requirements.txt for specific package versions

Key Dependencies

  • vllm==0.8.5.post1 - Fast LLM inference
  • sentence-transformers - BGE embeddings
  • pymupdf - PDF processing
  • logits-processor-zoo - LLM output processing
  • pandas, numpy - Data processing

Installation

  1. Clone the repository:
git clone https://github.com/bogoconic1/9th-place-kaggle-mdc-finding-data-references.git
cd 9th-place-kaggle-mdc-finding-data-references
  1. Install dependencies:
pip install -r requirements.txt
  1. Download required models:
  • Qwen/Qwen2.5-7B-Instruct-AWQ (for DOI extraction)
  • Qwen/Qwen2.5-32B-Instruct-AWQ (for DOI classification)
  • Qwen/Qwen3-Embedding-0.6B (for similarity scoring)

Configuration

Edit conf/main.yaml to set paths for:

  • Data directories (PDF/XML files)
  • Model paths
  • Environment settings (LOCAL/KAGGLE)

Example configuration:

runtime:
  env: "LOCAL"

data:
  pdf_dir:
    LOCAL: "/path/to/train/PDF"
  xml_dir:
    LOCAL: "/path/to/train/XML"

models:
  doi_extract_model:
    LOCAL: "Qwen/Qwen2.5-7B-Instruct-AWQ"
  doi_classify_model:
    LOCAL: "Qwen/Qwen2.5-32B-Instruct-AWQ"
  embedding_model:
    LOCAL: "Qwen/Qwen3-Embedding-0.6B"

Usage

Full Pipeline

Run the complete pipeline:

python src/run_acc.py      # Process accession numbers
python src/run_doi.py      # Process DOI references
python src/run_combine.py  # Generate final submission

License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages