Scientific Poster Metadata Extraction

Executive Summary: Three Methods for Scientific Poster Metadata Extraction

The challenge of extracting structured metadata from scientific posters requires flexible solutions that accommodate different cost constraints, resource availability, and accuracy requirements. Our solution provides three distinct approaches to the same problem, each optimized for different operational contexts and user needs.

Method 1: DeepSeek API - For users who need immediate deployment with moderate costs (~$0.003/poster). Best balance of ease-of-use and accuracy for most applications.

Method 2: Qwen Local - For users with GPU resources who prioritize zero ongoing costs and data privacy. Runs entirely on local hardware with competitive accuracy.

Method 3: BioELECTRA+CRF - For users who need the highest accuracy and fastest processing, and can invest in creating training data. Eliminates hallucination risks through deterministic sequence labeling.

The choice depends entirely on your resources and requirements: If you have API budget and need quick deployment, use Method 1. If you have local GPU resources and prioritize privacy/zero costs, use Method 2. If you can invest in training data and need maximum accuracy with zero hallucination, Method 3 is optimal.

Additionally, Methods 1 and 2 can serve as data generation tools for training Method 3, allowing users to bootstrap their way to the highest-accuracy solution by auto-labeling posters for CRF training.

Important Note on Accuracy

All accuracy estimates are unvalidated - these are rough estimates based on limited testing and theoretical benchmarks. Actual accuracy can only be determined through proper validation using Cochran's random sampling methodology as outlined in this document. Please validate before production use.

Original Task Requirements Addressed

This repository directly addresses the original take-home task requirements with our three-tiered approach:

Pipeline Overview: Key Steps and Components

Core Pipeline Steps:

PDF Text Extraction: Use PyMuPDF (fitz) to extract text from poster PDFs
Text Preprocessing: Clean and structure extracted text for model consumption
Metadata Extraction: Apply selected method (API LLM, Local LLM, or Transformer+CRF)
JSON Output: Structure results according to Table 1 requirements (see below)
Validation: Apply Cochran sampling methodology for expedited manual quality assessment

Table 1 Metadata Fields (from original task):

Title of the poster
Authors (with affiliations)
Summary of the poster
Keywords
Methods
Results (main findings)
References
Funding source

Key Components:

Text Extraction Engine: PyMuPDF for robust PDF processing
LLM Interface Layer: Unified API for different model backends
Prompt Engineering Module: Structured templates for consistent extraction
Validation Framework: Statistical sampling and accuracy measurement
Output Standardization: JSON schema compliance with required metadata fields

Tools, Models, and Infrastructure

Core Technologies:

PyMuPDF (fitz): PDF text extraction engine
OpenAI Python SDK: API client for Method 1
HuggingFace Transformers: Model loading and inference for Method 2
PyTorch: Deep learning framework with CUDA support
python-dotenv: Secure environment variable management
Jupyter: Interactive development and execution

Models by Method:

Method 1: DeepSeek-Chat API (cost-effective, ~$0.003/poster)
Method 2: Qwen2.5-1.5B-Instruct (local inference, 8-bit quantized)
Method 3: BioELECTRA-base + CRF layer (demo only - requires training)

Hardware Requirements:

Method 1: Any system with internet (CPU sufficient)
Method 2: CUDA GPU with 8GB+ VRAM (RTX 4090 recommended for batching)
Method 3: Same as Method 2 (after training completion)

API Integration:

DeepSeek API: Primary endpoint (OpenAI-compatible)
Environment Variables: API keys loaded securely via .env file

Assumptions and Dependencies

Key Assumptions:

Poster PDFs contain extractable text (not image-only scans)
Scientific posters follow standard academic formatting conventions
Target metadata fields (Table 1) are present in poster content
For Method 3: Training data can be generated via Methods 1-2 bootstrapping

Dependencies:

Python 3.8+: Core runtime environment
CUDA-capable GPU: Required for Methods 2-3 (8GB+ VRAM recommended)
API Keys: DeepSeek API access for Method 1
Training Data: 500-1000 labeled posters for Method 3 (generated via auto-labeling)
Validation Dataset: Representative poster sample for Cochran sampling

Pipeline Evaluation Framework

Evaluation Approach:

Cochran's Random Sampling: Statistically significant validation methodology
Field-Specific Metrics: Tailored accuracy measures per metadata type
Cross-Method Comparison: Benchmarking across all three approaches
Statistical Validation: 95% confidence intervals with finite population correction

Specific Metrics:

Title Extraction: Exact match + semantic similarity (>0.8 threshold)
Author Detection: Fuzzy string matching with edit distance <2
Keyword Extraction: Overlap coefficient >0.6 with expert annotations
Abstract Fields: BLEU score >0.7 vs. expert summaries
Overall Accuracy: Weighted F1-score across all metadata fields

Sample Size Requirements (Cochran's Formula):

1000 posters → 278 validation samples (27.8%)
10,000 posters → 370 validation samples (3.7%)
100,000+ posters → 383 validation samples (0.4%)

Implementation Notes

Current Implementation:

Method 1: Fully functional with enhanced structured prompting
Method 2: Complete with 8-bit quantization and batching optimization
Method 3: Demonstration framework only (requires training data)

Differences from Ideal Pipeline:

Method 3 Limitation: Currently demo-only due to training data requirements
Hardware Constraints: Optimized for single-GPU deployment vs. distributed inference
API Fallbacks: Demo results provided when API keys unavailable

Testing Instructions:

Clone repository: git clone https://github.com/jimnoneill/poster-metadata-extractor.git
Install dependencies: pip install -r requirements.txt
Configure API keys: cp env.example .env (edit as needed)
Run notebooks: Execute cells in notebooks/01_method1_deepseek_api.ipynb
Validate outputs: Check output/ directory for generated JSON files

Three-Method Approach

Method 1: DeepSeek API Extraction

Notebooks: 01_method1_deepseek_api.ipynb

Cost-effective API-based extraction using DeepSeek's language model.

Performance Characteristics:

Estimated Accuracy: 85-90% (unvalidated - requires Cochran sampling validation)
Cost: ~$0.003 per poster (200x cheaper than GPT-4)
Speed: 5-15 seconds per poster
Hallucination Risk: Low-Medium (mitigated by structured prompts)
Setup: Easy - just requires API key

Best For: Production systems with budget constraints, high-volume processing

Method 2: Qwen Local Extraction

Notebooks: 02_method2_qwen_local.ipynb

Local small language model (1.5B parameters) for privacy-sensitive environments.

Performance Characteristics:

Estimated Accuracy: 80-85% (unvalidated - requires Cochran sampling validation)
Cost: $0 (runs locally, only electricity costs)
Speed: 10-40 seconds per poster (single), ~1.1s per poster (RTX 4090 batched)
Hallucination Risk: Low (structured prompting)
Setup: Medium - requires model download and GPU memory

RTX 4090 Batching Capacity:

Recommended batch size: 32 posters simultaneously
Throughput: ~3,273 posters/hour, ~26,182 posters/day (8hrs)
Memory efficiency: 8-bit quantization enables large-scale processing

Best For: Privacy-sensitive environments, budget-conscious deployments, edge computing

Method 3: BioELECTRA+CRF (DEMO)

Notebooks: 03_method3_bioelectra_crf_demo.ipynb

DEMONSTRATION ONLY - Future possibility requiring 500-1000 labeled posters for training.

Performance Characteristics (Estimated):

Estimated Accuracy: 85-92% (theoretical - based on BLURB benchmarks, requires training & validation)
Cost: $0 (after training - local inference only)
Speed: <0.5 seconds per poster (fastest of all methods)
Hallucination Risk: 0% (deterministic sequence labeling)
Setup: Complex - requires extensive training data

Training Requirements: 500-1000 manually labeled poster PDFs with BIO annotations

Auto-Labeling Plan: The training data for Method 3 will be generated by auto-labeling 1,000 posters (or however many needed) using our top-performing Methods 1 (DeepSeek) and 2 (Qwen) to bootstrap the CRF training dataset. This approach leverages LLM-generated annotations as weak supervision for the final deterministic model.

Approach Comparison

Feature	Method 1 (DeepSeek)	Method 2 (Qwen Local)	Method 3 (BioELECTRA)
Accuracy	85-90% (unvalidated)	80-85% (unvalidated)	85-92% (theoretical)
Cost per poster	$0.003	$0	$0
Speed	5-15s	10-40s	<0.5s
Privacy	External API	Local	Local
Setup complexity	Easy	Medium	Complex
Hallucination	Low-Med	Low	None
Training required	No	No	Yes (500-1000 posters)

Quality Validation Framework

Cochran's Sampling for Manual Validation

We strongly recommend validating extraction quality using Cochran's formula for statistically significant sample size determination:

n = (Z² × p × (1-p)) / e²
n_adjusted = n / (1 + (n-1)/N)  # Finite population correction

Where:

Z = 1.96 (95% confidence level)
p = 0.5 (maximum variability assumption)
e = 0.05 (±5% margin of error)
N = total population size (number of posters)

Sample Sizes by Dataset (with finite population correction):

100 posters: Validate ~80 randomly selected outputs (79.5%)
500 posters: Validate ~217 randomly selected outputs (43.5%)
1000 posters: Validate ~278 randomly selected outputs (27.8%)
10,000 posters: Validate ~370 randomly selected outputs (3.7%)
100,000+ posters: Validate ~383 randomly selected outputs (0.4%)

Key Insight: For smaller datasets (<1000), you must validate a high percentage. Only when scaling to tens of thousands of posters does the required validation percentage become practical (under 5%).

Validation Process:

Extract metadata from full dataset
Randomly sample using calculated sample size
Expert manual review of sampled outputs
Calculate accuracy metrics (precision, recall, F1)
Apply correction factors to full dataset if needed

This ensures statistically significant quality assessment across all methods.

Accuracy Measurement Guidelines

Field-specific accuracy calculation:

Title: Exact match or semantic similarity >0.8
Authors: Name matching with fuzzy string matching (edit distance <2)
Keywords: Overlap coefficient >0.6 with expert annotations
Methods/Results: BLEU score >0.7 compared to expert summaries

Project Structure

poster_project/
├── notebooks/                            # Jupyter notebooks with execution outputs
│   ├── 01_method1_deepseek_api.ipynb    # DeepSeek API extraction
│   ├── 02_method2_qwen_local.ipynb      # Qwen local model  
│   └── 03_method3_bioelectra_crf_demo.ipynb # BioELECTRA demo
├── src/                                 # Python implementation scripts
│   ├── method1_deepseek_api.py          # API extraction script
│   ├── method2_qwen_local.py            # Local model script
│   └── method3_bioelectra_crf_demo.py   # Demo script
├── data/                                # Sample data and test files
│   ├── test-poster.pdf                  # Sample poster for testing
│   └── Take-home-task.pdf               # Original assignment
├── output/                              # Generated extraction results
│   ├── method1_deepseek_results.json    # DeepSeek API results
│   └── method2_qwen_results.json        # Qwen local results
├── requirements.txt                     # Python dependencies
├── env.example                          # Environment variables template
└── README.md                            # This documentation

Installation & Setup

1. Environment Setup

git clone https://github.com/jimnoneill/poster-metadata-extractor.git
cd poster-metadata-extractor
pip install -r requirements.txt

2. API Configuration (Method 1)

cp env.example .env
# Edit .env and add your DEEPSEEK_API_KEY

3. GPU Setup (Method 2)

For optimal performance with Qwen local model:

CUDA-capable GPU with 8GB+ VRAM
PyTorch with CUDA support

Usage

Quick Start

# Method 1: DeepSeek API
from src.method1_deepseek_api import extract_poster_metadata
results = extract_poster_metadata("data/your-poster.pdf")

# Method 2: Qwen Local  
from src.method2_qwen_local import QwenExtractor, extract_text_from_pdf
extractor = QwenExtractor()
results = extractor.extract_poster_metadata("data/your-poster.pdf")

# Method 3: Demo only
from src.method3_bioelectra_crf_demo import bioelectra_crf_demo
demo_results = bioelectra_crf_demo()

Notebook Execution

All notebooks are ready to run with pre-executed outputs:

Open desired method notebook in Jupyter
Set API keys if using Method 1
Run all cells to see extraction results

Implementation Details by Method

Method 1: DeepSeek API

Core: DeepSeek-Chat API with enhanced structured prompting
Dependencies: openai, python-dotenv, PyMuPDF
Features: JSON schema enforcement, cost tracking, fallback handling
Authentication: Secure API key loading via environment variables

Method 2: Qwen Local

Core: Qwen2.5-1.5B-Instruct with 8-bit quantization
Dependencies: torch, transformers, bitsandbytes, accelerate
Features: Few-shot prompting, GPU batching, memory optimization
Optimization: CUDA acceleration with stderr suppression for clean execution

Method 3: BioELECTRA Demo

Core: BioELECTRA-base + CRF layer (demonstration framework)
Dependencies: pytorch-crf, spacy, scikit-learn
Features: Sequence labeling, deterministic extraction, zero hallucination
Status: Demo only - requires training data for production use

Recommendations

For Production Use

Start with Method 1 (DeepSeek API) for immediate deployment
Implement Cochran sampling for quality validation
Consider Method 2 for privacy-sensitive applications
Plan Method 3 as long-term solution with proper training data

For BioELECTRA Training (Method 3)

Collect 500-1000 poster PDFs with diverse layouts and fields
Manual BIO annotation (~40-60 expert hours)
Entity types: Title, Authors, Affiliations, Methods, Results, Funding
Alternative simpler approaches: Rule-based NER, spaCy custom models
Validation: Cross-validation on held-out test set

Process, Limitations, and Future Work

Development Process

Iterative Methodology:

Phase 1: Rapid prototyping with API-based solution (Method 1)
Phase 2: Privacy-preserving local implementation (Method 2)
Phase 3: Scientific rigor through transformer+CRF architecture (Method 3)
Validation: Cochran sampling framework for statistical significance

Design Decisions:

Multi-tiered approach addresses different operational requirements
Bootstrapping strategy leverages LLM capabilities for CRF training data generation
JSON output standardization ensures consistency across all methods
Modular architecture enables easy method comparison and selection

Current Limitations

Technical Constraints:

Method 3 Training: Requires substantial labeled dataset (500-1000 posters)
GPU Dependencies: Methods 2-3 require CUDA-capable hardware for optimal performance
Text-Only Processing: Cannot handle image-only or poorly scanned PDFs
Single-Language Support: Optimized for English academic papers

Validation Limitations:

Accuracy Estimates: Based on limited testing, require proper validation
Domain Specificity: Tested primarily on biomedical/engineering posters
Scale Testing: Not yet validated on large-scale deployments (>10K posters)

Future Enhancements

Immediate Improvements (3-6 months):

Complete Method 3 Training: Generate 1000+ labeled examples using Methods 1-2
OCR Integration: Add image processing for scanned posters using Tesseract/PaddleOCR
Multilingual Support: Extend to Spanish, French, German scientific literature
Batch Processing: Implement distributed processing for large poster collections

Future Developments (6-12 months):

Multi-modal Architecture: Incorporate visual layout analysis using LayoutLM
Domain Adaptation: Fine-tune models for specific scientific disciplines
Active Learning: Implement uncertainty-based sample selection for validation
Real-time API: Deploy as microservice with REST API for integration

Research Extensions (1+ years):

Cross-lingual Transfer: Leverage multilingual transformers for global poster analysis
Temporal Analysis: Track research trend evolution across poster collections
Graph-based Extraction: Model author-institution-topic relationships
Automated Quality Assessment: Self-monitoring extraction confidence scoring

Evaluation Recommendations

Before Production Use:

Conduct Cochran Sampling: Validate accuracy on representative poster sample
Domain Testing: Evaluate performance across different scientific fields
Scale Assessment: Test throughput and accuracy on large poster collections
User Studies: Gather feedback from scientific librarians and researchers

Success Metrics:

Accuracy: >90% field-level accuracy on validation set
Coverage: Extract ≥7 of 8 Table 1 metadata fields per poster
Throughput: Process ≥1000 posters/hour on standard hardware
User Satisfaction: ≥85% user acceptance in library/repository contexts

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create feature branch (git checkout -b feature/improvement)
Commit changes (git commit -am 'Add improvement')
Push to branch (git push origin feature/improvement)
Create Pull Request

Citation

@software{oneill2025poster,
  title={Scientific Poster Metadata Extraction Toolkit},
  author={ONeill, Jamey},
  year={2025},
  url={https://github.com/jimnoneill/poster-metadata-extractor}
}

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
notebooks		notebooks
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt

License

jimnoneill/poster-metadata-extractor

Folders and files

Latest commit

History

Repository files navigation

Scientific Poster Metadata Extraction

Executive Summary: Three Methods for Scientific Poster Metadata Extraction

Important Note on Accuracy

Original Task Requirements Addressed

Pipeline Overview: Key Steps and Components

Tools, Models, and Infrastructure

Assumptions and Dependencies

Pipeline Evaluation Framework

Implementation Notes

Three-Method Approach

Method 1: DeepSeek API Extraction

Method 2: Qwen Local Extraction

Method 3: BioELECTRA+CRF (DEMO)

Approach Comparison

Quality Validation Framework

Cochran's Sampling for Manual Validation

Accuracy Measurement Guidelines

Project Structure

Installation & Setup

1. Environment Setup

2. API Configuration (Method 1)

3. GPU Setup (Method 2)

Usage

Quick Start

Notebook Execution

Implementation Details by Method

Method 1: DeepSeek API

Method 2: Qwen Local

Method 3: BioELECTRA Demo

Recommendations

For Production Use

For BioELECTRA Training (Method 3)

Process, Limitations, and Future Work

Development Process

Current Limitations

Future Enhancements

Evaluation Recommendations

License

Contributing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages