The challenge of extracting structured metadata from scientific posters requires flexible solutions that accommodate different cost constraints, resource availability, and accuracy requirements. Our solution provides three distinct approaches to the same problem, each optimized for different operational contexts and user needs.
Method 1: DeepSeek API - For users who need immediate deployment with moderate costs (~$0.003/poster). Best balance of ease-of-use and accuracy for most applications.
Method 2: Qwen Local - For users with GPU resources who prioritize zero ongoing costs and data privacy. Runs entirely on local hardware with competitive accuracy.
Method 3: BioELECTRA+CRF - For users who need the highest accuracy and fastest processing, and can invest in creating training data. Eliminates hallucination risks through deterministic sequence labeling.
The choice depends entirely on your resources and requirements: If you have API budget and need quick deployment, use Method 1. If you have local GPU resources and prioritize privacy/zero costs, use Method 2. If you can invest in training data and need maximum accuracy with zero hallucination, Method 3 is optimal.
Additionally, Methods 1 and 2 can serve as data generation tools for training Method 3, allowing users to bootstrap their way to the highest-accuracy solution by auto-labeling posters for CRF training.
All accuracy estimates are unvalidated - these are rough estimates based on limited testing and theoretical benchmarks. Actual accuracy can only be determined through proper validation using Cochran's random sampling methodology as outlined in this document. Please validate before production use.
This repository directly addresses the original take-home task requirements with our three-tiered approach:
Core Pipeline Steps:
- PDF Text Extraction: Use PyMuPDF (fitz) to extract text from poster PDFs
- Text Preprocessing: Clean and structure extracted text for model consumption
- Metadata Extraction: Apply selected method (API LLM, Local LLM, or Transformer+CRF)
- JSON Output: Structure results according to Table 1 requirements (see below)
- Validation: Apply Cochran sampling methodology for expedited manual quality assessment
Table 1 Metadata Fields (from original task):
- Title of the poster
- Authors (with affiliations)
- Summary of the poster
- Keywords
- Methods
- Results (main findings)
- References
- Funding source
Key Components:
- Text Extraction Engine: PyMuPDF for robust PDF processing
- LLM Interface Layer: Unified API for different model backends
- Prompt Engineering Module: Structured templates for consistent extraction
- Validation Framework: Statistical sampling and accuracy measurement
- Output Standardization: JSON schema compliance with required metadata fields
Core Technologies:
- PyMuPDF (fitz): PDF text extraction engine
- OpenAI Python SDK: API client for Method 1
- HuggingFace Transformers: Model loading and inference for Method 2
- PyTorch: Deep learning framework with CUDA support
- python-dotenv: Secure environment variable management
- Jupyter: Interactive development and execution
Models by Method:
- Method 1: DeepSeek-Chat API (cost-effective, ~$0.003/poster)
- Method 2: Qwen2.5-1.5B-Instruct (local inference, 8-bit quantized)
- Method 3: BioELECTRA-base + CRF layer (demo only - requires training)
Hardware Requirements:
- Method 1: Any system with internet (CPU sufficient)
- Method 2: CUDA GPU with 8GB+ VRAM (RTX 4090 recommended for batching)
- Method 3: Same as Method 2 (after training completion)
API Integration:
- DeepSeek API: Primary endpoint (OpenAI-compatible)
- Environment Variables: API keys loaded securely via .env file
Key Assumptions:
- Poster PDFs contain extractable text (not image-only scans)
- Scientific posters follow standard academic formatting conventions
- Target metadata fields (Table 1) are present in poster content
- For Method 3: Training data can be generated via Methods 1-2 bootstrapping
Dependencies:
- Python 3.8+: Core runtime environment
- CUDA-capable GPU: Required for Methods 2-3 (8GB+ VRAM recommended)
- API Keys: DeepSeek API access for Method 1
- Training Data: 500-1000 labeled posters for Method 3 (generated via auto-labeling)
- Validation Dataset: Representative poster sample for Cochran sampling
Evaluation Approach:
- Cochran's Random Sampling: Statistically significant validation methodology
- Field-Specific Metrics: Tailored accuracy measures per metadata type
- Cross-Method Comparison: Benchmarking across all three approaches
- Statistical Validation: 95% confidence intervals with finite population correction
Specific Metrics:
- Title Extraction: Exact match + semantic similarity (>0.8 threshold)
- Author Detection: Fuzzy string matching with edit distance <2
- Keyword Extraction: Overlap coefficient >0.6 with expert annotations
- Abstract Fields: BLEU score >0.7 vs. expert summaries
- Overall Accuracy: Weighted F1-score across all metadata fields
Sample Size Requirements (Cochran's Formula):
- 1000 posters → 278 validation samples (27.8%)
- 10,000 posters → 370 validation samples (3.7%)
- 100,000+ posters → 383 validation samples (0.4%)
Current Implementation:
- Method 1: Fully functional with enhanced structured prompting
- Method 2: Complete with 8-bit quantization and batching optimization
- Method 3: Demonstration framework only (requires training data)
Differences from Ideal Pipeline:
- Method 3 Limitation: Currently demo-only due to training data requirements
- Hardware Constraints: Optimized for single-GPU deployment vs. distributed inference
- API Fallbacks: Demo results provided when API keys unavailable
Testing Instructions:
- Clone repository:
git clone https://github.com/jimnoneill/poster-metadata-extractor.git - Install dependencies:
pip install -r requirements.txt - Configure API keys:
cp env.example .env(edit as needed) - Run notebooks: Execute cells in
notebooks/01_method1_deepseek_api.ipynb - Validate outputs: Check
output/directory for generated JSON files
Notebooks: 01_method1_deepseek_api.ipynb
Cost-effective API-based extraction using DeepSeek's language model.
Performance Characteristics:
- Estimated Accuracy: 85-90% (unvalidated - requires Cochran sampling validation)
- Cost: ~$0.003 per poster (200x cheaper than GPT-4)
- Speed: 5-15 seconds per poster
- Hallucination Risk: Low-Medium (mitigated by structured prompts)
- Setup: Easy - just requires API key
Best For: Production systems with budget constraints, high-volume processing
Notebooks: 02_method2_qwen_local.ipynb
Local small language model (1.5B parameters) for privacy-sensitive environments.
Performance Characteristics:
- Estimated Accuracy: 80-85% (unvalidated - requires Cochran sampling validation)
- Cost: $0 (runs locally, only electricity costs)
- Speed: 10-40 seconds per poster (single), ~1.1s per poster (RTX 4090 batched)
- Hallucination Risk: Low (structured prompting)
- Setup: Medium - requires model download and GPU memory
RTX 4090 Batching Capacity:
- Recommended batch size: 32 posters simultaneously
- Throughput: ~3,273 posters/hour, ~26,182 posters/day (8hrs)
- Memory efficiency: 8-bit quantization enables large-scale processing
Best For: Privacy-sensitive environments, budget-conscious deployments, edge computing
Notebooks: 03_method3_bioelectra_crf_demo.ipynb
DEMONSTRATION ONLY - Future possibility requiring 500-1000 labeled posters for training.
Performance Characteristics (Estimated):
- Estimated Accuracy: 85-92% (theoretical - based on BLURB benchmarks, requires training & validation)
- Cost: $0 (after training - local inference only)
- Speed: <0.5 seconds per poster (fastest of all methods)
- Hallucination Risk: 0% (deterministic sequence labeling)
- Setup: Complex - requires extensive training data
Training Requirements: 500-1000 manually labeled poster PDFs with BIO annotations
Auto-Labeling Plan: The training data for Method 3 will be generated by auto-labeling 1,000 posters (or however many needed) using our top-performing Methods 1 (DeepSeek) and 2 (Qwen) to bootstrap the CRF training dataset. This approach leverages LLM-generated annotations as weak supervision for the final deterministic model.
| Feature | Method 1 (DeepSeek) | Method 2 (Qwen Local) | Method 3 (BioELECTRA) |
|---|---|---|---|
| Accuracy | 85-90% (unvalidated) | 80-85% (unvalidated) | 85-92% (theoretical) |
| Cost per poster | $0.003 | $0 | $0 |
| Speed | 5-15s | 10-40s | <0.5s |
| Privacy | External API | Local | Local |
| Setup complexity | Easy | Medium | Complex |
| Hallucination | Low-Med | Low | None |
| Training required | No | No | Yes (500-1000 posters) |
We strongly recommend validating extraction quality using Cochran's formula for statistically significant sample size determination:
n = (Z² × p × (1-p)) / e²
n_adjusted = n / (1 + (n-1)/N) # Finite population correction
Where:
- Z = 1.96 (95% confidence level)
- p = 0.5 (maximum variability assumption)
- e = 0.05 (±5% margin of error)
- N = total population size (number of posters)
Sample Sizes by Dataset (with finite population correction):
- 100 posters: Validate ~80 randomly selected outputs (79.5%)
- 500 posters: Validate ~217 randomly selected outputs (43.5%)
- 1000 posters: Validate ~278 randomly selected outputs (27.8%)
- 10,000 posters: Validate ~370 randomly selected outputs (3.7%)
- 100,000+ posters: Validate ~383 randomly selected outputs (0.4%)
Key Insight: For smaller datasets (<1000), you must validate a high percentage. Only when scaling to tens of thousands of posters does the required validation percentage become practical (under 5%).
Validation Process:
- Extract metadata from full dataset
- Randomly sample using calculated sample size
- Expert manual review of sampled outputs
- Calculate accuracy metrics (precision, recall, F1)
- Apply correction factors to full dataset if needed
This ensures statistically significant quality assessment across all methods.
Field-specific accuracy calculation:
- Title: Exact match or semantic similarity >0.8
- Authors: Name matching with fuzzy string matching (edit distance <2)
- Keywords: Overlap coefficient >0.6 with expert annotations
- Methods/Results: BLEU score >0.7 compared to expert summaries
poster_project/
├── notebooks/ # Jupyter notebooks with execution outputs
│ ├── 01_method1_deepseek_api.ipynb # DeepSeek API extraction
│ ├── 02_method2_qwen_local.ipynb # Qwen local model
│ └── 03_method3_bioelectra_crf_demo.ipynb # BioELECTRA demo
├── src/ # Python implementation scripts
│ ├── method1_deepseek_api.py # API extraction script
│ ├── method2_qwen_local.py # Local model script
│ └── method3_bioelectra_crf_demo.py # Demo script
├── data/ # Sample data and test files
│ ├── test-poster.pdf # Sample poster for testing
│ └── Take-home-task.pdf # Original assignment
├── output/ # Generated extraction results
│ ├── method1_deepseek_results.json # DeepSeek API results
│ └── method2_qwen_results.json # Qwen local results
├── requirements.txt # Python dependencies
├── env.example # Environment variables template
└── README.md # This documentation
git clone https://github.com/jimnoneill/poster-metadata-extractor.git
cd poster-metadata-extractor
pip install -r requirements.txtcp env.example .env
# Edit .env and add your DEEPSEEK_API_KEYFor optimal performance with Qwen local model:
- CUDA-capable GPU with 8GB+ VRAM
- PyTorch with CUDA support
# Method 1: DeepSeek API
from src.method1_deepseek_api import extract_poster_metadata
results = extract_poster_metadata("data/your-poster.pdf")
# Method 2: Qwen Local
from src.method2_qwen_local import QwenExtractor, extract_text_from_pdf
extractor = QwenExtractor()
results = extractor.extract_poster_metadata("data/your-poster.pdf")
# Method 3: Demo only
from src.method3_bioelectra_crf_demo import bioelectra_crf_demo
demo_results = bioelectra_crf_demo()All notebooks are ready to run with pre-executed outputs:
- Open desired method notebook in Jupyter
- Set API keys if using Method 1
- Run all cells to see extraction results
- Core: DeepSeek-Chat API with enhanced structured prompting
- Dependencies:
openai,python-dotenv,PyMuPDF - Features: JSON schema enforcement, cost tracking, fallback handling
- Authentication: Secure API key loading via environment variables
- Core: Qwen2.5-1.5B-Instruct with 8-bit quantization
- Dependencies:
torch,transformers,bitsandbytes,accelerate - Features: Few-shot prompting, GPU batching, memory optimization
- Optimization: CUDA acceleration with stderr suppression for clean execution
- Core: BioELECTRA-base + CRF layer (demonstration framework)
- Dependencies:
pytorch-crf,spacy,scikit-learn - Features: Sequence labeling, deterministic extraction, zero hallucination
- Status: Demo only - requires training data for production use
- Start with Method 1 (DeepSeek API) for immediate deployment
- Implement Cochran sampling for quality validation
- Consider Method 2 for privacy-sensitive applications
- Plan Method 3 as long-term solution with proper training data
- Collect 500-1000 poster PDFs with diverse layouts and fields
- Manual BIO annotation (~40-60 expert hours)
- Entity types: Title, Authors, Affiliations, Methods, Results, Funding
- Alternative simpler approaches: Rule-based NER, spaCy custom models
- Validation: Cross-validation on held-out test set
Iterative Methodology:
- Phase 1: Rapid prototyping with API-based solution (Method 1)
- Phase 2: Privacy-preserving local implementation (Method 2)
- Phase 3: Scientific rigor through transformer+CRF architecture (Method 3)
- Validation: Cochran sampling framework for statistical significance
Design Decisions:
- Multi-tiered approach addresses different operational requirements
- Bootstrapping strategy leverages LLM capabilities for CRF training data generation
- JSON output standardization ensures consistency across all methods
- Modular architecture enables easy method comparison and selection
Technical Constraints:
- Method 3 Training: Requires substantial labeled dataset (500-1000 posters)
- GPU Dependencies: Methods 2-3 require CUDA-capable hardware for optimal performance
- Text-Only Processing: Cannot handle image-only or poorly scanned PDFs
- Single-Language Support: Optimized for English academic papers
Validation Limitations:
- Accuracy Estimates: Based on limited testing, require proper validation
- Domain Specificity: Tested primarily on biomedical/engineering posters
- Scale Testing: Not yet validated on large-scale deployments (>10K posters)
Immediate Improvements (3-6 months):
- Complete Method 3 Training: Generate 1000+ labeled examples using Methods 1-2
- OCR Integration: Add image processing for scanned posters using Tesseract/PaddleOCR
- Multilingual Support: Extend to Spanish, French, German scientific literature
- Batch Processing: Implement distributed processing for large poster collections
Future Developments (6-12 months):
- Multi-modal Architecture: Incorporate visual layout analysis using LayoutLM
- Domain Adaptation: Fine-tune models for specific scientific disciplines
- Active Learning: Implement uncertainty-based sample selection for validation
- Real-time API: Deploy as microservice with REST API for integration
Research Extensions (1+ years):
- Cross-lingual Transfer: Leverage multilingual transformers for global poster analysis
- Temporal Analysis: Track research trend evolution across poster collections
- Graph-based Extraction: Model author-institution-topic relationships
- Automated Quality Assessment: Self-monitoring extraction confidence scoring
Before Production Use:
- Conduct Cochran Sampling: Validate accuracy on representative poster sample
- Domain Testing: Evaluate performance across different scientific fields
- Scale Assessment: Test throughput and accuracy on large poster collections
- User Studies: Gather feedback from scientific librarians and researchers
Success Metrics:
- Accuracy: >90% field-level accuracy on validation set
- Coverage: Extract ≥7 of 8 Table 1 metadata fields per poster
- Throughput: Process ≥1000 posters/hour on standard hardware
- User Satisfaction: ≥85% user acceptance in library/repository contexts
MIT License - see LICENSE file for details.
- Fork the repository
- Create feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add improvement') - Push to branch (
git push origin feature/improvement) - Create Pull Request
@software{oneill2025poster,
title={Scientific Poster Metadata Extraction Toolkit},
author={ONeill, Jamey},
year={2025},
url={https://github.com/jimnoneill/poster-metadata-extractor}
}