A configurable AI-powered tool to automatically score research papers based on custom criteria.
- π€ Multi-Model AI Support - Gemini, Claude, and Groq with automatic fallbacks
- π Hybrid File Management - Upload files or browse local folders
- π― Configurable Scoring - Customize prompts and output formats
- π Real-time Dashboard - Streamlit web interface with progress tracking
- π Resume Capability - Process can be stopped and resumed
- πΎ Robust Output - CSV exports with comprehensive metadata
- π Citation Lookup - Automatic citation count retrieval
# Clone the repository
git clone https://github.com/your-username/research-paper-scorer.git
cd research-paper-scorer
# Install dependencies
pip install -r requirements.txt# Copy environment template
cp .env.example .env
# Edit .env with your API keys
nano .envAdd your API keys to the .env file:
# Get these from respective providers
GEMINI_API_KEY=your_gemini_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GROQ_API_KEY=your_groq_api_key_here
# Processing settings
BATCH_SIZE=3
MAX_WORKERS=2
DEFAULT_MODEL=gemini- Gemini API: Google AI Studio
- Anthropic API: Anthropic Console
- Groq API: Groq Console
Web Interface (Recommended):
streamlit run app.pyCommand Line Interface:
# Setup directory structure
python main.py --setup
# Process papers
python main.py --process --max-files 5
# Check status
python main.py --status- Upload PDFs: Use the sidebar to upload PDFs or browse local folders
- Configure Scoring: Customize scoring criteria in the Configuration tab
- Start Processing: Monitor real-time progress in the Processing tab
- View Results: Analyze results with visualizations in the Results tab
The application organizes files into folders:
data/
βββ pending/ # PDFs ready for processing
βββ processing/ # Currently being processed
βββ completed/ # Successfully processed
βββ failed/ # Failed processing
βββ outputs/ # CSV results and logs
Choose from pre-built templates or create custom ones:
- General Research - Standard academic paper scoring
- Medical Research - Clinical relevance, study design, ethics
- Engineering - Technical innovation, validation, scalability
- Social Sciences - Theory, methodology, social relevance
research-paper-scorer/
βββ app.py # Streamlit web interface
βββ main.py # Command line interface
βββ config.py # Configuration settings
βββ src/ # Source code modules
β βββ processors/ # File and paper processing
β βββ models/ # AI model handlers
β βββ extractors/ # PDF and citation extraction
β βββ outputs/ # CSV handling
β βββ dashboard/ # Streamlit UI components
βββ data/ # Processing directories
βββ templates/ # Scoring prompts and formats
βββ logs/ # Error logs
βββ tests/ # Unit tests
Customize in templates/scoring_prompts/:
- Modify existing templates
- Create domain-specific scoring criteria
- Use JSON format for structured outputs
Customize CSV columns in templates/output_formats/:
- Basic: Essential paper information
- Detailed: Complete scoring breakdown
- Custom: Your own column structure
Adjust in .env:
BATCH_SIZE: Papers processed simultaneouslyMAX_WORKERS: Parallel processing threadsDEFAULT_MODEL: Preferred AI model
# Process all papers
python main.py --process
# Process with limits
python main.py --process --max-files 10 --batch-size 2
# Use custom scoring prompt
python main.py --process --prompt-file my_prompt.txt
# Check file status
python main.py --status
# Clean up processing folder
python main.py --cleanup
# Verbose logging
python main.py --process --verbose- Model:
gemini-1.5-flash - Rate Limits: 15 RPM, 1M TPM (free tier)
- Best For: General research papers
- Model:
claude-3-sonnet-20240229 - Rate Limits: 5 RPM (free tier)
- Best For: Complex analysis, nuanced scoring
- Model:
llama3-8b-8192 - Rate Limits: Very fast inference
- Best For: High-volume processing
Generated CSV includes:
- Paper Metadata: Title, authors, DOI, journal, year
- Citation Data: Citation count, journal impact
- Scoring Results: Individual dimension scores and justifications
- Processing Info: Date, model used, processing status
No API keys configured:
- Check your
.envfile exists and contains valid keys - Verify API keys have sufficient quota
PDF extraction fails:
- Ensure PDFs are text-based (not scanned images)
- Check file permissions and size limits
Processing stuck:
- Use
python main.py --cleanupto reset - Check
logs/error.logfor detailed errors
Citation lookup fails:
- External APIs may have rate limits
- Some papers may not be indexed in citation databases
- Start Small: Test with 2-3 papers first
- Batch Size: Reduce if hitting rate limits
- Worker Threads: Increase for faster processing (watch API limits)
- Model Selection: Use Groq for speed, Claude for quality
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Documentation: Project Wiki
- Email: your-email@example.com
- OCR support for scanned PDFs
- More citation databases (PubMed, arXiv)
- Batch export formats (Excel, JSON)
- API endpoint for integration
- Docker containerization
- Cloud deployment options
- Streamlit for the amazing web framework
- Semantic Scholar for citation data
- OpenAlex for paper metadata
- AI model providers: Google, Anthropic, Groq