This Python script acts as a personalized literature recommender, designed to keep researchers updated with new and relevant papers from Semantic Scholar based on their past work. It leverages semantic embeddings and large language models (LLMs) to identify, rank, and summarize papers, delivering a concise HTML digest.
- Personalized Paper Discovery: Automatically fetches recent papers related to your research interests by analyzing your past publications on Semantic Scholar.
- Semantic Ranking: Ranks candidate papers by their abstract's cosine similarity to a "centroid" embedding of your previous work, ensuring high relevance.
- AI-Powered Summarization: Utilizes an LLM (via LiteLLM) to generate concise summaries, key findings, and a relevance rating for the top-ranked papers.
- Elegant HTML Digest: Produces a beautifully styled HTML file (
monthly_digest.html) for easy reading and sharing. - Embedding Caching: Employs ChromaDB to cache paper embeddings, speeding up subsequent runs and reducing API calls.
- Robust API Handling: Includes retry mechanisms for API calls and handles Semantic Scholar API rate limits with fallbacks.
- Fetch Seed Papers: The script first identifies a set of your most recent papers from Semantic Scholar using your
S2_AUTHOR_ID. - Discover Candidate Papers: For each seed paper, it queries Semantic Scholar's recommendations API (or falls back to citations API) to find related papers published within a specified recency window.
- Build Author Centroid: It computes a "centroid" vector in the embedding space. This centroid is the average of the embeddings of your past paper abstracts, representing your core research interests.
- Rank Candidates: Each candidate paper's abstract is embedded, and its cosine similarity to your author centroid is calculated. Papers are then ranked by this similarity score.
- Summarize Top Papers: The top-ranked papers are passed to an LLM (e.g.,
openai/gpt5minivia LiteLLM) which generates a concise summary, extracts up to 5 key findings, and assigns a relevance rating (0-5). - Generate HTML Digest: Finally, all the summarized papers are compiled into a single, well-formatted HTML file,
monthly_digest.html, ready for review.
- Python 3.8+
pippackage manager
-
Clone the repository:
git clone https://github.com/your-username/monthly-recommender-agent.git cd monthly-recommender-agent -
Install dependencies:
pip install -r requirements.txt
(If you don't have a
requirements.txt, create one with the following content and then run the command above):requests numpy tqdm openai chromadb python-dateutil litellm langchain-community
Set the following environment variables. You can do this by creating a .env file in the project root or by setting them directly in your shell.
S2_AUTHOR_ID: Your Semantic Scholar Author ID. You can find this by searching for your profile on Semantic Scholar and extracting the ID from the URL (e.g.,https://www.semanticscholar.org/author/Your-Name/YOUR_AUTHOR_ID).OPENAI_API_KEY: Your API key for OpenAI. This is used by LiteLLM to access the LLM for summarization.S2_API_KEY(Optional): Your Semantic Scholar API key. While not strictly required for basic usage, providing one can increase API rate limits and improve reliability for larger fetches.
Simply execute the Python script:
python monthly_recommender_agent.py- AUTHOR_ID = os.environ["S2_AUTHOR_ID"] # Semantic Scholar id
- NUM_CANDIDATES = 10 # pull this many candidate papers
- NUM_FINAL = 10 # keep this many for final digest
- NUM_SEED_PAPERS = 10 # number of your recent papers to use as seeds for finding related work
- RECENCY_MONTHS = 1 # only consider candidate papers from last N months
- EMB_MODEL = "all-mpnet-base-v2" # Embedding model used by HuggingFaceEmbeddings
- LLM_MODEL = "openai/gpt5mini" # LLM model used via LiteLLM for summarization
- SEMANTIC_URL = "https://api.semanticscholar.org/graph/v1" # Semantic Scholar API base URL
CHROMA_DB_PATH = "./chroma_db" # Directory for persistent ChromaDB storage
- NUM_CANDIDATES: The maximum number of candidate papers to fetch per seed paper.
- NUM_FINAL: The number of top-ranked papers to include in the final digest.
- NUM_SEED_PAPERS: The number of your own recent papers used to establish your research centroid and find related work.
- RECENCY_MONTHS: Only papers published within this many months from the current date will be considered as candidates.
- EMB_MODEL: Specifies the HuggingFace embedding model used (all-mpnet-base-v2 is a good default).
- LLM_MODEL: The LLM model identifier to be used by LiteLLM for summarization (e.g., openai/gpt5mini, openai/gpt-4-turbo, ollama/llama3).
- CHROMA_DB_PATH: The local directory where ChromaDB will store cached embeddings.
The script generates a file named monthly_digest.html in the same directory it's run from. Open this file in any web browser to view your personalized literature digest.
Contributions are welcome! Feel free to open issues or pull requests for bug fixes, new features, or improvements.
This project is open-source and available under the MIT License.
- Semantic Scholar: For providing the academic paper data and APIs.
- OpenAI: For powerful language models used in summarization.
- LiteLLM: For simplifying LLM API calls across various providers.
- HuggingFace: For providing robust embedding models.
- ChromaDB: For efficient and persistent vector database capabilities.