Research papers are often long and complex, making it difficult for researchers, students, and professionals to extract key insights efficiently. This tool automates the summarization process using Retrieval-Augmented Generation (RAG) and Generative AI models, providing quick and accurate summaries along with essential keyword extraction.
✅ Users can upload a PDF or provide a URL to a research paper.
✅ For PDFs: The system extracts text using PyPDF2.
✅ For URLs: The system scrapes text from web pages using BeautifulSoup.
✅ Extracted text is sent to the FastAPI backend.
✅ Keywords are extracted using TF-IDF.
✅ FAISS/ChromaDB stores embeddings for efficient retrieval.
✅ Summarized text and extracted keywords are displayed in the frontend.
- Frontend: Streamlit
- Backend: FastAPI
- ML Models (Planned): Llama-2 / BART / Pegasus for text summarization
- Database: FAISS / ChromaDB (Vector-based retrieval)
- Libraries: PyPDF2, Hugging Face Transformers, Scikit-learn, BeautifulSoup, gdown
✅ Keyword extraction using TF-IDF
✅ Supports direct PDF uploads & Web links
✅ FAISS Vector Search for fast retrieval of stored summaries
✅ Text Extraction & Processing (FastAPI, PyPDF2, BeautifulSoup, FAISS/ChromaDB)
- Keyword Extraction: O(n log n) (TF-IDF)
- FAISS Query Retrieval: O(log n) (Fast vector search)
- API-Based Processing: Allows easy scaling to handle large research datasets
🔹 Fix Frontend-Backend Connection: Debug Streamlit & FastAPI communication issues
🔹 Enhance Frontend UI: Improve responsiveness, add visual elements for keyword highlights
🔹 Implement Summarization Models: Integrate Llama-2, BART, or Pegasus for text summarization
🔹 Implement Query-based Retrieval: Allow users to search for related papers in the FAISS database
🔹 Support for More File Formats: Extend support beyond PDFs (e.g., DOCX, TXT)
🔹 Optimize Batch Processing: Implement parallelization for faster summarization of large datasets
🔹 Deploy the Application: Deploy the project on Hugging Face Spaces or Render for public access
🔹 Make Local Processing Feasible: Optimize LLM inference locally while allowing scalable API-based processing
🔹 Revolutionizing research accessibility with RAG-based summarization
🔹 Saves hours of reading time for researchers & students
🔹 Useful for academics, journalists, legal analysts & enterprise research platforms
🔹 Scalable to millions of research papers with optimized retrieval and batch processing
🔹 Improves productivity by enabling faster literature reviews & knowledge discovery
📂 Research-Paper-Summarizer
│── 📂 backend
│ │── summarizer.py # Handles text summarization
│ │── keyword_extractor.py # Extracts keywords from the paper
│ │── fetch_paper.py # Fetches research paper from URL/Drive
│ │── main.py # FastAPI backend
│
│── 📂 frontend
│ │── app.py # Streamlit UI (main app)
│ │── ui_components.py # UI components (sidebar, upload, results)
│
│── 📂 models
│ │── faiss_index # FAISS Vector Database
│ │── model.pth # Trained ML Model (optional)
│
│── 📂 data
│ │── example_papers/ # Store local PDF research papers
│
│── requirements.txt # Dependencies
│── README.md # Documentation
🚀🌍 Conclusion :
🔹 Saves hours of reading time for researchers & students.
🔹 Can be used for academic institutions, journalists, or legal analysis.
🔹 Scalable to millions of papers for enterprise research platforms.
### **🌍 Final Note:**
**Work is in progress... 🚧**