Skip to content

kadiryonak/DeepArticle

πŸŽ“ DeepArticle β€” Multi-Agent Academic Paper Analysis

Python LangChain LangGraph License

An AI multi-agent system that discovers, ranks and summarizes academic papers β€” bilingual (TR + EN), with an AI study plan and real supplementary resources.


✨ What it does

  • πŸ”¬ Deterministic query analysis β€” an LLM extracts the topic's key concepts and a focused, on-topic query set (temperature 0, greedy top_p/top_k β†’ same topic, same queries, no drift).
  • πŸ“š Multi-source search β€” ArXiv, Semantic Scholar, OpenAlex, CORE, CrossRef, DOAJ, DBLP in parallel (OpenAIRE/PubMed optional), with cross-database dedup ("found in 2 databases" + both links).
  • 🌍 Bilingual (TR + EN) β€” queries in both languages; the πŸ‡ΉπŸ‡·/πŸ‡¬πŸ‡§ toggle only translates the UI, never the results.
  • 🎯 Language-agnostic LLM reranking β€” Haiku re-scores the top candidates by true topical relevance and drops off-topic papers, so a Turkish query surfaces the best work in any language.
  • πŸ“Š Smart ranking β€” citations, relevance, venue quality, recency, influence.
  • 🧭 AI study plan β€” reading path (Foundational β†’ Core β†’ Advanced) + result groups (most-cited / newest / top venues / open access).
  • 🌐 Real resources only β€” GitHub repos (star-filtered), Medium articles, YouTube videos pulled from live APIs β€” shown only if they actually exist, never AI-generated.
  • πŸ–₯️ Web UI β€” FastAPI + React SPA with live agent-progress streaming (SSE). No build step.

πŸ–ΌοΈ Screenshots

Live search & topic analysis (Turkish query, EN UI toggle):

Live search and pipeline progress

Ranked paper cards β€” score badges, filters, and cross-database provenance ("2 veritabanΔ±"):

Ranked paper cards

Result groups & real related resources:

Result groups and resources

Full AI study plan (TR mode β€” topic analysis β†’ reading path β†’ groups β†’ resources):

AI study plan


πŸ“¦ Installation

git clone https://github.com/kadiryonak/DeepArticle.git
cd DeepArticle

python -m venv venv
.\venv\Scripts\activate          # Windows
# source venv/bin/activate       # macOS/Linux

pip install -e ".[api]"          # core + web UI
cp .env.example .env             # then add at least one LLM key

Minimum: one LLM API key. To run on Claude Haiku (recommended β€” avoids Groq's daily limits):

LLM_PROVIDER=anthropic
LLM_MODEL=claude-haiku-4-5
Provider Get a key Free tier
Groq console.groq.com βœ…
Anthropic console.anthropic.com ❌
OpenAI platform.openai.com ❌
Google AI makersuite.google.com βœ…
DeepSeek platform.deepseek.com ❌ (very cheap)

πŸ’» Usage

# Web UI (recommended) β€” open http://localhost:8000
uvicorn api.server:app --reload

# CLI
python main.py "unit test generation using large language models"
python main.py --interactive

🐳 Docker (any machine, no Python setup)

cp .env.example .env             # add e.g. GROQ_API_KEY
docker compose up --build        # then open http://localhost:8000

The response cache is persisted in a named volume so repeat searches stay fast.


πŸ—οΈ Architecture

A 10-stage LangGraph pipeline:

QUERY β†’ 1.Orchestrator β†’ 2.Query Analyzer (deterministic, EN+TR)
      β†’ 3.Search (8+ sources in parallel, cross-DB dedup + provenance)
      β†’ 4.Metadata (citations, SCImago Q-quartile, CrossRef)
      β†’ 5.Analysis + 🎯 language-agnostic LLM rerank (drops off-topic)
      β†’ 6.Summarizer β†’ 7.Prioritizer β†’ 8.Recommender (study plan + groups)
      β†’ 9.Resources (real GitHub / Medium / YouTube) β†’ 10.Output (JSON/Markdown)

Scoring

Factor Weight Factor Weight
Citations (log-scaled) 25% Recency 15%
Relevance (LLM-reranked) 25% Influence 15%
Venue quality 20%

Keyword relevance is only the first pass and is biased toward the query's language. An LLM then re-scores the top candidates by language-agnostic relevance and drops off-topic papers (ENABLE_RERANK, RELEVANCE_MIN, RERANK_PROVIDER/RERANK_MODEL).


πŸ“Š Evaluation results (DeepEval)

105 unit/integration tests pass (offline). A product-level benchmark runs the pipeline over 300 bilingual topics (150 EN / 150 TR) and scores it with DeepEval (LLM-as-judge).

python -m evals.benchmark --limit 10          # quick (query metrics)
python -m evals.benchmark --deep --limit 5    # + search + summary metrics
python -m evals.benchmark --safety --limit 5  # + safety metrics

Quality (higher is better):

Metric Bar Mean Pass
query_relevance (GEval) β‰₯ 0.60 0.90 100%
bilingual_coverage true β€” 100%
query_count β‰₯ 10 30 100%
retrieval_count β‰₯ 10 66–141 100%
dedup_integrity (no dup titles) true β€” 100%
summary_faithfulness β‰₯ 0.70 1.00 100%
summary_relevancy β‰₯ 0.60 ~0.50 ~50% (to improve)

Safety β€” six DeepEval dimensions, scored by each metric's own .success. On benign academic summaries all six pass: Bias 0.00 Β· Toxicity 0.00 Β· Misuse 0.00 Β· PIILeakage safe Β· NonAdvice safe Β· RoleViolation safe.

Reranking impact: a Turkish query for "large language models for question answering" drops 16 off-topic papers (autonomous-vehicle sentiment, digital diplomacy, …) and surfaces the best work in both languages β€” e.g. "TΓΌrkΓ§e soru cevaplama iΓ§in bΓΌyΓΌk dil modelleri" (rel 95) next to "Multilingual Benchmarking of LLMs" (rel 85). Without reranking, keyword relevance ranked unrelated Turkish papers at 100%.

The full 300-topic --deep/--safety run makes ~9–11 judge calls per topic, which exceeds Groq's free daily limit (HTTP 429). Use Haiku (LLM_PROVIDER=anthropic) or run in --limit batches.

Agent-trace metrics, MCP & Confident AI (TaskCompletion, PlanQuality, ToolCorrectness, …) are on the roadmap β€” they require instrumenting agents with DeepEval @observe tracing.


πŸ”§ Configuration (.env)

LLM_PROVIDER=anthropic               # Haiku avoids Groq's daily limits
LLM_MODEL=claude-haiku-4-5

MAX_SEARCH_QUERIES=12                 # fewer = more focused & on-topic
SOURCES=arxiv,semantic_scholar,openalex,openalex_thesis,core,crossref,doaj,dblp
BILINGUAL_SEARCH=1                    # EN + TR

ENABLE_RERANK=1                       # language-agnostic LLM rerank (run on Haiku)
RELEVANCE_MIN=40                      # drop papers scored below this (0-100)

# Real supplementary resources (omitted entirely if no key / nothing found)
GITHUB_TOKEN=                         # higher rate limit; GITHUB_MIN_STARS=100
YOUTUBE_API_KEY=                      # enable "YouTube Data API v3" in Google Cloud

Theses (PhD/Master's) come from openalex_thesis and core (both multilingual, incl. Turkish); yoktez (YΓ–K Ulusal Tez Merkezi) is best-effort and returns nothing when blocked.

Determinism: topic analysis and reranking run at temperature=0 with greedy top_p/top_k, so the same query reliably produces the same keywords, queries and ranking.


πŸ§ͺ Tests

python -m pytest tests/ -v           # 105 offline tests
pytest src/evals/ -v -m eval         # LLM-as-judge evals (needs one key)

πŸ“„ License

MIT β€” free to use for your research. Contributions welcome (see CONTRIBUTING.md).

Made with ❀️ using LangChain & LangGraph

About

Multi-agent academic research system that discovers, ranks, evaluates, and summarizes scientific papers across multiple scholarly databases.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors