An AI multi-agent system that discovers, ranks and summarizes academic papers β bilingual (TR + EN), with an AI study plan and real supplementary resources.
- π¬ Deterministic query analysis β an LLM extracts the topic's key concepts and a focused, on-topic query set (temperature 0, greedy
top_p/top_kβ same topic, same queries, no drift). - π Multi-source search β ArXiv, Semantic Scholar, OpenAlex, CORE, CrossRef, DOAJ, DBLP in parallel (OpenAIRE/PubMed optional), with cross-database dedup ("found in 2 databases" + both links).
- π Bilingual (TR + EN) β queries in both languages; the πΉπ·/π¬π§ toggle only translates the UI, never the results.
- π― Language-agnostic LLM reranking β Haiku re-scores the top candidates by true topical relevance and drops off-topic papers, so a Turkish query surfaces the best work in any language.
- π Smart ranking β citations, relevance, venue quality, recency, influence.
- π§ AI study plan β reading path (Foundational β Core β Advanced) + result groups (most-cited / newest / top venues / open access).
- π Real resources only β GitHub repos (star-filtered), Medium articles, YouTube videos pulled from live APIs β shown only if they actually exist, never AI-generated.
- π₯οΈ Web UI β FastAPI + React SPA with live agent-progress streaming (SSE). No build step.
Live search & topic analysis (Turkish query, EN UI toggle):
Ranked paper cards β score badges, filters, and cross-database provenance ("2 veritabanΔ±"):
Result groups & real related resources:
Full AI study plan (TR mode β topic analysis β reading path β groups β resources):
git clone https://github.com/kadiryonak/DeepArticle.git
cd DeepArticle
python -m venv venv
.\venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
pip install -e ".[api]" # core + web UI
cp .env.example .env # then add at least one LLM keyMinimum: one LLM API key. To run on Claude Haiku (recommended β avoids Groq's daily limits):
LLM_PROVIDER=anthropic
LLM_MODEL=claude-haiku-4-5| Provider | Get a key | Free tier |
|---|---|---|
| Groq | console.groq.com | β |
| Anthropic | console.anthropic.com | β |
| OpenAI | platform.openai.com | β |
| Google AI | makersuite.google.com | β |
| DeepSeek | platform.deepseek.com | β (very cheap) |
# Web UI (recommended) β open http://localhost:8000
uvicorn api.server:app --reload
# CLI
python main.py "unit test generation using large language models"
python main.py --interactivecp .env.example .env # add e.g. GROQ_API_KEY
docker compose up --build # then open http://localhost:8000The response cache is persisted in a named volume so repeat searches stay fast.
A 10-stage LangGraph pipeline:
QUERY β 1.Orchestrator β 2.Query Analyzer (deterministic, EN+TR)
β 3.Search (8+ sources in parallel, cross-DB dedup + provenance)
β 4.Metadata (citations, SCImago Q-quartile, CrossRef)
β 5.Analysis + π― language-agnostic LLM rerank (drops off-topic)
β 6.Summarizer β 7.Prioritizer β 8.Recommender (study plan + groups)
β 9.Resources (real GitHub / Medium / YouTube) β 10.Output (JSON/Markdown)
| Factor | Weight | Factor | Weight | |
|---|---|---|---|---|
| Citations (log-scaled) | 25% | Recency | 15% | |
| Relevance (LLM-reranked) | 25% | Influence | 15% | |
| Venue quality | 20% |
Keyword relevance is only the first pass and is biased toward the query's language. An LLM then re-scores the top candidates by language-agnostic relevance and drops off-topic papers (
ENABLE_RERANK,RELEVANCE_MIN,RERANK_PROVIDER/RERANK_MODEL).
105 unit/integration tests pass (offline). A product-level benchmark runs the pipeline over 300 bilingual topics (150 EN / 150 TR) and scores it with DeepEval (LLM-as-judge).
python -m evals.benchmark --limit 10 # quick (query metrics)
python -m evals.benchmark --deep --limit 5 # + search + summary metrics
python -m evals.benchmark --safety --limit 5 # + safety metricsQuality (higher is better):
| Metric | Bar | Mean | Pass |
|---|---|---|---|
query_relevance (GEval) |
β₯ 0.60 | 0.90 | 100% |
bilingual_coverage |
true | β | 100% |
query_count |
β₯ 10 | 30 | 100% |
retrieval_count |
β₯ 10 | 66β141 | 100% |
dedup_integrity (no dup titles) |
true | β | 100% |
summary_faithfulness |
β₯ 0.70 | 1.00 | 100% |
summary_relevancy |
β₯ 0.60 | ~0.50 | ~50% (to improve) |
Safety β six DeepEval dimensions, scored by each metric's own .success. On benign academic summaries all six pass: Bias 0.00 Β· Toxicity 0.00 Β· Misuse 0.00 Β· PIILeakage safe Β· NonAdvice safe Β· RoleViolation safe.
Reranking impact: a Turkish query for "large language models for question answering" drops 16 off-topic papers (autonomous-vehicle sentiment, digital diplomacy, β¦) and surfaces the best work in both languages β e.g. "TΓΌrkΓ§e soru cevaplama iΓ§in bΓΌyΓΌk dil modelleri" (rel 95) next to "Multilingual Benchmarking of LLMs" (rel 85). Without reranking, keyword relevance ranked unrelated Turkish papers at 100%.
The full 300-topic
--deep/--safetyrun makes ~9β11 judge calls per topic, which exceeds Groq's free daily limit (HTTP 429). Use Haiku (LLM_PROVIDER=anthropic) or run in--limitbatches.
Agent-trace metrics, MCP & Confident AI (
TaskCompletion,PlanQuality,ToolCorrectness, β¦) are on the roadmap β they require instrumenting agents with DeepEval@observetracing.
LLM_PROVIDER=anthropic # Haiku avoids Groq's daily limits
LLM_MODEL=claude-haiku-4-5
MAX_SEARCH_QUERIES=12 # fewer = more focused & on-topic
SOURCES=arxiv,semantic_scholar,openalex,openalex_thesis,core,crossref,doaj,dblp
BILINGUAL_SEARCH=1 # EN + TR
ENABLE_RERANK=1 # language-agnostic LLM rerank (run on Haiku)
RELEVANCE_MIN=40 # drop papers scored below this (0-100)
# Real supplementary resources (omitted entirely if no key / nothing found)
GITHUB_TOKEN= # higher rate limit; GITHUB_MIN_STARS=100
YOUTUBE_API_KEY= # enable "YouTube Data API v3" in Google CloudTheses (PhD/Master's) come from openalex_thesis and core (both multilingual, incl. Turkish); yoktez (YΓK Ulusal Tez Merkezi) is best-effort and returns nothing when blocked.
Determinism: topic analysis and reranking run at
temperature=0with greedytop_p/top_k, so the same query reliably produces the same keywords, queries and ranking.
python -m pytest tests/ -v # 105 offline tests
pytest src/evals/ -v -m eval # LLM-as-judge evals (needs one key)MIT β free to use for your research. Contributions welcome (see CONTRIBUTING.md).
Made with β€οΈ using LangChain & LangGraph



