Paper-Essence is an automated paper-digest workflow built on the Dify platform. This workflow can:
- 🕐 Fetch the latest papers from arXiv for specified research areas on a daily schedule
- 🤖 Use large language models to filter and select the most valuable papers
- 📄 Parse PDF papers with OCR to extract technical details
- 📧 Generate a structured daily digest and send it by email
GitHub repository: https://github.com/LiaoYFBH/PaperFlow — you can import prj/Paper-Essence-CN.yml or prj/Paper-Essence-EN.yml directly.
- Dify account: Register and log in to Dify
- Email account: An SMTP-capable email (this tutorial uses 163 Mail)
- LLM API: Configure either Baidu Wenxin or an OpenAI-compatible model
Install the following plugins from the Dify plugin marketplace:
| Plugin | Purpose |
|---|---|
paddle-aistudio/ernie-paddle-aistudio |
Baidu Wenxin LLM integration (Xinghe Community API) |
langgenius/paddleocr |
OCR for PDFs and images |
wjdsg/163-smtp-send-mail |
163 SMTP email sending |
langgenius/supabase |
Database storage for pushed records |
We use a cloud database (Supabase) to record papers that have already been pushed to avoid duplicates.
In the SQL Editor, run the following SQL statement:
create table pushed_papers (
arxiv_id text not null,
pushed_at timestamp default now(),
primary key (arxiv_id)
);This table records pushed paper IDs to ensure no duplicates.
Record the following information:
NEXT_PUBLIC_SUPABASE_URL→ Supabase URL for Dify pluginNEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY→ Supabase Key for Dify plugin
This tutorial uses WSL + Docker. You can refer to this article for WSL and Docker configuration.
First, clone the Dify repository. If you haven't configured Git, you can directly download the ZIP file from the repository page and extract it.
If you have Git configured, run the following commands in your terminal:
# Clone Dify repository
git clone https://github.com/langgenius/dify.gitYou need to have Git and Docker configured.
# Navigate to docker deployment directory
cd dify/docker
# Copy environment configuration file
cp .env.example .env
# Start Dify (this will automatically pull images and start all services)
docker compose up -dFirst, open Docker Desktop, then enter in the terminal:
docker compose up -dCheck the status:
docker compose psAccess the application at: http://localhost/
After logging in:
The core flow of the workflow is shown below:
| Stage | Node | Function |
|---|---|---|
| Trigger | Schedule Trigger | Auto-start at specified time daily |
| Config | Config Node | Read environment variables |
| Translation | LLM Translation | Translate research topic to English |
| Search | Get Rows → Pre-process → HTTP → Post-process | Query pushed records, search arXiv |
| Review | LLM Review | Use LLM to select Top 3 papers |
| Iteration | Iteration Node | For each paper: unpack → record → OCR → analyze → assemble |
| Output | Template Transform → Email | Generate report and send email |
- Log in to Dify
- In the Studio, click "Create App" → choose "Workflow"
- Enter an application name
- Choose the Trigger type for the workflow
Click the Settings button (top-right):
Click "Add Environment Variable":
Key variables:
| Name | Type | Description | Example |
|---|---|---|---|
table_name |
string | Supabase table name | pushed_papers |
SMTP_PORT |
string | SMTP port | 465 |
SMTP_SERVER |
string | SMTP server | smtp.163.com |
SMTP_PASSWORD |
secret | SMTP authorization code | (your auth code) |
SMTP_USER |
secret | SMTP user/email | your_email@163.com |
MY_RAW_TOPIC |
string | Research topic | agent memory |
How to get email authorization code:
Node name: Schedule Trigger
Configuration:
- Trigger Frequency: Daily
- Trigger Time:
8:59 AM(or adjust as needed)
Node name: Config (Type: Code) — This node reads environment variables and outputs them for downstream nodes.
Input Variables:
- From environment variables:
SMTP_PORT,SMTP_SERVER,SMTP_USER,SMTP_PASSWORD,MY_RAW_TOPIC,table_name
Output Variables:
raw_topic: Research topicuser_email: Recipient emailfetch_count: Number of papers to fetch (default: 50)push_limit: Push limit (default: 3)days_lookback: Days to look back (default: 30)- Plus SMTP configuration
Code:
import os
def main(
SMTP_USER: str,
MY_RAW_TOPIC: str,
SMTP_PORT: str,
SMTP_SERVER: str,
SMTP_PASSWORD: str,
table_name: str
) -> dict:
user_email = SMTP_USER
raw_topic = MY_RAW_TOPIC
smtp_port = SMTP_PORT
smtp_server = SMTP_SERVER
smtp_password = SMTP_PASSWORD
table_name = table_name
return {
"raw_topic": raw_topic,
"user_email": user_email,
"smtp_port": smtp_port,
"smtp_server": smtp_server,
"smtp_password": smtp_password,
"fetch_count": 50,
"push_limit": 3,
"days_lookback": 30,
"table_name": table_name
}Node name: Research Field LLM Translation (Type: LLM) — Converts the research topic into an optimized English boolean query for arXiv.
Model Configuration:
- Model:
ernie-4.5-turbo-128korernie-5.0-thinking-preview - Temperature:
0.7
Prompt Rules: Extract core concepts, translate terms (if necessary), construct boolean logic using AND/OR, wrap phrases in quotes, and output only the query string.
Node name: Get Rows (Tool: Supabase) — Fetches existing pushed arXiv IDs to avoid duplicates.
Configuration:
- Table name:
{{table_name}}(from Config node)
To improve stability and maintainability, the search function is split into "Pre-process" → "HTTP Request" → "Post-process".
Node name: Search Pre-process (Type: Code) — Builds the arXiv API request and prepares search parameters.
Input Variables:
topic: Translated English search termdays_lookback: Days to look backcount: Number of papers to fetchsupabase_output: Already pushed records (for deduplication)
Code Logic:
- Calculate cutoff date (cutoff_date)
- Parse Supabase returned pushed paper ID list
- Build boolean query string based on topic (supports AND/OR logic)
- Add arXiv category restrictions based on topic keywords (e.g., cs.CV, cs.CL)
- Extract search keywords for subsequent filtering
Output Variables:
base_query: Constructed query stringpushed_ids: List of already pushed IDscutoff_str: Cutoff date stringsearch_keywords: List of search keywordsfetch_limit: API fetch limit
Node name: HTTP Request (Type: http-request) — Calls arXiv API to get raw XML data.
Configuration:
- API URL:
http://export.arxiv.org/api/query - Method:
GET
Node name: Search Post-process (Type: Code) — Parses XML response and filters papers.
Input Variables:
http_response_body: HTTP node response body- Plus all output variables from the pre-process node
Code Logic:
- Parse XML response
- Deduplication filtering: Remove papers in
pushed_ids - Date filtering: Remove papers earlier than
cutoff_date - Keyword filtering: Ensure title or abstract contains at least one search keyword
- Format output as JSON object list
Output Variables:
result: Final filtered paper list (JSON string)count: Final paper countdebug: Debug information (including filtering statistics)
Node name: LLM Initial Review (Type: LLM) — Uses an LLM to score and select the top papers (Top 3).
Output Requirements:
- Clean JSON array format
- Preserve all original fields
- Output Top 3 papers
Node name: JSON Parse (Type: Code) — Tolerant parsing of LLM outputs into a normalized list of papers.
Core Logic:
- Handle nested JSON
- Support
papersortop_papersfields - Error-tolerant processing
Node name: Iteration — Processes each selected paper sequentially (unpack, record to Supabase, OCR the PDF, analyze with LLMs, assemble the final object).
Configuration:
- Input:
top_papers(paper array) - Output:
merged_paper(processed paper object) - Parallel Mode: Off (sequential execution)
- Error Handling: Stop on error
| # | Node Name | Type | Function |
|---|---|---|---|
| 1 | DataUnpack | code | Unpack iteration item into individual variables |
| 2 | Create a Row | tool | Record arxiv_id to Supabase to prevent duplicates |
| 3 | Document Parsing | tool | PaddleOCR parses PDF to extract text |
| 4 | get_footnote_text | code | Extract footnote information (for affiliation recognition) |
| 5 | truncated_text | code | Truncate OCR text (control LLM input length) |
| 6 | (LLM) Analysis | llm | Deep analysis to extract key information |
| 7 | Data Assembly | code | Assemble final paper object |
Unpacks the iteration item into individual variables.
Output:
title_str: Paper titlepdf_url: PDF linksummary_str: Abstractpublished: Publication dateauthors: Authorsarxiv_id: ArXiv ID
Records the paper ArXiv ID to the database to prevent duplicate pushes.
Configuration:
- Table name: From Config node
- Data:
{"arxiv_id": "{{arxiv_id}}"}
Node name: Document Parsing (Type: tool - PaddleOCR)
Uses PaddleOCR to parse the paper PDF and extract text content.
Configuration:
file: PDF URLfileType: 0 (PDF file)useLayoutDetection: true (enable layout detection)prettifyMarkdown: true (beautify output)
Extracts footnote information from OCR text for subsequent affiliation recognition.
Truncates OCR text to control LLM input length and avoid exceeding token limits.
Node name: (LLM) Analysis (Type: llm)
Performs deep analysis of the paper to extract key information.
Extracted Fields:
- One_Liner: One-sentence pain point and solution
- Architecture: Model architecture and key innovations
- Dataset: Data sources and scale
- Metrics: Core performance metrics
- Chinese_Abstract: Chinese abstract translation
- Affiliation: Author affiliations
- Code_Url: Code repository link
Core Principles:
- No fluff: Directly state specific methods
- Deep dive into details: Summarize algorithm logic, loss function design
- Data first: Show improvement over SOTA
- No N/A: Make reasonable inferences
Output Format: Pure JSON object
Node name: Data Assembly (Type: code)
Assembles all information into a structured paper object.
Core Functions:
- Parse publication status (identify top conference papers)
- Parse LLM output JSON
- Extract code links
- Assemble final paper object
Output Fields:
title: Titleauthors: Authorsaffiliation: Affiliationpdf_url: PDF linksummary: English abstractpublished: Publication statusgithub_stats: Code statuscode_url: Code linkai_evaluation: AI analysis results
Node name: Template Transform (Type: template-transform)
Uses a Jinja2 template to convert paper data into formatted email content.
Template Structure:
📅 PaperEssence Daily Digest
Based on your research topic "{{ raw_topic }}", we deliver 3 selected papers from arXiv's last 30 days, updated daily.
--------------------------------------------------
<small><i>⚠️ Note: Content is AI-generated for academic reference only. Please click the PDF link to verify the original paper before citing or conducting in-depth research.</i></small>
Generated: {{ items.target_date | default('Today') }}
==================================================
{% set final_list = items.paper | default(items) %}
{% for item in final_list %}
📄 [{{ loop.index }}] {{ item.title }}
--------------------------------------------------
👤 Authors: {{ item.authors }}
🏢 Affiliation: {{ item.affiliation }}
🔗 PDF: {{ item.pdf_url }}
📅 Status: {{ item.published }}
{% if item.code_url and item.code_url != 'N/A' %}
📦 Code: {{ item.github_stats }}
🔗 {{ item.code_url }}
{% else %}
📦 Code: {{ item.github_stats }}
{% endif %}
English Abstract:
{{ item.summary | replace('\n', ' ') }}
Chinese Abstract:
{{ item.ai_evaluation.Chinese_Abstract }}
🚀 Core Innovation:
{{ item.ai_evaluation.One_Liner }}
📊 Summary:
--------------------------------------------------
🏗️ Architecture:
{{ item.ai_evaluation.Architecture | replace('\n- ', '\n\n 🔹 ') | replace('- ', ' 🔹 ') }}
💾 Dataset:
{{ item.ai_evaluation.Dataset | replace('\n- ', '\n\n 🔹 ') | replace('- ', ' 🔹 ') }}
📈 Metrics:
{{ item.ai_evaluation.Metrics | replace('\n- ', '\n\n 🔹 ') | replace('- ', ' 🔹 ') }}
==================================================
{% else %}
⚠️ No new papers today.
{% endfor %}Node name: 163 SMTP Email Sender (Type: tool - 163-smtp-send-mail)
Configuration:
username_send: Sender email (from environment variables)authorization_code: Email authorization code (from environment variables)username_recv: Recipient emailsubject:PaperEssence-{{cutoff_str}}-{{today_str}}content: Content from template transform
Node name: Output (Type: end)
Outputs the final result for debugging and verification.
After testing and confirming the workflow works correctly:
Click the Publish button (top-right):
Record the following information:
- API endpoint:
https://api.dify.ai/v1/workflows/run(or your private deployment URL) - API key:
app-xxxxxxxxxxxx
If Dify cloud scheduling is restricted on the free tier, you can use Windows Task Scheduler to trigger the workflow via a curl POST.
This solution uses Git Bash to execute curl commands, so you need to install Git for Windows first.
📥 Download: https://git-scm.com/downloads/win
Installation Notes:
- Recommended to use default installation path (e.g.,
C:\Program Files\Git) or custom path (e.g.,D:\ProgramFiles\Git) - Ensure "Git Bash Here" is checked during installation
- Press
Win + R→ typetaskschd.msc→ Enter - Click "Create Task"
- Name:
Paper-Essence Daily Run - Check "Run with highest privileges"
- Click "New"
- Select "On a schedule"
- Choose "Daily", set time (recommended to match Dify workflow timer, e.g.,
20:55) - Click "OK"
-
Click "New"
-
Action: "Start a program"
-
Program/script: Enter your Git Bash path, for example:
D:\ProgramFiles\Git\bin\bash.exeor default installation path:
C:\Program Files\Git\bin\bash.exe -
Add arguments:
curl -N -X POST "https://api.dify.ai/v1/workflows/run" -H "Authorization: Bearer app-YOUR-API-KEY" -H "Content-Type: application/json" -d '{ "inputs": {}, "response_mode": "streaming", "user": "cron-job" }'⚠️ Note: Replaceapp-YOUR-API-KEYwith your actual API key
- You can uncheck "Start the task only if the computer is on AC power" to ensure laptops run the task on battery
- Check "If the task fails, restart every" and set retry interval
- Click "OK" to save the task
- Click "Run" in the workflow editor (top-right)
- Observe node execution and outputs
- Verify each node's output meets expectations
When the workflow executes successfully, you will receive an email like this:
This tutorial covers building a complete end-to-end pipeline: arXiv fetching → PaddleOCR parsing → LLM analysis → Jinja2 templating → SMTP delivery, with Supabase-based deduplication and scheduling. The workflow involves YAML node configuration, environment variables, and Supabase integration, creating a comprehensive pipeline from ArXiv retrieval to email delivery with enhanced deduplication and error handling.
The provided prj/Paper-Essence-CN.yml and prj/Paper-Essence-EN.yml can be imported into a Dify workspace to reproduce the workflow.
Special thanks to Professor Zhang Jing, Professor Guan Mu, and Professor Yang Youzhi for their guidance.
















































