Paper-Essence — Tutorial: Building an Automated Paper Digest Workflow

📖 Project Overview

Paper-Essence is an automated paper-digest workflow built on the Dify platform. This workflow can:

🕐 Fetch the latest papers from arXiv for specified research areas on a daily schedule
🤖 Use large language models to filter and select the most valuable papers
📄 Parse PDF papers with OCR to extract technical details
📧 Generate a structured daily digest and send it by email

GitHub repository: https://github.com/LiaoYFBH/PaperFlow — you can import prj/Paper-Essence-CN.yml or prj/Paper-Essence-EN.yml directly.

🛠️ Prerequisites

1. Platform and Accounts

Dify account: Register and log in to Dify
Email account: An SMTP-capable email (this tutorial uses 163 Mail)
LLM API: Configure either Baidu Wenxin or an OpenAI-compatible model

2. Install Required Plugins

Install the following plugins from the Dify plugin marketplace:

Plugin	Purpose
`paddle-aistudio/ernie-paddle-aistudio`	Baidu Wenxin LLM integration (Xinghe Community API)
`langgenius/paddleocr`	OCR for PDFs and images
`wjdsg/163-smtp-send-mail`	163 SMTP email sending
`langgenius/supabase`	Database storage for pushed records

3. Prepare Supabase Database

We use a cloud database (Supabase) to record papers that have already been pushed to avoid duplicates.

Step 1: Login and Create Project

Step 2: Create Table

In the SQL Editor, run the following SQL statement:

create table pushed_papers (
  arxiv_id text not null,
  pushed_at timestamp default now(),
  primary key (arxiv_id)
);

This table records pushed paper IDs to ensure no duplicates.

Step 3: Get API Keys

Record the following information:

NEXT_PUBLIC_SUPABASE_URL → Supabase URL for Dify plugin
NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY → Supabase Key for Dify plugin

Step 4: Configure Supabase Plugin in Dify

(Optional) Deploy Dify with Docker

Environment Setup

This tutorial uses WSL + Docker. You can refer to this article for WSL and Docker configuration.

Clone the Dify Repository

First, clone the Dify repository. If you haven't configured Git, you can directly download the ZIP file from the repository page and extract it.

If you have Git configured, run the following commands in your terminal:

# Clone Dify repository
git clone https://github.com/langgenius/dify.git

You need to have Git and Docker configured.

# Navigate to docker deployment directory
cd dify/docker

# Copy environment configuration file
cp .env.example .env

# Start Dify (this will automatically pull images and start all services)
docker compose up -d

First, open Docker Desktop, then enter in the terminal:

docker compose up -d

Check the status:

docker compose ps

Access the application at: http://localhost/

After logging in:

📊 Workflow Architecture

The core flow of the workflow is shown below:

Flow Description

Stage	Node	Function
Trigger	Schedule Trigger	Auto-start at specified time daily
Config	Config Node	Read environment variables
Translation	LLM Translation	Translate research topic to English
Search	Get Rows → Pre-process → HTTP → Post-process	Query pushed records, search arXiv
Review	LLM Review	Use LLM to select Top 3 papers
Iteration	Iteration Node	For each paper: unpack → record → OCR → analyze → assemble
Output	Template Transform → Email	Generate report and send email

🔧 Step-by-step Setup

Step 1 — Create the Workflow

Log in to Dify
In the Studio, click "Create App" → choose "Workflow"

Enter an application name

Choose the Trigger type for the workflow

Step 2 — Configure Environment Variables

Click the Settings button (top-right):

Click "Add Environment Variable":

Key variables:

Name	Type	Description	Example
`table_name`	string	Supabase table name	`pushed_papers`
`SMTP_PORT`	string	SMTP port	`465`
`SMTP_SERVER`	string	SMTP server	`smtp.163.com`
`SMTP_PASSWORD`	secret	SMTP authorization code	(your auth code)
`SMTP_USER`	secret	SMTP user/email	`your_email@163.com`
`MY_RAW_TOPIC`	string	Research topic	`agent memory`

How to get email authorization code:

Step 3 — Schedule Trigger

Node name: Schedule Trigger

Configuration:

Trigger Frequency: Daily
Trigger Time: 8:59 AM (or adjust as needed)

Step 4 — Configuration (Code Node)

Node name: Config (Type: Code) — This node reads environment variables and outputs them for downstream nodes.

Input Variables:

From environment variables: SMTP_PORT, SMTP_SERVER, SMTP_USER, SMTP_PASSWORD, MY_RAW_TOPIC, table_name

Output Variables:

raw_topic: Research topic
user_email: Recipient email
fetch_count: Number of papers to fetch (default: 50)
push_limit: Push limit (default: 3)
days_lookback: Days to look back (default: 30)
Plus SMTP configuration

Code:

import os

def main(
    SMTP_USER: str,
    MY_RAW_TOPIC: str,
    SMTP_PORT: str,
    SMTP_SERVER: str,
    SMTP_PASSWORD: str,
    table_name: str
) -> dict:

    user_email = SMTP_USER
    raw_topic = MY_RAW_TOPIC

    smtp_port = SMTP_PORT
    smtp_server = SMTP_SERVER
    smtp_password = SMTP_PASSWORD
    table_name = table_name

    return {
        "raw_topic": raw_topic,
        "user_email": user_email,
        "smtp_port": smtp_port,
        "smtp_server": smtp_server,
        "smtp_password": smtp_password,
        "fetch_count": 50,
        "push_limit": 3,
        "days_lookback": 30,
        "table_name": table_name
    }

Step 5 — Research Field LLM Translation (LLM Node)

Node name: Research Field LLM Translation (Type: LLM) — Converts the research topic into an optimized English boolean query for arXiv.

Model Configuration:

Model: ernie-4.5-turbo-128k or ernie-5.0-thinking-preview
Temperature: 0.7

Prompt Rules: Extract core concepts, translate terms (if necessary), construct boolean logic using AND/OR, wrap phrases in quotes, and output only the query string.

Step 6 — Query Pushed Records (Supabase Node)

Node name: Get Rows (Tool: Supabase) — Fetches existing pushed arXiv IDs to avoid duplicates.

Configuration:

Table name: {{table_name}} (from Config node)

Step 7 — Search Papers (3 Nodes)

To improve stability and maintainability, the search function is split into "Pre-process" → "HTTP Request" → "Post-process".

7.1 Search Pre-process (Code Node)

Node name: Search Pre-process (Type: Code) — Builds the arXiv API request and prepares search parameters.

Input Variables:

topic: Translated English search term
days_lookback: Days to look back
count: Number of papers to fetch
supabase_output: Already pushed records (for deduplication)

Code Logic:

Calculate cutoff date (cutoff_date)
Parse Supabase returned pushed paper ID list
Build boolean query string based on topic (supports AND/OR logic)
Add arXiv category restrictions based on topic keywords (e.g., cs.CV, cs.CL)
Extract search keywords for subsequent filtering

Output Variables:

base_query: Constructed query string
pushed_ids: List of already pushed IDs
cutoff_str: Cutoff date string
search_keywords: List of search keywords
fetch_limit: API fetch limit

7.2 HTTP Request (HTTP Node)

Node name: HTTP Request (Type: http-request) — Calls arXiv API to get raw XML data.

Configuration:

API URL: http://export.arxiv.org/api/query
Method: GET

7.3 Search Post-process (Code Node)

Node name: Search Post-process (Type: Code) — Parses XML response and filters papers.

Input Variables:

http_response_body: HTTP node response body
Plus all output variables from the pre-process node

Code Logic:

Parse XML response
Deduplication filtering: Remove papers in pushed_ids
Date filtering: Remove papers earlier than cutoff_date
Keyword filtering: Ensure title or abstract contains at least one search keyword
Format output as JSON object list

Output Variables:

result: Final filtered paper list (JSON string)
count: Final paper count
debug: Debug information (including filtering statistics)

Step 8 — LLM Initial Review

Node name: LLM Initial Review (Type: LLM) — Uses an LLM to score and select the top papers (Top 3).

Output Requirements:

Clean JSON array format
Preserve all original fields
Output Top 3 papers

Step 9 — JSON Parsing (Code Node)

Node name: JSON Parse (Type: Code) — Tolerant parsing of LLM outputs into a normalized list of papers.

Core Logic:

Handle nested JSON
Support papers or top_papers fields
Error-tolerant processing

Step 10 — Iteration Node

Node name: Iteration — Processes each selected paper sequentially (unpack, record to Supabase, OCR the PDF, analyze with LLMs, assemble the final object).

Configuration:

Input: top_papers (paper array)
Output: merged_paper (processed paper object)
Parallel Mode: Off (sequential execution)
Error Handling: Stop on error

Iteration Internal Flow

#	Node Name	Type	Function
1	DataUnpack	code	Unpack iteration item into individual variables
2	Create a Row	tool	Record arxiv_id to Supabase to prevent duplicates
3	Document Parsing	tool	PaddleOCR parses PDF to extract text
4	get_footnote_text	code	Extract footnote information (for affiliation recognition)
5	truncated_text	code	Truncate OCR text (control LLM input length)
6	(LLM) Analysis	llm	Deep analysis to extract key information
7	Data Assembly	code	Assemble final paper object

10.1 DataUnpack (Code Node)

Unpacks the iteration item into individual variables.

Output:

title_str: Paper title
pdf_url: PDF link
summary_str: Abstract
published: Publication date
authors: Authors
arxiv_id: ArXiv ID

10.2 Create a Row (Supabase Node)

Records the paper ArXiv ID to the database to prevent duplicate pushes.

Configuration:

Table name: From Config node
Data: {"arxiv_id": "{{arxiv_id}}"}

10.3 Document Parsing (PaddleOCR Node)

Node name: Document Parsing (Type: tool - PaddleOCR)

Uses PaddleOCR to parse the paper PDF and extract text content.

Configuration:

file: PDF URL
fileType: 0 (PDF file)
useLayoutDetection: true (enable layout detection)
prettifyMarkdown: true (beautify output)

10.4 get_footnote_text (Code Node)

Extracts footnote information from OCR text for subsequent affiliation recognition.

10.5 truncated_text (Code Node)

Truncates OCR text to control LLM input length and avoid exceeding token limits.

10.6 (LLM) Analysis

Node name: (LLM) Analysis (Type: llm)

Performs deep analysis of the paper to extract key information.

Extracted Fields:

One_Liner: One-sentence pain point and solution
Architecture: Model architecture and key innovations
Dataset: Data sources and scale
Metrics: Core performance metrics
Chinese_Abstract: Chinese abstract translation
Affiliation: Author affiliations
Code_Url: Code repository link

Core Principles:

No fluff: Directly state specific methods
Deep dive into details: Summarize algorithm logic, loss function design
Data first: Show improvement over SOTA
No N/A: Make reasonable inferences

Output Format: Pure JSON object

10.7 Data Assembly (Code Node)

Node name: Data Assembly (Type: code)

Assembles all information into a structured paper object.

Core Functions:

Parse publication status (identify top conference papers)
Parse LLM output JSON
Extract code links
Assemble final paper object

Output Fields:

title: Title
authors: Authors
affiliation: Affiliation
pdf_url: PDF link
summary: English abstract
published: Publication status
github_stats: Code status
code_url: Code link
ai_evaluation: AI analysis results

Step 11 — Template Transform

Node name: Template Transform (Type: template-transform)

Uses a Jinja2 template to convert paper data into formatted email content.

Template Structure:

📅 PaperEssence Daily Digest
Based on your research topic "{{ raw_topic }}", we deliver 3 selected papers from arXiv's last 30 days, updated daily.
--------------------------------------------------
<small><i>⚠️ Note: Content is AI-generated for academic reference only. Please click the PDF link to verify the original paper before citing or conducting in-depth research.</i></small>
Generated: {{ items.target_date | default('Today') }}
==================================================

{% set final_list = items.paper | default(items) %}

{% for item in final_list %}
📄 [{{ loop.index }}] {{ item.title }}
--------------------------------------------------
👤 Authors: {{ item.authors }}
🏢 Affiliation: {{ item.affiliation }}
🔗 PDF: {{ item.pdf_url }}
📅 Status: {{ item.published }}
{% if item.code_url and item.code_url != 'N/A' %}
📦 Code: {{ item.github_stats }}
   🔗 {{ item.code_url }}
{% else %}
📦 Code: {{ item.github_stats }}
{% endif %}

English Abstract:
{{ item.summary | replace('\n', ' ') }}

Chinese Abstract:
{{ item.ai_evaluation.Chinese_Abstract }}

🚀 Core Innovation:
{{ item.ai_evaluation.One_Liner }}

📊 Summary:
--------------------------------------------------
🏗️ Architecture:
{{ item.ai_evaluation.Architecture | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

💾 Dataset:
{{ item.ai_evaluation.Dataset | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

📈 Metrics:
{{ item.ai_evaluation.Metrics | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

==================================================
{% else %}
⚠️ No new papers today.
{% endfor %}

Step 12 — Send Email (163 SMTP)

Node name: 163 SMTP Email Sender (Type: tool - 163-smtp-send-mail)

Configuration:

username_send: Sender email (from environment variables)
authorization_code: Email authorization code (from environment variables)
username_recv: Recipient email
subject: PaperEssence-{{cutoff_str}}-{{today_str}}
content: Content from template transform

Step 13 — Output Node

Node name: Output (Type: end)

Outputs the final result for debugging and verification.

📤 Publishing and Getting the Workflow API

After testing and confirming the workflow works correctly:

Click the Publish button (top-right):

Record the following information:

API endpoint: https://api.dify.ai/v1/workflows/run (or your private deployment URL)
API key: app-xxxxxxxxxxxx

⏰ Alternative: Local Schedule Trigger (Windows Task Scheduler)

If Dify cloud scheduling is restricted on the free tier, you can use Windows Task Scheduler to trigger the workflow via a curl POST.

Prerequisite: Install Git for Windows

This solution uses Git Bash to execute curl commands, so you need to install Git for Windows first.

📥 Download: https://git-scm.com/downloads/win

Installation Notes:

Recommended to use default installation path (e.g., C:\Program Files\Git) or custom path (e.g., D:\ProgramFiles\Git)
Ensure "Git Bash Here" is checked during installation

Configure Windows Task Scheduler

Press Win + R → type taskschd.msc → Enter
Click "Create Task"

General Tab:

Name: Paper-Essence Daily Run
Check "Run with highest privileges"

Triggers Tab:

Click "New"
Select "On a schedule"
Choose "Daily", set time (recommended to match Dify workflow timer, e.g., 20:55)
Click "OK"

Actions Tab:

Click "New"
Action: "Start a program"
Program/script: Enter your Git Bash path, for example:
```
D:\ProgramFiles\Git\bin\bash.exe
```
or default installation path:
```
C:\Program Files\Git\bin\bash.exe
```

Add arguments:

curl -N -X POST "https://api.dify.ai/v1/workflows/run" -H "Authorization: Bearer app-YOUR-API-KEY" -H "Content-Type: application/json" -d '{ "inputs": {}, "response_mode": "streaming", "user": "cron-job" }'

⚠️ Note: Replace app-YOUR-API-KEY with your actual API key

Conditions Tab (Optional):

You can uncheck "Start the task only if the computer is on AC power" to ensure laptops run the task on battery

Settings Tab (Optional):

Check "If the task fails, restart every" and set retry interval

Click "OK" to save the task

🧪 Testing and Debugging

Manual Test

Click "Run" in the workflow editor (top-right)
Observe node execution and outputs
Verify each node's output meets expectations

Success Result

When the workflow executes successfully, you will receive an email like this:

📝 Summary

This tutorial covers building a complete end-to-end pipeline: arXiv fetching → PaddleOCR parsing → LLM analysis → Jinja2 templating → SMTP delivery, with Supabase-based deduplication and scheduling. The workflow involves YAML node configuration, environment variables, and Supabase integration, creating a comprehensive pipeline from ArXiv retrieval to email delivery with enhanced deduplication and error handling.

The provided prj/Paper-Essence-CN.yml and prj/Paper-Essence-EN.yml can be imported into a Dify workspace to reproduce the workflow.

🙏 Acknowledgments

Special thanks to Professor Zhang Jing, Professor Guan Mu, and Professor Yang Youzhi for their guidance.

FilesExpand file tree

Paper-Essence-Tutorial-en.md

Latest commit

History

Paper-Essence-Tutorial-en.md

File metadata and controls

Paper-Essence — Tutorial: Building an Automated Paper Digest Workflow

📖 Project Overview

🛠️ Prerequisites

1. Platform and Accounts

2. Install Required Plugins

3. Prepare Supabase Database

Step 1: Login and Create Project

Step 2: Create Table

Step 3: Get API Keys

Step 4: Configure Supabase Plugin in Dify

(Optional) Deploy Dify with Docker

Environment Setup

Clone the Dify Repository

📊 Workflow Architecture

Flow Description

🔧 Step-by-step Setup

Step 1 — Create the Workflow

Step 2 — Configure Environment Variables

Step 3 — Schedule Trigger

Step 4 — Configuration (Code Node)

Step 5 — Research Field LLM Translation (LLM Node)

Step 6 — Query Pushed Records (Supabase Node)

Step 7 — Search Papers (3 Nodes)

7.1 Search Pre-process (Code Node)

7.2 HTTP Request (HTTP Node)

7.3 Search Post-process (Code Node)

Step 8 — LLM Initial Review

Step 9 — JSON Parsing (Code Node)

Step 10 — Iteration Node

Iteration Internal Flow

10.1 DataUnpack (Code Node)

10.2 Create a Row (Supabase Node)

10.3 Document Parsing (PaddleOCR Node)

10.4 get_footnote_text (Code Node)

10.5 truncated_text (Code Node)

10.6 (LLM) Analysis

10.7 Data Assembly (Code Node)

Step 11 — Template Transform

Step 12 — Send Email (163 SMTP)

Step 13 — Output Node

📤 Publishing and Getting the Workflow API

⏰ Alternative: Local Schedule Trigger (Windows Task Scheduler)

Prerequisite: Install Git for Windows

Configure Windows Task Scheduler

General Tab:

Triggers Tab:

Actions Tab:

Conditions Tab (Optional):

Settings Tab (Optional):

🧪 Testing and Debugging

Manual Test

Success Result

📝 Summary

🙏 Acknowledgments