| name | epub2md |
|---|---|
| description | 将 EPUB 电子书拆分为 Markdown 文件,按章节和多级目录结构输出。当用户提到 EPUB、epub、电子书拆分、epub转markdown、epub转md、章节拆分、或需要将epub文件转换为markdown文档时,使用此skill。也适用于用户想要从epub中提取内容、按章节阅读epub、或批量处理epub文件的场景。Do NOT use for PDF, DOCX, or other non-EPUB formats. |
将 EPUB 电子书按章节目录(TOC)拆分为独立的 Markdown 文件,保留多级目录结构。图片附件提取到 images/ 目录,MD 中的图片引用根据章节所在目录自动计算正确的相对路径。章节命名优先使用 TOC 标题,无 TOC 时从 HTML 内容提取真实标题。使用 Python 脚本处理,避免将整个文件内容加载到上下文中。
Copy this checklist and track progress:
Epub2md Progress:
- [ ] Step 1: Validate EPUB path and resolve to absolute path
- [ ] Step 2: Run epub2md.py to split EPUB
- [ ] Step 3: Review chapter list from manifest
- [ ] Step4: Verify output structure
Context Management: This skill handles potentially large EPUB files. The Python script does all heavy lifting (HTML parsing, image extraction, markdown conversion) outside of the conversation context. Only the chapter directory (TOC) and manifest metadata are brought into context — never the full chapter content. After every 5 chapters processed, a context compression checkpoint is performed.
The user may provide the EPUB path in various formats:
- Relative path:
./books/mybook.epubormybook.epub - Windows absolute path:
D:\Books\mybook.epuborC:\Users\name\Downloads\book.epub - macOS absolute path:
/Users/name/Downloads/book.epub
Resolution rules:
- If the path is relative, resolve it against the current working directory
- If the path contains backslashes (Windows), convert to forward slashes for Python compatibility
- Verify the file exists and has
.epubextension - If the file is not found, ask the user to verify the path
Execute the Python script to perform the full split:
python <skill_dir>/scripts/epub2md.py <epub_path> [--output-dir <dir>]Where:
<skill_dir>is the directory containing this SKILL.md<epub_path>is the absolute path resolved in Step 1--output-diris optional; defaults to a directory named after the EPUB file in the same location as the EPUB
No external Python dependencies required — the script uses only Python standard library (zipfile, html.parser, re, json, etc.). No need to pip install anything.
The script will:
- Parse the EPUB structure (container.xml → OPF → spine/manifest)
- Extract the table of contents (NCX for EPUB2, NAV for EPUB3, or fallback to spine)
- Extract all images to
images/directory - Convert each chapter's HTML to Markdown
- Fix image references to point to the images/ directory (relative paths computed per-chapter)
- Create the directory structure matching the TOC hierarchy
- Write each chapter as a separate
.mdfile - Generate
manifest.jsonwith complete metadata
Important: The script outputs only the chapter directory (titles and hierarchy) to stdout, NOT the chapter content. This keeps context usage minimal.
After the script completes, read the manifest.json to present the chapter list to the user:
# Read only the chapter titles and structure, not the contentRead <output_dir>/manifest.json and present:
- Total number of chapters
- Directory tree structure
- Number of images extracted
- Any warnings or issues
Context Compression Checkpoint (every 5 chapters during review)
If there are more than 5 chapters, do NOT read all chapter files into context. Instead:
- Read only
manifest.jsonfor the chapter list - After every 5 chapters reviewed, write a
workflow_state.mdfile in the output directory:
# Epub2md Workflow State
## Source
- File: <epub_filename>
- Path: <absolute_path_to_epub>
## Output
- Directory: <absolute_path_to_output_dir>
- Total chapters: N
- Images: M (in images/)
- Manifest: <absolute_path_to_manifest.json>
## Current Progress
- Chapters reviewed: X/N
- Last reviewed chapter: <title>
## Chapter Structure Summary
- Level 0 chapters: [list of top-level chapter titles]
- Level 1 chapters: [list of second-level chapter titles]
- ...- State: "Context checkpoint saved. Previous chapter details released from context. Progress preserved in
workflow_state.md."
Why this matters: A book with 30+ chapters would consume enormous context if every chapter's content were loaded. By only loading the manifest and compressing every 5 chapters, we keep context usage constant regardless of book size.
- Verify the output directory contains:
- Chapter
.mdfiles in the expected directory structure images/directory with imagesmanifest.jsonwith all chapters listed as"status": "done"
- Chapter
- Optionally read 1-2 chapter files to spot-check content quality
- Report the final result to the user
Given an EPUB named MyBook.epub, the output looks like:
MyBook/
├── 001_Foreword.md
├── 002_Chapter One/
│ ├── 002_Chapter One.md
│ ├── 003_Section 1.1.md
│ └── 004_Section 1.2.md
├── 005_Chapter Two/
│ ├── 005_Chapter Two.md
│ └── 006_Section 2.1.md
├── images/
│ ├── cover.png
│ ├── fig1.jpg
│ └── fig2.png
└── manifest.json
Each .md file contains the chapter content converted from HTML to Markdown, with image references pointing to the images/ directory using correct relative paths (e.g., images/cover.png for root-level chapters, ../images/cover.png for subdirectory chapters).
manifest.json contains:
{
"source_file": "MyBook.epub",
"book_title": "My Book",
"output_dir": "/path/to/MyBook",
"total_chapters": 6,
"attachments_dir": "images",
"image_count": 3,
"chapters": [
{
"index": 0,
"title": "Foreword",
"src": "foreword.xhtml",
"dir_path": "",
"filename": "001_Foreword.md",
"full_path": "/path/to/MyBook/001_Foreword.md",
"level": 0,
"status": "done"
}
]
}Only use the manifest for metadata — never load all chapter content into context at once. If you need to inspect a specific chapter, read that single file only.
- NEVER read the entire EPUB file into context — the Python script handles all file I/O
- Only load manifest.json for chapter listings and progress tracking
- Read individual chapter files only when the user asks about specific content
- Compress context every 5 chapters when reviewing or processing sequentially — write
workflow_state.mdand release earlier details - The script prints only the TOC summary (chapter numbers and titles), not chapter content, to stdout
- File not found: Verify the path with the user, suggest checking for typos
- No TOC found: The script falls back to spine items and extracts real titles from HTML headings (h1/h2/h3) or source filenames
- Encoding issues: The script tries multiple encodings (UTF-8, GBK, Latin-1); if all fail, report the specific file
- Empty chapters: Some chapters may have no content (e.g., cover pages, blank separator pages); these are marked with a placeholder in the MD file
- Image extraction failures: Individual image failures are logged as warnings; the conversion continues
- Batch processing: To process multiple EPUB files, run the script once per file
- Custom output location: Use
--output-dirto specify where the chapter files go - Resuming: Check
manifest.jsonto see which chapters were successfully extracted; if all have"status": "done", the extraction is complete - Large books: For books with 50+ chapters, rely on
workflow_state.mdandmanifest.jsonrather than trying to track every chapter in conversation context