Skip to content

Latest commit

 

History

History
196 lines (149 loc) · 8.17 KB

File metadata and controls

196 lines (149 loc) · 8.17 KB
name epub2md
description 将 EPUB 电子书拆分为 Markdown 文件,按章节和多级目录结构输出。当用户提到 EPUB、epub、电子书拆分、epub转markdown、epub转md、章节拆分、或需要将epub文件转换为markdown文档时,使用此skill。也适用于用户想要从epub中提取内容、按章节阅读epub、或批量处理epub文件的场景。Do NOT use for PDF, DOCX, or other non-EPUB formats.

Epub2md — EPUB to Markdown Chapter Splitter

Overview

将 EPUB 电子书按章节目录(TOC)拆分为独立的 Markdown 文件,保留多级目录结构。图片附件提取到 images/ 目录,MD 中的图片引用根据章节所在目录自动计算正确的相对路径。章节命名优先使用 TOC 标题,无 TOC 时从 HTML 内容提取真实标题。使用 Python 脚本处理,避免将整个文件内容加载到上下文中。

Workflow

Copy this checklist and track progress:

Epub2md Progress:
- [ ] Step 1: Validate EPUB path and resolve to absolute path
- [ ] Step 2: Run epub2md.py to split EPUB
- [ ] Step 3: Review chapter list from manifest
- [ ] Step4: Verify output structure

Context Management: This skill handles potentially large EPUB files. The Python script does all heavy lifting (HTML parsing, image extraction, markdown conversion) outside of the conversation context. Only the chapter directory (TOC) and manifest metadata are brought into context — never the full chapter content. After every 5 chapters processed, a context compression checkpoint is performed.

Step 1: Validate EPUB path and resolve to absolute path

The user may provide the EPUB path in various formats:

  • Relative path: ./books/mybook.epub or mybook.epub
  • Windows absolute path: D:\Books\mybook.epub or C:\Users\name\Downloads\book.epub
  • macOS absolute path: /Users/name/Downloads/book.epub

Resolution rules:

  1. If the path is relative, resolve it against the current working directory
  2. If the path contains backslashes (Windows), convert to forward slashes for Python compatibility
  3. Verify the file exists and has .epub extension
  4. If the file is not found, ask the user to verify the path

Step 2: Run epub2md.py to split EPUB

Execute the Python script to perform the full split:

python <skill_dir>/scripts/epub2md.py <epub_path> [--output-dir <dir>]

Where:

  • <skill_dir> is the directory containing this SKILL.md
  • <epub_path> is the absolute path resolved in Step 1
  • --output-dir is optional; defaults to a directory named after the EPUB file in the same location as the EPUB

No external Python dependencies required — the script uses only Python standard library (zipfile, html.parser, re, json, etc.). No need to pip install anything.

The script will:

  1. Parse the EPUB structure (container.xml → OPF → spine/manifest)
  2. Extract the table of contents (NCX for EPUB2, NAV for EPUB3, or fallback to spine)
  3. Extract all images to images/ directory
  4. Convert each chapter's HTML to Markdown
  5. Fix image references to point to the images/ directory (relative paths computed per-chapter)
  6. Create the directory structure matching the TOC hierarchy
  7. Write each chapter as a separate .md file
  8. Generate manifest.json with complete metadata

Important: The script outputs only the chapter directory (titles and hierarchy) to stdout, NOT the chapter content. This keeps context usage minimal.

Step 3: Review chapter list from manifest

After the script completes, read the manifest.json to present the chapter list to the user:

# Read only the chapter titles and structure, not the content

Read <output_dir>/manifest.json and present:

  • Total number of chapters
  • Directory tree structure
  • Number of images extracted
  • Any warnings or issues

Context Compression Checkpoint (every 5 chapters during review)

If there are more than 5 chapters, do NOT read all chapter files into context. Instead:

  1. Read only manifest.json for the chapter list
  2. After every 5 chapters reviewed, write a workflow_state.md file in the output directory:
# Epub2md Workflow State

## Source
- File: <epub_filename>
- Path: <absolute_path_to_epub>

## Output
- Directory: <absolute_path_to_output_dir>
- Total chapters: N
- Images: M (in images/)
- Manifest: <absolute_path_to_manifest.json>

## Current Progress
- Chapters reviewed: X/N
- Last reviewed chapter: <title>

## Chapter Structure Summary
- Level 0 chapters: [list of top-level chapter titles]
- Level 1 chapters: [list of second-level chapter titles]
- ...
  1. State: "Context checkpoint saved. Previous chapter details released from context. Progress preserved in workflow_state.md."

Why this matters: A book with 30+ chapters would consume enormous context if every chapter's content were loaded. By only loading the manifest and compressing every 5 chapters, we keep context usage constant regardless of book size.

Step 4: Verify output structure

  1. Verify the output directory contains:
    • Chapter .md files in the expected directory structure
    • images/ directory with images
    • manifest.json with all chapters listed as "status": "done"
  2. Optionally read 1-2 chapter files to spot-check content quality
  3. Report the final result to the user

Output Structure

Given an EPUB named MyBook.epub, the output looks like:

MyBook/
├── 001_Foreword.md
├── 002_Chapter One/
│   ├── 002_Chapter One.md
│   ├── 003_Section 1.1.md
│   └── 004_Section 1.2.md
├── 005_Chapter Two/
│   ├── 005_Chapter Two.md
│   └── 006_Section 2.1.md
├── images/
│   ├── cover.png
│   ├── fig1.jpg
│   └── fig2.png
└── manifest.json

Each .md file contains the chapter content converted from HTML to Markdown, with image references pointing to the images/ directory using correct relative paths (e.g., images/cover.png for root-level chapters, ../images/cover.png for subdirectory chapters).

Manifest Format

manifest.json contains:

{
  "source_file": "MyBook.epub",
  "book_title": "My Book",
  "output_dir": "/path/to/MyBook",
  "total_chapters": 6,
  "attachments_dir": "images",
  "image_count": 3,
  "chapters": [
    {
      "index": 0,
      "title": "Foreword",
      "src": "foreword.xhtml",
      "dir_path": "",
      "filename": "001_Foreword.md",
      "full_path": "/path/to/MyBook/001_Foreword.md",
      "level": 0,
      "status": "done"
    }
  ]
}

Only use the manifest for metadata — never load all chapter content into context at once. If you need to inspect a specific chapter, read that single file only.

Context Management Rules

  1. NEVER read the entire EPUB file into context — the Python script handles all file I/O
  2. Only load manifest.json for chapter listings and progress tracking
  3. Read individual chapter files only when the user asks about specific content
  4. Compress context every 5 chapters when reviewing or processing sequentially — write workflow_state.md and release earlier details
  5. The script prints only the TOC summary (chapter numbers and titles), not chapter content, to stdout

Error Handling

  • File not found: Verify the path with the user, suggest checking for typos
  • No TOC found: The script falls back to spine items and extracts real titles from HTML headings (h1/h2/h3) or source filenames
  • Encoding issues: The script tries multiple encodings (UTF-8, GBK, Latin-1); if all fail, report the specific file
  • Empty chapters: Some chapters may have no content (e.g., cover pages, blank separator pages); these are marked with a placeholder in the MD file
  • Image extraction failures: Individual image failures are logged as warnings; the conversion continues

Tips

  • Batch processing: To process multiple EPUB files, run the script once per file
  • Custom output location: Use --output-dir to specify where the chapter files go
  • Resuming: Check manifest.json to see which chapters were successfully extracted; if all have "status": "done", the extraction is complete
  • Large books: For books with 50+ chapters, rely on workflow_state.md and manifest.json rather than trying to track every chapter in conversation context