Skip to content

AdkinsHan/epub2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

epub2md

Split EPUB ebooks into Markdown files, preserving the table of contents hierarchy as a multi-level directory structure.

Why This Skill?

Reading English-language EPUBs is a core part of language learning for millions of Chinese and Asian readers. But the workflow is painful: you toggle between the book and a dictionary, copy-paste paragraphs into translation tools, and lose your place — over and over.

epub2md was built to change that. By splitting an EPUB into per-chapter Markdown files, you unlock a fundamentally better reading loop:

  • Feed chapters directly into AI — Drop a chapter into Claude, ChatGPT, or any LLM for instant translation, vocabulary annotation, or bilingual interleaving. No more copy-pasting from a clunky e-reader.
  • Bilingual interleaved reading — Markdown makes it trivial to produce paragraph-by-paragraph bilingual text (original + translation), which matches how most Asian learners actually read: English paragraph first, Chinese translation below, repeat. This interleaved rhythm keeps you in the flow instead of bouncing between tabs.
  • Chapter-level context control — LLMs have context limits. Splitting by chapter means each chunk is the right size for high-quality AI assistance — no truncation, no loss, no hallucination from over-stuffing.
  • Markdown as universal format — Notes, highlights, and AI annotations live alongside the text. Edit in Obsidian, VS Code, or any Markdown editor. Version-control your study notes with Git. The format is yours.

In short: epub2md turns a locked-up ebook into a learner-ready, AI-ready, Markdown-native study kit.

Features

  • Zero dependencies — uses only the Python standard library (zipfile, html.parser, re, json, etc.)
  • TOC-aware splitting — reads NCX (EPUB2) and NAV (EPUB3) tables of contents to build the chapter hierarchy
  • Directory structure — creates nested folders matching the book's chapter hierarchy
  • Image extraction — saves all images to an images/ directory and fixes Markdown references with correct relative paths
  • Smart title enhancement — automatically enriches thin titles (e.g. "1", "Part II") by extracting descriptive text from the HTML content
  • Spine fallback — when no TOC is available, falls back to the OPF spine order and extracts real titles from HTML headings
  • Manifest tracking — generates manifest.json with full chapter metadata for progress tracking and resuming

Usage

CLI

python scripts/epub2md.py <input.epub> [--output-dir <dir>]
  • <input.epub> — path to the EPUB file
  • --output-dir — output directory (defaults to a directory named after the EPUB file in the same location)

Claude Code Skill

This project doubles as a Claude Code skill.

Install the Skill

Clone this repository into your Claude Code skills folder:

# macOS / Linux
git clone https://github.com/AdkinsHan/epub2md.git ~/.claude/skills/epub2md

# Windows (PowerShell)
git clone https://github.com/AdkinsHan/epub2md.git "$env:USERPROFILE\.claude\skills\epub2md"

Or if you've already cloned it elsewhere:

# macOS / Linux
cp -r /path/to/epub2md ~/.claude/skills/epub2md

# Windows (PowerShell)
Copy-Item -Recurse "D:\path\to\epub2md" "$env:USERPROFILE\.claude\skills\epub2md"

Restart Claude Code or reload the session — the /epub2md command will be available automatically.

Use the Skill

In a Claude Code session, simply type:

/epub2md <path-to-book.epub>

The skill will:

  1. Validate and resolve the EPUB file path
  2. Run the Python script to split the EPUB into chapters
  3. Present the chapter list and output structure for review
  4. Verify the results

You can also specify a custom output directory:

/epub2md /path/to/book.epub --output-dir /path/to/output

Or simply mention an EPUB file in conversation — the skill auto-triggers on keywords like "EPUB", "epub to markdown", or "split ebook".

Output Structure

Given MyBook.epub, the output looks like:

MyBook/
├── 001_Foreword.md
├── 002_Chapter One/
│   ├── 002_Chapter One.md
│   ├── 003_Section 1.1.md
│   └── 004_Section 1.2.md
├── 005_Chapter Two/
│   ├── 005_Chapter Two.md
│   └── 006_Section 2.1.md
├── images/
│   ├── cover.png
│   ├── fig1.jpg
│   └── fig2.png
└── manifest.json

Image references in Markdown files use correct relative paths (e.g. images/cover.png for root-level chapters, ../images/cover.png for nested chapters).

Manifest Format

manifest.json contains:

{
  "source_file": "MyBook.epub",
  "book_title": "My Book",
  "output_dir": "/path/to/MyBook",
  "total_chapters": 6,
  "attachments_dir": "images",
  "image_count": 3,
  "chapters": [
    {
      "index": 0,
      "title": "Foreword",
      "src": "foreword.xhtml",
      "dir_path": "",
      "filename": "001_Foreword.md",
      "full_path": "/path/to/MyBook/001_Foreword.md",
      "level": 0,
      "status": "done"
    }
  ]
}

HTML to Markdown Conversion

Supports common EPUB HTML elements:

HTML Markdown
<h1><h6> #######
<p> Paragraph with blank line
<strong>, <b> **bold**
<em>, <i> *italic*
<code> `inline`
<pre> Fenced code block
<ul>, <ol>, <li> - item / 1. item
<blockquote> > quote
<a> [text](href)
<img> ![alt](src)
<hr> ---
<dl>, <dt>, <dd> Definition lists
<sup>, <sub> ^superscript, ~subscript
HTML entities Unicode characters

Limitations & Caveats

  • No table support — HTML <table> elements are preserved as-is or stripped; complex tables will not render correctly in Markdown.
  • No CSS styling — Font sizes, colors, column layouts, and other CSS-driven formatting are lost during conversion.
  • Footnotes & endnotes — EPUB footnote links and back-references are not converted to Markdown footnote syntax ([^1]); they become plain links.
  • SVG images — Only raster images (PNG, JPG, etc.) are extracted. SVG and embedded base64 images are not handled.
  • DRM-protected EPUBs — Files with digital rights management will fail to open. This tool only works with DRM-free EPUBs.
  • Encoding edge cases — While the script tries UTF-8, GBK, and Latin-1, some rarely-encoded EPUBs may produce garbled text. Check the output if the source file uses an unusual encoding.
  • Merged chapters — Some EPUBs pack multiple logical chapters into a single HTML file. The script splits by TOC entries, not by internal headings, so these will appear as one long chapter.
  • Not a full ebook reader — This tool is designed for extraction and conversion, not for reading. It does not preserve reading position, bookmarks, or annotations from the original EPUB.

License

Apache License 2.0

About

Split the EPUB ebook into Markdown files and output them by chapter and multi-level directory structure.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages