A high-quality tool for adapting EPUB content using LLMs. Users can make changes ranging from simple cleanup to full genre transformations, with built-in quality controls to preserve author voice and ensure consistency.
Empower users to adapt books however they see fit, while maintaining the highest possible quality.
The tool should:
- Support the full spectrum of changes (typo fixes → genre transformations)
- Quantify and preserve author voice/style
- Ensure consistency across chapters
- Give users full control and transparency
- Make quality the default, not an afterthought
┌─────────────────────────────────────────────────────────────────┐
│ EDIT - Change existing content (cleanup → transformation)│
│ TRANSFORM - Major adaptation (genre, setting, plot) │
│ ANNOTATE - Add commentary layer, original unchanged │
└─────────────────────────────────────────────────────────────────┘
MINIMAL CHANGES EXTENSIVE CHANGES
│ │
▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐
│ Cleanup │ │ Filtering│ │ Style │ │ Plot │ │ Genre/Setting │
│ │ │ │ │ Adapt │ │ Changes │ │ Transformation │
├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────────────┤
│ OCR fix │ │ Content │ │ Modernize│ │ Character│ │ Steampunk LOTR │
│ Typos │ │ removal │ │ language │ │ arcs │ │ Sci-fi → Fantasy │
│ Format │ │ Age-gate │ │ Simplify │ │ Endings │ │ Period changes │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────────────┘
│ │ │ │ │
~5% drift ~15% drift ~30% drift ~50% drift ~80% drift
(safe) (moderate) (notable) (significant) (derivative)
┌─────────────────────────────────────────────────────────────────┐
│ COMMENTARY STYLES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Scholarly - Literary analysis, sources, references │
│ Historical - Period context, author biography, events │
│ Educational - Vocabulary, concepts, explanations │
│ Devil's Advocate - Challenge assumptions, alternative views │
│ Thematic - Connections to other works, parallels │
│ Personal Lens - User-specified perspective (economic, etc.) │
│ Fun Facts - Trivia, behind-the-scenes, inspirations │
│ Funny - Humorous observations, witty asides │
│ Cross-Reference - Links to other texts, author's other works │
│ │
└─────────────────────────────────────────────────────────────────┘
Output formats: Footnotes, Endnotes, Commentary Chapters
- Style Profiler: Analyze source EPUBs to extract quantified author fingerprint
- Drift Scoring: Measure how much output deviates from original voice
- Style-Aware Prompts: Feed metrics to LLM to guide rewrites
- Multi-Source Profiles: Build robust profiles from multiple works by same author
- Book Model: Extract characters, locations, timeline, relationships
- Change Planning: Map ripple effects before execution
- Cross-Chapter Validation: Detect contradictions introduced by edits
- Reference Tracking: Ensure modified elements stay consistent
- HTML Preservation: Attributes, classes, formatting untouched
- Surgical Edits: Only modify targeted paragraphs
- EPUB Standards: Valid, readable output files
- Audit Trail: Detailed change reports (before/after)
- Drift Reports: Per-chapter style deviation scores
- Dry-Run Mode: Preview changes without applying (with caching)
- Verbose Logging: Full visibility into process
- Content Categorization: Clear FILTER/BORDERLINE/CLEAN classifications
- Configurable Thresholds: "Warn if drift > 30%"
- Approval Workflows: Review flagged changes before applying
- Granular Prompts: Full control over analysis/rewrite behavior
- CLI Flexibility: Override any setting via command line
- Borderline Review: User decides on edge-case content
- Persistent Preferences: Save filtering decisions across sessions
- Filter Profiles: Quick presets (strict/moderate/minimal)
┌─────────────────────────────────────────────────────────┐
│ INPUT: One or more EPUBs by target author │
│ │
│ PROCESS: │
│ - Extract all text content │
│ - Analyze linguistic features: │
│ - Sentence length distribution │
│ - Vocabulary richness / word frequencies │
│ - Punctuation patterns │
│ - Function word ratios │
│ - Common phrases / n-grams │
│ - Part-of-speech patterns │
│ - Style embeddings (sentence transformers) │
│ │
│ OUTPUT: author_profile.json │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ INPUT: Target EPUB │
│ │
│ PROCESS: │
│ - Parse structure (chapters, sections) │
│ - Extract entities (characters, locations, objects) │
│ - Map relationships and timeline │
│ - Identify key plot points │
│ │
│ OUTPUT: book_model.json │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ INPUT: book_model.json + user's change request │
│ │
│ PROCESS: │
│ - Interpret user intent │
│ - Identify affected chapters/passages │
│ - Map ripple effects │
│ - Generate change_plan.json │
│ │
│ OUTPUT: change_plan.json │
│ (List of specific modifications per chapter) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ INPUT: EPUB + prompts + (optional) profile + plan │
│ │
│ FOR EACH CHAPTER: │
│ 1. Extract paragraphs │
│ 2. Analysis pass → FILTER or CLEAN │
│ 3. If FILTER: Cleaning pass → selective rewrites │
│ 4. Apply changes to HTML │
│ 5. Measure style drift (if profile exists) │
│ 6. Log changes │
│ │
│ OUTPUT: Modified EPUB │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ INPUT: Modified EPUB + book_model + author_profile │
│ │
│ CHECKS: │
│ - Consistency scan (contradictions, broken refs) │
│ - Style drift report (per-chapter scores) │
│ - Structural validation (valid EPUB) │
│ │
│ OUTPUT: │
│ - validation_report.json │
│ - Warnings for user review │
└─────────────────────────────────────────────────────────┘
-
epub_cleaner.py- Main editing engine -
config.example.yaml- Configuration template -
prompts.example.yaml- Prompt templates -
requirements.txt- Dependencies -
README.md- Documentation
-
style_profiler.py- Analyze EPUBs, generate author profiles -
style_validator.py- Measure drift, score output quality -
author_profile.schema.json- Profile format specification
-
book_analyzer.py- Extract book model (characters, plot, timeline) -
change_planner.py- Generate modification plans -
consistency_checker.py- Cross-chapter validation -
book_model.schema.json- Model format specification
-
cli.py- Unified CLI with subcommands -
workflows/- Preset workflows (cleanup, filter, adapt, transform) -
interactive.py- Guided mode for complex changes
-
annotator.py- Generate commentary for passages -
footnote_inserter.py- Insert footnotes/endnotes into EPUB -
commentary.example.yaml- Commentary style templates
-
voice_composer.py- Extract and compose character voices -
character_voice.schema.json- Character voice profile format - Voice blending for narrative (e.g., 70% Tolkien + 30% Herbert)
- Cross-work character interactions with distinct voices preserved
-
constraints.example.yaml- Explicit style rules system -
epub_validator.py- E-reader compatibility validation
Content filtering designed like VidAngel - users toggle content categories with guaranteed filtering and control over edge cases.
-
borderline_reviewer.py- Borderline content detection and user interaction -
filter_categories.yaml- External customizable content categories -
FilterPreferencesclass - Persistent user preferences - Filter profiles (strict/moderate/minimal)
- Dry-run caching system for API efficiency
- Multi-pass filtering with verification
- Non-interactive batch processing mode
booksmith/
├── epub_cleaner.py # Main editing engine (with profile integration + caching)
├── borderline_reviewer.py # Borderline content detection and user interaction
├── style_profiler.py # Author style analysis (LLM-based)
├── style_validator.py # Drift measurement
├── book_analyzer.py # Book model extraction
├── change_planner.py # Modification planning
├── consistency_checker.py # Cross-chapter validation
├── annotator.py # Commentary generation
├── footnote_inserter.py # Insert annotations into EPUB
├── voice_composer.py # Character voice extraction & composition
├── epub_validator.py # E-reader compatibility validation
├── cost_estimator.py # API cost estimation before processing
├── interactive.py # Guided wizard mode
├── cli.py # Unified command-line interface
│
├── config.example.yaml # Configuration template
├── prompts.example.yaml # Prompt templates
├── filter_categories.yaml # Content filtering categories (customizable)
├── commentary.example.yaml # Commentary style templates
├── constraints.example.yaml # Style constraints template
├── requirements.txt # Dependencies
├── README.md # Documentation
├── claude.md # Development notes and roadmap
├── .gitignore # Keep secrets out of git
│
├── schemas/
│ ├── author_profile.schema.json
│ ├── book_model.schema.json
│ ├── change_plan.schema.json
│ └── character_voice.schema.json
│
├── workflows/
│ ├── cleanup.yaml # OCR/formatting fixes
│ ├── filter.yaml # Content filtering
│ ├── modernize.yaml # Language modernization
│ ├── transform.yaml # Major adaptations
│ └── annotate.yaml # Commentary presets
│
└── examples/
├── profile_tolkien.json # Sample author profile
├── voice_gandalf.json # Sample character voice
├── plan_steampunk.json # Sample change plan
├── constraints_formal.yaml # Formal writing constraints
└── constraints_modern.yaml # Modern style constraints
~/.booksmith/ # User data directory
├── filter_preferences.yaml # Saved borderline content preferences
└── cache/
└── dryrun_<hash>.json # Cached dry-run analysis results
# Simple cleanup (current functionality)
epub-cleaner edit --input book.epub --output clean.epub
# Build author profile
epub-cleaner profile --input author_works/*.epub --output tolkien.profile.json
# Edit with style preservation
epub-cleaner edit --input book.epub --profile tolkien.profile.json --max-drift 20
# Full transformation workflow
epub-cleaner transform --input lotr.epub --profile tolkien.profile.json \
--goal "Convert to steampunk setting" --output lotr_steampunk.epub
# Validate existing EPUB against profile
epub-cleaner validate --input modified.epub --profile original_author.profile.json
# Estimate API costs before processing
epub-cleaner estimate --input book.epub --workflow transform --model opus
epub-cleaner estimate --input book.epub --all-features # Include all optional features
# Add commentary/annotations
epub-cleaner annotate --input lotr.epub --output lotr_annotated.epub \
--style scholarly,historical,funny \
--format footnotes \
--frequency "2-4 per chapter"
# VidAngel-style filtering with borderline review
epub-cleaner --input book.epub --borderline-review --filter-profile moderate
# Batch processing with saved preferences
epub-cleaner --input book.epub --non-interactive --filter-profile strict
# Multi-pass filtering for thoroughness
epub-cleaner --input book.epub --passes 2 --verify
# Process specific chapters
epub-cleaner --input book.epub --chapters "1-5,10,15-20"
# Dry-run with caching (preview then apply)
epub-cleaner --input book.epub --dry-run # Analyze and cache results
epub-cleaner --input book.epub # Apply using cached analysis
epub-cleaner --input book.epub --no-cache # Force fresh analysis
epub-cleaner --input book.epub --clear-cache # Clear cached resultsAnnotations must follow EPUB3 accessibility standards:
<!-- In chapter HTML: footnote reference -->
<p>Frodo took the Ring<a epub:type="noteref" href="#note1" id="ref1">1</a></p>
<!-- At end of chapter or in separate notes file -->
<aside epub:type="footnote" id="note1">
<p><a href="#ref1">1.</a> Note the contrast with Isildur, who claimed
the Ring by force. Frodo's acceptance is reluctant, thrust upon him
by circumstance rather than desire.</p>
</aside>Requirements for footnote_inserter.py:
- Generate valid EPUB3 footnote markup with
epub:typeattributes - Create bidirectional links (reference ↔ footnote)
- Update content.opf manifest if adding notes file
- Add appropriate CSS for footnote styling
- Ensure compatibility with major e-readers (Kindle, Kobo, Apple Books)
Core editing functionality - clean, configurable, shareable.
- Clean main script
- YAML config/prompts
- CLI with argparse
- Documentation
LLM-based author voice preservation.
- Style profiler module (feed EPUBs to Opus, get rich analysis)
- Style validator (LLM-based drift scoring)
- Author profile schema
- Style-aware prompting (integrate profile into edit prompts)
- Drift reports per chapter
Support for plot/character-level changes.
- Book model extraction (characters, plot, timeline)
- Book model schema
- Change planning (ripple effect mapping)
- Consistency checking (cross-chapter validation)
User experience polish.
- Unified CLI with subcommands
- Preset workflows (cleanup, filter, adapt, transform)
- Interactive mode for complex changes
- Example profiles/plans
Add commentary layer without modifying original text.
- Annotator module (generate commentary per user style)
- Footnote inserter (proper EPUB3 markup)
- Commentary style templates
- Multiple styles: scholarly, historical, educational, devil's advocate, funny
- E-reader compatibility testing
Mix and match voices/characters from different works and authors.
-
voice_composer.py- Extract and compose character voices - Character voice profiles (separate from author style profiles)
- Voice blending for narrative (e.g., 70% Tolkien + 30% Herbert)
- Cross-work character interactions with distinct voices preserved
- Constraints system (
constraints.yaml) for explicit style rules -
epub_validator.py- E-reader compatibility validation
User-controlled content filtering with transparency and guaranteed results.
- Borderline content review system
- External filter categories (customizable YAML)
- Persistent user preferences
- Filter profiles (strict/moderate/minimal)
- Dry-run caching for API efficiency
- Multi-pass filtering with verification
- Non-interactive batch mode
The filtering system is designed to work like VidAngel, where users can toggle content categories and have guaranteed filtering with user control over edge cases.
- User Control: Users decide exactly what content categories to filter
- Transparency: Clear categorization of content (FILTER/BORDERLINE/CLEAN)
- Persistence: Preferences saved across sessions
- Efficiency: Dry-run caching avoids redundant API calls
Analyzes content and categorizes it for filtering decisions.
# Key classes:
FilterPreferences # Manages persistent user preferences
BorderlineReviewer # Analyzes content and presents choices to user
# Workflow:
1. First pass: Identify FILTER, BORDERLINE, and CLEAN content
2. Present borderline items to user with category info
3. User selects which borderline categories to filter
4. Second pass: Apply filtering with user's choices
5. Optionally save preferences for future booksExternal YAML file defining content categories. Each category has:
name: Human-readable namedescription: What this category covers (used in LLM prompts)default_threshold: filter | borderline | cleanexamples: Sample content for LLM reference
Category Groups:
- Sensual/Sexual: explicit_sexual, sensual_scenes, nudity_descriptions, kiss_sensory, physical_attraction, revealing_clothing, implied_intimacy, lingering_gaze, brief_kiss, romantic_tension
- Violence: graphic_gore, torture_scenes, combat_violence, death_descriptions, implied_violence
- Language: strong_profanity, slurs, mild_profanity, crude_humor
- Substance: drug_use_detailed, alcohol_abuse, substance_references
Search Paths (in order):
- Current working directory
- Script directory (same folder as borderline_reviewer.py)
~/.booksmith/filter_categories.yaml
Persistent storage for user's filtering decisions:
always_filter: Categories to always filternever_filter: Categories to never filterask_each_time: Categories requiring user decisionprofiles: Named presets (strict/moderate/minimal)history: Past decisions for learning
Caches analysis results to avoid duplicate API calls.
Cache Location: ~/.booksmith/cache/dryrun_<hash>.json
Cache Key = hash of:
- Input file path
- File modification time
- File size
- Prompts content
- Model name
Default TTL: 24 hours (configurable with --cache-ttl)
Workflow:
# Step 1: Dry-run analyzes and caches results
python epub_cleaner.py --input book.epub --dry-run
# Step 2: Full run reuses cached analysis (saves API calls)
python epub_cleaner.py --input book.epub --output clean.epub| Option | Description |
|---|---|
--borderline-review |
Enable interactive borderline content review |
--filter-profile PROFILE |
Apply preset: strict, moderate, or minimal |
--non-interactive |
Use saved preferences, no prompts |
--passes N |
Multi-pass filtering (1-3) for thoroughness |
--verify |
Add verification pass after filtering |
--chapters SELECTION |
Process specific chapters (e.g., "1-5,10") |
--model MODEL |
Override model from config |
--dry-run |
Preview changes with caching |
--no-cache |
Force fresh analysis |
--clear-cache |
Clear cached results for book |
--cache-ttl HOURS |
Cache expiration time |
| Profile | Behavior |
|---|---|
strict |
Filter all borderline content automatically |
moderate |
Filter explicit, ask about borderline content |
minimal |
Only filter explicit content, keep everything else |
- LLM-based profiling: Feed EPUBs directly to Claude for rich, qualitative analysis
- Captures nuance: Tone, thematic preoccupations, narrative techniques, authorial quirks
- Multi-source: Can analyze multiple works by same author for robust profile
- Trade-off: Richer results than statistical metrics, but requires LLM calls
- JSON: Profiles, models, plans (human-readable, versionable)
- YAML: Config, prompts (user-editable)
- EPUB: Input/output (standard format)
- Primary: Anthropic Claude (current)
- Planned: OpenAI, local models (ollama)
- Granularity: Paragraph-level vs sentence-level vs scene-level edits?
Caching: Cache analysis results for iterative editing?RESOLVED: Dry-run caching implemented- Diff Format: How to present before/after for user review?
- Profile Portability: Share profiles for popular authors?
- Batch Processing: Multiple EPUBs in one run?
- Original script:
../epub_cleaner_RUN_THIS.py - Keep prompts and config in YAML for clean separation
- Style profiler should work offline (no LLM needed)
- Prioritize user control and transparency