Booksmith

A high-quality tool for adapting EPUB content using LLMs. Users can make changes ranging from simple cleanup to full genre transformations, with built-in quality controls to preserve author voice and ensure consistency.

Vision

Empower users to adapt books however they see fit, while maintaining the highest possible quality.

The tool should:

Support the full spectrum of changes (typo fixes → genre transformations)
Quantify and preserve author voice/style
Ensure consistency across chapters
Give users full control and transparency
Make quality the default, not an afterthought

Modes

┌─────────────────────────────────────────────────────────────────┐
│  EDIT       - Change existing content (cleanup → transformation)│
│  TRANSFORM  - Major adaptation (genre, setting, plot)           │
│  ANNOTATE   - Add commentary layer, original unchanged          │
└─────────────────────────────────────────────────────────────────┘

Use Case Spectrum (Edit/Transform)

MINIMAL CHANGES                                              EXTENSIVE CHANGES
      │                                                              │
      ▼                                                              ▼
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐
│  Cleanup │  │ Filtering│  │  Style   │  │  Plot    │  │  Genre/Setting   │
│          │  │          │  │ Adapt    │  │  Changes │  │  Transformation  │
├──────────┤  ├──────────┤  ├──────────┤  ├──────────┤  ├──────────────────┤
│ OCR fix  │  │ Content  │  │ Modernize│  │ Character│  │ Steampunk LOTR   │
│ Typos    │  │ removal  │  │ language │  │ arcs     │  │ Sci-fi → Fantasy │
│ Format   │  │ Age-gate │  │ Simplify │  │ Endings  │  │ Period changes   │
└──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘
      │              │             │             │                │
   ~5% drift     ~15% drift    ~30% drift    ~50% drift      ~80% drift
   (safe)        (moderate)    (notable)     (significant)   (derivative)

Commentary Types (Annotate Mode)

┌─────────────────────────────────────────────────────────────────┐
│  COMMENTARY STYLES                                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Scholarly      - Literary analysis, sources, references        │
│  Historical     - Period context, author biography, events      │
│  Educational    - Vocabulary, concepts, explanations            │
│  Devil's Advocate - Challenge assumptions, alternative views    │
│  Thematic       - Connections to other works, parallels         │
│  Personal Lens  - User-specified perspective (economic, etc.)   │
│  Fun Facts      - Trivia, behind-the-scenes, inspirations       │
│  Funny          - Humorous observations, witty asides           │
│  Cross-Reference - Links to other texts, author's other works   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Output formats: Footnotes, Endnotes, Commentary Chapters

Quality Pillars

1. Style Preservation

Style Profiler: Analyze source EPUBs to extract quantified author fingerprint
Drift Scoring: Measure how much output deviates from original voice
Style-Aware Prompts: Feed metrics to LLM to guide rewrites
Multi-Source Profiles: Build robust profiles from multiple works by same author

2. Consistency

Book Model: Extract characters, locations, timeline, relationships
Change Planning: Map ripple effects before execution
Cross-Chapter Validation: Detect contradictions introduced by edits
Reference Tracking: Ensure modified elements stay consistent

3. Structural Integrity

HTML Preservation: Attributes, classes, formatting untouched
Surgical Edits: Only modify targeted paragraphs
EPUB Standards: Valid, readable output files

4. Transparency

Audit Trail: Detailed change reports (before/after)
Drift Reports: Per-chapter style deviation scores
Dry-Run Mode: Preview changes without applying (with caching)
Verbose Logging: Full visibility into process
Content Categorization: Clear FILTER/BORDERLINE/CLEAN classifications

5. User Control

Configurable Thresholds: "Warn if drift > 30%"
Approval Workflows: Review flagged changes before applying
Granular Prompts: Full control over analysis/rewrite behavior
CLI Flexibility: Override any setting via command line
Borderline Review: User decides on edge-case content
Persistent Preferences: Save filtering decisions across sessions
Filter Profiles: Quick presets (strict/moderate/minimal)

Architecture

Phase 0: Style Profiling (optional but recommended)

┌─────────────────────────────────────────────────────────┐
│  INPUT: One or more EPUBs by target author              │
│                                                         │
│  PROCESS:                                               │
│  - Extract all text content                             │
│  - Analyze linguistic features:                         │
│    - Sentence length distribution                       │
│    - Vocabulary richness / word frequencies             │
│    - Punctuation patterns                               │
│    - Function word ratios                               │
│    - Common phrases / n-grams                           │
│    - Part-of-speech patterns                            │
│    - Style embeddings (sentence transformers)           │
│                                                         │
│  OUTPUT: author_profile.json                            │
└─────────────────────────────────────────────────────────┘

Phase 1: Book Analysis (for significant changes)

┌─────────────────────────────────────────────────────────┐
│  INPUT: Target EPUB                                     │
│                                                         │
│  PROCESS:                                               │
│  - Parse structure (chapters, sections)                 │
│  - Extract entities (characters, locations, objects)    │
│  - Map relationships and timeline                       │
│  - Identify key plot points                             │
│                                                         │
│  OUTPUT: book_model.json                                │
└─────────────────────────────────────────────────────────┘

Phase 2: Change Planning (for significant changes)

┌─────────────────────────────────────────────────────────┐
│  INPUT: book_model.json + user's change request         │
│                                                         │
│  PROCESS:                                               │
│  - Interpret user intent                                │
│  - Identify affected chapters/passages                  │
│  - Map ripple effects                                   │
│  - Generate change_plan.json                            │
│                                                         │
│  OUTPUT: change_plan.json                               │
│  (List of specific modifications per chapter)           │
└─────────────────────────────────────────────────────────┘

Phase 3: Execution

┌─────────────────────────────────────────────────────────┐
│  INPUT: EPUB + prompts + (optional) profile + plan      │
│                                                         │
│  FOR EACH CHAPTER:                                      │
│  1. Extract paragraphs                                  │
│  2. Analysis pass → FILTER or CLEAN                     │
│  3. If FILTER: Cleaning pass → selective rewrites       │
│  4. Apply changes to HTML                               │
│  5. Measure style drift (if profile exists)             │
│  6. Log changes                                         │
│                                                         │
│  OUTPUT: Modified EPUB                                  │
└─────────────────────────────────────────────────────────┘

Phase 4: Validation

┌─────────────────────────────────────────────────────────┐
│  INPUT: Modified EPUB + book_model + author_profile     │
│                                                         │
│  CHECKS:                                                │
│  - Consistency scan (contradictions, broken refs)       │
│  - Style drift report (per-chapter scores)              │
│  - Structural validation (valid EPUB)                   │
│                                                         │
│  OUTPUT:                                                │
│  - validation_report.json                               │
│  - Warnings for user review                             │
└─────────────────────────────────────────────────────────┘

Component Modules

Core (MVP - Phase 1) ✓

epub_cleaner.py - Main editing engine
config.example.yaml - Configuration template
prompts.example.yaml - Prompt templates
requirements.txt - Dependencies
README.md - Documentation

Style Preservation (Phase 2) ✓

style_profiler.py - Analyze EPUBs, generate author profiles
style_validator.py - Measure drift, score output quality
author_profile.schema.json - Profile format specification

Deep Editing (Phase 3) ✓

book_analyzer.py - Extract book model (characters, plot, timeline)
change_planner.py - Generate modification plans
consistency_checker.py - Cross-chapter validation
book_model.schema.json - Model format specification

CLI & Workflows (Phase 4) ✓

cli.py - Unified CLI with subcommands
workflows/ - Preset workflows (cleanup, filter, adapt, transform)
interactive.py - Guided mode for complex changes

Annotation Mode (Phase 5) ✓

annotator.py - Generate commentary for passages
footnote_inserter.py - Insert footnotes/endnotes into EPUB
commentary.example.yaml - Commentary style templates

Creative Composition (Phase 6) ✓

voice_composer.py - Extract and compose character voices
character_voice.schema.json - Character voice profile format
Voice blending for narrative (e.g., 70% Tolkien + 30% Herbert)
Cross-work character interactions with distinct voices preserved
constraints.example.yaml - Explicit style rules system
epub_validator.py - E-reader compatibility validation

VidAngel-Style Filtering (Phase 7) ✓

Content filtering designed like VidAngel - users toggle content categories with guaranteed filtering and control over edge cases.

borderline_reviewer.py - Borderline content detection and user interaction
filter_categories.yaml - External customizable content categories
FilterPreferences class - Persistent user preferences
Filter profiles (strict/moderate/minimal)
Dry-run caching system for API efficiency
Multi-pass filtering with verification
Non-interactive batch processing mode

File Structure (Current)

booksmith/
├── epub_cleaner.py          # Main editing engine (with profile integration + caching)
├── borderline_reviewer.py   # Borderline content detection and user interaction
├── style_profiler.py        # Author style analysis (LLM-based)
├── style_validator.py       # Drift measurement
├── book_analyzer.py         # Book model extraction
├── change_planner.py        # Modification planning
├── consistency_checker.py   # Cross-chapter validation
├── annotator.py             # Commentary generation
├── footnote_inserter.py     # Insert annotations into EPUB
├── voice_composer.py        # Character voice extraction & composition
├── epub_validator.py        # E-reader compatibility validation
├── cost_estimator.py        # API cost estimation before processing
├── interactive.py           # Guided wizard mode
├── cli.py                   # Unified command-line interface
│
├── config.example.yaml      # Configuration template
├── prompts.example.yaml     # Prompt templates
├── filter_categories.yaml   # Content filtering categories (customizable)
├── commentary.example.yaml  # Commentary style templates
├── constraints.example.yaml # Style constraints template
├── requirements.txt         # Dependencies
├── README.md                # Documentation
├── claude.md                # Development notes and roadmap
├── .gitignore               # Keep secrets out of git
│
├── schemas/
│   ├── author_profile.schema.json
│   ├── book_model.schema.json
│   ├── change_plan.schema.json
│   └── character_voice.schema.json
│
├── workflows/
│   ├── cleanup.yaml         # OCR/formatting fixes
│   ├── filter.yaml          # Content filtering
│   ├── modernize.yaml       # Language modernization
│   ├── transform.yaml       # Major adaptations
│   └── annotate.yaml        # Commentary presets
│
└── examples/
    ├── profile_tolkien.json     # Sample author profile
    ├── voice_gandalf.json       # Sample character voice
    ├── plan_steampunk.json      # Sample change plan
    ├── constraints_formal.yaml  # Formal writing constraints
    └── constraints_modern.yaml  # Modern style constraints

~/.booksmith/                    # User data directory
├── filter_preferences.yaml      # Saved borderline content preferences
└── cache/
    └── dryrun_<hash>.json       # Cached dry-run analysis results

CLI Design (Target)

# Simple cleanup (current functionality)
epub-cleaner edit --input book.epub --output clean.epub

# Build author profile
epub-cleaner profile --input author_works/*.epub --output tolkien.profile.json

# Edit with style preservation
epub-cleaner edit --input book.epub --profile tolkien.profile.json --max-drift 20

# Full transformation workflow
epub-cleaner transform --input lotr.epub --profile tolkien.profile.json \
    --goal "Convert to steampunk setting" --output lotr_steampunk.epub

# Validate existing EPUB against profile
epub-cleaner validate --input modified.epub --profile original_author.profile.json

# Estimate API costs before processing
epub-cleaner estimate --input book.epub --workflow transform --model opus
epub-cleaner estimate --input book.epub --all-features  # Include all optional features

# Add commentary/annotations
epub-cleaner annotate --input lotr.epub --output lotr_annotated.epub \
    --style scholarly,historical,funny \
    --format footnotes \
    --frequency "2-4 per chapter"

# VidAngel-style filtering with borderline review
epub-cleaner --input book.epub --borderline-review --filter-profile moderate

# Batch processing with saved preferences
epub-cleaner --input book.epub --non-interactive --filter-profile strict

# Multi-pass filtering for thoroughness
epub-cleaner --input book.epub --passes 2 --verify

# Process specific chapters
epub-cleaner --input book.epub --chapters "1-5,10,15-20"

# Dry-run with caching (preview then apply)
epub-cleaner --input book.epub --dry-run           # Analyze and cache results
epub-cleaner --input book.epub                     # Apply using cached analysis
epub-cleaner --input book.epub --no-cache          # Force fresh analysis
epub-cleaner --input book.epub --clear-cache       # Clear cached results

EPUB Footnote Standards

Annotations must follow EPUB3 accessibility standards:

<!-- In chapter HTML: footnote reference -->
<p>Frodo took the Ring<a epub:type="noteref" href="#note1" id="ref1">1</a></p>

<!-- At end of chapter or in separate notes file -->
<aside epub:type="footnote" id="note1">
  <p><a href="#ref1">1.</a> Note the contrast with Isildur, who claimed
  the Ring by force. Frodo's acceptance is reluctant, thrust upon him
  by circumstance rather than desire.</p>
</aside>

Requirements for footnote_inserter.py:

Generate valid EPUB3 footnote markup with epub:type attributes
Create bidirectional links (reference ↔ footnote)
Update content.opf manifest if adding notes file
Add appropriate CSS for footnote styling
Ensure compatibility with major e-readers (Kindle, Kobo, Apple Books)

Development Phases

Phase 1: MVP ✓

Core editing functionality - clean, configurable, shareable.

Clean main script
YAML config/prompts
CLI with argparse
Documentation

Phase 2: Style Preservation ✓

LLM-based author voice preservation.

Style profiler module (feed EPUBs to Opus, get rich analysis)
Style validator (LLM-based drift scoring)
Author profile schema
Style-aware prompting (integrate profile into edit prompts)
Drift reports per chapter

Phase 3: Deep Editing ✓

Support for plot/character-level changes.

Book model extraction (characters, plot, timeline)
Book model schema
Change planning (ripple effect mapping)
Consistency checking (cross-chapter validation)

Phase 4: CLI & Workflows ✓

User experience polish.

Unified CLI with subcommands
Preset workflows (cleanup, filter, adapt, transform)
Interactive mode for complex changes
Example profiles/plans

Phase 5: Annotation Mode ✓

Add commentary layer without modifying original text.

Annotator module (generate commentary per user style)
Footnote inserter (proper EPUB3 markup)
Commentary style templates
Multiple styles: scholarly, historical, educational, devil's advocate, funny
E-reader compatibility testing

Phase 6: Creative Composition ✓

Mix and match voices/characters from different works and authors.

voice_composer.py - Extract and compose character voices
Character voice profiles (separate from author style profiles)
Voice blending for narrative (e.g., 70% Tolkien + 30% Herbert)
Cross-work character interactions with distinct voices preserved
Constraints system (constraints.yaml) for explicit style rules
epub_validator.py - E-reader compatibility validation

Phase 7: VidAngel-Style Filtering ✓

User-controlled content filtering with transparency and guaranteed results.

Borderline content review system
External filter categories (customizable YAML)
Persistent user preferences
Filter profiles (strict/moderate/minimal)
Dry-run caching for API efficiency
Multi-pass filtering with verification
Non-interactive batch mode

VidAngel-Style Filtering System

The filtering system is designed to work like VidAngel, where users can toggle content categories and have guaranteed filtering with user control over edge cases.

Core Philosophy

User Control: Users decide exactly what content categories to filter
Transparency: Clear categorization of content (FILTER/BORDERLINE/CLEAN)
Persistence: Preferences saved across sessions
Efficiency: Dry-run caching avoids redundant API calls

Components

1. Borderline Reviewer (`borderline_reviewer.py`)

Analyzes content and categorizes it for filtering decisions.

# Key classes:
FilterPreferences     # Manages persistent user preferences
BorderlineReviewer    # Analyzes content and presents choices to user

# Workflow:
1. First pass: Identify FILTER, BORDERLINE, and CLEAN content
2. Present borderline items to user with category info
3. User selects which borderline categories to filter
4. Second pass: Apply filtering with user's choices
5. Optionally save preferences for future books

2. Filter Categories (`filter_categories.yaml`)

External YAML file defining content categories. Each category has:

name: Human-readable name
description: What this category covers (used in LLM prompts)
default_threshold: filter | borderline | clean
examples: Sample content for LLM reference

Category Groups:

Sensual/Sexual: explicit_sexual, sensual_scenes, nudity_descriptions, kiss_sensory, physical_attraction, revealing_clothing, implied_intimacy, lingering_gaze, brief_kiss, romantic_tension
Violence: graphic_gore, torture_scenes, combat_violence, death_descriptions, implied_violence
Language: strong_profanity, slurs, mild_profanity, crude_humor
Substance: drug_use_detailed, alcohol_abuse, substance_references

Search Paths (in order):

Current working directory
Script directory (same folder as borderline_reviewer.py)
~/.booksmith/filter_categories.yaml

3. User Preferences (`~/.booksmith/filter_preferences.yaml`)

Persistent storage for user's filtering decisions:

always_filter: Categories to always filter
never_filter: Categories to never filter
ask_each_time: Categories requiring user decision
profiles: Named presets (strict/moderate/minimal)
history: Past decisions for learning

4. Dry-Run Caching System

Caches analysis results to avoid duplicate API calls.

Cache Location: ~/.booksmith/cache/dryrun_<hash>.json

Cache Key = hash of:
- Input file path
- File modification time
- File size
- Prompts content
- Model name

Default TTL: 24 hours (configurable with --cache-ttl)

Workflow:

# Step 1: Dry-run analyzes and caches results
python epub_cleaner.py --input book.epub --dry-run

# Step 2: Full run reuses cached analysis (saves API calls)
python epub_cleaner.py --input book.epub --output clean.epub

CLI Options for Filtering

Option	Description
`--borderline-review`	Enable interactive borderline content review
`--filter-profile PROFILE`	Apply preset: strict, moderate, or minimal
`--non-interactive`	Use saved preferences, no prompts
`--passes N`	Multi-pass filtering (1-3) for thoroughness
`--verify`	Add verification pass after filtering
`--chapters SELECTION`	Process specific chapters (e.g., "1-5,10")
`--model MODEL`	Override model from config
`--dry-run`	Preview changes with caching
`--no-cache`	Force fresh analysis
`--clear-cache`	Clear cached results for book
`--cache-ttl HOURS`	Cache expiration time

Filter Profiles

Profile	Behavior
`strict`	Filter all borderline content automatically
`moderate`	Filter explicit, ask about borderline content
`minimal`	Only filter explicit content, keep everything else

Technical Decisions

Style Profiling Approach

LLM-based profiling: Feed EPUBs directly to Claude for rich, qualitative analysis
Captures nuance: Tone, thematic preoccupations, narrative techniques, authorial quirks
Multi-source: Can analyze multiple works by same author for robust profile
Trade-off: Richer results than statistical metrics, but requires LLM calls

Storage Formats

JSON: Profiles, models, plans (human-readable, versionable)
YAML: Config, prompts (user-editable)
EPUB: Input/output (standard format)

LLM Providers

Primary: Anthropic Claude (current)
Planned: OpenAI, local models (ollama)

Open Questions

Granularity: Paragraph-level vs sentence-level vs scene-level edits?
Caching: Cache analysis results for iterative editing? RESOLVED: Dry-run caching implemented
Diff Format: How to present before/after for user review?
Profile Portability: Share profiles for popular authors?
Batch Processing: Multiple EPUBs in one run?

Development Notes

Original script: ../epub_cleaner_RUN_THIS.py
Keep prompts and config in YAML for clean separation
Style profiler should work offline (no LLM needed)
Prioritize user control and transparency

FilesExpand file tree

claude.md

Latest commit

History