Skip to content

Latest commit

 

History

History
585 lines (477 loc) · 26 KB

File metadata and controls

585 lines (477 loc) · 26 KB

Booksmith

A high-quality tool for adapting EPUB content using LLMs. Users can make changes ranging from simple cleanup to full genre transformations, with built-in quality controls to preserve author voice and ensure consistency.


Vision

Empower users to adapt books however they see fit, while maintaining the highest possible quality.

The tool should:

  • Support the full spectrum of changes (typo fixes → genre transformations)
  • Quantify and preserve author voice/style
  • Ensure consistency across chapters
  • Give users full control and transparency
  • Make quality the default, not an afterthought

Modes

┌─────────────────────────────────────────────────────────────────┐
│  EDIT       - Change existing content (cleanup → transformation)│
│  TRANSFORM  - Major adaptation (genre, setting, plot)           │
│  ANNOTATE   - Add commentary layer, original unchanged          │
└─────────────────────────────────────────────────────────────────┘

Use Case Spectrum (Edit/Transform)

MINIMAL CHANGES                                              EXTENSIVE CHANGES
      │                                                              │
      ▼                                                              ▼
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐
│  Cleanup │  │ Filtering│  │  Style   │  │  Plot    │  │  Genre/Setting   │
│          │  │          │  │ Adapt    │  │  Changes │  │  Transformation  │
├──────────┤  ├──────────┤  ├──────────┤  ├──────────┤  ├──────────────────┤
│ OCR fix  │  │ Content  │  │ Modernize│  │ Character│  │ Steampunk LOTR   │
│ Typos    │  │ removal  │  │ language │  │ arcs     │  │ Sci-fi → Fantasy │
│ Format   │  │ Age-gate │  │ Simplify │  │ Endings  │  │ Period changes   │
└──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘
      │              │             │             │                │
   ~5% drift     ~15% drift    ~30% drift    ~50% drift      ~80% drift
   (safe)        (moderate)    (notable)     (significant)   (derivative)

Commentary Types (Annotate Mode)

┌─────────────────────────────────────────────────────────────────┐
│  COMMENTARY STYLES                                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Scholarly      - Literary analysis, sources, references        │
│  Historical     - Period context, author biography, events      │
│  Educational    - Vocabulary, concepts, explanations            │
│  Devil's Advocate - Challenge assumptions, alternative views    │
│  Thematic       - Connections to other works, parallels         │
│  Personal Lens  - User-specified perspective (economic, etc.)   │
│  Fun Facts      - Trivia, behind-the-scenes, inspirations       │
│  Funny          - Humorous observations, witty asides           │
│  Cross-Reference - Links to other texts, author's other works   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Output formats: Footnotes, Endnotes, Commentary Chapters

Quality Pillars

1. Style Preservation

  • Style Profiler: Analyze source EPUBs to extract quantified author fingerprint
  • Drift Scoring: Measure how much output deviates from original voice
  • Style-Aware Prompts: Feed metrics to LLM to guide rewrites
  • Multi-Source Profiles: Build robust profiles from multiple works by same author

2. Consistency

  • Book Model: Extract characters, locations, timeline, relationships
  • Change Planning: Map ripple effects before execution
  • Cross-Chapter Validation: Detect contradictions introduced by edits
  • Reference Tracking: Ensure modified elements stay consistent

3. Structural Integrity

  • HTML Preservation: Attributes, classes, formatting untouched
  • Surgical Edits: Only modify targeted paragraphs
  • EPUB Standards: Valid, readable output files

4. Transparency

  • Audit Trail: Detailed change reports (before/after)
  • Drift Reports: Per-chapter style deviation scores
  • Dry-Run Mode: Preview changes without applying (with caching)
  • Verbose Logging: Full visibility into process
  • Content Categorization: Clear FILTER/BORDERLINE/CLEAN classifications

5. User Control

  • Configurable Thresholds: "Warn if drift > 30%"
  • Approval Workflows: Review flagged changes before applying
  • Granular Prompts: Full control over analysis/rewrite behavior
  • CLI Flexibility: Override any setting via command line
  • Borderline Review: User decides on edge-case content
  • Persistent Preferences: Save filtering decisions across sessions
  • Filter Profiles: Quick presets (strict/moderate/minimal)

Architecture

Phase 0: Style Profiling (optional but recommended)

┌─────────────────────────────────────────────────────────┐
│  INPUT: One or more EPUBs by target author              │
│                                                         │
│  PROCESS:                                               │
│  - Extract all text content                             │
│  - Analyze linguistic features:                         │
│    - Sentence length distribution                       │
│    - Vocabulary richness / word frequencies             │
│    - Punctuation patterns                               │
│    - Function word ratios                               │
│    - Common phrases / n-grams                           │
│    - Part-of-speech patterns                            │
│    - Style embeddings (sentence transformers)           │
│                                                         │
│  OUTPUT: author_profile.json                            │
└─────────────────────────────────────────────────────────┘

Phase 1: Book Analysis (for significant changes)

┌─────────────────────────────────────────────────────────┐
│  INPUT: Target EPUB                                     │
│                                                         │
│  PROCESS:                                               │
│  - Parse structure (chapters, sections)                 │
│  - Extract entities (characters, locations, objects)    │
│  - Map relationships and timeline                       │
│  - Identify key plot points                             │
│                                                         │
│  OUTPUT: book_model.json                                │
└─────────────────────────────────────────────────────────┘

Phase 2: Change Planning (for significant changes)

┌─────────────────────────────────────────────────────────┐
│  INPUT: book_model.json + user's change request         │
│                                                         │
│  PROCESS:                                               │
│  - Interpret user intent                                │
│  - Identify affected chapters/passages                  │
│  - Map ripple effects                                   │
│  - Generate change_plan.json                            │
│                                                         │
│  OUTPUT: change_plan.json                               │
│  (List of specific modifications per chapter)           │
└─────────────────────────────────────────────────────────┘

Phase 3: Execution

┌─────────────────────────────────────────────────────────┐
│  INPUT: EPUB + prompts + (optional) profile + plan      │
│                                                         │
│  FOR EACH CHAPTER:                                      │
│  1. Extract paragraphs                                  │
│  2. Analysis pass → FILTER or CLEAN                     │
│  3. If FILTER: Cleaning pass → selective rewrites       │
│  4. Apply changes to HTML                               │
│  5. Measure style drift (if profile exists)             │
│  6. Log changes                                         │
│                                                         │
│  OUTPUT: Modified EPUB                                  │
└─────────────────────────────────────────────────────────┘

Phase 4: Validation

┌─────────────────────────────────────────────────────────┐
│  INPUT: Modified EPUB + book_model + author_profile     │
│                                                         │
│  CHECKS:                                                │
│  - Consistency scan (contradictions, broken refs)       │
│  - Style drift report (per-chapter scores)              │
│  - Structural validation (valid EPUB)                   │
│                                                         │
│  OUTPUT:                                                │
│  - validation_report.json                               │
│  - Warnings for user review                             │
└─────────────────────────────────────────────────────────┘

Component Modules

Core (MVP - Phase 1) ✓

  • epub_cleaner.py - Main editing engine
  • config.example.yaml - Configuration template
  • prompts.example.yaml - Prompt templates
  • requirements.txt - Dependencies
  • README.md - Documentation

Style Preservation (Phase 2) ✓

  • style_profiler.py - Analyze EPUBs, generate author profiles
  • style_validator.py - Measure drift, score output quality
  • author_profile.schema.json - Profile format specification

Deep Editing (Phase 3) ✓

  • book_analyzer.py - Extract book model (characters, plot, timeline)
  • change_planner.py - Generate modification plans
  • consistency_checker.py - Cross-chapter validation
  • book_model.schema.json - Model format specification

CLI & Workflows (Phase 4) ✓

  • cli.py - Unified CLI with subcommands
  • workflows/ - Preset workflows (cleanup, filter, adapt, transform)
  • interactive.py - Guided mode for complex changes

Annotation Mode (Phase 5) ✓

  • annotator.py - Generate commentary for passages
  • footnote_inserter.py - Insert footnotes/endnotes into EPUB
  • commentary.example.yaml - Commentary style templates

Creative Composition (Phase 6) ✓

  • voice_composer.py - Extract and compose character voices
  • character_voice.schema.json - Character voice profile format
  • Voice blending for narrative (e.g., 70% Tolkien + 30% Herbert)
  • Cross-work character interactions with distinct voices preserved
  • constraints.example.yaml - Explicit style rules system
  • epub_validator.py - E-reader compatibility validation

VidAngel-Style Filtering (Phase 7) ✓

Content filtering designed like VidAngel - users toggle content categories with guaranteed filtering and control over edge cases.

  • borderline_reviewer.py - Borderline content detection and user interaction
  • filter_categories.yaml - External customizable content categories
  • FilterPreferences class - Persistent user preferences
  • Filter profiles (strict/moderate/minimal)
  • Dry-run caching system for API efficiency
  • Multi-pass filtering with verification
  • Non-interactive batch processing mode

File Structure (Current)

booksmith/
├── epub_cleaner.py          # Main editing engine (with profile integration + caching)
├── borderline_reviewer.py   # Borderline content detection and user interaction
├── style_profiler.py        # Author style analysis (LLM-based)
├── style_validator.py       # Drift measurement
├── book_analyzer.py         # Book model extraction
├── change_planner.py        # Modification planning
├── consistency_checker.py   # Cross-chapter validation
├── annotator.py             # Commentary generation
├── footnote_inserter.py     # Insert annotations into EPUB
├── voice_composer.py        # Character voice extraction & composition
├── epub_validator.py        # E-reader compatibility validation
├── cost_estimator.py        # API cost estimation before processing
├── interactive.py           # Guided wizard mode
├── cli.py                   # Unified command-line interface
│
├── config.example.yaml      # Configuration template
├── prompts.example.yaml     # Prompt templates
├── filter_categories.yaml   # Content filtering categories (customizable)
├── commentary.example.yaml  # Commentary style templates
├── constraints.example.yaml # Style constraints template
├── requirements.txt         # Dependencies
├── README.md                # Documentation
├── claude.md                # Development notes and roadmap
├── .gitignore               # Keep secrets out of git
│
├── schemas/
│   ├── author_profile.schema.json
│   ├── book_model.schema.json
│   ├── change_plan.schema.json
│   └── character_voice.schema.json
│
├── workflows/
│   ├── cleanup.yaml         # OCR/formatting fixes
│   ├── filter.yaml          # Content filtering
│   ├── modernize.yaml       # Language modernization
│   ├── transform.yaml       # Major adaptations
│   └── annotate.yaml        # Commentary presets
│
└── examples/
    ├── profile_tolkien.json     # Sample author profile
    ├── voice_gandalf.json       # Sample character voice
    ├── plan_steampunk.json      # Sample change plan
    ├── constraints_formal.yaml  # Formal writing constraints
    └── constraints_modern.yaml  # Modern style constraints

~/.booksmith/                    # User data directory
├── filter_preferences.yaml      # Saved borderline content preferences
└── cache/
    └── dryrun_<hash>.json       # Cached dry-run analysis results

CLI Design (Target)

# Simple cleanup (current functionality)
epub-cleaner edit --input book.epub --output clean.epub

# Build author profile
epub-cleaner profile --input author_works/*.epub --output tolkien.profile.json

# Edit with style preservation
epub-cleaner edit --input book.epub --profile tolkien.profile.json --max-drift 20

# Full transformation workflow
epub-cleaner transform --input lotr.epub --profile tolkien.profile.json \
    --goal "Convert to steampunk setting" --output lotr_steampunk.epub

# Validate existing EPUB against profile
epub-cleaner validate --input modified.epub --profile original_author.profile.json

# Estimate API costs before processing
epub-cleaner estimate --input book.epub --workflow transform --model opus
epub-cleaner estimate --input book.epub --all-features  # Include all optional features

# Add commentary/annotations
epub-cleaner annotate --input lotr.epub --output lotr_annotated.epub \
    --style scholarly,historical,funny \
    --format footnotes \
    --frequency "2-4 per chapter"

# VidAngel-style filtering with borderline review
epub-cleaner --input book.epub --borderline-review --filter-profile moderate

# Batch processing with saved preferences
epub-cleaner --input book.epub --non-interactive --filter-profile strict

# Multi-pass filtering for thoroughness
epub-cleaner --input book.epub --passes 2 --verify

# Process specific chapters
epub-cleaner --input book.epub --chapters "1-5,10,15-20"

# Dry-run with caching (preview then apply)
epub-cleaner --input book.epub --dry-run           # Analyze and cache results
epub-cleaner --input book.epub                     # Apply using cached analysis
epub-cleaner --input book.epub --no-cache          # Force fresh analysis
epub-cleaner --input book.epub --clear-cache       # Clear cached results

EPUB Footnote Standards

Annotations must follow EPUB3 accessibility standards:

<!-- In chapter HTML: footnote reference -->
<p>Frodo took the Ring<a epub:type="noteref" href="#note1" id="ref1">1</a></p>

<!-- At end of chapter or in separate notes file -->
<aside epub:type="footnote" id="note1">
  <p><a href="#ref1">1.</a> Note the contrast with Isildur, who claimed
  the Ring by force. Frodo's acceptance is reluctant, thrust upon him
  by circumstance rather than desire.</p>
</aside>

Requirements for footnote_inserter.py:

  • Generate valid EPUB3 footnote markup with epub:type attributes
  • Create bidirectional links (reference ↔ footnote)
  • Update content.opf manifest if adding notes file
  • Add appropriate CSS for footnote styling
  • Ensure compatibility with major e-readers (Kindle, Kobo, Apple Books)

Development Phases

Phase 1: MVP ✓

Core editing functionality - clean, configurable, shareable.

  • Clean main script
  • YAML config/prompts
  • CLI with argparse
  • Documentation

Phase 2: Style Preservation ✓

LLM-based author voice preservation.

  • Style profiler module (feed EPUBs to Opus, get rich analysis)
  • Style validator (LLM-based drift scoring)
  • Author profile schema
  • Style-aware prompting (integrate profile into edit prompts)
  • Drift reports per chapter

Phase 3: Deep Editing ✓

Support for plot/character-level changes.

  • Book model extraction (characters, plot, timeline)
  • Book model schema
  • Change planning (ripple effect mapping)
  • Consistency checking (cross-chapter validation)

Phase 4: CLI & Workflows ✓

User experience polish.

  • Unified CLI with subcommands
  • Preset workflows (cleanup, filter, adapt, transform)
  • Interactive mode for complex changes
  • Example profiles/plans

Phase 5: Annotation Mode ✓

Add commentary layer without modifying original text.

  • Annotator module (generate commentary per user style)
  • Footnote inserter (proper EPUB3 markup)
  • Commentary style templates
  • Multiple styles: scholarly, historical, educational, devil's advocate, funny
  • E-reader compatibility testing

Phase 6: Creative Composition ✓

Mix and match voices/characters from different works and authors.

  • voice_composer.py - Extract and compose character voices
  • Character voice profiles (separate from author style profiles)
  • Voice blending for narrative (e.g., 70% Tolkien + 30% Herbert)
  • Cross-work character interactions with distinct voices preserved
  • Constraints system (constraints.yaml) for explicit style rules
  • epub_validator.py - E-reader compatibility validation

Phase 7: VidAngel-Style Filtering ✓

User-controlled content filtering with transparency and guaranteed results.

  • Borderline content review system
  • External filter categories (customizable YAML)
  • Persistent user preferences
  • Filter profiles (strict/moderate/minimal)
  • Dry-run caching for API efficiency
  • Multi-pass filtering with verification
  • Non-interactive batch mode

VidAngel-Style Filtering System

The filtering system is designed to work like VidAngel, where users can toggle content categories and have guaranteed filtering with user control over edge cases.

Core Philosophy

  • User Control: Users decide exactly what content categories to filter
  • Transparency: Clear categorization of content (FILTER/BORDERLINE/CLEAN)
  • Persistence: Preferences saved across sessions
  • Efficiency: Dry-run caching avoids redundant API calls

Components

1. Borderline Reviewer (borderline_reviewer.py)

Analyzes content and categorizes it for filtering decisions.

# Key classes:
FilterPreferences     # Manages persistent user preferences
BorderlineReviewer    # Analyzes content and presents choices to user

# Workflow:
1. First pass: Identify FILTER, BORDERLINE, and CLEAN content
2. Present borderline items to user with category info
3. User selects which borderline categories to filter
4. Second pass: Apply filtering with user's choices
5. Optionally save preferences for future books

2. Filter Categories (filter_categories.yaml)

External YAML file defining content categories. Each category has:

  • name: Human-readable name
  • description: What this category covers (used in LLM prompts)
  • default_threshold: filter | borderline | clean
  • examples: Sample content for LLM reference

Category Groups:

  • Sensual/Sexual: explicit_sexual, sensual_scenes, nudity_descriptions, kiss_sensory, physical_attraction, revealing_clothing, implied_intimacy, lingering_gaze, brief_kiss, romantic_tension
  • Violence: graphic_gore, torture_scenes, combat_violence, death_descriptions, implied_violence
  • Language: strong_profanity, slurs, mild_profanity, crude_humor
  • Substance: drug_use_detailed, alcohol_abuse, substance_references

Search Paths (in order):

  1. Current working directory
  2. Script directory (same folder as borderline_reviewer.py)
  3. ~/.booksmith/filter_categories.yaml

3. User Preferences (~/.booksmith/filter_preferences.yaml)

Persistent storage for user's filtering decisions:

  • always_filter: Categories to always filter
  • never_filter: Categories to never filter
  • ask_each_time: Categories requiring user decision
  • profiles: Named presets (strict/moderate/minimal)
  • history: Past decisions for learning

4. Dry-Run Caching System

Caches analysis results to avoid duplicate API calls.

Cache Location: ~/.booksmith/cache/dryrun_<hash>.json

Cache Key = hash of:
- Input file path
- File modification time
- File size
- Prompts content
- Model name

Default TTL: 24 hours (configurable with --cache-ttl)

Workflow:

# Step 1: Dry-run analyzes and caches results
python epub_cleaner.py --input book.epub --dry-run

# Step 2: Full run reuses cached analysis (saves API calls)
python epub_cleaner.py --input book.epub --output clean.epub

CLI Options for Filtering

Option Description
--borderline-review Enable interactive borderline content review
--filter-profile PROFILE Apply preset: strict, moderate, or minimal
--non-interactive Use saved preferences, no prompts
--passes N Multi-pass filtering (1-3) for thoroughness
--verify Add verification pass after filtering
--chapters SELECTION Process specific chapters (e.g., "1-5,10")
--model MODEL Override model from config
--dry-run Preview changes with caching
--no-cache Force fresh analysis
--clear-cache Clear cached results for book
--cache-ttl HOURS Cache expiration time

Filter Profiles

Profile Behavior
strict Filter all borderline content automatically
moderate Filter explicit, ask about borderline content
minimal Only filter explicit content, keep everything else

Technical Decisions

Style Profiling Approach

  • LLM-based profiling: Feed EPUBs directly to Claude for rich, qualitative analysis
  • Captures nuance: Tone, thematic preoccupations, narrative techniques, authorial quirks
  • Multi-source: Can analyze multiple works by same author for robust profile
  • Trade-off: Richer results than statistical metrics, but requires LLM calls

Storage Formats

  • JSON: Profiles, models, plans (human-readable, versionable)
  • YAML: Config, prompts (user-editable)
  • EPUB: Input/output (standard format)

LLM Providers

  • Primary: Anthropic Claude (current)
  • Planned: OpenAI, local models (ollama)

Open Questions

  1. Granularity: Paragraph-level vs sentence-level vs scene-level edits?
  2. Caching: Cache analysis results for iterative editing? RESOLVED: Dry-run caching implemented
  3. Diff Format: How to present before/after for user review?
  4. Profile Portability: Share profiles for popular authors?
  5. Batch Processing: Multiple EPUBs in one run?

Development Notes

  • Original script: ../epub_cleaner_RUN_THIS.py
  • Keep prompts and config in YAML for clean separation
  • Style profiler should work offline (no LLM needed)
  • Prioritize user control and transparency