Document-to-Audiobook Pipeline

A restart-safe Go-based pipeline that turns any PDF into a polished, single-file MP3.

Status

✅ Code Standards Compliant: Follows Go Coding Standards, Bash Coding Standards, and Design Principles
✅ Fully tested on Fedora 42 (stock repos + RPM Fusion)
⚠️ Untested on Ubuntu – behavior there is unknown
🔧 TTS layer powered by the project-specific fork: https://github.com/nnikolov3/book_expert_f5-tts

Pipeline Overview

PDF → PNG: High-DPI page conversion (pdf-to-png)
PNG → OCR: Text extraction with Tesseract (png-to-text-tesseract)
OCR → Enhanced Text: LLM-powered narration enhancement (png-text-augment)
Text Organization: Merging and structuring (merge-text)
Text → Audio: TTS synthesis with F5-TTS (text-to-wav)
Audio Processing: WAV → 48kHz mono → MP3 (wav-to-mp3)

Architecture: Implemented in Go for performance and reliability, with comprehensive configuration management through project.toml. Follows modern software engineering practices with extensive testing, linting, and quality assurance.

Repository Structure

book_expert/
├── bin/                      # Compiled Go binaries
├── cmd/                      # Go command implementations
│   ├── pdf-to-png/           # PDF → PNG conversion
│   ├── png-to-text-tesseract/ # OCR processing  
│   ├── png-text-augment/     # LLM enhancement
│   ├── merge-text/           # Text concatenation
│   ├── text-to-wav/          # TTS synthesis
│   └── wav-to-mp3/           # Audio conversion
├── internal/                 # Internal Go packages
│   ├── config/               # Configuration management
│   └── logging/              # Structured logging
├── scripts/                  # Development tools
│   ├── test_pipeline.sh      # Comprehensive testing
│   └── profile_go.sh         # Performance profiling
├── test/                     # Integration tests
├── logs/                     # Pipeline and test logs
├── data/                     # Processing workspace
│   ├── raw/                  # Source PDFs (configurable)
│   └── <pdf_name>/           # Per-document processing
│       ├── png/              # Rendered pages
│       ├── text/             # OCR + enhanced text
│       ├── wav/              # TTS audio chunks
│       └── mp3/              # Final audiobook
├── project.toml              # ★ Complete pipeline configuration ★
├── Makefile                  # Build and test automation
├── go.mod                    # Go module definition
├── DESIGN_PRINCIPLES_GUIDE.md # Development standards
├── GO_CODING_STANDARD.md     # Go coding guidelines
└── README.md                 # This file

Note: All directory paths are configurable through project.toml - nothing is hardcoded.

Quick Start (Fedora 42)

1. System Dependencies

sudo dnf install \
  ghostscript tesseract tesseract-langpack-eng \
  poppler-utils ImageMagick jq yq rsync ffmpeg \
  shellcheck nproc coreutils awk grep curl flock

2. F5-TTS Setup

git clone https://github.com/nnikolov3/book_expert_f5-tts.git
cd book_expert_f5-tts
python -m venv .venv && source .venv/bin/activate
pip install -e .

3. Project Setup

git clone https://github.com/<your-org>/book_expert.git
cd book_expert
make build

4. API Configuration

export GEMINI_API_KEY="sk-…"      # Google Gemini
export CEREBRAS_API_KEY="cb-…"     # Cerebras inference endpoint
# export NVIDIA_API_KEY="na-…"     # Optional

5. Configuration

Open project.toml and adjust:

[paths] / [directories] / [processing_dir] / [logs_dir] – folder layout
[settings] – DPI, force, worker counts
[google_api] & [cerebras_api] – model names, temps, tokens
[f5_tts_settings] – TTS model, worker threads
[prompts.*] – full system/user prompts used by each LLM call

Pipeline Components

Binary	Purpose	Configuration
`pdf-to-png`	PDF → PNG conversion	`settings.dpi`, `settings.*`
`png-to-text-tesseract`	OCR text extraction	`tesseract.*`
`png-text-augment`	LLM enhancement	`google_api.`, `prompts.`
`merge-text`	Text concatenation	`text_concatenation.*`
`text-to-wav`	TTS synthesis	`f5_tts_settings.*`
`wav-to-mp3`	Audio conversion	Audio processing settings

Usage

# Build all binaries
make build

# Run pipeline stages
./bin/pdf-to-png --input data/raw --output data         # PDF → PNG
./bin/png-to-text-tesseract --input data --output data  # PNG → OCR + LLM
./bin/merge-text --input data --output data             # Text → complete.txt
./bin/text-to-wav --input data --output data            # Text → WAV chunks
./bin/wav-to-mp3 --input data --output data             # WAV → MP3

Key Features:

📋 Reads project.toml for all configuration and paths
❓ Supports --help for detailed usage information
🔄 Idempotent operations—safe to rerun; use --force to overwrite
📊 Comprehensive logging and error reporting
⚡ Parallel processing where applicable

Typical Workflow

Drop PDFs into the folder pointed to by paths.input_dir (default data/raw/).
Run make build to compile all binaries.
Execute the pipeline binaries in order.
Find your audiobook at <output_dir>/<pdf_name>/mp3/<pdf_name>.mp3.

Configuration Guide

Key Configuration Sections in `project.toml`:

[paths], [directories], [processing_dir], [logs_dir] – all folder locations
[settings] – DPI, worker counts, force rebuild flag
[google_api], [cerebras_api] – model, temp, tokens, key var names
[prompts.*] – editable multi-paragraph prompts for every LLM stage
[f5_tts_settings] – TTS model name and worker threads
[retry] – global max-retries & back-off seconds

Dynamic Configuration: All binaries read configurations at runtime, enabling directory restructuring, model switching, and prompt modifications without recompilation.

Troubleshooting

Issue	Solution
🔨 Missing binary	Run `make build` or install system dependencies
🔑 Missing API key	Binary indicates required environment variable
🚫 HTTP 429 errors	Automatic retry with exponential backoff
💥 Partial runs/crashes	Rerun binary; completed outputs skipped unless `--force`
🐛 Pipeline issues	Check logs in `logs/` directory
🔍 Debug mode	Use `--verbose` flag for detailed output

Development and Testing

Quick Commands

make help          # Show all available targets
make build         # Build all binaries
make test          # Run full testing pipeline
make test-quick    # Run essential checks only  
make lint          # Run linters on all code
make fmt           # Format all code
make clean         # Clean build artifacts
make ci            # Full CI pipeline (clean, format, lint, test, build)

Quality Assurance

✅ Comprehensive testing: Unit tests, integration tests, and performance benchmarks
🔍 Static analysis: golangci-lint, staticcheck, go vet
📏 Code formatting: gofmt, goimports
🔨 Shell script validation: shellcheck for all Bash scripts
📊 Code coverage: Tracked and reported
⚡ Performance profiling: CPU and memory profiling available

Development Workflow

make dev           # Quick development cycle (format + test-quick)
make test-profile  # Run tests with profiling enabled
make metrics       # Show code quality metrics

Contributing

Code Standards

Go Code: Must pass go fmt, go vet, and golangci-lint
Bash Scripts: Must pass shellcheck validation
Design Principles: Follow guidelines in DESIGN_PRINCIPLES_GUIDE.md
Documentation: Update project.toml docs for new configuration options

Requirements

✅ All new functionality requires tests in cmd/*/main_test.go
✅ Code must follow the established patterns and conventions
✅ PRs must pass the full CI pipeline (make ci)
✅ Changes should maintain backward compatibility

Testing Standards

Unit tests for all public functions
Integration tests for pipeline components
Performance benchmarks for critical paths
Error case coverage

License

This project follows modern software engineering practices with comprehensive testing, linting, and quality assurance. All code adheres to established coding standards and design principles for maintainability and reliability.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
cmd		cmd
examples		examples
internal		internal
logs/pipeline		logs/pipeline
scripts		scripts
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
BASH_CODING_STANDARD.md		BASH_CODING_STANDARD.md
DESIGN_PRINCIPLES_GUIDE.md		DESIGN_PRINCIPLES_GUIDE.md
GO_CODING_STANDARD.md		GO_CODING_STANDARD.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
package-lock.json		package-lock.json
project.toml		project.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document-to-Audiobook Pipeline

Status

Pipeline Overview

Repository Structure

Quick Start (Fedora 42)

1. System Dependencies

2. F5-TTS Setup

3. Project Setup

4. API Configuration

5. Configuration

Pipeline Components

Usage

Typical Workflow

Configuration Guide

Key Configuration Sections in `project.toml`:

Troubleshooting

Development and Testing

Quick Commands

Quality Assurance

Development Workflow

Contributing

Code Standards

Requirements

Testing Standards

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document-to-Audiobook Pipeline

Status

Pipeline Overview

Repository Structure

Quick Start (Fedora 42)

1. System Dependencies

2. F5-TTS Setup

3. Project Setup

4. API Configuration

5. Configuration

Pipeline Components

Usage

Typical Workflow

Configuration Guide

Key Configuration Sections in project.toml:

Troubleshooting

Development and Testing

Quick Commands

Quality Assurance

Development Workflow

Contributing

Code Standards

Requirements

Testing Standards

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Key Configuration Sections in `project.toml`:

Packages