Agents Guide: Voice Typing Project

Overview

This repository contains a collection of bash scripts for privacy-focused, offline voice typing. It leverages whisper or whisper.cpp for speech-to-text and ydotool to type the transcribed text into the active window.

Core Components

voice_typing: A standalone script that runs the whisper CLI for every audio clip. Best for occasional use and low resource consumption.
voice_client_local: A client/server setup where a local whisper.cpp server is managed as a user service (whisper.service). Faster than voice_typing but uses more resources.
voice_client: A client script designed to connect to a remote whisper.cpp server.
llama_edit: An optional text-correction utility that uses an LLM (via llama.cpp server) to refine transcribed text (e.g., correcting "um", "uh", "oops, I mean..." or grammar).

Modes of Operation

Mode	Script	Backend	Connection Type
Standalone	`voice_typing`	`whisper` CLI	Local process
Local Client/Server	`voice_client_local`	`whisper.cpp`	Local `whisper.service`
Remote Client/Server	`voice_client`	`whisper.cpp`	Networked server

Architecture & Patterns

Producer-Consumer Pattern

All main scripts use a FIFO-based producer-consumer architecture:

Producer: A background loop records audio using sox (rec) and writes the filename to a named pipe (FIFO).
Consumer: A foreground loop reads filenames from the FIFO, processes them (transcription), and then types the result.

Service Management

The client/server scripts rely on systemd user services:

whisper.service: Manages the whisper.cpp server.
llama.service: Manages the llama.cpp server for llama_edit.

Text Injection

Text is injected into the OS using ydotool. This requires the ydotoold daemon to be running and the YDOTOOL_SOCKET environment variable to be correctly set.

Essential Commands

Running the tools

# Standalone mode
./voice_typing

# Standalone mode with LLM text correction
./voice_typing -flow

# Local client/server mode
./voice_client_local

# Local client/server mode with LLM text correction
./voice_client_local -flow

# Remote client/server mode
./voice_client

# Remote client/server mode with LLM text correction
./voice_client -flow

Testing `llama_edit`

./llama_edit "Your uncorrected text here"

Dependencies & Requirements

Audio: sox, lame, ffmpeg
Transcription: whisper (OpenAI) or whisper.cpp
Text Injection: ydotool
Data Processing: jq, curl

Gotchas & Troubleshooting

`ydotool` Permissions

If you encounter failed to connect socket '/tmp/.ydotool_socket': Permission denied, ensure:

The user is in the input group.
ydotoold is running.
export YDOTOOL_SOCKET=/tmp/.ydotool_socket is in your .bashrc.
You may need sudo chmod +s $(which ydotool).

Silence Detection

The rec command uses silence detection thresholds. If transcription doesn't start or stops too early:

Adjust the silence thresholds in the rec command (e.g., silence 1 0.2 4% 1 1.0 2%).
Increasing the percentage (e.g., 6% or 8%) makes the detector more tolerant of background noise.

Server Configuration

voice_client: By default, it targets http://127.0.0.1:7777. If using a remote server, you must edit the script to point to the correct IP/hostname.
llama_edit: Targets http://127.0.0.1:8087/v1/chat/completions. Ensure your llama.service is configured to listen on this port.

Transcription Noise

If the output contains "Thanks for watching!" or other artifacts from the model, the scripts include logic to filter these out based on string matching or length checks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agents Guide: Voice Typing Project

Overview

Core Components

Modes of Operation

Architecture & Patterns

Producer-Consumer Pattern

Service Management

Text Injection

Essential Commands

Running the tools

Testing `llama_edit`

Dependencies & Requirements

Gotchas & Troubleshooting

`ydotool` Permissions

Silence Detection

Server Configuration

Transcription Noise

Uh oh!

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Agents Guide: Voice Typing Project

Overview

Core Components

Modes of Operation

Architecture & Patterns

Producer-Consumer Pattern

Service Management

Text Injection

Essential Commands

Running the tools

Testing llama_edit

Dependencies & Requirements

Gotchas & Troubleshooting

ydotool Permissions

Silence Detection

Server Configuration

Transcription Noise

Testing `llama_edit`

`ydotool` Permissions