Complete reference for all configuration options in Voxtype.
Voxtype looks for configuration in the following locations (in order):
- Path specified via
-c/--configflag ~/.config/voxtype/config.toml(XDG config directory)/etc/voxtype/config.toml(system-wide default)- Built-in defaults
Type: String
Default: "whisper"
Required: No
Selects which speech-to-text engine to use for transcription.
Values:
whisper- OpenAI Whisper via whisper.cpp (default, recommended)parakeet- NVIDIA Parakeet via ONNX Runtime (experimental, requires special binary)
Example:
engine = "whisper"CLI override:
voxtype --engine parakeet daemonNotes:
- Parakeet requires a Parakeet-enabled binary (
voxtype-*-parakeet-*) - When using Parakeet, you must also configure the
[parakeet]section - See PARAKEET.md for detailed Parakeet setup instructions
Controls which key triggers push-to-talk recording.
Type: String
Default: "SCROLLLOCK"
Required: No
The main key to hold for recording. Must be a valid Linux evdev key name.
Common values:
SCROLLLOCK- Scroll Lock key (recommended)PAUSE- Pause/Break keyRIGHTALT- Right Alt keyF13throughF24- Extended function keysINSERT- Insert keyHOME- Home keyEND- End keyPAGEUP- Page Up keyPAGEDOWN- Page Down keyDELETE- Delete key
Example:
[hotkey]
key = "PAUSE"Finding key names:
sudo evtest
# Select keyboard, press desired key, note KEY_XXXX nameType: Array of strings
Default: []
Required: No
Additional modifier keys that must be held along with the main key.
Valid modifiers:
LEFTCTRL,RIGHTCTRLLEFTALT,RIGHTALTLEFTSHIFT,RIGHTSHIFTLEFTMETA,RIGHTMETA
Example:
[hotkey]
key = "SCROLLLOCK"
modifiers = ["LEFTCTRL"] # Requires Ctrl+ScrollLockType: String
Default: "push_to_talk"
Required: No
Activation mode for the hotkey.
Values:
push_to_talk- Hold hotkey to record, release to transcribe (default)toggle- Press hotkey once to start recording, press again to stop
Example:
[hotkey]
key = "SCROLLLOCK"
mode = "toggle" # Press to start, press again to stopType: Boolean
Default: true
Required: No
Enable or disable the built-in hotkey detection.
When set to false, voxtype will not listen for keyboard events via evdev. Instead, use the voxtype record command to control recording from external sources like compositor keybindings.
When to disable:
- You prefer using your compositor's native keybindings (Hyprland, Sway)
- You don't want to add your user to the
inputgroup - You want to use key combinations not supported by evdev (e.g., Super+V)
Example:
[hotkey]
enabled = false # Use compositor keybindings insteadUsage with compositor keybindings:
When enabled = false, control recording via CLI:
voxtype record start # Start recording
voxtype record start --file=out.txt # Write transcription to a file
voxtype record start --file # Write to file_path from config
voxtype record stop # Stop and transcribe
voxtype record toggle # Toggle recording stateBind these commands in your compositor config:
Hyprland:
bind = SUPER, V, exec, voxtype record start
bindr = SUPER, V, exec, voxtype record stop
Sway:
bindsym $mod+v exec voxtype record start
bindsym --release $mod+v exec voxtype record stop
Note: For toggle mode to work correctly, you must also set state_file = "auto" so voxtype can track its current state.
See User Manual - Compositor Keybindings for complete setup instructions.
Type: String Default: None (disabled) Required: No
Optional modifier key that triggers the secondary model when held while pressing the hotkey. Requires secondary_model to be set in the [whisper] section.
Example:
[hotkey]
key = "SCROLLLOCK"
model_modifier = "LEFTSHIFT" # Hold Shift + hotkey for secondary model
[whisper]
model = "base.en"
secondary_model = "large-v3-turbo"Valid key names: Same modifier keys as modifiers option:
LEFTSHIFT,RIGHTSHIFTLEFTCTRL,RIGHTCTRLLEFTALT,RIGHTALTLEFTMETA,RIGHTMETA
Note: This only applies when using evdev hotkey detection (enabled = true). When using compositor keybindings, use voxtype record start --model <model> instead.
Type: String Default: None (disabled) Required: No
Optional key to cancel recording or transcription in progress. When pressed, any active recording is discarded and any in-progress transcription is aborted. No text is output.
Example:
[hotkey]
key = "SCROLLLOCK"
cancel_key = "ESC" # Press Escape to cancelValid key names: Same as the key option - any valid Linux evdev key name.
Common cancel keys:
ESC- Escape keyBACKSPACE- Backspace keyF12- Function key
Note: This only applies when using evdev hotkey detection (enabled = true). When using compositor keybindings, use voxtype record cancel instead. See User Manual - Canceling Transcription.
Controls audio capture settings.
Type: String
Default: "default"
Required: No
The audio input device to use. Use "default" for the system default microphone.
Finding device names:
pactl list sources shortExample:
[audio]
device = "alsa_input.usb-Blue_Microphones_Yeti-00.analog-stereo"Type: Integer
Default: 16000
Required: No
Audio sample rate in Hz. Whisper expects 16000 Hz; other rates will be resampled.
Recommended: Keep at 16000 unless your hardware requires otherwise.
[audio]
sample_rate = 16000Type: Integer
Default: 60
Required: No
Maximum recording duration in seconds. Recording automatically stops after this limit as a safety measure.
Example:
[audio]
max_duration_secs = 120 # Allow 2-minute recordingsControls audio feedback sounds (beeps when recording starts/stops).
Type: Boolean
Default: false
Required: No
When true, plays audio cues when recording starts and stops.
Type: String
Default: "default"
Required: No
Sound theme to use for audio feedback.
Built-in themes:
default- Clear, pleasant two-tone beepssubtle- Quiet, unobtrusive clicksmechanical- Typewriter/keyboard-like sounds
Custom themes: Specify a path to a directory containing start.wav, stop.wav, and error.wav files.
Type: Float
Default: 0.7
Required: No
Volume level for audio feedback, from 0.0 (silent) to 1.0 (full volume).
Example:
[audio.feedback]
enabled = true
theme = "subtle"
volume = 0.5Custom theme example:
[audio.feedback]
enabled = true
theme = "/home/user/.config/voxtype/sounds"
volume = 0.8Controls the Whisper speech-to-text engine.
Type: String
Default: "local"
Required: No
Selects the transcription backend.
Values:
local- Use whisper.cpp locally via FFI bindings (default, fully offline)remote- Send audio to a remote server for transcriptioncli- Use whisper-cli subprocess (fallback for systems where FFI crashes)
Privacy Notice: When using
remotebackend, audio is transmitted over the network. See User Manual - Remote Whisper Servers for privacy considerations.
When to use cli backend:
The cli backend is a workaround for systems where the whisper-rs FFI bindings crash due to C++ exceptions crossing the FFI boundary. This affects some systems with glibc 2.42+ (e.g., Ubuntu 25.10). If voxtype crashes during transcription, try the cli backend.
Requires whisper-cli from whisper.cpp.
Examples:
[whisper]
backend = "remote"
remote_endpoint = "http://192.168.1.100:8080"[whisper]
backend = "cli"
whisper_cli_path = "/usr/local/bin/whisper-cli" # OptionalType: String
Default: "base.en"
Required: No
Which Whisper model to use for transcription.
Model names:
| Value | Size | Speed | Accuracy | Notes |
|---|---|---|---|---|
tiny |
39 MB | Fastest | Good | Multilingual |
tiny.en |
39 MB | Fastest | Better | English only |
base |
142 MB | Fast | Better | Multilingual |
base.en |
142 MB | Fast | Good | English only (default) |
small |
466 MB | Medium | Great | Multilingual |
small.en |
466 MB | Medium | Great | English only |
medium |
1.5 GB | Slow | Excellent | Multilingual |
medium.en |
1.5 GB | Slow | Excellent | English only |
large-v3 |
3.1 GB | Slowest | Best | Multilingual |
large-v3-turbo |
1.6 GB | Fast | Excellent | Multilingual, GPU recommended |
Custom model path:
[whisper]
model = "/home/user/models/custom-whisper.bin"Type: String or Array of Strings
Default: "en"
Required: No
Language code for transcription. Supports three modes:
- Single language - Use a specific language for all transcriptions
- Auto-detect - Let Whisper detect from all ~99 supported languages
- Constrained auto-detect - Detect from a specific set of allowed languages
Common values:
"en"- English"auto"- Auto-detect from all languages"es"- Spanish"fr"- French"de"- German"ja"- Japanese"zh"- Chinese["en", "fr"]- Auto-detect between English and French only
Examples:
[whisper]
# Single language (fastest, most accurate for monolingual use)
language = "en"
# Auto-detect from all languages
language = "auto"
# Constrained auto-detect (recommended for multilingual users)
# Whisper sometimes misdetects language for short sentences.
# This limits detection to your known languages for better accuracy.
language = ["en", "fr"]
# Works with any number of languages
language = ["en", "fr", "de", "es"]When to use constrained auto-detect:
- You regularly speak in 2-3 languages
- Whisper misdetects language for short sentences
- You want faster detection than full auto-detect
Note: Remote backends (OpenAI API) don't support language arrays. When using remote backend with an array, the first language is used.
Type: Boolean
Default: false
Required: No
When true, translates non-English speech to English.
Example:
[whisper]
language = "auto"
translate = true # Translate everything to EnglishType: Integer Default: Auto-detected Required: No
Number of CPU threads for Whisper inference. If omitted, automatically detects optimal thread count.
Example:
[whisper]
threads = 4 # Limit to 4 threadsTip: For best performance, set to your physical core count (not hyperthreads).
Type: Boolean
Default: false
Required: No
Controls when the Whisper model is loaded into memory.
Values:
false(default) - Model is loaded at daemon startup and kept in memory. Provides fastest response times but uses memory/VRAM continuously.true- Model is loaded when recording starts and unloaded after transcription completes. Saves memory/VRAM but adds a brief delay when starting each recording.
When to use on_demand_loading = true:
- Running on a memory-constrained system
- Using GPU acceleration and want to free VRAM for other applications
- Running multiple GPU-accelerated applications simultaneously
- Using large models (medium, large-v3) that consume significant memory
When to keep default (false):
- Want the fastest possible response time
- Have plenty of available memory/VRAM
- Using voxtype frequently throughout the day
Example:
[whisper]
model = "large-v3"
on_demand_loading = true # Free VRAM when not transcribingPerformance note: On modern systems with SSDs, model loading typically takes under 1 second for base/small models. Larger models (medium, large-v3) may take 2-3 seconds to load.
Type: Boolean
Default: false
Required: No
GPU memory isolation mode. When enabled, transcription runs in a subprocess that exits after each recording, fully releasing GPU memory between transcriptions.
Values:
false(default) - Model stays loaded in the daemon process. Fastest response, but GPU memory is held continuously.true- Model loads in a subprocess when recording starts, subprocess exits after transcription. Releases all GPU/VRAM between recordings.
When to use gpu_isolation = true:
- Laptops with hybrid graphics (NVIDIA Optimus, AMD switchable)
- You want the discrete GPU to power down when not transcribing
- Battery life is a priority
- You're running other GPU-intensive applications alongside Voxtype
When to keep default (false):
- Desktop systems with dedicated GPUs
- You transcribe frequently and want zero latency
- Power consumption is not a concern
Performance impact:
Benchmarks on AMD Radeon RX 7800 XT with large-v3-turbo:
| Mode | Transcription Latency | Idle RAM | Idle GPU Memory |
|---|---|---|---|
Standard (false) |
0.49s avg | ~1.6 GB | 409 MB |
GPU Isolation (true) |
0.50s avg | 0 | 0 |
The model loads while you speak (0.38-0.42s), so the additional latency is only ~10ms (2%) after recording stops. The delay should be barely perceptible because model loading overlaps with speaking time.
Example:
[whisper]
model = "large-v3-turbo"
gpu_isolation = true # Release GPU memory between transcriptionsNote: This setting only applies when using the local whisper backend (backend = "local"). It has no effect with remote transcription since no local GPU is used.
Type: Boolean
Default: false
Required: No
Optimizes Whisper's context window size for short recordings. When enabled, clips under 22.5 seconds use a smaller context window proportional to their length, speeding up transcription. Also sets no_context=true to prevent phrase repetition.
Values:
false(default) - Use Whisper's full 30-second context window (1500 tokens). Most compatible.true- Use optimized context window for short clips. Faster but may cause issues with some models.
Performance impact:
| Mode | ~1.5s clip (CPU) | ~1.5s clip (GPU) |
|---|---|---|
Enabled (true) |
~8s | ~0.28s |
Disabled (false) |
~15s | ~0.46s |
The optimization provides roughly 1.6-1.9x speedup for short recordings on both CPU and GPU.
When to enable (true):
- You want faster transcription for short clips
- Your model doesn't exhibit repetition issues (test before enabling)
- You're using smaller models (tiny, base, small) which are more stable
When to keep disabled (false):
- You use large-v3 or large-v3-turbo models (known repetition issues)
- You experience phrase repetition like "word word word"
- You want maximum compatibility across all models
Example:
[whisper]
model = "base.en"
context_window_optimization = true # Enable for faster transcriptionCLI override:
voxtype --whisper-context-optimization daemonNote: This setting only applies when using the local whisper backend (backend = "local"). It has no effect with remote transcription.
Type: String Default: None (empty) Required: No
Provides context to Whisper to improve transcription accuracy for domain-specific vocabulary. The prompt hints at terminology, proper nouns, or formatting conventions that Whisper should expect in the audio.
Why use it:
Whisper sometimes mistranscribes uncommon words, especially:
- Technical jargon (Kubernetes, TypeScript, PostgreSQL)
- Company or product names (Voxtype, Hyprland, Waybar)
- People's names (especially non-English names)
- Acronyms and abbreviations (API, CLI, LLM)
- Domain-specific terms (medical, legal, scientific)
By providing an initial prompt with these terms, Whisper is more likely to recognize and transcribe them correctly.
Example:
[whisper]
model = "base.en"
initial_prompt = "Technical discussion about Rust, TypeScript, and Kubernetes."More examples:
# Software development context
initial_prompt = "Voxtype, Hyprland, Waybar, Sway, wtype, ydotool, systemd, journalctl."
# Medical dictation
initial_prompt = "Medical notes. Terms: hypertension, myocardial infarction, CT scan, MRI."
# Meeting with specific attendees
initial_prompt = "Meeting with Zhang Wei, François Dupont, and Priya Sharma."CLI override:
voxtype --initial-prompt "Discussion about Kubernetes and Terraform" daemonTips:
- Keep prompts concise (a few words or a short sentence)
- List specific terms you expect to appear in your dictation
- Update the prompt when your context changes (different project, different domain)
- The prompt doesn't need to be grammatically correct—a list of terms works well
Note: This setting only applies when using the local whisper backend (backend = "local"). Remote servers may ignore the initial_prompt parameter.
Type: String Default: None (disabled) Required: No
A secondary Whisper model that can be triggered on-demand using the model_modifier hotkey or the --model CLI flag. Useful for having a fast model for everyday use and a more accurate model available when needed.
Example:
[hotkey]
model_modifier = "LEFTSHIFT"
[whisper]
model = "base.en" # Fast model for everyday use
secondary_model = "large-v3-turbo" # Accurate model when neededUsage:
- Hold
model_modifierwhile pressing the hotkey to use the secondary model - Or use CLI:
voxtype record start --model large-v3-turbo
Type: Array of strings
Default: []
Required: No
Additional models that can be requested via the --model CLI flag. The primary model and secondary_model are always available; this list adds more options.
Example:
[whisper]
model = "base.en"
secondary_model = "large-v3-turbo"
available_models = ["medium.en", "small.en"] # Additional models for CLIUsage:
voxtype record start --model medium.enNote: Models must be downloaded before use. Run voxtype setup --download --model <name> to download.
Type: Integer
Default: 2
Required: No
Maximum number of models to keep loaded in memory simultaneously. When this limit is reached and a new model is requested, the least recently used non-primary model is evicted.
Example:
[whisper]
model = "base.en"
secondary_model = "large-v3-turbo"
max_loaded_models = 3 # Keep up to 3 models in memoryNotes:
- The primary model is never evicted
- Only applies when
gpu_isolation = false(subprocess mode doesn't cache models) - Higher values use more memory but reduce model loading latency
Type: Integer
Default: 300 (5 minutes)
Required: No
Time in seconds after which idle non-primary models are automatically evicted from memory. Set to 0 to disable auto-eviction.
Example:
[whisper]
model = "base.en"
secondary_model = "large-v3-turbo"
cold_model_timeout_secs = 60 # Evict unused models after 1 minuteNotes:
- Only evicts models that haven't been used within the timeout period
- The primary model is never evicted
- Helps free memory when switching models infrequently
The following options are used when backend = "remote". They have no effect when using local transcription.
Privacy Notice: Remote transcription sends your audio over the network. This feature was designed for users who self-host Whisper servers on their own hardware. While it can also connect to cloud services like OpenAI, users with privacy concerns should carefully consider the implications. See User Manual - Remote Whisper Servers for details.
Type: String
Default: None
Required: Yes (when backend = "remote")
The base URL of the remote Whisper server. Must include the protocol (http:// or https://).
Examples:
[whisper]
backend = "remote"
# Self-hosted whisper.cpp server
remote_endpoint = "http://192.168.1.100:8080"
# OpenAI API
remote_endpoint = "https://api.openai.com"Security note: Voxtype logs a warning if you use HTTP (unencrypted) for non-localhost endpoints, as your audio would be transmitted in the clear.
Type: String
Default: "whisper-1"
Required: No
The model name to send to the remote server.
- For whisper.cpp server: This is ignored (the server uses whatever model it was started with)
- For OpenAI API: Must be
"whisper-1" - For other providers: Check their documentation
Example:
[whisper]
backend = "remote"
remote_endpoint = "https://api.openai.com"
remote_model = "whisper-1"Type: String Default: None Required: No (depends on server)
API key for authenticating with the remote server. Sent as a Bearer token in the Authorization header.
Recommendation: Use the VOXTYPE_WHISPER_API_KEY environment variable instead of putting keys in your config file.
Example using environment variable:
export VOXTYPE_WHISPER_API_KEY="sk-..."Example in config (less secure):
[whisper]
backend = "remote"
remote_endpoint = "https://api.openai.com"
remote_api_key = "sk-..."Type: Integer
Default: 30
Required: No
Maximum time in seconds to wait for the remote server to respond. Increase for slow networks or when transcribing long audio.
Example:
[whisper]
backend = "remote"
remote_endpoint = "http://192.168.1.100:8080"
remote_timeout_secs = 60 # 60 second timeout for long recordingsType: String Default: Auto-detected from PATH Required: No
Path to the whisper-cli binary. Only used when backend = "cli".
If not specified, voxtype searches for whisper-cli or whisper in:
- Your
$PATH - Common system locations (
/usr/local/bin,/usr/bin) - Current directory
~/.local/bin
Example:
[whisper]
backend = "cli"
whisper_cli_path = "/opt/whisper.cpp/build/bin/whisper-cli"Installing whisper-cli:
Build from source at github.com/ggerganov/whisper.cpp:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build --config Release
sudo cp build/bin/whisper-cli /usr/local/bin/Configuration for the Parakeet speech-to-text engine. This section is only used when engine = "parakeet".
Note: Parakeet support is experimental. See PARAKEET.md for detailed setup instructions.
Type: String
Default: "parakeet-tdt-0.6b-v3"
Required: No
The Parakeet model to use. Can be a model name (looked up in ~/.local/share/voxtype/models/) or an absolute path to a model directory.
Example:
[parakeet]
model = "parakeet-tdt-0.6b-v3"Using absolute path:
[parakeet]
model = "/opt/models/parakeet-tdt-0.6b-v3"Type: String Default: Auto-detected from model files Required: No
The model architecture type. Usually auto-detected based on files present in the model directory.
Values:
tdt- Token-Duration-Transducer (recommended, proper punctuation)ctc- Connectionist Temporal Classification (faster, character-level)
Example:
[parakeet]
model = "parakeet-tdt-0.6b-v3"
model_type = "tdt"Type: Boolean
Default: false
Required: No
Same behavior as [whisper].on_demand_loading. When true, loads the model only when recording starts and unloads after transcription.
Example:
[parakeet]
model = "parakeet-tdt-0.6b-v3"
on_demand_loading = true # Free memory when not transcribingengine = "parakeet"
[parakeet]
model = "parakeet-tdt-0.6b-v3"
on_demand_loading = false # Keep model loaded for fast responseControls how transcribed text is delivered.
Type: String
Default: "type"
Required: No
Primary output method.
Values:
type- Simulate keyboard input at cursor position (uses wtype, dotool, or ydotool)clipboard- Copy text to clipboard (requires wl-copy)paste- Copy to clipboard then simulate paste keystroke (requires wl-copy, and wtype, dotool, or ydotool)file- Write transcription to a file (requiresfile_pathto be set)
Example:
[output]
mode = "paste"Example (file output):
[output]
mode = "file"
file_path = "~/transcriptions/output.txt"
file_mode = "append"Note about wtype compatibility:
wtype does not work on KDE Plasma or GNOME Wayland because these compositors don't support the virtual keyboard protocol. On these desktops, voxtype automatically falls back to dotool (if installed) or ydotool. For ydotool, the daemon must be running (systemctl --user enable --now ydotool). See Troubleshooting for details.
Note about non-US keyboard layouts:
For non-US keyboard layouts (German QWERTZ, French AZERTY, etc.), dotool is recommended over ydotool. Set dotool_xkb_layout to your layout code (e.g., "de" for German). ydotool does not support keyboard layouts and will produce incorrect characters (e.g., 'y' and 'z' swapped on German layouts).
Note about paste mode:
The paste mode is an alternative for non-US keyboard layouts. Instead of typing characters directly, it copies text to the clipboard and simulates a paste keystroke. This works regardless of keyboard layout but overwrites your clipboard. Requires wl-copy for clipboard access.
Type: String
Default: "ctrl+v"
Required: No
Keystroke to simulate for paste mode. Change this if your environment uses a different paste shortcut.
Format: "modifier+key" or "modifier+modifier+key" (case-insensitive)
Common values:
"ctrl+v"- Standard paste (default)"shift+insert"- Universal paste for Hyprland/Omarchy"ctrl+shift+v"- Some terminal emulators
Example:
[output]
mode = "paste"
paste_keys = "shift+insert" # For Hyprland/OmarchySupported keys:
- Modifiers:
ctrl,shift,alt,super(alsoleftctrl,rightctrl, etc.) - Letters:
a-z - Special:
insert,enter
Type: Boolean
Default: true
Required: No
When true and mode = "type", falls back to clipboard if typing fails.
Note: This setting is ignored when driver_order is set, since the driver list explicitly defines what's tried.
Example:
[output]
mode = "type"
fallback_to_clipboard = true # Use clipboard if typing drivers failType: Array of strings
Default: ["wtype", "dotool", "ydotool", "clipboard", "xclip"]
Required: No
Custom order of output drivers to try when mode = "type". Each driver is tried in sequence until one succeeds. This allows you to prefer specific drivers or exclude others entirely.
Available drivers:
wtype- Wayland virtual keyboard (best CJK/Unicode support, wlroots compositors only)dotool- uinput-based typing (supports keyboard layouts, works on X11/Wayland/TTY)ydotool- uinput-based typing (requires daemon, X11/Wayland/TTY)clipboard- Wayland clipboard via wl-copyxclip- X11 clipboard via xclip
Default behavior (no driver_order set): The default chain is: wtype → dotool → ydotool → clipboard → xclip
Examples:
[output]
mode = "type"
# Prefer ydotool over dotool, skip wtype
driver_order = ["ydotool", "dotool", "clipboard"]
# X11-only setup
driver_order = ["dotool", "ydotool", "xclip"]
# Force single driver (no fallback)
driver_order = ["ydotool"]
# KDE/GNOME Wayland (wtype doesn't work)
driver_order = ["dotool", "ydotool", "clipboard"]CLI override:
voxtype --driver=ydotool,clipboard daemonNote: When driver_order is set, fallback_to_clipboard is ignored—the driver list explicitly defines what's tried.
Type: String (optional) Default: None Required: No
Keyboard layout for dotool output driver. Required for non-US keyboard layouts (German, French, etc.) when using dotool as the typing backend.
dotool is automatically used as a fallback when wtype fails (e.g., on GNOME/KDE Wayland). Unlike ydotool, dotool supports keyboard layouts via XKB environment variables.
Common values:
"de"- German (QWERTZ)"fr"- French (AZERTY)"es"- Spanish"uk"- Ukrainian"ru"- Russian
Example:
[output]
mode = "type"
dotool_xkb_layout = "de" # German keyboard layoutType: String (optional) Default: None Required: No
Keyboard layout variant for dotool. Use this for layout variations like nodeadkeys.
Example:
[output]
dotool_xkb_layout = "de"
dotool_xkb_variant = "nodeadkeys" # German without dead keysType: String (path)
Default: None
Required: Only when mode = "file"
File path for file output mode. When mode = "file", transcriptions are written to this file instead of being typed or copied to clipboard.
This path is also used as the default for the --output-file CLI flag when appending.
Example:
[output]
mode = "file"
file_path = "~/transcriptions/output.txt"Note: Parent directories are created automatically if they don't exist.
Type: String
Default: "overwrite"
Required: No
Controls how file output handles existing files.
Values:
overwrite- Replace the file contents on each transcription (default)append- Add transcription to the end of the file
This setting applies to both config-based file output (mode = "file") and the --output-file CLI flag.
Example:
[output]
mode = "file"
file_path = "~/transcriptions/log.txt"
file_mode = "append" # Build a running log of transcriptionsControls desktop notifications at various stages.
Type: Boolean
Default: false
Required: No
When true, shows a notification when recording starts (hotkey pressed).
Type: Boolean
Default: false
Required: No
When true, shows a notification when recording stops (transcription begins).
Type: Boolean
Default: true
Required: No
When true, shows a notification with the transcribed text after transcription completes.
Requires: notify-send (libnotify)
Example:
[output.notification]
on_recording_start = true # Notify when PTT activates
on_recording_stop = true # Notify when transcribing
on_transcription = true # Show transcribed textType: Integer
Default: 0
Required: No
Delay in milliseconds between each typed character. Increase if characters are being dropped.
Example:
[output]
type_delay_ms = 10 # 10ms delay between charactersType: Integer
Default: 0
Required: No
Delay in milliseconds before typing starts. This allows the virtual keyboard to initialize and helps prevent the first character from being dropped on some compositors. Try 100-200ms if you experience issues.
Note: When using compositor integration (via
voxtype setup compositor), best results come from not binding Escape in the submap. Some users have had success with Escape bound by increasing this delay, but the most consistent fix is to use F12 or another key instead.
Example:
[output]
pre_type_delay_ms = 100 # 100ms delay before typing startsType: Boolean
Default: false
Required: No
Automatically send an Enter keypress after outputting the transcribed text. Useful for chat applications, command lines, or forms where you want to auto-submit after dictation.
Example:
[output]
auto_submit = true # Press Enter after transcriptionNote: This works with all output modes (type, paste) but has no effect in clipboard mode since clipboard-only output doesn't simulate keypresses.
Type: Boolean
Default: false
Required: No
Convert newlines in transcribed text to Shift+Enter instead of regular Enter. This is useful for applications where pressing Enter submits the message or form, but you want to insert line breaks within your text.
Why use it:
Many chat and messaging applications (Slack, Discord, Teams, etc.) and some IDEs (Cursor AI chat) use Enter to submit/send and Shift+Enter to insert a line break. When dictating multi-line text, regular newlines would submit prematurely. This option ensures line breaks are inserted without triggering submission.
Example:
[output]
shift_enter_newlines = true # Use Shift+Enter for newlinesCommon use cases:
- Slack, Discord, Microsoft Teams chat
- AI coding assistants (Cursor, GitHub Copilot Chat)
- Web forms where Enter submits
- Any application where Enter has special meaning
Note: This only affects the wtype output driver. When combined with auto_submit = true, the final Enter (to submit) is still sent as a regular Enter after all Shift+Enter line breaks.
Type: String Default: None (disabled) Required: No
Shell command to execute immediately before typing output. Runs after post-processing but before text is typed/pasted.
Primary use case: Compositor integration to block modifier keys during typing. When using compositor keybindings with modifiers (e.g., SUPER+CTRL+X), if you release keys slowly, held modifiers can interfere with typed output.
Example:
[output]
pre_output_command = "hyprctl dispatch submap voxtype_suppress"Automatic setup: Use voxtype setup compositor hyprland|sway|river to automatically configure this.
Type: String Default: None (disabled) Required: No
Shell command to execute immediately after typing output completes.
Primary use case: Compositor integration to restore normal modifier behavior after typing.
Example:
[output]
post_output_command = "hyprctl dispatch submap reset"Compositor integration example:
[output]
# Switch to modifier-blocking submap before typing
pre_output_command = "hyprctl dispatch submap voxtype_suppress"
# Return to normal submap after typing
post_output_command = "hyprctl dispatch submap reset"Other uses:
[output]
# Notification when typing starts/finishes
pre_output_command = "notify-send 'Typing...'"
post_output_command = "notify-send 'Done'"
# Logging
post_output_command = "echo $(date) >> ~/voxtype.log"See User Manual - Output Hooks for detailed setup instructions.
Optional post-processing command that runs after transcription. The command receives the transcribed text on stdin and should output the processed text on stdout.
Best use cases:
- Translation: Speak in one language, output in another
- Domain vocabulary: Medical, legal, or technical term correction
- Reformatting: Convert casual dictation to formal prose
- Filler word removal: Remove "um", "uh", "like" that Whisper sometimes keeps
- Custom workflows: Multi-output scenarios (e.g., translate to 5 languages, save JSON to file, inject only English at cursor)
Important notes:
- Adds 2-5 seconds latency depending on model size
- For most users, Whisper large-v3-turbo with Voxtype's built-in
spoken_punctuationis sufficient - LLMs interpret text literally—saying "slash" won't produce "/" (use
spoken_punctuationfor that) - Use instruct/chat models, not reasoning models (they output
<think>blocks) - Avoid emojis in LLM output—ydotool cannot type them
Type: String Default: None (disabled) Required: Yes (if section is present)
The shell command to execute. Text is piped to stdin, processed text read from stdout.
Examples:
# Use Ollama with a small model for quick cleanup
[output.post_process]
command = "ollama run llama3.2:1b 'Clean up this transcription. Fix grammar and remove filler words. Output only the cleaned text:'"
# Simple filler word removal with sed
[output.post_process]
command = "sed 's/\\bum\\b//g; s/\\buh\\b//g; s/\\blike\\b//g'"
# Custom Python script
[output.post_process]
command = "python3 ~/.config/voxtype/cleanup.py"
# LM Studio API (OpenAI-compatible)
[output.post_process]
command = "~/.config/voxtype/lm-studio-cleanup.sh"Type: Integer
Default: 30000 (30 seconds)
Required: No
Maximum time in milliseconds to wait for the command to complete. If exceeded, the original text is used and a warning is logged.
Recommendations:
- Simple shell commands:
5000(5 seconds) - Local LLMs:
30000-60000(30-60 seconds) - Remote APIs:
30000or higher
Example:
[output.post_process]
command = "ollama run llama3.2:1b 'Clean up:'"
timeout_ms = 45000 # 45 second timeout for LLMIf the post-processing command fails for any reason (command not found, non-zero exit, timeout, empty output), Voxtype gracefully falls back to the original transcribed text and logs a warning. This ensures voice-to-text output is never blocked by post-processing issues.
Debugging:
Run voxtype with -v or -vv to see detailed logs about post-processing:
voxtype -vvFor users running LM Studio locally, here's an example script:
#!/bin/bash
# ~/.config/voxtype/lm-studio-cleanup.sh
INPUT=$(cat)
curl -s http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"messages\": [{
\"role\": \"system\",
\"content\": \"Clean up this dictated text. Fix spelling, remove filler words (um, uh), add proper punctuation. Output ONLY the cleaned text - no quotes, no emojis, no explanations.\"
},{
\"role\": \"user\",
\"content\": \"$INPUT\"
}],
\"temperature\": 0.1
}" | jq -r '.choices[0].message.content'Make it executable: chmod +x ~/.config/voxtype/lm-studio-cleanup.sh
Named profiles for context-specific settings. Profiles allow you to define different post-processing commands and output modes for different use cases, selectable at recording time via --profile.
Each profile is a TOML table under [profiles]:
[profiles.slack]
post_process_command = "ollama run llama3.2:1b 'Format for Slack:'"
[profiles.code]
post_process_command = "ollama run llama3.2:1b 'Format as code comment:'"
output_mode = "clipboard"
[profiles.email]
post_process_command = "ollama run llama3.2:1b 'Format as professional email:'"
post_process_timeout_ms = 45000Type: String
Default: None (uses [output.post_process].command)
Required: No
Shell command for post-processing. Overrides the default [output.post_process].command when this profile is active.
Type: Integer
Default: None (uses [output.post_process].timeout_ms or 30000)
Required: No
Timeout in milliseconds for the post-processing command.
Type: String
Default: None (uses [output].mode)
Required: No
Output mode override. Valid values: type, clipboard, paste.
Specify a profile when starting a recording:
voxtype record start --profile slack
voxtype record toggle --profile code- Options not specified in a profile inherit from the main config
- Unknown profile names log a warning and use default settings
- Profiles have no effect on
record stoporrecord cancel
# Default post-processing
[output.post_process]
command = "ollama run llama3.2:1b 'Clean up:'"
timeout_ms = 30000
# Profile: casual chat
[profiles.slack]
post_process_command = "ollama run llama3.2:1b 'Rewrite casually for Slack:'"
# Profile: code comments, output to clipboard
[profiles.code]
post_process_command = "ollama run llama3.2:1b 'Format as code comment:'"
output_mode = "clipboard"
# Profile: meeting notes with longer timeout
[profiles.notes]
post_process_command = "ollama run llama3.2:1b 'Convert to bullet points:'"
post_process_timeout_ms = 60000Controls text post-processing after transcription.
Type: Boolean
Default: false
Required: No
When true, converts spoken punctuation words into their symbol equivalents. Useful for developers and technical writing.
Supported conversions:
| Spoken | Symbol |
|---|---|
period |
. |
comma |
, |
question mark |
? |
exclamation mark / exclamation point |
! |
colon |
: |
semicolon |
; |
open paren / open parenthesis |
( |
close paren / close parenthesis |
) |
open bracket |
[ |
close bracket |
] |
open brace |
{ |
close brace |
} |
dash / hyphen |
- |
underscore |
_ |
at sign / at symbol |
@ |
hash / hashtag |
# |
dollar sign |
$ |
percent / percent sign |
% |
ampersand |
& |
asterisk |
* |
plus / plus sign |
+ |
equals / equals sign |
= |
slash / forward slash |
/ |
backslash |
\ |
pipe |
| |
tilde |
~ |
backtick |
` |
single quote |
' |
double quote |
" |
new line |
newline character |
new paragraph |
double newline |
tab |
tab character |
Example:
[text]
spoken_punctuation = trueWith this enabled, saying "function open paren close paren" produces function().
Type: Table (key-value pairs)
Default: {}
Required: No
Custom word replacements applied after transcription. Matching is case-insensitive but preserves word boundaries. Useful for:
- Correcting frequently misheard words
- Expanding abbreviations
- Fixing brand names or technical terms
Example:
[text]
replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" }If Whisper transcribes "vox type" (or "Vox Type"), it will be replaced with "voxtype".
Multiple replacements:
[text.replacements]
"vox type" = "voxtype"
"oh marky" = "Omarchy"
"oh marchy" = "Omarchy"
"omar g" = "Omarchy"
"omar key" = "Omarchy"Voice Activity Detection settings. VAD filters silence-only recordings before transcription, preventing Whisper hallucinations when processing silence.
Type: Boolean
Default: false
Required: No
Enable Voice Activity Detection. When enabled, recordings with no detected speech are rejected before transcription, and the "Cancelled" audio feedback is played.
Example:
[vad]
enabled = trueCLI override:
voxtype --vad daemonType: Float
Default: 0.5
Required: No
Speech detection threshold from 0.0 to 1.0. Higher values require more confident speech detection (stricter), lower values are more permissive.
Example:
[vad]
enabled = true
threshold = 0.6 # More strict, may reject soft speechCLI override:
voxtype --vad --vad-threshold 0.3 daemon # More permissiveType: Integer
Default: 100
Required: No
Minimum speech duration in milliseconds. Recordings with less detected speech than this are rejected. Helps filter out brief noise spikes.
Example:
[vad]
enabled = true
min_speech_duration_ms = 200 # Require at least 200ms of speechType: String (path) Default: Auto-detected based on engine Required: No
Path to a custom VAD model file. If not set, uses the default model location (~/.local/share/voxtype/models/).
- Whisper engine: Uses
ggml-silero-vad.bin(GGML format) - Parakeet engine: Uses bundled Silero model (no external file needed)
Example:
[vad]
enabled = true
model = "/custom/path/to/vad-model.bin"Download the VAD model before enabling:
voxtype setup vadThis downloads the appropriate model for your configured transcription engine.
[vad]
enabled = true
threshold = 0.5
min_speech_duration_ms = 100Controls status display icons for Waybar and other tray integrations.
Type: String
Default: "emoji"
Required: No
The icon theme to use for status display. Determines which icons appear in Waybar and other integrations.
Built-in themes:
Font-based themes (require specific fonts installed):
| Theme | idle | recording | transcribing | stopped | Font Required |
|---|---|---|---|---|---|
emoji |
🎙️ | 🎤 | ⏳ | (empty) | None (default) |
nerd-font |
U+F130 | U+F111 | U+F110 | U+F131 | Nerd Font |
material |
U+F036C | U+F040A | U+F04CE | U+F036D | Material Design Icons |
phosphor |
U+E43A | U+E438 | U+E225 | U+E43B | Phosphor Icons |
codicons |
U+EB51 | U+EBFC | U+EB4C | U+EB52 | Codicons |
omarchy |
U+EC12 | U+EC1C | U+EC1C | U+EC12 | Omarchy font |
Universal themes (no special fonts required):
| Theme | idle | recording | transcribing | stopped | Description |
|---|---|---|---|---|---|
minimal |
○ | ● | ◐ | × | Simple Unicode circles |
dots |
◯ | ⬤ | ◔ | ◌ | Geometric shapes |
arrows |
▶ | ● | ↻ | ■ | Media player style |
text |
[MIC] | [REC] | [...] | [OFF] | Plain text labels |
Icon codepoint reference:
| Theme | idle | recording | transcribing | stopped |
|---|---|---|---|---|
nerd-font |
microphone | circle | spinner | microphone-slash |
material |
mdi-microphone | mdi-record | mdi-sync | mdi-microphone-off |
phosphor |
ph-microphone | ph-record | ph-circle-notch | ph-microphone-slash |
codicons |
codicon-mic | codicon-record | codicon-sync | codicon-mute |
Custom theme: Specify a path to a TOML file containing custom icons.
Example:
[status]
icon_theme = "nerd-font"Custom theme file format (~/.config/voxtype/icons.toml):
idle = "🎙️"
recording = "🔴"
transcribing = "⏳"
stopped = ""Per-state icon overrides. These take precedence over the theme.
Type: Table Default: Empty (use theme icons) Required: No
Override specific icons without creating a full custom theme.
Example:
[status]
icon_theme = "emoji"
[status.icons]
recording = "🔴" # Override just the recording iconVoxtype outputs an alt field in JSON that enables Waybar's format-icons feature. You can either:
-
Use voxtype's icon themes (simpler):
[status] icon_theme = "nerd-font"
-
Override in Waybar config (more control):
The alt field values match state names: idle, recording, transcribing, stopped.
See User Manual - Waybar Integration for complete setup instructions.
Type: String
Default: "auto"
Required: No
Path to a state file for external integrations like Waybar or Polybar. When configured, the daemon writes its current state to this file whenever state changes.
Values written:
idle- Ready for inputrecording- Push-to-talk active, capturing audiotranscribing- Processing audio through Whisper
Special values:
"auto"- Uses$XDG_RUNTIME_DIR/voxtype/state(default, recommended)"disabled"- Turns off state file (also accepts"none","off","false")
Example:
# Use automatic location (default)
state_file = "auto"
# Or specify explicit path
state_file = "/tmp/voxtype-state"
# Disable state file
state_file = "disabled"Usage with voxtype status:
Once enabled, you can monitor the state:
# One-shot check
voxtype status
# JSON output for scripts
voxtype status --format json
# Continuous monitoring (for Waybar)
voxtype status --follow --format jsonWaybar module example:
"custom/voxtype": {
"exec": "voxtype status --follow --format json",
"return-type": "json",
"format": "{}",
"tooltip": true
}See User Manual - Waybar Integration for complete setup instructions.
Most configuration options can be overridden via command line:
| Config Option | CLI Flag |
|---|---|
| Config file | -c, --config |
| hotkey.key | --hotkey |
| whisper.model | --model |
| output.mode = "clipboard" | --clipboard |
| output.mode = "paste" | --paste |
| vad.enabled | --vad |
| vad.threshold | --vad-threshold |
| status.icon_theme | --icon-theme (status subcommand) |
| Verbosity | -v, -vv, -q |
Example:
voxtype --hotkey PAUSE --model small.en --clipboard
voxtype status --format json --icon-theme nerd-fontControls log verbosity via the tracing crate.
RUST_LOG=debug voxtype
RUST_LOG=voxtype=trace voxtypeOverrides the config directory location (default: ~/.config).
XDG_CONFIG_HOME=/custom/config voxtype
# Looks for: /custom/config/voxtype/config.tomlOverrides the data directory location (default: ~/.local/share).
XDG_DATA_HOME=/custom/data voxtype
# Models stored in: /custom/data/voxtype/models/API key for remote Whisper server authentication. Used when backend = "remote".
export VOXTYPE_WHISPER_API_KEY="sk-..."This is the recommended way to provide API keys instead of putting them in the config file.
[whisper]
model = "base.en"[whisper]
model = "medium.en"
threads = 8
[output.notification]
on_transcription = true[whisper]
model = "tiny.en"
[audio]
max_duration_secs = 30
[output]
type_delay_ms = 0[whisper]
model = "large-v3"
language = "auto"
translate = true # Translate to English[whisper]
model = "large-v3-turbo"
on_demand_loading = true # Free VRAM when not transcribing
[audio.feedback]
enabled = true # Helpful feedback since model loading adds brief delay
theme = "default"[whisper]
model = "large-v3-turbo"
gpu_isolation = true # Release GPU memory between transcriptions
[audio.feedback]
enabled = true
theme = "default"GPU isolation runs transcription in a subprocess that exits after each recording, allowing the discrete GPU to power down. The model loads while you speak, so perceived latency is nearly identical to standard mode.
[hotkey]
key = "F13"
modifiers = ["LEFTCTRL", "LEFTSHIFT"]
[output]
mode = "clipboard"
[output.notification]
on_transcription = true[whisper]
model = "base.en"
[text]
# Say "period" to get ".", "open paren" to get "(", etc.
spoken_punctuation = true
# Fix common misheard technical terms
[text.replacements]
javascript = "JavaScript"
typescript = "TypeScript"
python = "Python"[hotkey]
key = "F24"
[output]
mode = "clipboard"
[output.notification]
on_recording_start = false
on_recording_stop = false
on_transcription = false # No desktop notifications[hotkey]
enabled = false # Disable built-in hotkey, use compositor keybindings
# Required for toggle mode
state_file = "auto"
[whisper]
model = "base.en"
[audio.feedback]
enabled = true # Audio cues helpful when using external triggersThen configure your compositor:
Hyprland (~/.config/hypr/hyprland.conf):
bind = SUPER, V, exec, voxtype record start
bindr = SUPER, V, exec, voxtype record stop
Sway (~/.config/sway/config):
bindsym $mod+v exec voxtype record start
bindsym --release $mod+v exec voxtype record stop
Offload transcription to a GPU server on your local network:
[whisper]
backend = "remote"
language = "en"
# Your whisper.cpp server
remote_endpoint = "http://192.168.1.100:8080"
remote_timeout_secs = 30On your GPU server, run whisper.cpp server:
./server -m models/ggml-large-v3-turbo.bin --host 0.0.0.0 --port 8080Use OpenAI's hosted Whisper API (requires API key, has privacy implications):
[whisper]
backend = "remote"
language = "en"
remote_endpoint = "https://api.openai.com"
remote_model = "whisper-1"
remote_timeout_secs = 30
# API key set via: export VOXTYPE_WHISPER_API_KEY="sk-..."Note: Cloud-based transcription sends your audio to third-party servers. See User Manual - Remote Whisper Servers for privacy considerations.
Use a fast model for everyday dictation with a more accurate model available on-demand:
[hotkey]
key = "SCROLLLOCK"
model_modifier = "LEFTSHIFT" # Hold Shift + hotkey for secondary model
[whisper]
model = "base.en" # Fast model, always ready
secondary_model = "large-v3-turbo" # Accurate model on-demand
available_models = ["medium.en"] # Additional models for CLI
max_loaded_models = 2 # Keep 2 models in memory
cold_model_timeout_secs = 300 # Evict unused models after 5 min
[audio.feedback]
enabled = true # Helpful when switching modelsUsage:
- Normal hotkey press: Uses
base.en(fast) - Hold Shift + hotkey: Uses
large-v3-turbo(accurate) - CLI override:
voxtype record start --model medium.en
Download models first:
voxtype setup --download --model base.en
voxtype setup --download --model large-v3-turbo
voxtype setup --download --model medium.enCompatibility: Multi-model works with all modes:
on_demand_loading = true: Models load in background during recordinggpu_isolation = true: Fresh subprocess per transcription with requested modelbackend = "remote": Model name passed to remote server
The following configuration options are deprecated but still supported for backwards compatibility. They will log a warning when used.
| Deprecated Option | Replacement | Notes |
|---|---|---|
wtype_delay_ms |
pre_type_delay_ms |
Renamed for clarity (applies to all output drivers, not just wtype) |
--wtype-delay CLI flag |
--pre-type-delay |
CLI equivalent of the above |