Skip to content

Commit 5764aa2

Browse files
committed
Allow outputting transcriptions to a file
1 parent 40c67dd commit 5764aa2

5 files changed

Lines changed: 144 additions & 50 deletions

File tree

README.md

Lines changed: 66 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ VoxInput is meant to be used with [LocalAI](https://localai.io), but it will fun
1111
## Features
1212

1313
- **Speech-to-Text Daemon**: Runs as a background process to listen for signals to start or stop recording audio.
14-
- **Audio Capture and Playback**: Records audio from the microphone and plays it back for verification.
14+
- **Audio Capture**: Records audio from the microphone or any other device, including audio you are listening to.
1515
- **Transcription**: Converts recorded audio into text using a local or remote transcription service.
1616
- **Text Automation**: Simulates typing the transcribed text into an application using [`dotool`](https://git.sr.ht/~geb/dotool).
1717
- **Voice Activity Detection**: In realtime mode VoxInput uses VAD to detect speech segments and automatically transcribe them.
@@ -24,11 +24,15 @@ VoxInput is meant to be used with [LocalAI](https://localai.io), but it will fun
2424
- `dotool` (for simulating keyboard input)
2525
- `OPENAI_API_KEY` or `VOXINPUT_API_KEY`: Your OpenAI API key for Whisper transcription. If you have a local instance with no key, then just leave it unset.
2626
- `OPENAI_BASE_URL` or `VOXINPUT_BASE_URL`: The base URL of the OpenAI compatible API server: defaults to `http://localhost:8080/v1`
27-
- `OPENAI_WS_BASE_URL` or `VOXINPUT_WS_BASE_URL`: The base URL of the realtime websocket API: defaults to `ws://localhost:8080/v1/realtime`
28-
- OpenAI Realtime API support - VoxInput's realtime mode with VAD requires a [websocket endpoint that support's OpenAI's realtime API in transcription only mode](https://github.com/mudler/LocalAI/pull/5392). You can disable realtime mode with `--no-realtime`.
29-
30-
Note that the VoxInput env vars take precedence over the OpenAI ones.
31-
27+
- `XDG_RUNTIME_DIR`: Required for PID and state files in `$XDG_RUNTIME_DIR`.
28+
- `VOXINPUT_LANG`: Language code for transcription (defaults to empty).
29+
- `VOXINPUT_TRANSCRIPTION_MODEL`: Transcription model (default: `whisper-1`).
30+
- `VOXINPUT_TRANSCRIPTION_TIMEOUT`: Timeout duration (default: `30s`).
31+
- `VOXINPUT_SHOW_STATUS`: Show GUI notifications (`yes`/`no`, default: `yes`).
32+
- `VOXINPUT_CAPTURE_DEVICE`: Specific audio capture device name (run `voxinput devices` to list).
33+
- `VOXINPUT_OUTPUT_FILE`: Path to save the transcribed text to a file instead of typing it with dotool.
34+
35+
**Note**: `VOXINPUT_` vars take precedence over `OPENAI_` vars.
3236
Unless you don't mind running VoxInput as root, then you also need to ensure the following is setup for `dotool`
3337

3438
- Your user is in the `input` user group
@@ -78,36 +82,50 @@ The pop-up window showing when recording has begun can be disabled by setting `V
7882

7983
### Commands
8084

81-
- **`listen`**: Starts the speech-to-text daemon.
85+
- **`listen`**: Start speech to text daemon.
86+
- `--replay`: Play the audio just recorded for transcription (non-realtime mode only).
87+
- `--no-realtime`: Use the HTTP API instead of the realtime API; disables VAD.
88+
- `--no-show-status`: Don't show when recording has started or stopped.
89+
- `--output-file <path>`: Save transcript to file instead of typing.
8290
```bash
8391
./voxinput listen
8492
```
8593

86-
- **`record`**: Sends a signal to the daemon to start recording audio then exits. In realtime mode this will start transcription.
94+
- **`record`**: Tell existing listener to start recording audio. In realtime mode it also begins transcription.
8795
```bash
8896
./voxinput record
8997
```
9098

91-
- **`write`** or **`stop`**: Sends a signal to the daemon to stop recording. When not in realtime mode this triggers transcription.
99+
- **`write`** or **`stop`**: Tell existing listener to stop recording audio and begin transcription if not in realtime mode. `stop` alias makes more sense in realtime mode.
92100
```bash
93101
./voxinput write
94102
```
95103

96-
- **`toggle`**: Toggle recording on/off (start recording if idle, stop if recording). Only works in realtime mode.
104+
- **`toggle`**: Toggle recording on/off (start recording if idle, stop if recording).
97105
```bash
98106
./voxinput toggle
99107
```
100108

101-
- **`status`**: Show whether the server is listening and if it's currently recording. Only works in realtime mode.
109+
- **`status`**: Show whether the server is listening and if it's currently recording.
102110
```bash
103111
./voxinput status
104112
```
105113

106-
- **`help`**: Displays help information.
114+
- **`devices`**: List capture devices.
115+
```bash
116+
./voxinput devices
117+
```
118+
119+
- **`help`**: Show help message.
107120
```bash
108121
./voxinput help
109122
```
110123

124+
- **`ver`**: Print version.
125+
```bash
126+
./voxinput ver
127+
```
128+
111129
### Example Realtime Workflow
112130

113131
1. Start the daemon in a terminal window:
@@ -146,6 +164,42 @@ The pop-up window showing when recording has begun can be disabled by setting `V
146164

147165
4. The transcribed text will be typed into the active application.
148166

167+
### Example Workflow: Transcribing an Online Meeting or Video Stream
168+
169+
To create a transcript of an online meeting or video stream by capturing system audio:
170+
171+
1. List available capture devices:
172+
173+
```bash
174+
./voxinput devices
175+
```
176+
177+
Identify the monitor device, e.g., "Monitor of Built-in Audio Analog Stereo".
178+
179+
2. Start the daemon specifying the device and output file:
180+
181+
```bash
182+
VOXINPUT_CAPTURE_DEVICE="Monitor of Built-in Audio Analog Stereo" ./voxinput listen --output-file meeting_transcript.txt
183+
```
184+
185+
Note: Add `--no-realtime` if you prefer the HTTP API.
186+
187+
3. Start recording:
188+
189+
```bash
190+
./voxinput record
191+
```
192+
193+
4. Play your online meeting or video stream; the system audio will be captured.
194+
195+
5. Stop recording:
196+
197+
```bash
198+
./voxinput stop
199+
```
200+
201+
6. The transcript is now in `meeting_transcript.txt`.
202+
149203
### Quick start with LocalAI
150204

151205
1. Follow https://localai.io/installation/ to install LocalAI, the simplest way is using Docker:
@@ -183,10 +237,6 @@ The realtime mode has a UI to display various actions being taken by VoxInput. H
183237
- `SIGUSR2`: Stop recording and transcribe audio.
184238
- `SIGTERM`: Stop the daemon.
185239

186-
## Limitations
187-
188-
- Uses the default audio input, make sure you have the device you want to use set as the default on your system.
189-
190240
## License
191241

192242
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

flake.lock

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

listen.go

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -114,15 +114,16 @@ func waitForSessionUpdated(ctx context.Context, conn *openairt.Conn) error {
114114
}
115115

116116
type ListenConfig struct {
117-
PIDPath string
118-
APIKey string
119-
HTTPAPIBase string
120-
WSAPIBase string
121-
Lang string
122-
Model string
123-
Timeout time.Duration
124-
UI *gui.GUI
117+
PIDPath string
118+
APIKey string
119+
HTTPAPIBase string
120+
WSAPIBase string
121+
Lang string
122+
Model string
123+
Timeout time.Duration
124+
UI *gui.GUI
125125
CaptureDevice string
126+
OutputFile string
126127
}
127128

128129
func listen(config ListenConfig) {
@@ -349,6 +350,21 @@ Listen:
349350

350351
log.Println("main: received transcribed text: ", text)
351352

353+
if config.OutputFile != "" {
354+
f, err := os.OpenFile(config.OutputFile, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
355+
if err != nil {
356+
log.Printf("Failed to open output file %s: %v\n", config.OutputFile, err)
357+
continue
358+
}
359+
if _, err := fmt.Fprintln(f, text); err != nil {
360+
log.Printf("Failed to write to output file: %v\n", err)
361+
}
362+
if err := f.Close(); err != nil {
363+
log.Printf("Failed to close output file: %v\n", err)
364+
}
365+
continue
366+
}
367+
352368
dotool := exec.CommandContext(ctx, "dotool")
353369
stdin, err := dotool.StdinPipe()
354370
if err != nil {

main.go

Lines changed: 50 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,32 @@ func main() {
3636

3737
switch cmd {
3838
case "help":
39-
fmt.Println("Available commands:")
40-
fmt.Println(" listen - Start speech to text daemon")
41-
fmt.Println(" --replay play the audio just recorded for transcription")
42-
fmt.Println(" --no-realtime use the HTTP API instead of the realtime API; disables VAD")
43-
fmt.Println(" --no-show-status don't show when recording has started or stopped")
44-
fmt.Println(" record - Tell existing listener to start recording audio. In realtime mode it also begins transcription")
45-
fmt.Println(" write - Tell existing listener to stop recording audio and begin transcription if not in realtime mode")
46-
fmt.Println(" stop - Alias for write; makes more sense in realtime mode")
47-
fmt.Println(" toggle - Toggle recording on/off (start recording if idle, stop if recording)")
48-
fmt.Println(" status - Show whether the server is listening and if it's currently recording")
49-
fmt.Println(" devices - List capture devices")
50-
fmt.Println(" help - Show this help message")
51-
fmt.Println(" ver - Print version")
39+
fmt.Println(`Available commands:
40+
listen - Start speech to text daemon
41+
--replay play the audio just recorded for transcription
42+
--no-realtime use the HTTP API instead of the realtime API; disables VAD
43+
--no-show-status don't show when recording has started or stopped
44+
--output-file <path> Write transcribed text to file instead of keyboard
45+
record - Tell existing listener to start recording audio. In realtime mode it also begins transcription
46+
write - Tell existing listener to stop recording audio and begin transcription if not in realtime mode
47+
stop - Alias for write; makes more sense in realtime mode
48+
toggle - Toggle recording on/off (start recording if idle, stop if recording)
49+
status - Show whether the server is listening and if it's currently recording
50+
devices - List capture devices
51+
help - Show this help message
52+
ver - Print version
53+
54+
Environment variables:
55+
VOXINPUT_API_KEY or OPENAI_API_KEY - OpenAI API key (default: sk-xxx)
56+
VOXINPUT_BASE_URL or OPENAI_BASE_URL - HTTP API base URL (default: http://localhost:8080/v1)
57+
VOXINPUT_WS_BASE_URL or OPENAI_WS_BASE_URL - WebSocket API base URL (default: ws://localhost:8080/v1/realtime)
58+
VOXINPUT_LANG or LANG - Language code for transcription (default: none)
59+
VOXINPUT_TRANSCRIPTION_MODEL or TRANSCRIPTION_MODEL - Transcription model (default: whisper-1)
60+
VOXINPUT_TRANSCRIPTION_TIMEOUT or TRANSCRIPTION_TIMEOUT - Transcription timeout (default: 30s)
61+
VOXINPUT_SHOW_STATUS or SHOW_STATUS - Show status notifications (yes/no, default: yes)
62+
VOXINPUT_CAPTURE_DEVICE - Name of the capture device (default: system default; use 'devices' to list)
63+
VOXINPUT_OUTPUT_FILE - File to write transcribed text to (instead of keyboard)
64+
XDG_RUNTIME_DIR - Directory for PID and state files (required, standard XDG variable)`)
5265
return
5366
case "ver":
5467
fmt.Printf("v%s\n", strings.TrimSpace(string(version)))
@@ -75,7 +88,9 @@ func main() {
7588
timeoutStr := getPrefixedEnv([]string{"VOXINPUT", ""}, "TRANSCRIPTION_TIMEOUT", "30s")
7689
showStatus := getPrefixedEnv([]string{"VOXINPUT", ""}, "SHOW_STATUS", "yes")
7790
captureDeviceName := getPrefixedEnv([]string{"VOXINPUT"}, "CAPTURE_DEVICE", "")
78-
91+
92+
outputFile := getPrefixedEnv([]string{"VOXINPUT"}, "OUTPUT_FILE", "")
93+
7994
timeout, err := time.ParseDuration(timeoutStr)
8095
if err != nil {
8196
log.Println("main: failed to parse timeout", err)
@@ -101,21 +116,34 @@ func main() {
101116
replay := slices.Contains(os.Args[2:], "--replay")
102117
realtime := !slices.Contains(os.Args[2:], "--no-realtime")
103118

119+
var outputFileArg string
120+
for i := 2; i < len(os.Args); i++ {
121+
arg := os.Args[i]
122+
if arg == "--output-file" && i+1 < len(os.Args) {
123+
outputFileArg = os.Args[i+1]
124+
break
125+
}
126+
}
127+
if outputFileArg != "" {
128+
outputFile = outputFileArg
129+
}
130+
104131
if realtime {
105132
ctx, cancel := context.WithCancel(context.Background())
106133
ui := gui.New(ctx, showStatus)
107134

108135
go func() {
109136
listen(ListenConfig{
110-
PIDPath: pidPath,
111-
APIKey: apiKey,
112-
HTTPAPIBase: httpApiBase,
113-
WSAPIBase: wsApiBase,
114-
Lang: lang,
115-
Model: model,
116-
Timeout: timeout,
117-
UI: ui,
137+
PIDPath: pidPath,
138+
APIKey: apiKey,
139+
HTTPAPIBase: httpApiBase,
140+
WSAPIBase: wsApiBase,
141+
Lang: lang,
142+
Model: model,
143+
Timeout: timeout,
144+
UI: ui,
118145
CaptureDevice: captureDeviceName,
146+
OutputFile: outputFile,
119147
})
120148
cancel()
121149
}()

version.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.7.1
1+
0.8.0

0 commit comments

Comments
 (0)