AI-powered voice input for your desktop — press a hotkey, speak, and the transcript lands wherever your cursor is.
Soto is a cross-platform desktop voice-input app: hold a hotkey, speak, release, and your transcript is injected into whatever app holds focus. It is inspired by type4me by @joewongjc — bringing the same hotkey-driven flow to macOS and Windows. Compared with type4me, Soto replaces the original ASR2LM pipeline with an Omni-model architecture where the provider consumes audio and prompt context in a single multimodal request.
The whole UX is anchored on a tiny floating voice capsule that confirms the mic is hot, visualises your speech in real time, and gets out of the way the moment the transcript is delivered. It has two modes — dictation and translate:
Download the latest installer from sotoapp.org (or the GitHub Releases page):
| Platform | Installer |
|---|---|
| macOS (Apple Silicon) | Soto_*_aarch64.dmg |
| Windows 11 (x64) | Soto_*_x64-setup.exe |
- Install and launch Soto.
- Open the Models page and configure an Omni provider (e.g. MiMo, Doubao, or Qwen).
- Hold the configured hotkey, speak, release — the transcript is injected at your cursor.
macOS only — two permissions in System Settings → Privacy & Security:
- Microphone — capture audio while the hotkey is held
- Accessibility — inject the final transcript and listen for the global hotkey across apps
You can review and request access anytime in Settings → Permissions inside Soto; that panel keeps status checks read-only and uses macOS' native authorization prompts for Microphone and Accessibility.
If you'd rather build from source, see Development below.
- Hotkey-triggered recording — hold or tap a configurable key combo to start and stop
- Voice capsule — a tiny floating window confirms you're being heard and lets you abort
- Omni-model providers — MiMo, Doubao, and Qwen out of the box, plus any OpenAI-compatible endpoint; one request carries audio + prompt + hot-words
- Modes — per-mode system prompts (Dictation, Translate, and your own) tailor the model's output
- Dictionary / hot words — boost recognition for domain-specific terms
- Universal text injection — inserts into any app through the platform injection ladder: Windows native insertion uses SendInput, while macOS uses clipboard paste with Cmd+V dispatch through the native bridge
- History — searchable local session log with privacy controls
- Self-contained desktop package — no background services; the native macOS dylib or Windows DLL is bundled with the app
Soto is an Electron + React + TypeScript desktop app in a pnpm workspace monorepo. All the business logic lives in a pure-TypeScript core package; the Electron main process is a thin orchestration + persistence + native-bridge layer; the renderer is React; and OS-native capabilities are reached through koffi FFI into per-platform native libraries. There is no Rust — @soto/core is the single source of truth for types and logic.
packages/
core (@soto/core) pure TS logic + canonical zod types: chord matching,
session FSM, audio DSP/WAV, omni provider client,
prompt assembly, hot-word ranking, IPC schemas.
Layered foundation/contract/capabilities/domain
(dependency-cruiser-enforced). Zero Electron/native
deps; fully unit-tested.
ipc (@soto/ipc) the IPC trust boundary: per-command router + the
command policy catalog. Builds on @soto/core.
native-bridge (@soto/native-bridge) the koffi FFI seam: native ABI decls +
bridge loader + the InjectionNativePort contract.
apps/desktop (@soto/desktop) the Electron app (electron-vite + React)
src/main/ main process: the IPC router (trust boundary) from
@soto/ipc, Drizzle/better-sqlite3 SqliteStore
(~/.soto), the SessionController, and the
@soto/native-bridge koffi bridge.
src/preload/ per-command contextBridge surface (no generic invoke).
src/renderer/ React UI (pages + Zustand) + the transparent capsule.
native/
macos/ SwiftPM package → libSotoMacNative.dylib
(`@_cdecl` shim, called via koffi).
windows/ .NET Native AOT project → SotoWinNative.dll
(`UnmanagedCallersOnly` exports, called via koffi).
Frontend layering rules and module boundaries are spelled out in CONTRIBUTING.md.
Prerequisites: pnpm 10.24.0. The workspace pins Node.js 24.16.0
through pnpm (devEngines.runtime in package.json plus the lockfile runtime);
run JS commands through pnpm so scripts use the managed runtime.
Platform-native dependencies (only needed when (re)building the native libraries):
- macOS development and release builds require macOS 14+ on Apple Silicon and Xcode 16+ / Swift 6+ to build
libSotoMacNative.dylib. - Windows development and release builds require Windows 11 x64 and the .NET 8 SDK with Native AOT support to build
SotoWinNative.dll.
Native package details live in native/macos/README.md and native/windows/README.md.
# Install JS dependencies (pnpm workspace)
pnpm install
# Type-check the Electron app (main + renderer)
pnpm check
# Run all tests (@soto/core + @soto/ipc + the Electron app)
pnpm test
# Dev server (electron-vite hot-reload main + renderer)
pnpm dev
# Production build (electron-vite build)
pnpm build
# Local macOS package smoke (Swift dylib + Electron Builder + package checks)
pnpm smoke:package:macNative modules (better-sqlite3, koffi) are rebuilt against the bundled Electron
ABI by the app's dev/package scripts (electron-rebuild); pnpm test rebuilds
them against the pnpm-managed Node 24 ABI first. better-sqlite3 is versioned
through the workspace catalog and remains the only approved install-time native
build in pnpm-workspace.yaml.
The local package smoke creates apps/desktop/dist/Soto-<version>-arm64.dmg and
verifies the staged Soto.app, codesign status, and bundled native dylib. It is
faster than the signed release gate and does not notarize.
macOS release builds that need Developer ID signing and notarization use
pnpm build:mac:signed with maintainer signing credentials (kept outside the
repo); it builds the Swift dylib, packages it into Contents/Resources/native/,
and runs Electron Builder signing/notarization. That path also emits Electron
Builder feed artifacts such as latest-mac.yml, and the packaging scripts verify
those files exist. The release scripts live directly under scripts/.
On a locked-down Windows shell, enable pnpm via Corepack first:
corepack enable pnpm --install-directory "$env:APPDATA\npm"Soto currently builds prompts from the mode system prompt and dictionary hot-words. Future work will feed richer signals to the AI so it can produce better transcriptions and transformations with less manual configuration.
| Signal | Approach | Status |
|---|---|---|
| Focused window — app name + window title | Query the OS window manager at session start and append to the prompt (e.g. "User is writing in Slack → #engineering-channel") | Planned |
| Window content — visible text in the active app | macOS Accessibility API (AXUIElement) / Windows UI Automation; screen-reader–style extraction at record time |
Planned |
| Clipboard — text already copied by the user | Read clipboard at record start and include as a hint; powers natural "rewrite this" / "translate this" flows without a custom mode | Planned |
Show a live preview of the transcription inside the recording capsule as audio is processed, rather than waiting until the session ends. Reduces the "did it hear me?" uncertainty and lets the user abort early if recognition goes off-track. Requires provider-side streaming support.
Words that appear in a completed transcript but were not in the user's dictionary could be automatically surfaced as candidates to add. A post-session review UI (or a confidence-threshold auto-add) would let the dictionary grow organically from real usage without manual curation.
Make hot words easier to tune once the dictionary grows beyond a flat enabled/disabled list. Candidate improvements include per-mode hot-word sets, priority or weight controls, alias/pronunciation hints, bulk import/export, and review tools for stale or conflicting entries.
See CONTRIBUTING.md for layering rules and test expectations.
If Soto has been useful to you, you're welcome to buy a coffee on Ko-fi. It helps cover code-signing certificates and release distribution costs — much appreciated either way.
- Inspiration: type4me by @joewongjc — the original macOS voice-input app written in SwiftUI. We owe the core product design to type4me: hotkey-driven sessions, mode-based prompts, dictionary / hot words, and local history. If you're on macOS only, please also check out the upstream project.
- Runtime: Electron, React, koffi
- Providers: MiMo (Xiaomi), Doubao (ByteDance), Qwen (Alibaba Cloud), and any OpenAI-compatible endpoint
MIT — see LICENSE.