What if you can fulfill your dream of becoming a cute girl? Well, it's possible now (sort of).
- Audio transcription is done with Whisper.
- Translation is done with DeepL.
- Text to (cute) speech is done with Voicevox.
On my laptop, only CPU
Screen.Recording.2023-05-07.at.12.12.45.AM.mov
- Install Docker for voicevox engine
- Install Python 3.10 + Poetry, I recommend using asdf for this.
- Install dependencies with Poetry by running
poetry install. If you don't want to use it, checkpyproject.tomlfor Python and package versions. - Rename/copy
config.template.pytoconfig.py. - Download whisper's models (https://github.com/openai/whisper#available-models-and-languages) and update
WHISPER_MODEL_PATHin config.py with the path to the model file of your choice. - Update the array
VOICE_OUTPUT_DEVICE_IDSin config.py with devices that you want the final voice to go to (e.g. speaker/headphone/"fake" microphone for voice chats) - SET
SPEAKER_IDin voicevox_client/voice_config.py to your desired speaker ID. See below for how to check the voices out.
Start Voicevox engine in 1 console:
# Depends on whether you have GPU or not
# With GPU
docker compose -f docker-compose.gpu.yml up
# Without GPU
docker compose -f docker-compose.cpu.yml upStart the program in another console:
poetry run python main.py
# Or wish a shell inside poetry's virtualenv
poetry shell
python main.py- Move whisper audio transcription + voicevox engine to some cloud server with GPU or just Google Colab if internet connection is good so less local resource is needed and things will run faster.
Run this inside a python console with asyncio (python -m asyncio):
from voicevox_client.client import Client
with Client() as client:
for speaker in client.fetch_speakers():
print(speaker)speaker_uuid from this can be used to get more info about the speaker.
Each speaker has a styles array, each element has its own id that can be used to for speaker initialization/voice synthesis.
We can combine speaker_uuid and id to check voice samples from the get speaker info API.
Run this inside a python console with asyncio (python -m asyncio):
from voicevox_client.client import Client
with Client() as client:
speaker = client.fetch_speaker_info("<speaker_uuid>")
# speaker["portrait"] is an base64 encoded image
# speaker["style_infos"] is an array where each element contains id (style id), portrait (base64 encoded image), icon (base64 encoded image), voice_samples (array of base64 encoded voice samples)
# Sample code to write the base64 encoded data to a file:
# decoded = base64.b64decode(speaker["style_infos"][0]["voice_samples"][0])
# out_file = ("test.wav")
# with open(out_file, 'wb') as file:
# file.write(decoded)Run this inside a python console with asyncio (python -m asyncio):
from voicevox_client.client import Client
with Client() as client:
with open("test.wav", "wb") as f:
f.write(client.text_to_speech("交流できて嬉しいです", speaker_id=10))Run this inside a python console:
import sounddevice as sd
print(sd.query_devices())Use something like VB-CABLE to forward the audio output of this program to a fake audio input device, then use that fake the device as audio input for your voice chat application, should work with most games/Discord/Zoom.