This repository contains materials from my self-designed introductory course Speech Synthesis and Voice Cloning, delivered during the Independent Study Period 2025 (ISP'25) at Skoltech.
The course introduces students to modern deep learning–based speech synthesis technologies and provides hands-on experience in building personalized text-to-speech (TTS) models.
- syllabus.pdf: full course syllabus as submitted for the ISP'25 course proposals. Also available online;
- lectures/: complete set of lecture slides (PDF). The structure is described below.
- isp-tts: accompanying repository with a simple TTS model implementation and example notebooks used in lectures and project work.
- (22.01) What is Speech Synthesis?
Text-to-speech demos, brief history, TTS pipeline - (23.01) Audio Representations and Text Preprocessing
Waveforms and spectrograms, graphemes and phonemes, practice - (24.01) Creating Own Speech Dataset
TTS dataset checklist, useful tools, practice, start of the project work
- (27.01) Text-to-Speech Models. Acoustic models
TTS architectures, code-along session - (28.01) Model Training and Evaluation
Training pipeline, evaluation metrics, code-along session - (29.01) Vocoders. Voice Conversion
Vocoder architectures, voice conversion models
- (30.01) Extra Topics and Project Work
Expressive, emotional, multi-lingual and modern speech synthesis - (31.01) Project Presentations and Wrap-up
Presentations and fun
Develop a working speech synthesis model trained or fine-tuned on your own voice data.
- Collect and prepare a dataset
- Train or fine-tune a TTS model
- Present results and insights
- Groups of 2–3 students: explore and prepare solution together but everyone uses own dataset
- Baseline solution presented in class but free to use any models / ready-to-use solutions
- Peer voting for the most interesting projects
Feedback on the individual steps and results, no formal grade.
All material was collected and prepared with care and love. References and attributions are included on individual slides.
This course was inspired by the following open courses:
- YSDA Speech Processing Course: https://github.com/yandexdataschool/speech_course
- HSE Deep Learning for Audio (DLA): https://github.com/markovka17/dla
- MIPT Deep Learning for Audio Course: https://github.com/severilov/DL-Audio-AIMasters-Course