Fine-tune Text to Speech Models in 2025: CSM-1B and Orpheus TTS

Опубликовано: 24 Март 2025
на канале: Trelis Research
1,764
88

📜 Get repo access at Trelis.com/ADVANCED-transcription

📧 Get the Trelis AI Newsletter: https://trelis.substack.com

❗️If you subscribed here, click the bell to be notified of new vids

🤝 Work for Trelis: https://trelis.com/jobs/

💡 Need Technical or Market Assistance?
Book a Consult Here: https://forms.gle/wJXVZXwioKMktjyVA

💸 Starting a New Project/Venture?
Apply for a Trelis Grant: https://trelis.com/trelis-ai-grants/

Video Links:
Slides: https://docs.google.com/presentation/...
One-click Runpod template (affiliate): https://www.runpod.io/console/deploy?...
Llama 3 Paper: https://arxiv.org/pdf/2407.21783
StyleTTS2: https://arxiv.org/pdf/2306.07691
Moshi: https://arxiv.org/pdf/2410.00037
Orpheus: https://canopylabs.ai/model-releases
Sesame’s CSM-1B: https://www.sesame.com/research/cross...
Colab Notebook - Orpheus Cloning: https://colab.research.google.com/dri...
Colab Notebook - Orpheus Inference: https://colab.research.google.com/dri...

TIMESTAMPS:
00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 (?)
01:04 End-to-End Multimodal Models and Their Capabilities
02:36 Traditional Approaches to Text-to-Speech
03:06 Token-Based Approaches and Their Advantages
03:25 Detailed Look at Orpheus and CSM-1B Models
06:58 Training and Inference with Token-Based Models
12:53 Hierarchical Tokenization for High-Quality Audio
14:11 Kyutai’s Moshi Model for Text + Speech
23:41 Sesame’s CSM-1B Model Architecture
25:13 Orpheus TTS architecture by Canopy Labs
27:34 Inferencing and Cloning with CSM-1B
40:13 Context Aware Text to Speech with CSM-1B
48:21 Orpheus Inference and Cloning - FREE Colab
55:09 Orpheus Voice Cloning Setup
01:01:20 Orpheus Fine-tuning (Full fine-tuning and LoRA fine-tuning)
01:09:55 Running Full Fine Tuning
01:19:33 Running LoRa Fine Tuning
01:25:20 Inference and Comparison
01:29:27 Inference with Cloning AND fine-tuning
01:35:48 The future of token-based multi-modal models

PS: I've rotated all hf access keys