Muyan-TTS Make Podcasts with AI Model Locally Step-by-Step Tutorial
AI Summary
Overview
- Video by Fahad Miraza on Moan TTS, a text-to-speech model optimized for podcast scenarios.
Key Features
- Open-source and trainable model suitable for zero- and one-shot use cases.
- Two model versions: Base TTS model (multispeaker) and Supervised Fine-tuning (single speaker).
- Efficient for voice cloning with lightweight fine-tuning capabilities.
Installation Instructions
- Environment Setup:
- Install FFmpeg multimedia library.
- Create a virtual environment using Python 3.10.
- Clone the Repository:
- Link for repository in video description.
- Install PyAudio using the command:
pip install pyaudio
.- Directory Structure:
- Create a directory for pre-trained models with specified subdirectories.
- Download the base model, SFT model, and Chinese Hubert model.
- Log into Hugging Face CLI:
huggingface-cli login
and fetch the access token.- Running the Model:
- Use
tts.py
script for text-to-speech conversion.- Provide sample audio and corresponding text for voice synthesis.
Performance
- Generates audio at approximately 1 second of audio per 30 seconds on standard GPUs.
- VRAM consumption during inference noted below 8 GB under typical usage.
Conclusion
- Moan TTS shows promise as a viable model for podcasting and voice cloning, with efficiency in deployment and adaptability for user-specific voices.