Muyan-TTS Make Podcasts with AI Model Locally Step-by-Step Tutorial



AI Summary

Overview

  • Video by Fahad Miraza on Moan TTS, a text-to-speech model optimized for podcast scenarios.

Key Features

  • Open-source and trainable model suitable for zero- and one-shot use cases.
  • Two model versions: Base TTS model (multispeaker) and Supervised Fine-tuning (single speaker).
  • Efficient for voice cloning with lightweight fine-tuning capabilities.

Installation Instructions

  1. Environment Setup:
    • Install FFmpeg multimedia library.
    • Create a virtual environment using Python 3.10.
  2. Clone the Repository:
    • Link for repository in video description.
    • Install PyAudio using the command: pip install pyaudio.
  3. Directory Structure:
    • Create a directory for pre-trained models with specified subdirectories.
    • Download the base model, SFT model, and Chinese Hubert model.
    • Log into Hugging Face CLI: huggingface-cli login and fetch the access token.
  4. Running the Model:
    • Use tts.py script for text-to-speech conversion.
    • Provide sample audio and corresponding text for voice synthesis.

Performance

  • Generates audio at approximately 1 second of audio per 30 seconds on standard GPUs.
  • VRAM consumption during inference noted below 8 GB under typical usage.

Conclusion

  • Moan TTS shows promise as a viable model for podcasting and voice cloning, with efficiency in deployment and adaptability for user-specific voices.