Qwen3-TTS
Qwen3-TTS is an advanced, open-source text-to-speech (TTS) system developed by Alibaba’s Qwen team supporting voice cloning, voice design, and ultra-high-quality human-like speech generation. The system achieves state-of-the-art performance with 97ms end-to-end latency for streaming synthesis and supports 10+ languages.
Core Capabilities
1. Voice Cloning (3-Second)
Clone any voice with minimal reference audio:
- Minimum: 3 seconds of reference audio
- Recommended: 10-30 seconds for best results
- Requirements: Clean audio with minimal background noise, varied intonation, accurate transcription
- Cross-lingual: Cloned voices can generate speech in any supported language
- Output: Voice can replicate emotion, tone, and speaking style
2. Voice Design
Create custom voices through natural language descriptions:
- Control dimensions: Timbre, cadence, emotional nuance, persona
- No reference audio needed: Pure text-based voice creation
- Instruction-driven: “Deep male voice with slight rasp,” “Speak with excitement,” “Slow, deliberate pace”
- Adaptive prosody: Automatically adjusts tone and rhythm based on text meaning
- Multi-character support: Create consistent character voices for dialogue
3. Custom Voice (Preset Timbres)
Use 9 premium pre-designed timbres:
- Vivian (Chinese) — Bright, slightly edgy young female
- Serena (Chinese) — Warm, gentle young female
- Uncle Fu (Chinese) — Seasoned male with low, mellow timbre
- Dylan (Beijing dialect) — Youthful male with clear timbre
- Eric (Sichuan dialect) — Lively male with husky brightness
- Ryan (English) — Dynamic male with strong rhythmic drive
- Aiden (English) — Sunny American male with clear midrange
- Ono Anna (Japanese) — Playful female with light nimble timbre
- Sohee (Korean) — Warm female with rich emotion
Technical Architecture
Qwen3-TTS-Tokenizer-12Hz
Multi-codebook speech encoder achieving:
- High-fidelity compression: Efficient acoustic compression while preserving quality
- Paralinguistic preservation: Maintains emotion, tone, speaking style, acoustic environment
- Lightweight reconstruction: Non-DiT architecture for fast inference
- Discrete tokens: Encodes speech as tokens for language model processing
Dual-Track Streaming Architecture
Enables ultra-low latency synthesis:
- First packet latency: Generated after just 1 character input
- End-to-end latency: As low as 97 milliseconds
- Dual-mode: Supports both streaming and non-streaming generation
- Real-time capable: Suitable for conversational AI applications
End-to-End Language Model
Discrete multi-codebook LM architecture:
- Full-information modeling: Complete end-to-end speech modeling
- No bottlenecks: Bypasses traditional LM+DiT information bottlenecks
- Cascading error reduction: Eliminates cascading errors from separate components
- Enhanced versatility: Greater generalization and performance ceiling
Performance Metrics
Voice Cloning Quality
- Multilingual WER (10 languages): 1.835% average
- Speaker similarity: 0.789 (outperforms MiniMax, ElevenLabs)
- Cross-lingual capability: Outperforms CosyVoice3
- Speech stability: Surpasses MiniMax and SeedTTS
Long-Form Generation
- Continuous synthesis: Up to 10 minutes of audio
- Chinese WER: 2.36% (10-minute synthesis)
- English WER: 2.81% (10-minute synthesis)
- Consistency: Maintains voice quality throughout
Voice Design Performance
- Instruction-following: Outperforms MiniMax closed-source model
- Generative expressiveness: Significantly leads open-source competitors
- Style control: 75.4% score on InstructTTS-Eval benchmark
Model Variants
1.7B Models (Peak Performance)
| Model | Features | Languages | Streaming | Instruction Control |
|---|---|---|---|---|
| VoiceDesign | Create custom voices from text descriptions | 10 | ✅ | ✅ |
| CustomVoice | Style control over 9 preset timbres | 10 | ✅ | ✅ |
| Base | 3-second rapid voice cloning | 10 | ✅ | ❌ |
0.6B Models (Speed/Efficiency Balance)
| Model | Features | Languages | Streaming | Instruction Control |
|---|---|---|---|---|
| CustomVoice | 9 preset timbres with faster inference | 10 | ✅ | ❌ |
| Base | Voice cloning optimized for speed | 10 | ✅ | ❌ |
Language Support
10 Major Languages:
- Chinese (Standard, Beijing, Shanghai, Sichuan dialects)
- English
- Japanese
- Korean
- German
- French
- Russian
- Portuguese
- Spanish
- Italian
Multilingual capabilities: Single-speaker multilingual generation, cross-lingual voice cloning
Installation & Setup
Environment Setup
# Create clean Python 3.12 environment
python3.12 -m venv qwen-tts-env
source qwen-tts-env/bin/activate
# Install package
pip install qwen-tts Optional: FlashAttention 2 (Reduces VRAM)
# For machines with <96GB RAM and many CPU cores
pip install flash-attn Usage Examples
Custom Voice Generation
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel("Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")
# Generate with preset voice
audio = model.generate_custom_voice(
text="Hello, this is a test.",
language="en",
speaker="Aiden",
instruct="Speak with enthusiasm"
) Voice Cloning
# Clone a voice from reference audio
audio = model.generate_voice_clone(
text="Please read this text in my voice.",
ref_audio="reference_voice.wav",
ref_text="This is my reference voice sample.",
language="en"
)
# Reuse prompt to avoid recomputing features
prompt = model.create_voice_clone_prompt(
ref_audio="reference.wav",
ref_text="Reference text"
)
# Multiple generations with same voice
for text in text_batch:
audio = model.generate_voice_clone(
text=text,
voice_clone_prompt=prompt
) Voice Design
# Create voice from description
audio = model.generate_voice_design(
text="Say this with enthusiasm!",
instruct="Young female voice, energetic, speaking rapidly"
) Web UI Demo
# Launch local web interface
qwen-tts-demo
# For HTTPS (required for microphone on remote access)
qwen-tts-demo --ssl-certfile cert.pem --ssl-keykey key.pem OpenAI-Compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="sk-xxx")
response = client.audio.speech.create(
model="qwen3-tts",
voice="Vivian",
input="This sounds like a real person speaking."
)
response.stream_to_file("output.mp3") Use Cases
Ideal For
- Audiobook production: Consistent narrator voices across chapters
- Content creation: Natural voice-overs for videos
- Real-time applications: 97ms latency enables live conversation
- Multilingual content: Cross-lingual voice cloning
- Character voices: Consistent voices for dialogue/gaming
- Accessibility: High-quality text-to-speech for assistive tech
- Local deployment: Privacy-focused, self-hosted solution
Practical Examples
Audiobook Workflow:
- Record 30-60 seconds of desired narrator voice
- Clone voice using 3-second minimum
- Process chapters in batches
- Maintain consistent narration throughout
Character Voice Workflow (Design then Clone):
- Use VoiceDesign to synthesize character description
- Create voice clone prompt from synthesized clip
- Generate all character lines reusing prompt
- Consistent voice across entire project
Real-Time Performance
- Latency: 97ms end-to-end (first audio packet after 1 character)
- Streaming: Real-time generation suitable for interactive AI
- Processing: First audio packet delivered during typing
- Conversation: Fast enough for natural dialogue flow
Community Integration
OpenAI-Compatible Server
git clone https://github.com/groxaxo/Qwen3-TTS-Openai-Fastapi
docker build -t qwen3-tts-api .
docker run --gpus all -p 8880:8880 qwen3-tts-api Compatible with:
- OpenAI Python client
- Open-WebUI
- LLM applications expecting OpenAI API
- Any service using OpenAI TTS endpoints
vLLM Integration
Use vLLM-Omni for optimized inference:
- Offline inference examples
- Batch processing
- Performance optimization
Advantages
✅ State-of-the-art quality: Outperforms ElevenLabs, MiniMax
✅ Ultra-low latency: 97ms enables real-time conversation
✅ Voice cloning: Just 3 seconds of reference audio
✅ Voice design: Create voices from text descriptions
✅ Multilingual: 10+ languages with dialect support
✅ Long-form: Generate 10 minutes of continuous speech
✅ Open-source: Apache 2.0 license
✅ Self-hosted: Full privacy, no cloud dependence
✅ Flexible models: 0.6B (fast) to 1.7B (high quality)
✅ Well-documented: Comprehensive examples and guides
✅ Community momentum: 3.3k+ GitHub stars
Limitations & Considerations
⚠️ Requires GPU for acceptable inference speed
⚠️ 0.6B model faster but lower quality than 1.7B
⚠️ Streaming mode has different latency profile than batch
⚠️ Best results with clean reference audio for voice cloning
⚠️ Cross-lingual cloning may not perfect all accents
Deployment Options
Local Installation
- Direct Python package install
- Web UI demo
- Full control, privacy
API Access
- DashScope API (Alibaba Cloud) — Mainl and China
- DashScope International — Global access
- Real-time endpoints for all models
Docker/Containerized
- Pre-built images available
- GPU support
- OpenAI-compatible FastAPI wrapper
Related Resources
- GitHub: https://github.com/QwenLM/Qwen3-TTS
- Hugging Face: https://huggingface.co/collections/Qwen/qwen3-tts
- HF Demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS
- ModelScope Demo: https://modelscope.cn/studios/Qwen/Qwen3-TTS
- Research Paper: https://arxiv.org/abs/2601.15621
- DashScope API: https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-design
Related Concepts
- Text-to-Speech - Overview of TTS technology
- Voice Cloning - Voice cloning techniques
- Speech Synthesis - Speech generation methods
- Streaming Audio - Real-time audio generation
- Multilingual TTS - Multi-language synthesis
- Qwen - Alibaba’s Qwen model family
- Open-Source AI - Open-source AI projects
Last updated: January 2026
Confidence: High (official documentation and GitHub)
Status: Active development
License: Apache 2.0
Creator: Alibaba Qwen Team
GitHub Stars: 3.3k+
Key Advantage: 97ms latency for real-time voice synthesis