Qwen3-TTS

Qwen3-TTS is an advanced, open-source text-to-speech (TTS) system developed by Alibaba’s Qwen team supporting voice cloning, voice design, and ultra-high-quality human-like speech generation. The system achieves state-of-the-art performance with 97ms end-to-end latency for streaming synthesis and supports 10+ languages.

Core Capabilities

1. Voice Cloning (3-Second)

Clone any voice with minimal reference audio:

  • Minimum: 3 seconds of reference audio
  • Recommended: 10-30 seconds for best results
  • Requirements: Clean audio with minimal background noise, varied intonation, accurate transcription
  • Cross-lingual: Cloned voices can generate speech in any supported language
  • Output: Voice can replicate emotion, tone, and speaking style

2. Voice Design

Create custom voices through natural language descriptions:

  • Control dimensions: Timbre, cadence, emotional nuance, persona
  • No reference audio needed: Pure text-based voice creation
  • Instruction-driven: “Deep male voice with slight rasp,” “Speak with excitement,” “Slow, deliberate pace”
  • Adaptive prosody: Automatically adjusts tone and rhythm based on text meaning
  • Multi-character support: Create consistent character voices for dialogue

3. Custom Voice (Preset Timbres)

Use 9 premium pre-designed timbres:

  • Vivian (Chinese) — Bright, slightly edgy young female
  • Serena (Chinese) — Warm, gentle young female
  • Uncle Fu (Chinese) — Seasoned male with low, mellow timbre
  • Dylan (Beijing dialect) — Youthful male with clear timbre
  • Eric (Sichuan dialect) — Lively male with husky brightness
  • Ryan (English) — Dynamic male with strong rhythmic drive
  • Aiden (English) — Sunny American male with clear midrange
  • Ono Anna (Japanese) — Playful female with light nimble timbre
  • Sohee (Korean) — Warm female with rich emotion

Technical Architecture

Qwen3-TTS-Tokenizer-12Hz

Multi-codebook speech encoder achieving:

  • High-fidelity compression: Efficient acoustic compression while preserving quality
  • Paralinguistic preservation: Maintains emotion, tone, speaking style, acoustic environment
  • Lightweight reconstruction: Non-DiT architecture for fast inference
  • Discrete tokens: Encodes speech as tokens for language model processing

Dual-Track Streaming Architecture

Enables ultra-low latency synthesis:

  • First packet latency: Generated after just 1 character input
  • End-to-end latency: As low as 97 milliseconds
  • Dual-mode: Supports both streaming and non-streaming generation
  • Real-time capable: Suitable for conversational AI applications

End-to-End Language Model

Discrete multi-codebook LM architecture:

  • Full-information modeling: Complete end-to-end speech modeling
  • No bottlenecks: Bypasses traditional LM+DiT information bottlenecks
  • Cascading error reduction: Eliminates cascading errors from separate components
  • Enhanced versatility: Greater generalization and performance ceiling

Performance Metrics

Voice Cloning Quality

  • Multilingual WER (10 languages): 1.835% average
  • Speaker similarity: 0.789 (outperforms MiniMax, ElevenLabs)
  • Cross-lingual capability: Outperforms CosyVoice3
  • Speech stability: Surpasses MiniMax and SeedTTS

Long-Form Generation

  • Continuous synthesis: Up to 10 minutes of audio
  • Chinese WER: 2.36% (10-minute synthesis)
  • English WER: 2.81% (10-minute synthesis)
  • Consistency: Maintains voice quality throughout

Voice Design Performance

  • Instruction-following: Outperforms MiniMax closed-source model
  • Generative expressiveness: Significantly leads open-source competitors
  • Style control: 75.4% score on InstructTTS-Eval benchmark

Model Variants

1.7B Models (Peak Performance)

ModelFeaturesLanguagesStreamingInstruction Control
VoiceDesignCreate custom voices from text descriptions10
CustomVoiceStyle control over 9 preset timbres10
Base3-second rapid voice cloning10

0.6B Models (Speed/Efficiency Balance)

ModelFeaturesLanguagesStreamingInstruction Control
CustomVoice9 preset timbres with faster inference10
BaseVoice cloning optimized for speed10

Language Support

10 Major Languages:

  • Chinese (Standard, Beijing, Shanghai, Sichuan dialects)
  • English
  • Japanese
  • Korean
  • German
  • French
  • Russian
  • Portuguese
  • Spanish
  • Italian

Multilingual capabilities: Single-speaker multilingual generation, cross-lingual voice cloning

Installation & Setup

Environment Setup

# Create clean Python 3.12 environment  
python3.12 -m venv qwen-tts-env  
source qwen-tts-env/bin/activate  
  
# Install package  
pip install qwen-tts  

Optional: FlashAttention 2 (Reduces VRAM)

# For machines with <96GB RAM and many CPU cores  
pip install flash-attn  

Usage Examples

Custom Voice Generation

from qwen_tts import Qwen3TTSModel  
  
model = Qwen3TTSModel("Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")  
  
# Generate with preset voice  
audio = model.generate_custom_voice(  
    text="Hello, this is a test.",  
    language="en",  
    speaker="Aiden",  
    instruct="Speak with enthusiasm"  
)  

Voice Cloning

# Clone a voice from reference audio  
audio = model.generate_voice_clone(  
    text="Please read this text in my voice.",  
    ref_audio="reference_voice.wav",  
    ref_text="This is my reference voice sample.",  
    language="en"  
)  
  
# Reuse prompt to avoid recomputing features  
prompt = model.create_voice_clone_prompt(  
    ref_audio="reference.wav",  
    ref_text="Reference text"  
)  
  
# Multiple generations with same voice  
for text in text_batch:  
    audio = model.generate_voice_clone(  
        text=text,  
        voice_clone_prompt=prompt  
    )  

Voice Design

# Create voice from description  
audio = model.generate_voice_design(  
    text="Say this with enthusiasm!",  
    instruct="Young female voice, energetic, speaking rapidly"  
)  

Web UI Demo

# Launch local web interface  
qwen-tts-demo  
  
# For HTTPS (required for microphone on remote access)  
qwen-tts-demo --ssl-certfile cert.pem --ssl-keykey key.pem  

OpenAI-Compatible API

from openai import OpenAI  
  
client = OpenAI(base_url="http://localhost:8880/v1", api_key="sk-xxx")  
  
response = client.audio.speech.create(  
    model="qwen3-tts",  
    voice="Vivian",  
    input="This sounds like a real person speaking."  
)  
response.stream_to_file("output.mp3")  

Use Cases

Ideal For

  • Audiobook production: Consistent narrator voices across chapters
  • Content creation: Natural voice-overs for videos
  • Real-time applications: 97ms latency enables live conversation
  • Multilingual content: Cross-lingual voice cloning
  • Character voices: Consistent voices for dialogue/gaming
  • Accessibility: High-quality text-to-speech for assistive tech
  • Local deployment: Privacy-focused, self-hosted solution

Practical Examples

Audiobook Workflow:

  1. Record 30-60 seconds of desired narrator voice
  2. Clone voice using 3-second minimum
  3. Process chapters in batches
  4. Maintain consistent narration throughout

Character Voice Workflow (Design then Clone):

  1. Use VoiceDesign to synthesize character description
  2. Create voice clone prompt from synthesized clip
  3. Generate all character lines reusing prompt
  4. Consistent voice across entire project

Real-Time Performance

  • Latency: 97ms end-to-end (first audio packet after 1 character)
  • Streaming: Real-time generation suitable for interactive AI
  • Processing: First audio packet delivered during typing
  • Conversation: Fast enough for natural dialogue flow

Community Integration

OpenAI-Compatible Server

git clone https://github.com/groxaxo/Qwen3-TTS-Openai-Fastapi  
docker build -t qwen3-tts-api .  
docker run --gpus all -p 8880:8880 qwen3-tts-api  

Compatible with:

  • OpenAI Python client
  • Open-WebUI
  • LLM applications expecting OpenAI API
  • Any service using OpenAI TTS endpoints

vLLM Integration

Use vLLM-Omni for optimized inference:

  • Offline inference examples
  • Batch processing
  • Performance optimization

Advantages

State-of-the-art quality: Outperforms ElevenLabs, MiniMax
Ultra-low latency: 97ms enables real-time conversation
Voice cloning: Just 3 seconds of reference audio
Voice design: Create voices from text descriptions
Multilingual: 10+ languages with dialect support
Long-form: Generate 10 minutes of continuous speech
Open-source: Apache 2.0 license
Self-hosted: Full privacy, no cloud dependence
Flexible models: 0.6B (fast) to 1.7B (high quality)
Well-documented: Comprehensive examples and guides
Community momentum: 3.3k+ GitHub stars

Limitations & Considerations

⚠️ Requires GPU for acceptable inference speed
⚠️ 0.6B model faster but lower quality than 1.7B
⚠️ Streaming mode has different latency profile than batch
⚠️ Best results with clean reference audio for voice cloning
⚠️ Cross-lingual cloning may not perfect all accents

Deployment Options

Local Installation

  • Direct Python package install
  • Web UI demo
  • Full control, privacy

API Access

  • DashScope API (Alibaba Cloud) — Mainl and China
  • DashScope International — Global access
  • Real-time endpoints for all models

Docker/Containerized

  • Pre-built images available
  • GPU support
  • OpenAI-compatible FastAPI wrapper

Last updated: January 2026
Confidence: High (official documentation and GitHub)
Status: Active development
License: Apache 2.0
Creator: Alibaba Qwen Team
GitHub Stars: 3.3k+
Key Advantage: 97ms latency for real-time voice synthesis