Qwen3-TTS

Qwen3-TTS is an advanced, open-source text-to-speech (TTS) system developed by Alibaba’s Qwen team supporting voice cloning, voice design, and ultra-high-quality human-like speech generation. The system achieves state-of-the-art performance with 97ms end-to-end latency for streaming synthesis and supports 10+ languages.

Core Capabilities

1. Voice Cloning (3-Second)

Clone any voice with minimal reference audio:

Minimum: 3 seconds of reference audio
Recommended: 10-30 seconds for best results
Requirements: Clean audio with minimal background noise, varied intonation, accurate transcription
Cross-lingual: Cloned voices can generate speech in any supported language
Output: Voice can replicate emotion, tone, and speaking style

2. Voice Design

Create custom voices through natural language descriptions:

Control dimensions: Timbre, cadence, emotional nuance, persona
No reference audio needed: Pure text-based voice creation
Instruction-driven: “Deep male voice with slight rasp,” “Speak with excitement,” “Slow, deliberate pace”
Adaptive prosody: Automatically adjusts tone and rhythm based on text meaning
Multi-character support: Create consistent character voices for dialogue

3. Custom Voice (Preset Timbres)

Use 9 premium pre-designed timbres:

Vivian (Chinese) — Bright, slightly edgy young female
Serena (Chinese) — Warm, gentle young female
Uncle Fu (Chinese) — Seasoned male with low, mellow timbre
Dylan (Beijing dialect) — Youthful male with clear timbre
Eric (Sichuan dialect) — Lively male with husky brightness
Ryan (English) — Dynamic male with strong rhythmic drive
Aiden (English) — Sunny American male with clear midrange
Ono Anna (Japanese) — Playful female with light nimble timbre
Sohee (Korean) — Warm female with rich emotion

Technical Architecture

Qwen3-TTS-Tokenizer-12Hz

Multi-codebook speech encoder achieving:

High-fidelity compression: Efficient acoustic compression while preserving quality
Paralinguistic preservation: Maintains emotion, tone, speaking style, acoustic environment
Lightweight reconstruction: Non-DiT architecture for fast inference
Discrete tokens: Encodes speech as tokens for language model processing

Dual-Track Streaming Architecture

Enables ultra-low latency synthesis:

First packet latency: Generated after just 1 character input
End-to-end latency: As low as 97 milliseconds
Dual-mode: Supports both streaming and non-streaming generation
Real-time capable: Suitable for conversational AI applications

End-to-End Language Model

Discrete multi-codebook LM architecture:

Full-information modeling: Complete end-to-end speech modeling
No bottlenecks: Bypasses traditional LM+DiT information bottlenecks
Cascading error reduction: Eliminates cascading errors from separate components
Enhanced versatility: Greater generalization and performance ceiling

Performance Metrics

Voice Cloning Quality

Multilingual WER (10 languages): 1.835% average
Speaker similarity: 0.789 (outperforms MiniMax, ElevenLabs)
Cross-lingual capability: Outperforms CosyVoice3
Speech stability: Surpasses MiniMax and SeedTTS

Long-Form Generation

Continuous synthesis: Up to 10 minutes of audio
Chinese WER: 2.36% (10-minute synthesis)
English WER: 2.81% (10-minute synthesis)
Consistency: Maintains voice quality throughout

Voice Design Performance

Instruction-following: Outperforms MiniMax closed-source model
Generative expressiveness: Significantly leads open-source competitors
Style control: 75.4% score on InstructTTS-Eval benchmark

Model Variants

1.7B Models (Peak Performance)

Model	Features	Languages	Streaming	Instruction Control
VoiceDesign	Create custom voices from text descriptions	10	✅	✅
CustomVoice	Style control over 9 preset timbres	10	✅	✅
Base	3-second rapid voice cloning	10	✅	❌

0.6B Models (Speed/Efficiency Balance)

Model	Features	Languages	Streaming	Instruction Control
CustomVoice	9 preset timbres with faster inference	10	✅	❌
Base	Voice cloning optimized for speed	10	✅	❌

Language Support

10 Major Languages:

Chinese (Standard, Beijing, Shanghai, Sichuan dialects)
English
Japanese
Korean
German
French
Russian
Portuguese
Spanish
Italian

Multilingual capabilities: Single-speaker multilingual generation, cross-lingual voice cloning

Installation & Setup

Environment Setup

# Create clean Python 3.12 environment  
python3.12 -m venv qwen-tts-env  
source qwen-tts-env/bin/activate  
  
# Install package  
pip install qwen-tts

Optional: FlashAttention 2 (Reduces VRAM)

# For machines with <96GB RAM and many CPU cores  
pip install flash-attn

Usage Examples

Custom Voice Generation

from qwen_tts import Qwen3TTSModel  
  
model = Qwen3TTSModel("Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")  
  
# Generate with preset voice  
audio = model.generate_custom_voice(  
    text="Hello, this is a test.",  
    language="en",  
    speaker="Aiden",  
    instruct="Speak with enthusiasm"  
)

Voice Cloning

# Clone a voice from reference audio  
audio = model.generate_voice_clone(  
    text="Please read this text in my voice.",  
    ref_audio="reference_voice.wav",  
    ref_text="This is my reference voice sample.",  
    language="en"  
)  
  
# Reuse prompt to avoid recomputing features  
prompt = model.create_voice_clone_prompt(  
    ref_audio="reference.wav",  
    ref_text="Reference text"  
)  
  
# Multiple generations with same voice  
for text in text_batch:  
    audio = model.generate_voice_clone(  
        text=text,  
        voice_clone_prompt=prompt  
    )

Voice Design

# Create voice from description  
audio = model.generate_voice_design(  
    text="Say this with enthusiasm!",  
    instruct="Young female voice, energetic, speaking rapidly"  
)

Web UI Demo

# Launch local web interface  
qwen-tts-demo  
  
# For HTTPS (required for microphone on remote access)  
qwen-tts-demo --ssl-certfile cert.pem --ssl-keykey key.pem

OpenAI-Compatible API

from openai import OpenAI  
  
client = OpenAI(base_url="http://localhost:8880/v1", api_key="sk-xxx")  
  
response = client.audio.speech.create(  
    model="qwen3-tts",  
    voice="Vivian",  
    input="This sounds like a real person speaking."  
)  
response.stream_to_file("output.mp3")

Use Cases

Ideal For

Audiobook production: Consistent narrator voices across chapters
Content creation: Natural voice-overs for videos
Real-time applications: 97ms latency enables live conversation
Multilingual content: Cross-lingual voice cloning
Character voices: Consistent voices for dialogue/gaming
Accessibility: High-quality text-to-speech for assistive tech
Local deployment: Privacy-focused, self-hosted solution

Practical Examples

Audiobook Workflow:

Record 30-60 seconds of desired narrator voice
Clone voice using 3-second minimum
Process chapters in batches
Maintain consistent narration throughout

Character Voice Workflow (Design then Clone):

Use VoiceDesign to synthesize character description
Create voice clone prompt from synthesized clip
Generate all character lines reusing prompt
Consistent voice across entire project

Real-Time Performance

Latency: 97ms end-to-end (first audio packet after 1 character)
Streaming: Real-time generation suitable for interactive AI
Processing: First audio packet delivered during typing
Conversation: Fast enough for natural dialogue flow

Community Integration

OpenAI-Compatible Server

git clone https://github.com/groxaxo/Qwen3-TTS-Openai-Fastapi  
docker build -t qwen3-tts-api .  
docker run --gpus all -p 8880:8880 qwen3-tts-api

Compatible with:

OpenAI Python client
Open-WebUI
LLM applications expecting OpenAI API
Any service using OpenAI TTS endpoints

vLLM Integration

Use vLLM-Omni for optimized inference:

Offline inference examples
Batch processing
Performance optimization

Advantages

✅ State-of-the-art quality: Outperforms ElevenLabs, MiniMax
✅ Ultra-low latency: 97ms enables real-time conversation
✅ Voice cloning: Just 3 seconds of reference audio
✅ Voice design: Create voices from text descriptions
✅ Multilingual: 10+ languages with dialect support
✅ Long-form: Generate 10 minutes of continuous speech
✅ Open-source: Apache 2.0 license
✅ Self-hosted: Full privacy, no cloud dependence
✅ Flexible models: 0.6B (fast) to 1.7B (high quality)
✅ Well-documented: Comprehensive examples and guides
✅ Community momentum: 3.3k+ GitHub stars

Limitations & Considerations

⚠️ Requires GPU for acceptable inference speed
⚠️ 0.6B model faster but lower quality than 1.7B
⚠️ Streaming mode has different latency profile than batch
⚠️ Best results with clean reference audio for voice cloning
⚠️ Cross-lingual cloning may not perfect all accents

Deployment Options

Local Installation

Direct Python package install
Web UI demo
Full control, privacy

API Access

DashScope API (Alibaba Cloud) — Mainl and China
DashScope International — Global access
Real-time endpoints for all models

Docker/Containerized

Pre-built images available
GPU support
OpenAI-compatible FastAPI wrapper

GitHub: https://github.com/QwenLM/Qwen3-TTS
Hugging Face: https://huggingface.co/collections/Qwen/qwen3-tts
HF Demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS
ModelScope Demo: https://modelscope.cn/studios/Qwen/Qwen3-TTS
Research Paper: https://arxiv.org/abs/2601.15621
DashScope API: https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-design

Text-to-Speech - Overview of TTS technology
Voice Cloning - Voice cloning techniques
Speech Synthesis - Speech generation methods
Streaming Audio - Real-time audio generation
Multilingual TTS - Multi-language synthesis
Qwen - Alibaba’s Qwen model family
Open-Source AI - Open-source AI projects

Last updated: January 2026
Confidence: High (official documentation and GitHub)
Status: Active development
License: Apache 2.0
Creator: Alibaba Qwen Team
GitHub Stars: 3.3k+
Key Advantage: 97ms latency for real-time voice synthesis

Explorer

Qwen3-TTS

Qwen3-TTS

Core Capabilities

1. Voice Cloning (3-Second)

2. Voice Design

3. Custom Voice (Preset Timbres)

Technical Architecture

Qwen3-TTS-Tokenizer-12Hz

Dual-Track Streaming Architecture

End-to-End Language Model

Performance Metrics

Voice Cloning Quality

Long-Form Generation

Voice Design Performance

Model Variants

1.7B Models (Peak Performance)

0.6B Models (Speed/Efficiency Balance)

Language Support

Installation & Setup

Environment Setup

Optional: FlashAttention 2 (Reduces VRAM)

Usage Examples

Custom Voice Generation

Voice Cloning

Voice Design

Web UI Demo

OpenAI-Compatible API

Use Cases

Ideal For

Practical Examples

Real-Time Performance

Community Integration

OpenAI-Compatible Server

vLLM Integration

Advantages

Limitations & Considerations

Deployment Options

Local Installation

API Access

Docker/Containerized

Related Resources

Related Concepts

Filter Videos

Tags

Channels

Favorites

Table of Contents

Recent Updates

Backlinks