NVIDIA PersonaPlex - Full-Duplex Conversational Speech Model
Overview
NVIDIA PersonaPlex is a 7-billion parameter full-duplex speech-to-speech conversational model that combines voice cloning and role conditioning for natural, real-time conversations. Released January 2026, it solves the traditional tradeoff between natural conversations (fixed voice/role) and customization (cascaded ASR→LLM→TTS systems).
PersonaPlex listens and speaks simultaneously using a single unified architecture, enabling natural conversational dynamics like interruptions, overlaps, backchannels (“uh-huh”, “okay”), and rapid turn-taking with sub-second latency.
Key Differentiators
Solves Traditional Limitations:
- ❌ Cascaded ASR→LLM→TTS: Customizable but robotic, rigid turn-taking
- ❌ Full-duplex models (Moshi): Natural but fixed voice and role
- ✅ PersonaPlex: Natural + Customizable voice + Role control
Dual Prompting System:
- Voice Prompt - Audio embedding defining vocal characteristics, speaking style, prosody (zero-shot voice cloning)
- Text Prompt - Role definition, background, context, business instructions for persona control
Architecture
Model Specs
- Base Model: Moshi (7B parameters from Kyutai)
- Network: Dual-stream Transformer with concurrent listening/speaking
- Audio Codec: Mimi neural codec (24kHz sample rate)
- Components:
- Mimi Speech Encoder (ConvNet + Transformer)
- Temporal Transformer + Depth Transformer
- Mimi Speech Decoder (Transformer + ConvNet)
- Underlying Language Model: Helium (enables OOD generalization)
Full-Duplex Design
- User audio and agent speech processed simultaneously
- Streaming input/output with low latency (170ms-240ms response latency)
- Generates both text tokens and audio tokens autoregressively
- Learns non-verbal aspects: pausing, interruption timing, natural rhythm
Training Data
Real Conversations:
- 7,303 conversations (1,217 hours) from Fisher English corpus
- Multi-speaker overlapping audio with speaker separation
- Annotated with varying detail levels using GPT-OSS-120B
Synthetic Data:
- 1,840 hours of customer service dialogs (105,410 dialogs)
- 410 hours of QA dialogs (39,322 dialogs)
- Generated using LLMs + ChatterboxTTS + TortoiseTTS
Total: ~2,250 hours of synthetic + real conversational data
Capabilities & Use Cases
Conversational Dynamics
- Interruptions - User can interrupt without agent stopping
- Turn-Taking - Sub-second latency, natural pacing
- Backchanneling - Contextual “mm-hmm”, “yeah”, “okay” expressions
- Pauses - Natural silence and thinking time
- Overlaps - Handles speech overlaps naturally
Real-World Applications
- Customer Service - Verify identity, handle scenarios (banking, medical office)
- Assistants - General knowledge answering with persona
- Specialized Roles - Emergency scenarios (spaceship reactor meltdown), character-based interactions
- Accessibility - Personalized conversational interfaces
- Multi-Character - Different voices/personalities in single session
Example Scenarios
- Banking - “You work for First Neuron Bank. Name: Sanni Virtanen. A $1,200 Home Depot transaction was declined…”
- Medical Reception - “Record patient: full name, DOB, allergies, tobacco/alcohol use, medical history…”
- Technical Support - “You’re an astronaut on a Mars mission. The reactor core is melting…”
Performance Metrics
FullDuplexBench Scores (Published Results)
| Metric | Score |
|---|---|
| Pause Handling (Synthetic) | 0.358 ↓ |
| Pause Handling (Candor) | 0.431 ↓ |
| Backchannel Rate | 0.042 ↑ |
| Smooth Turn Taking | 0.908 ↑ |
| Turn-Taking Latency | 0.170s ↓ |
| User Interruption Handling | 0.950 ↑ |
| Interruption Latency | 0.240s ↓ |
| Response Quality (GPT-4o judge) | 4.29/5 ↑ |
| Voice Similarity (WavLM-TDNN) | 0.650 ↑ |
Outperforms other open-source and commercial systems on conversational dynamics, latency, and task adherence.
ServiceDuplexBench (Custom Benchmark)
- Extended FullDuplexBench with 350 customer service evaluation questions
- Evaluates proper noun recall, context adherence, unfulfillable requests, rudeness management
- PersonaPlex achieves SOTA performance across service scenarios
Technical Details
Inference
- Hardware: NVIDIA A100 80GB, H100, Hopper-based GPUs
- Framework: PyTorch
- Latency: 170-240ms response time (sub-second)
- Real-time Operation: Streaming audio in/out at 24kHz
Licensing & Availability
- Code: MIT License (GitHub)
- Model Weights: NVIDIA Open Model License (commercial use allowed)
- Base Model: CC-BY-4.0 (Moshi from Kyutai)
- Release Date: January 15, 2026
- Platform: Hugging Face (nvidia/personaplex-7b-v1)
Installation & Setup
From Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("nvidia/personaplex-7b-v1") Docker Deployment
docker run -e FAL_KEY="..." ghcr.io/nvidia/personaplex-7b-v1 Requirements
- Python 3.8+
- PyTorch
- NVIDIA GPU with 80GB VRAM (A100, H100 recommended)
Prompting Examples
Voice Prompt
Audio sample (3-10 seconds) establishing target speaker’s voice characteristics
Text Prompt Examples
Assistant Role:
You are a wise and friendly teacher. Answer questions or provide
advice in a clear and engaging way.
Customer Service - Banking:
You work for First Neuron Bank. Your name is Sanni Virtanen.
The customer's transaction for $1,200 at Home Depot was declined.
Verify customer identity and explain the unusual location flag.
Character Role:
You are Alex, an astronaut on a Mars mission. The reactor core
is melting. Several systems are failing. Explain the situation
and urgently ask for help stabilizing the reactor.
Benchmarks & Evaluation
Available Benchmarks
- FullDuplexBench - Turn-taking, interruption, pause handling, QA quality
- ServiceDuplexBench - Customer service scenarios, role adherence, context recall
- WavLM-TDNN - Voice similarity/speaker consistency metrics
Strengths
✅ State-of-the-art conversational dynamics
✅ Natural interruption handling
✅ Strong role/persona adherence
✅ Low-latency responses
✅ Voice consistency with prompts
✅ Generalizes to OOD scenarios
Limitations
⚠️ English-only (v1.0)
⚠️ Requires GPU with ~80GB VRAM
⚠️ Training data limited to ~2,250 hours
Related Technologies
- Moshi - Base architecture from Kyutai Labs
- Helium - Language model enabling semantic understanding
- Mimi - Neural audio codec
- Model Context Protocol - For integration with AI assistants
- NVIDIA - Company behind PersonaPlex research
Research & Papers
- Paper: PersonaPlex preprint (2503.04721)
- Benchmark Paper: Full-Duplex-Bench (2503.04721)
- Base Model Paper: Moshi (2410.00037)
- Speech Processing: WavLM (2110.13900)
Comparison to Alternatives
| Feature | PersonaPlex | Moshi | Cascaded ASR→LLM→TTS |
|---|---|---|---|
| Full-Duplex | ✅ | ✅ | ❌ |
| Voice Cloning | ✅ Zero-shot | ❌ | ✅ (TTS only) |
| Role Conditioning | ✅ Text prompt | ❌ | ✅ (LLM) |
| Latency | <250ms | <200ms | 1-3s |
| Naturalness | High | High | Lower |
| Customization | High | Low | High |
| Task Adherence | SOTA | Good | Good |
Use in Production
Ideal For:
- Customer service bots with human-like interaction
- Personalized voice assistants
- Character/NPC dialogue systems
- Accessibility applications
- Real-time translation (with fine-tuning)
Deployment Considerations:
- Requires A100/H100 GPU infrastructure
- ~80GB VRAM per instance
- Batch processing possible for throughput
- Streaming API design for real-time interaction
Resources
- Research Page: https://research.nvidia.com/labs/adlr/personaplex/
- Model Card: https://huggingface.co/nvidia/personaplex-7b-v1
- GitHub Repository: https://github.com/NVIDIA/personaplex
- Paper PDF: https://research.nvidia.com/labs/adlr/files/personaplex/personaplex_preprint.pdf
- Demo Audio Examples: https://research.nvidia.com/labs/adlr/personaplex/
Notes
- Represents major advancement in conversational AI naturalness + customization
- First model to combine zero-shot voice cloning with instruction-based role conditioning in real-time duplex
- NVIDIA ADLR team builds on open-source foundations (Kyutai Moshi, TortoiseTTS, ChatterboxTTS)
- Addresses key limitation of previous duplex models (fixed voice/role)
- Training on hybrid real + synthetic data shows strong generalization
- Results show 95% user interruption handling rate, sub-second response latency