NVIDIA PersonaPlex - Full-Duplex Conversational Speech Model

Overview

NVIDIA PersonaPlex is a 7-billion parameter full-duplex speech-to-speech conversational model that combines voice cloning and role conditioning for natural, real-time conversations. Released January 2026, it solves the traditional tradeoff between natural conversations (fixed voice/role) and customization (cascaded ASR→LLM→TTS systems).

PersonaPlex listens and speaks simultaneously using a single unified architecture, enabling natural conversational dynamics like interruptions, overlaps, backchannels (“uh-huh”, “okay”), and rapid turn-taking with sub-second latency.

Key Differentiators

Solves Traditional Limitations:

  • ❌ Cascaded ASR→LLM→TTS: Customizable but robotic, rigid turn-taking
  • ❌ Full-duplex models (Moshi): Natural but fixed voice and role
  • PersonaPlex: Natural + Customizable voice + Role control

Dual Prompting System:

  1. Voice Prompt - Audio embedding defining vocal characteristics, speaking style, prosody (zero-shot voice cloning)
  2. Text Prompt - Role definition, background, context, business instructions for persona control

Architecture

Model Specs

  • Base Model: Moshi (7B parameters from Kyutai)
  • Network: Dual-stream Transformer with concurrent listening/speaking
  • Audio Codec: Mimi neural codec (24kHz sample rate)
  • Components:
    • Mimi Speech Encoder (ConvNet + Transformer)
    • Temporal Transformer + Depth Transformer
    • Mimi Speech Decoder (Transformer + ConvNet)
  • Underlying Language Model: Helium (enables OOD generalization)

Full-Duplex Design

  • User audio and agent speech processed simultaneously
  • Streaming input/output with low latency (170ms-240ms response latency)
  • Generates both text tokens and audio tokens autoregressively
  • Learns non-verbal aspects: pausing, interruption timing, natural rhythm

Training Data

Real Conversations:

  • 7,303 conversations (1,217 hours) from Fisher English corpus
  • Multi-speaker overlapping audio with speaker separation
  • Annotated with varying detail levels using GPT-OSS-120B

Synthetic Data:

  • 1,840 hours of customer service dialogs (105,410 dialogs)
  • 410 hours of QA dialogs (39,322 dialogs)
  • Generated using LLMs + ChatterboxTTS + TortoiseTTS

Total: ~2,250 hours of synthetic + real conversational data

Capabilities & Use Cases

Conversational Dynamics

  • Interruptions - User can interrupt without agent stopping
  • Turn-Taking - Sub-second latency, natural pacing
  • Backchanneling - Contextual “mm-hmm”, “yeah”, “okay” expressions
  • Pauses - Natural silence and thinking time
  • Overlaps - Handles speech overlaps naturally

Real-World Applications

  • Customer Service - Verify identity, handle scenarios (banking, medical office)
  • Assistants - General knowledge answering with persona
  • Specialized Roles - Emergency scenarios (spaceship reactor meltdown), character-based interactions
  • Accessibility - Personalized conversational interfaces
  • Multi-Character - Different voices/personalities in single session

Example Scenarios

  1. Banking - “You work for First Neuron Bank. Name: Sanni Virtanen. A $1,200 Home Depot transaction was declined…”
  2. Medical Reception - “Record patient: full name, DOB, allergies, tobacco/alcohol use, medical history…”
  3. Technical Support - “You’re an astronaut on a Mars mission. The reactor core is melting…”

Performance Metrics

FullDuplexBench Scores (Published Results)

MetricScore
Pause Handling (Synthetic)0.358 ↓
Pause Handling (Candor)0.431 ↓
Backchannel Rate0.042 ↑
Smooth Turn Taking0.908 ↑
Turn-Taking Latency0.170s ↓
User Interruption Handling0.950 ↑
Interruption Latency0.240s ↓
Response Quality (GPT-4o judge)4.29/5 ↑
Voice Similarity (WavLM-TDNN)0.650 ↑

Outperforms other open-source and commercial systems on conversational dynamics, latency, and task adherence.

ServiceDuplexBench (Custom Benchmark)

  • Extended FullDuplexBench with 350 customer service evaluation questions
  • Evaluates proper noun recall, context adherence, unfulfillable requests, rudeness management
  • PersonaPlex achieves SOTA performance across service scenarios

Technical Details

Inference

  • Hardware: NVIDIA A100 80GB, H100, Hopper-based GPUs
  • Framework: PyTorch
  • Latency: 170-240ms response time (sub-second)
  • Real-time Operation: Streaming audio in/out at 24kHz

Licensing & Availability

  • Code: MIT License (GitHub)
  • Model Weights: NVIDIA Open Model License (commercial use allowed)
  • Base Model: CC-BY-4.0 (Moshi from Kyutai)
  • Release Date: January 15, 2026
  • Platform: Hugging Face (nvidia/personaplex-7b-v1)

Installation & Setup

From Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer  
  
model = AutoModelForCausalLM.from_pretrained("nvidia/personaplex-7b-v1")  

Docker Deployment

docker run -e FAL_KEY="..." ghcr.io/nvidia/personaplex-7b-v1  

Requirements

  • Python 3.8+
  • PyTorch
  • NVIDIA GPU with 80GB VRAM (A100, H100 recommended)

Prompting Examples

Voice Prompt

Audio sample (3-10 seconds) establishing target speaker’s voice characteristics

Text Prompt Examples

Assistant Role:

You are a wise and friendly teacher. Answer questions or provide   
advice in a clear and engaging way.  

Customer Service - Banking:

You work for First Neuron Bank. Your name is Sanni Virtanen.   
The customer's transaction for $1,200 at Home Depot was declined.   
Verify customer identity and explain the unusual location flag.  

Character Role:

You are Alex, an astronaut on a Mars mission. The reactor core   
is melting. Several systems are failing. Explain the situation   
and urgently ask for help stabilizing the reactor.  

Benchmarks & Evaluation

Available Benchmarks

  1. FullDuplexBench - Turn-taking, interruption, pause handling, QA quality
  2. ServiceDuplexBench - Customer service scenarios, role adherence, context recall
  3. WavLM-TDNN - Voice similarity/speaker consistency metrics

Strengths

✅ State-of-the-art conversational dynamics
✅ Natural interruption handling
✅ Strong role/persona adherence
✅ Low-latency responses
✅ Voice consistency with prompts
✅ Generalizes to OOD scenarios

Limitations

⚠️ English-only (v1.0)
⚠️ Requires GPU with ~80GB VRAM
⚠️ Training data limited to ~2,250 hours

  • Moshi - Base architecture from Kyutai Labs
  • Helium - Language model enabling semantic understanding
  • Mimi - Neural audio codec
  • Model Context Protocol - For integration with AI assistants
  • NVIDIA - Company behind PersonaPlex research

Research & Papers

  • Paper: PersonaPlex preprint (2503.04721)
  • Benchmark Paper: Full-Duplex-Bench (2503.04721)
  • Base Model Paper: Moshi (2410.00037)
  • Speech Processing: WavLM (2110.13900)

Comparison to Alternatives

FeaturePersonaPlexMoshiCascaded ASR→LLM→TTS
Full-Duplex
Voice Cloning✅ Zero-shot✅ (TTS only)
Role Conditioning✅ Text prompt✅ (LLM)
Latency<250ms<200ms1-3s
NaturalnessHighHighLower
CustomizationHighLowHigh
Task AdherenceSOTAGoodGood

Use in Production

Ideal For:

  • Customer service bots with human-like interaction
  • Personalized voice assistants
  • Character/NPC dialogue systems
  • Accessibility applications
  • Real-time translation (with fine-tuning)

Deployment Considerations:

  • Requires A100/H100 GPU infrastructure
  • ~80GB VRAM per instance
  • Batch processing possible for throughput
  • Streaming API design for real-time interaction

Resources

Notes

  • Represents major advancement in conversational AI naturalness + customization
  • First model to combine zero-shot voice cloning with instruction-based role conditioning in real-time duplex
  • NVIDIA ADLR team builds on open-source foundations (Kyutai Moshi, TortoiseTTS, ChatterboxTTS)
  • Addresses key limitation of previous duplex models (fixed voice/role)
  • Training on hybrid real + synthetic data shows strong generalization
  • Results show 95% user interruption handling rate, sub-second response latency