NVIDIA Parakeet TDT High-Quality English Transcription - Install Locally



AI Summary

Summary of Nvidia’s Parakeet Speech-to-Text Model

  • Overview: Nvidia has launched the Parakeet model, a new speech-to-text AI that outperforms Whisper 3 large.
  • Features:
    • Transcribes audio segments of up to 24 minutes in a single pass.
    • Contains only 600 million parameters while maintaining high performance.
    • Excels in predicting word-level timestamps, punctuation, and specialized audio such as song lyrics.
  • Input & Output:
    • Accepts 16 kHz monochord wave or FLAC format.
    • Outputs well-formatted English text with timing information.
  • Architecture: Built for advanced English transcription tasks using an Excel variant of the Fast Conformer.

Installation Steps:

  1. Set up a Conda environment.
  2. Clone Nvidia’s Nemo directory and navigate to the ASR examples.
  3. Install the Nemo toolkit for Automatic Speech Recognition.
  4. Launch the demo local server on port 7860 using the command provided in the video.
  5. Upload an audio file to transcribe.

Performance:

  • Example transcription with 10-second audio was accurate with timestamps.
  • Performance on longer audio (3 minutes) showed exemplary speed and accuracy.
  • The model seems optimized for English but handled some French and Spanish subtly.
  • Runs with low resource usage (around 3 GB of RAM for VMs).

Conclusion:

  • The model is robust, fast, and optimized for GPU acceleration but can run on CPUs as well. License is CC BY 4.0, check for commercial use terms.