Aero-1 Audio Audio Language Model for ASR - Install and Test Locally



AI Summary

Overview of Arow1 Audio Model

  • Developed by LLM Labs for automatic speech recognition and audio tasks.
  • Built on the Quen 2.5 1.5 billion language model architecture.
  • Competes with larger models like OpenAI’s Whisper and commercial offerings like 11 Labs.
  • Trained using 16 H100 GPUs on 50,000 hours of high-quality filtered audio data in just one day.

Installation Steps

  1. Create a virtual environment.
  2. Install necessary prerequisites:
    • Transformers: pip install transformers
    • Gradio: shared commands included in the repository.

Running the Model

  • Code to download the model and launch a Gradio demo:
    • Model size: ~5 GB.
    • Provides audio transcription features.
    • Access the Gradio interface in your browser.

Testing and Observations

  • Input several audio files for transcription tests.
  • The model performed well with curated examples but struggled with user-uploaded files, especially in different formats.
  • Noted high VRAM consumption (over 5 GB) for the model.

Final Thoughts

  • Despite being an innovative effort, the model demonstrated average performance compared to newer ASR models like the Nvidia Parakeet model, which outperformed it significantly.
  • Currently supports English only.

Links

  • GitHub repository for model details: GitHub Link.
  • Discount codes and more details available in the video description.