Aero-1 Audio Audio Language Model for ASR - Install and Test Locally
AI Summary
Overview of Arow1 Audio Model
- Developed by LLM Labs for automatic speech recognition and audio tasks.
- Built on the Quen 2.5 1.5 billion language model architecture.
- Competes with larger models like OpenAI’s Whisper and commercial offerings like 11 Labs.
- Trained using 16 H100 GPUs on 50,000 hours of high-quality filtered audio data in just one day.
Installation Steps
- Create a virtual environment.
- Install necessary prerequisites:
- Transformers:
pip install transformers
- Gradio: shared commands included in the repository.
Running the Model
- Code to download the model and launch a Gradio demo:
- Model size: ~5 GB.
- Provides audio transcription features.
- Access the Gradio interface in your browser.
Testing and Observations
- Input several audio files for transcription tests.
- The model performed well with curated examples but struggled with user-uploaded files, especially in different formats.
- Noted high VRAM consumption (over 5 GB) for the model.
Final Thoughts
- Despite being an innovative effort, the model demonstrated average performance compared to newer ASR models like the Nvidia Parakeet model, which outperformed it significantly.
- Currently supports English only.
Links
- GitHub repository for model details: GitHub Link.
- Discount codes and more details available in the video description.