Groq Inference Provider

by Groq

Low-latency, high-throughput LLM inference powered by Groq’s proprietary Language Processing Unit (LPU™).

See https://groq.com and GroqCloud https://console.groq.com

Features

  • Purpose-built LPU hardware optimized for LLM inference (low latency, high throughput)
  • OpenAI-compatible API surface (works with many OpenAI-style SDKs)
  • Integration with popular ML ecosystems: Hugging Face (provider), LangChain, LlamaIndex, Vercel AI, LlamaIndex, Instructor
  • Hosted models and support for many open-source model families (Mixtral, LLaMA variants, Qwen-family, etc.)
  • SDK and HTTP/REST access (Python, JavaScript/TypeScript, curl)
  • Support for structured outputs (e.g., with Instructor/Pydantic)
  • Pay-as-you-go pricing and account-level API keys via GroqCloud

Superpowers

Groq is designed for applications where inference latency and predictable throughput matter most. The LPU architecture removes many GPU bottlenecks encountered during sequence processing, so Groq is a strong choice for:

  • Real-time conversational agents and chat systems
  • Low-latency search and retrieval-augmented generation (RAG)
  • High-concurrency production endpoints (APIs, live services)

Who should use it

  • Teams that need lower tail-latency than typical GPU hosts
  • Applications that benefit from deterministic latency and high request rates
  • Developers who want an OpenAI-compatible API but faster inference

Quick integration examples

Note: Groq exposes an OpenAI-compatible endpoint (often used as base_url https://api.groq.com/openai/v1) and is also available as a provider on Hugging Face (provider=“groq”).

Python (OpenAI-compatible client):

import os  
from openai import OpenAI  
  
client = OpenAI(  
    api_key=os.environ.get("GROQ_API_KEY"),  
    base_url="https://api.groq.com/openai/v1",  
)  
  
resp = client.chat.completions.create(  
    model="mixtral-8x7b-32768",  
    messages=[{"role":"user","content":"Explain the importance of fast LLM inference."}],  
)  
print(resp.choices[0].message.content)  

JavaScript (OpenAI-compatible):

import OpenAI from "openai";  
  
const client = new OpenAI({  
  apiKey: process.env.GROQ_API_KEY,  
  baseURL: "https://api.groq.com/openai/v1",  
});  
  
const response = await client.chat.completions.create({  
  model: "mixtral-8x7b-32768",  
  messages: [{ role: "user", content: "What is the capital of France?" }],  
});  
console.log(response.choices[0].message.content);  

cURL (direct HTTP):

curl -X POST https://api.groq.com/openai/v1/chat/completions \  
  -H "Authorization: Bearer $GROQ_API_KEY" \  
  -H "Content-Type: application/json" \  
  -d '{  
    "model": "mixtral-8x7b-32768",  
    "messages": [{"role":"user","content":"Summarize Groq for me."}]  
  }'  

Hugging Face (use Groq as provider via HF InferenceClient):

from huggingface_hub import InferenceClient  
import os  
  
client = InferenceClient(provider="groq", api_key=os.environ["HF_TOKEN"])  # or route via HF  
  
out = client.chat.completions.create(  
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",  
    messages=[{"role":"user","content":"What is low-latency inference?"}],  
)  
print(out.choices[0].message)  

Supported models & ecosystem

  • Hosts and supports many open-source LLMs (example families reported in docs and community: Mixtral, LLaMA 3/variants, Qwen, other scaled models)
  • Integrates with LangChain, LlamaIndex, Instructor, and other tooling used to build RAG and production LLM systems
  • Available via Hugging Face as a provider to route requests through HF’s unified proxy or directly via Groq API keys

Performance & architecture notes

  • Groq’s LPU is a custom inference accelerator optimized for sequence model workloads; its design trades general GPU flexibility for predictable, lower-latency LLM execution
  • Typical benefits: lower tail latency, high sustained throughput, predictable performance under concurrent load
  • Real-world performance will depend on model size, batch/concurrency settings, tokenization and prompt length

Pricing

  • Groq offers a pay-as-you-go pricing model (billed via GroqCloud or, if routed via Hugging Face, billed through HF depending on routing configuration)
  • Exact rates vary by model, compute tier, and region — check GroqCloud console or sales for up-to-date pricing

Operational considerations & best practices

  • Use the OpenAI-compatible client to minimize code changes when switching providers (base_url + api_key)
  • Prefer smaller context windows or prompt engineering to reduce cost and latency when appropriate
  • Monitor tail-latency and token usage in GroqCloud console (console.groq.com) and set up alerting for quota or performance anomalies
  • When integration-critical, run load tests to measure latency percentiles for your workload and model choice
  • Consider provider fallbacks (e.g., route through Hugging Face provider stack with provider=“auto”) to increase availability

Limitations & cautions

  • Not every model or fine-tuned variant may be available; verify model availability before committing to a specific model in production
  • Proprietary hardware may have different behavior for some kernels/ops vs GPU — test model outputs for parity
  • Pricing and regional availability may change; validate commercial terms for sustained production usage
  • Groq: https://groq.com
  • GroqCloud console: https://console.groq.com
  • Groq API (OpenAI-compatible) — check Groq docs and GroqCloud for full reference
  • Hugging Face provider integration: search “Groq” on Hugging Face model/provider pages