How to Build Your Own AI Data Center in 2025 — Paul Gilbert, Arista Networks
AI Summary
Title: AI Network Infrastructure Overview
Speaker: Paul Gil, Tech Lead at Arista Networks
Key Points:
- Introduction to AI Models and Infrastructure:
- Focus on training models and the infrastructure for inference.
- Importance of understanding job completion times related to model training and inference.
- GPU Infrastructure:
- Discussed the setup of backend and frontend networks for AI workloads.
- Typical GPU configurations include 248 GPUs for training and 4 H100s for inference.
- Networking Challenges in AI:
- Isolation of GPU networks due to high cost and power consumption.
- Use of fast switches (leaf and spine) but no connections to external networks to avoid compromising performance.
- Need for one-to-one bandwidth ratios to manage bursty traffic generated by GPUs.
- Importance of Design and Scale:
- Design choices affect the performance and capability to scale up networks for AI as workloads grow.
- Comparison of scale up vs. scale out in AI infrastructure.
- Traffic Management in AI Networks:
- Traffic is east-west due to GPU communication, with north-south traffic for data retrieval.
- Use of tools like RDMA and error management for efficient networking.
- Key protocols include Cuda and Nickel, which impact network performance.
- Power Requirements:
- AI racks require significantly more power than traditional server racks (e.g., 10.2 KW for 8 GPUs).
- Enterprises must adapt to higher power consumption and cooling requirements, including water-cooled racks.
- Future Trends:
- Expected advancements in Ethernet technology to improve congestion control and packet handling.
- Continuous growth of data consumption and network demands in AI.
Conclusion:- The AI infrastructure landscape is evolving, focusing on optimizing network design to support increased GPU utilization and managing complexity in power and traffic control.