How to Build Your Own AI Data Center in 2025 — Paul Gilbert, Arista Networks



AI Summary

Title: AI Network Infrastructure Overview
Speaker: Paul Gil, Tech Lead at Arista Networks
Key Points:

  1. Introduction to AI Models and Infrastructure:
    • Focus on training models and the infrastructure for inference.
    • Importance of understanding job completion times related to model training and inference.
  2. GPU Infrastructure:
    • Discussed the setup of backend and frontend networks for AI workloads.
    • Typical GPU configurations include 248 GPUs for training and 4 H100s for inference.
  3. Networking Challenges in AI:
    • Isolation of GPU networks due to high cost and power consumption.
    • Use of fast switches (leaf and spine) but no connections to external networks to avoid compromising performance.
    • Need for one-to-one bandwidth ratios to manage bursty traffic generated by GPUs.
  4. Importance of Design and Scale:
    • Design choices affect the performance and capability to scale up networks for AI as workloads grow.
    • Comparison of scale up vs. scale out in AI infrastructure.
  5. Traffic Management in AI Networks:
    • Traffic is east-west due to GPU communication, with north-south traffic for data retrieval.
    • Use of tools like RDMA and error management for efficient networking.
    • Key protocols include Cuda and Nickel, which impact network performance.
  6. Power Requirements:
    • AI racks require significantly more power than traditional server racks (e.g., 10.2 KW for 8 GPUs).
    • Enterprises must adapt to higher power consumption and cooling requirements, including water-cooled racks.
  7. Future Trends:
    • Expected advancements in Ethernet technology to improve congestion control and packet handling.
    • Continuous growth of data consumption and network demands in AI.
      Conclusion:
    • The AI infrastructure landscape is evolving, focusing on optimizing network design to support increased GPU utilization and managing complexity in power and traffic control.