Scaling RoCE Networks for AI Training | Adi Gangidi

Поделиться
HTML-код
  • Опубликовано: 28 ноя 2024

Комментарии • 3

  • @lolcat6294
    @lolcat6294 4 месяца назад +2

    🎯 Key points for quick navigation:
    00:19 *Meta transitioned AI training from horizontal to vertical scaling, requiring a dedicated RDMA network over converged Ethernet.*
    01:00 *RDMA fabrics at Meta support tens of thousands of GPUs, handling diverse AI training use cases.*
    03:11 *AI training involves complex, recursive processes that scale vertically with HPC-style parallel processing.*
    05:32 *RDMA with RoCE V2 enables high-bandwidth, low-latency GPU communication crucial for AI training.*
    08:30 *Meta's network design for AI training includes balanced topologies and traffic patterns accommodating hierarchical and full mesh models.*
    12:44 *Load balancing challenges in RDMA deployments at Meta involve adapting to uneven distribution of server destinations across IP prefixes.*
    16:52 *Issues with slow receivers impacting network performance at Meta are often related to GPU memory allocation pressures, causing PCI and network bottlenecks.*
    Made with HARPA AI

  • @jagsinghbrar
    @jagsinghbrar Год назад +1

    Adi, that was a good talk . I enjoyed watching it. Lots of useful info. Thank you! - Jag

  • @aaa-hw2ty
    @aaa-hw2ty 3 месяца назад

    Each spine switch connects to 256 ToR switch and some uplink switches. which types of spine switches can support nearly 300 * 400Gbps ports?