🎯 Key points for quick navigation: 00:19 *Meta transitioned AI training from horizontal to vertical scaling, requiring a dedicated RDMA network over converged Ethernet.* 01:00 *RDMA fabrics at Meta support tens of thousands of GPUs, handling diverse AI training use cases.* 03:11 *AI training involves complex, recursive processes that scale vertically with HPC-style parallel processing.* 05:32 *RDMA with RoCE V2 enables high-bandwidth, low-latency GPU communication crucial for AI training.* 08:30 *Meta's network design for AI training includes balanced topologies and traffic patterns accommodating hierarchical and full mesh models.* 12:44 *Load balancing challenges in RDMA deployments at Meta involve adapting to uneven distribution of server destinations across IP prefixes.* 16:52 *Issues with slow receivers impacting network performance at Meta are often related to GPU memory allocation pressures, causing PCI and network bottlenecks.* Made with HARPA AI
🎯 Key points for quick navigation:
00:19 *Meta transitioned AI training from horizontal to vertical scaling, requiring a dedicated RDMA network over converged Ethernet.*
01:00 *RDMA fabrics at Meta support tens of thousands of GPUs, handling diverse AI training use cases.*
03:11 *AI training involves complex, recursive processes that scale vertically with HPC-style parallel processing.*
05:32 *RDMA with RoCE V2 enables high-bandwidth, low-latency GPU communication crucial for AI training.*
08:30 *Meta's network design for AI training includes balanced topologies and traffic patterns accommodating hierarchical and full mesh models.*
12:44 *Load balancing challenges in RDMA deployments at Meta involve adapting to uneven distribution of server destinations across IP prefixes.*
16:52 *Issues with slow receivers impacting network performance at Meta are often related to GPU memory allocation pressures, causing PCI and network bottlenecks.*
Made with HARPA AI
Adi, that was a good talk . I enjoyed watching it. Lots of useful info. Thank you! - Jag
Each spine switch connects to 256 ToR switch and some uplink switches. which types of spine switches can support nearly 300 * 400Gbps ports?