SC'19: Tutorials: IB, Omni-Path, and HSE: Advanced Features, Challenges & Usage
HTML-код
- Опубликовано: 10 фев 2025
- As InfiniBand (IB), Omni-Path, and High-Speed Ethernet (HSE) technologies
mature, they are being used to design and deploy various High-End Computing
(HEC) systems: HPC clusters with GPGPUs supporting MPI, Storage
and Parallel File Systems, Cloud Computing systems with SR-IOV Virtualization,
Grid Computing systems, and Deep Learning systems. These systems are bringing
new challenges in terms of performance, scalability, portability, reliability
and network congestion. Many scientists, engineers, researchers, managers and
system administrators are becoming interested in learning about these
challenges, approaches being used to solve these challenges, and the associated
impact on performance and scalability. This tutorial will start with an
overview of these systems. Advanced hardware and software features of IB,
Omni-Path, HSE, and RoCE and their capabilities to address these challenges will
be emphasized. Next, we will focus on Open Fabrics RDMA and Libfabrics
programming, and network management infrastructure and tools to effectively use
these systems. A common set of challenges being faced while designing these
systems will be presented. Case studies focusing on domain-specific
challenges in designing these systems,
their solutions and sample performance numbers will be presented.
Finally, hands-on exercises will be carried out with
Open Fabrics and Libfabrics software stacks and Network Management tools.
sc19.supercomp...