Training Large Language Models on Kubernetes - Ronen Dar, Run:ai

Поделиться
HTML-код
  • Опубликовано: 12 ноя 2023
  • Training Large Language Models on Kubernetes - Ronen Dar, Run:ai
    Large Language Models (LLMs) are emerging as the biggest technology breakthrough since the iPhone launched. LLMs are huge in size and their training requires massive amounts of data and compute power. Often LLM training is being carried out on bare metal servers with workload schedulers from the high-performance computing world, like Slurm. In this talk, we present the challenges involved in pre-training LLMs in general and in specific on Kubernetes. We discuss best practices in terms of networking optimization, distributed resource management, scheduling, and code manipulation. We provide scripts based on NVIDIA’s Megatron Transformer framework with pre-made configurations, data pre-processing workflows, and training setup to make it easy for users to quickly start LLM training on K8s. We further provide benchmarks results comparing training throughput between bare metal environments and K8s-based environments with models like GPT, T5 and BERT, across a varying number of GPU nodes.
  • НаукаНаука

Комментарии •