Scaling ML Serving to 1000s of Models

Поделиться
HTML-код
  • Опубликовано: 4 сен 2024
  • Gerard Casas Saez, Cash App
    Join the Cash App engineering team as we discuss effective strategies for scaling ML serving solutions to manage thousands of models efficiently. In this talk, Gerard Casas Saez (Senior Machine Learning Engineer) shares how Cash App optimized their platform, focusing on ONNX model performance, hot container replacements, and automatic, streamlined model deployments. Learn about the enhancements made to AWS Sagemaker Multi-Model Endpoints, including zero downtime upgrades and process improvements that accelerate productionization through a custom Python client and robust approval workflows.
    Gerard will also discuss Cash App’s approach to managing AWS Sagemaker endpoints as a unified team, highlighting techniques to minimize on-call disruptions and manage services without becoming a bottleneck. Additionally, learn about insights into the future of their platform, including plans for hosting large language models and ongoing optimization efforts.
    Attendees will leave with a clear understanding of best practices for ONNX serving, strategies for reducing deployment times, and techniques to enhance monitoring and stability. This session is essential for professionals looking to scale their ML operations effectively in a cost-sensitive and high-demand environment.

Комментарии •