Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Поделиться
HTML-код
  • Опубликовано: 18 сен 2024
  • A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
    Connect with us:
    Website: databricks.com
    Facebook: / databricksinc
    Twitter: / databricks
    LinkedIn: / databricks
    Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com...

Комментарии • 1

  • @DZitLee
    @DZitLee 2 года назад

    I cannot make this run on Databricks. It keeps saying: "java gateway process exited before sending its port number databricks"
    Is there any config that I need to do?