Real-time Feature Engineering with Apache Spark Streaming and Hof

Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs

What AMD's Earnings Beat Means for the Company's Future

Unstoppable Simone Biles Wins 6th Gold Medal With Spectacular Floor Routine! | Paris Olympics

Barcelona vs. Manchester City in Orlando | Highlights | ESPN FC

I PROPOSED TO THE LOVE OF MY LIFE IN THE BAHAMAS!

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Databricks

Просмотров 2,7 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 авг 2024
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
Feature Reaping is a compute efficient and low latency solution for deleting historical data at sub-partition granularity (i.e., columns or selected map keys), and in order to do it efficiently at our scale, we added a new physical encoding in ORC (called FlatMap) that allowed us to selectively reap/delete specific map keys (features) without performing expensive decoding/encoding and decompression/compression. In this talk, we’ll take a deep dive into Spark’s optimizer, evaluation engine, data layouts and commit protocols and share how we’ve implemented these complementary techniques. To this end, we’ll discuss several catalyst optimizations to automatically rewrite feature injection/reaping queries as a SQL joins/transforms, describe new ORC physical encodings for storing feature maps, and discuss details of how Spark writes/commits indexed feature tables.
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: databricks.com/product/unifie...
See all the previous Summit sessions:
Connect with us:
Website: databricks.com
Facebook: / databricksinc
Twitter: / databricks
LinkedIn: / databricks
Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-nam...
Наука

Комментарии •

Следующие

Автовоспроизведение

Real-time Feature Engineering with Apache Spark Streaming and Hof

Real-time Feature Engineering with Apache Spark Streaming and Hof

Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs

Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs

What AMD's Earnings Beat Means for the Company's Future

What AMD's Earnings Beat Means for the Company's Future

Unstoppable Simone Biles Wins 6th Gold Medal With Spectacular Floor Routine! | Paris Olympics

Unstoppable Simone Biles Wins 6th Gold Medal With Spectacular Floor Routine! | Paris Olympics

Barcelona vs. Manchester City in Orlando | Highlights | ESPN FC

Barcelona vs. Manchester City in Orlando | Highlights | ESPN FC

I PROPOSED TO THE LOVE OF MY LIFE IN THE BAHAMAS!

I PROPOSED TO THE LOVE OF MY LIFE IN THE BAHAMAS!

Advanced Natural Language Processing with Apache Spark NLP

Advanced Natural Language Processing with Apache Spark NLP

Data Modeling in the Modern Data Stack

Data Modeling in the Modern Data Stack

Processing 25GB of data in Spark | How many Executors and how much Memory per Executor is required.

Processing 25GB of data in Spark | How many Executors and how much Memory per Executor is required.

What if my Intel CPU explodes??

What if my Intel CPU explodes??

A day in the life of a Software Engineer 👩‍💻 in Nepal | Pokhara | 11-7 Job | Itdeurali | work vlog

A day in the life of a Software Engineer 👩‍💻 in Nepal | Pokhara | 11-7 Job | Itdeurali | work vlog

Neural Recommender Systems

Neural Recommender Systems

Earnings and savings of software engineers in Australia | Software engineer salaries

Earnings and savings of software engineers in Australia | Software engineer salaries

Parquet File Format - Explained to a 5 Year Old!

Parquet File Format - Explained to a 5 Year Old!

ПРО БЛОКИРОВКУ YOUTUBE ❌ Ютуб заблокируют - как смотреть?

ПРО БЛОКИРОВКУ YOUTUBE ❌ Ютуб заблокируют - как смотреть?

А вы докажите что мы сломали ноутбук! Как современные сервисы решают проблемы.

А вы докажите что мы сломали ноутбук! Как современные сервисы решают проблемы.

А вы докажите что мы сломали ноутбук! Как современные сервисы решают проблемы.

А вы докажите что мы сломали ноутбук! Как современные сервисы решают проблемы.

Роскомнадзор блокирует VPN сервисы. Apple помогает?

Роскомнадзор блокирует VPN сервисы. Apple помогает?

GD169A Тестирую и допиливаю. Есть ссылка на Дзен

GD169A Тестирую и допиливаю. Есть ссылка на Дзен

Looks very comfortable. #leddisplay #ledscreen #ledwall #eagerled

Looks very comfortable. #leddisplay #ledscreen #ledwall #eagerled

Какой ноутбук взять для учёбы? #msi #rtx4090 #laptop #юмор #игровой #apple #shorts

Какой ноутбук взять для учёбы? #msi #rtx4090 #laptop #юмор #игровой #apple #shorts

Как бесплатно замутить iphone 15 pro max

Как бесплатно замутить iphone 15 pro max