Advancing Spark - Data Lakehouse Star Schemas with Dynamic Partition Pruning!

Advancing Analytics

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 июл 2020
Hot on the heels of last week's Spark & AI Summit announcement, Simon is digging into the new features of Spark 3.0. In this episode, we're looking at Dynamic Partition Pruning, which should dramatically speed up queries over partitioned data!
Not sure about partitioning? Don't know why you should care? Watch now!
Don't forget to like & subscribe for more sparky goodness, and check out the Advancing Analytics blog for more content! www.advancinganalytics.co.uk/...

Комментарии • 19

@krishnakomanduri9432 3 года назад ⁺³
Hey!
I have been watching RUclips videos since ages but this is the first time I am commenting on a RUclips video. Your content is awesome and I mean it! Just buy a professional mic and a HD camera and never stop making videos like this one. I'd love to see more practical demonstrations on your channel. Good job!
@ConnorRoss311 4 года назад ⁺²
Great video! Can’t wait either for git project
@loganboyd 3 года назад
Really like your videos. RUclips is NOT the best source for good detailed Spark content but watching videos is better than reading :)
We are moving to Cloudera CDP from a HDP platform in the next couple of months. Spark 3.0 and it's new features look cool and should be very helpful.
Am I understanding the DPP feature correctly if I said, it's only going to provide partition pruning when these two things are true:
1. you have a predicate on a column on a smaller dimension table that is joined to a larger fact table
2. the join key from the fact table side is an existing partitioned column
@AdvancingAnalytics 3 года назад
Hey Logan - yep, I believe that's correct. This means you'll need to have tied your partitioning strategy to a foreign key of some sort to get maximum benefit from this approach, otherwise you'll never be hitting it... that said, I'm now questioning myself, I'll have a quick play over the next couple of days and confirm that it's only when the join key is your partition column. Lemme get back to you with a definitive!
Simon
@karol2614 2 года назад
@@AdvancingAnalytics
Do you have any answer to question Logan ?
@divyanshjain6679 3 года назад
Hi!
I have gone through with AQE video & found it very interesting.
Coming to DPP, I'm totally new to Delta Lake n don't know much about the concept. Can u please share the block of code you have used to load data to Delta Table. Also, which databricks datset have u loaded as I can see multiple folders inside "nyctaxi" dataset.
Thanks
@karol2614 2 года назад
what is the best partitioning strategy for star schema warehouse? There are big facts in this structure that are related to a large number of dimensions - partitioning after one connection, queries will be suboptimal when using another key.
@gardnmi 2 года назад
I'm not sure if there have been updates to how spark handles data partitioning since this video but when I tried out your example on a delta table it actually managed to filter the date partition using the calculated date field within the fact table (See below). However, when I tested it with a non date dimension that was partitioned such as organization_id and filtering for organizational_name it was not able to filter the partitions so the dynamic partitioning join with a organizational_dim table outperformed the filter in the fact table.
PartitionFilters: [isnotnull(service_from_date#299881), (date_format(cast(service_from_date#299881 as timestamp), y...,
@mohitsanghai5455 3 года назад
Great Video...Just have few questions - u applied filter on dimension table Date and spark filters the data, converts it into hash table and broadcast it. At the same time it applied partition pruning on Fact table Sales and only pick up the required records. Does it broadcast those records as well? Does subquery broadcast means those records? What if the filtered data is also huge? Will
Spark still broacast it? or use SortMerge Join in that case.
@AdvancingAnalytics 3 года назад ⁺¹
Broadcast join just means that one of the two joining tables is small enough to be broadcast. So if one side of the join is huge, each worker will only have the rdd blocks it needs, but it will pull a whole copy of the smaller table onto each worker so that it can satisfy all joins.
If both sides of the query are huge, then yeah it'll revert to a SortMerge etc, but at least it will still have pushed the partition filter back down to the file system
Simon
@mohitsanghai5455 3 года назад
@@AdvancingAnalytics Thanks for clearing some doubts...But what was subquery broadcast ?
@EvgenAnufriev Год назад
Could you share your opinion on if the Data Vault methodology is good for implementing it using Databricks Spark and/or. Spark Streaming (Azure Cloud), Delta tables? Data size is in tens of GB/ TBs
@ravisamal3533 3 года назад
Hey can you index your spark videos playlist
@adrestrada1 3 года назад
Hi Simon, Do you ve Github to start following you! ?
@AdvancingAnalytics 3 года назад ⁺¹
Not really! I have a git account for slides/demos from conference talks, but the examples on RUclips are all very quick & hardcoded to my env. We're looking at ways of sharing the notebooks in a more sustainable way!
@flixgpt 3 года назад ⁺¹
Your accent is irritating and even the subtitle is not able to pick up .. it's hard to follow
@Advancing_Terry 3 года назад ⁺⁵
If you press ALT+F4, RUclips will change the accent. It's a cool feature
@bittu007ize 2 года назад
@@Advancing_Terry awesome feature
@curiouslycally Месяц назад
your comment is irritating

Следующие

Автовоспроизведение

Advancing Spark - Rethinking ETL with Databricks Autoloader