Master Databricks and Apache Spark Step by Step: Lesson 13 - Using SQL Joins

Bryan Cafferky

Просмотров 8 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 5 янв 2025

Комментарии • 25

@oliviersac1 Год назад ⁺¹
Hello there I found your presentation interesting. If that can help, I can provide another use case where cross join is usefull.
For instance, you have a table of cars with their attributes and you want to compare them and give a comparison score.
Scoring could be as following
- Checking if both car have the same transmission type (manual or automatic)
- Checking if both cars have the same fuel type
- Comparing the HP (closet gives the higher score)
- And so on...
- For each attributes we can give a score which will give a total at the end between each cars
In order to do it, you can create the cartesian product of the cars
Then we filter the list where car_id_x car_id_y (so we dont compare the cars with themselves)
We then obtain the score for each permutation as mentionned
Then we can order by score DESC
At the end we obtain an ordered list by score DESC of all the other cars for each cars!
@moe10134 2 года назад
This video helped me get notice in a new program to me, Databricks, in week 2 of a new job. You explained these joins very well. Great Video!!
@potnuruavinash 7 месяцев назад ⁺¹
HI Bryan, you mentioned that ideal practise is thst system of record should come from warehouse .. directly pulling from application production databases is not a best practise.
My question is that even to populate warehouse you need to pull data from application databases right ?? which can never be avoided .. Am i missing anything here ?
@BryanCafferky 7 месяцев назад
Right. The DW data comes from the systems of record and that means it needs to get the data from the app databases. The standard practice is for DBAs to set up a process to create a duplicate read only copy of the prod database which can be used by reporting systems. For SQL Server, log shipping is commonly used but in the cloud, copying the database should be simple. Make sense?
@gianniprocida3332 2 года назад ⁺²
I'm learning SQL thanks to you. This video clarified my ideas on how to use SQL joins. Amazing content. It would have been nice to have the same lesson using Python.
@BryanCafferky 2 года назад ⁺²
Thanks. Check out my Python & SQL playlist. ruclips.net/p/PL7_h0bRfL52oWNfE0GhwbnNjeJSmf8Q35
Also, I cover using Python on Spark later in this "Master Databricks & Spark" series so you'll probably like that too.
@joyo2122 3 года назад ⁺²
most of the time i use left join
@josegheevarghese 3 года назад ⁺²
Thank you very much for the series! They were very helpful !!
@BryanCafferky 3 года назад
YW
@Raaj_ML 3 года назад ⁺¹
Can you please explain your point at 7:04 "I prefer not to use outer joins...." ? I think you said you prefer to use outer joins in the beginning of this video, to identify data missing..
@BryanCafferky 3 года назад ⁺¹
Good point. Outer joins can be useful. I may be thinking too much of how relational databases work in that Outer Joins are more costly. I'm trying to find more info about that. However, this blog explains that Outer Joins can generate tons of rows which is a consideration. See this blog: databricks.com/session/optimizing-apache-spark-sql-joins
@vinr 3 года назад ⁺¹
Thanks a lot for this great series on Spark
@BryanCafferky 3 года назад
You're welcome.
@jazzy9302 3 года назад ⁺¹
Thank you for the series!
@BryanCafferky 3 года назад
You're welcome.
@rhard007 2 года назад
Thanks you for your videos. They have been so helpful.
@BryanCafferky 2 года назад
Thanks. Glad it helps.
@vinr 3 года назад ⁺¹
Can I ask a question related to spark streaming, say we have incoming csv files and we need to process them, however we need to do all the transformation on the data within a single file and output it as a file, which means each incoming file should have and corresponding out going file with the necessary transformation done to the records in that file only. However we also need to work on a cluster so that load can be distributed and files can be processed in parallel. Is this something possible? Thanks
@BryanCafferky 3 года назад
That sounds complex from an orchestration standpoint. My gut sense would be to try to save the transformed data into a common datastore such as a Delta Lake table or external database like Azure SQL database, Synapse, or Snowflake. Just seems like you may be making the process more complicated by keeping so many separate streams and persisted data files.
@vinr 3 года назад ⁺¹
@@BryanCafferky I don't mind saving it to different datastore, my main point is I need to do the transformation within the individual incoming data, as each incoming file contain some unique set of data, which cannot be mingled with the another data set from another incoming file, eX: say I take rank over certain group by data, now this rank should be applied to the data within single incoming file, not on set of data from few of incoming data files at once
@BryanCafferky 3 года назад ⁺¹
@@vinr Ok. Yeah. Sorry. I'd have to dig deeper into what you are doing to offer advice. Its a bit beyond the scope of a RUclips comment.
@vinr 3 года назад
@@BryanCafferky Sure, thanks
@RajaBabu-su6ow 3 года назад ⁺¹
Lovely sir..
@kumarpyarasani9068 3 года назад ⁺¹
Hi Bryan,
Could you please create a video how to append the data to existing table when we receive new file or additional data.
Thanks,
Sri
@BryanCafferky 3 года назад
Hi kumar, Generally, you would use Delta Lake tables for that but parquet can also be used. I do plan to cover Detla Lake in a different series but I'll consider the idea of doing a video in this series with just parquet. This link has info you may find helpful. docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-dml-insert-into.html

Следующие

Автовоспроизведение

Master Databricks and Apache Spark Step by Step: Lesson 14 - Using SQL Set Operators