Hello there I found your presentation interesting. If that can help, I can provide another use case where cross join is usefull. For instance, you have a table of cars with their attributes and you want to compare them and give a comparison score. Scoring could be as following - Checking if both car have the same transmission type (manual or automatic) - Checking if both cars have the same fuel type - Comparing the HP (closet gives the higher score) - And so on... - For each attributes we can give a score which will give a total at the end between each cars In order to do it, you can create the cartesian product of the cars Then we filter the list where car_id_x car_id_y (so we dont compare the cars with themselves) We then obtain the score for each permutation as mentionned Then we can order by score DESC At the end we obtain an ordered list by score DESC of all the other cars for each cars!
HI Bryan, you mentioned that ideal practise is thst system of record should come from warehouse .. directly pulling from application production databases is not a best practise. My question is that even to populate warehouse you need to pull data from application databases right ?? which can never be avoided .. Am i missing anything here ?
Right. The DW data comes from the systems of record and that means it needs to get the data from the app databases. The standard practice is for DBAs to set up a process to create a duplicate read only copy of the prod database which can be used by reporting systems. For SQL Server, log shipping is commonly used but in the cloud, copying the database should be simple. Make sense?
I'm learning SQL thanks to you. This video clarified my ideas on how to use SQL joins. Amazing content. It would have been nice to have the same lesson using Python.
Thanks. Check out my Python & SQL playlist. ruclips.net/p/PL7_h0bRfL52oWNfE0GhwbnNjeJSmf8Q35 Also, I cover using Python on Spark later in this "Master Databricks & Spark" series so you'll probably like that too.
Can you please explain your point at 7:04 "I prefer not to use outer joins...." ? I think you said you prefer to use outer joins in the beginning of this video, to identify data missing..
Good point. Outer joins can be useful. I may be thinking too much of how relational databases work in that Outer Joins are more costly. I'm trying to find more info about that. However, this blog explains that Outer Joins can generate tons of rows which is a consideration. See this blog: databricks.com/session/optimizing-apache-spark-sql-joins
Can I ask a question related to spark streaming, say we have incoming csv files and we need to process them, however we need to do all the transformation on the data within a single file and output it as a file, which means each incoming file should have and corresponding out going file with the necessary transformation done to the records in that file only. However we also need to work on a cluster so that load can be distributed and files can be processed in parallel. Is this something possible? Thanks
That sounds complex from an orchestration standpoint. My gut sense would be to try to save the transformed data into a common datastore such as a Delta Lake table or external database like Azure SQL database, Synapse, or Snowflake. Just seems like you may be making the process more complicated by keeping so many separate streams and persisted data files.
@@BryanCafferky I don't mind saving it to different datastore, my main point is I need to do the transformation within the individual incoming data, as each incoming file contain some unique set of data, which cannot be mingled with the another data set from another incoming file, eX: say I take rank over certain group by data, now this rank should be applied to the data within single incoming file, not on set of data from few of incoming data files at once
Hi kumar, Generally, you would use Delta Lake tables for that but parquet can also be used. I do plan to cover Detla Lake in a different series but I'll consider the idea of doing a video in this series with just parquet. This link has info you may find helpful. docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-dml-insert-into.html
Hello there I found your presentation interesting. If that can help, I can provide another use case where cross join is usefull.
For instance, you have a table of cars with their attributes and you want to compare them and give a comparison score.
Scoring could be as following
- Checking if both car have the same transmission type (manual or automatic)
- Checking if both cars have the same fuel type
- Comparing the HP (closet gives the higher score)
- And so on...
- For each attributes we can give a score which will give a total at the end between each cars
In order to do it, you can create the cartesian product of the cars
Then we filter the list where car_id_x car_id_y (so we dont compare the cars with themselves)
We then obtain the score for each permutation as mentionned
Then we can order by score DESC
At the end we obtain an ordered list by score DESC of all the other cars for each cars!
This video helped me get notice in a new program to me, Databricks, in week 2 of a new job. You explained these joins very well. Great Video!!
HI Bryan, you mentioned that ideal practise is thst system of record should come from warehouse .. directly pulling from application production databases is not a best practise.
My question is that even to populate warehouse you need to pull data from application databases right ?? which can never be avoided .. Am i missing anything here ?
Right. The DW data comes from the systems of record and that means it needs to get the data from the app databases. The standard practice is for DBAs to set up a process to create a duplicate read only copy of the prod database which can be used by reporting systems. For SQL Server, log shipping is commonly used but in the cloud, copying the database should be simple. Make sense?
I'm learning SQL thanks to you. This video clarified my ideas on how to use SQL joins. Amazing content. It would have been nice to have the same lesson using Python.
Thanks. Check out my Python & SQL playlist. ruclips.net/p/PL7_h0bRfL52oWNfE0GhwbnNjeJSmf8Q35
Also, I cover using Python on Spark later in this "Master Databricks & Spark" series so you'll probably like that too.
most of the time i use left join
Thank you very much for the series! They were very helpful !!
YW
Can you please explain your point at 7:04 "I prefer not to use outer joins...." ? I think you said you prefer to use outer joins in the beginning of this video, to identify data missing..
Good point. Outer joins can be useful. I may be thinking too much of how relational databases work in that Outer Joins are more costly. I'm trying to find more info about that. However, this blog explains that Outer Joins can generate tons of rows which is a consideration. See this blog: databricks.com/session/optimizing-apache-spark-sql-joins
Thanks a lot for this great series on Spark
You're welcome.
Thank you for the series!
You're welcome.
Thanks you for your videos. They have been so helpful.
Thanks. Glad it helps.
Can I ask a question related to spark streaming, say we have incoming csv files and we need to process them, however we need to do all the transformation on the data within a single file and output it as a file, which means each incoming file should have and corresponding out going file with the necessary transformation done to the records in that file only. However we also need to work on a cluster so that load can be distributed and files can be processed in parallel. Is this something possible? Thanks
That sounds complex from an orchestration standpoint. My gut sense would be to try to save the transformed data into a common datastore such as a Delta Lake table or external database like Azure SQL database, Synapse, or Snowflake. Just seems like you may be making the process more complicated by keeping so many separate streams and persisted data files.
@@BryanCafferky I don't mind saving it to different datastore, my main point is I need to do the transformation within the individual incoming data, as each incoming file contain some unique set of data, which cannot be mingled with the another data set from another incoming file, eX: say I take rank over certain group by data, now this rank should be applied to the data within single incoming file, not on set of data from few of incoming data files at once
@@vinr Ok. Yeah. Sorry. I'd have to dig deeper into what you are doing to offer advice. Its a bit beyond the scope of a RUclips comment.
@@BryanCafferky Sure, thanks
Lovely sir..
Hi Bryan,
Could you please create a video how to append the data to existing table when we receive new file or additional data.
Thanks,
Sri
Hi kumar, Generally, you would use Delta Lake tables for that but parquet can also be used. I do plan to cover Detla Lake in a different series but I'll consider the idea of doing a video in this series with just parquet. This link has info you may find helpful. docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-dml-insert-into.html