59. Databricks Pyspark:Slowly Changing Dimension|SCD Type1| Merge using Pyspark and Spark SQL

Raja's Data Engineering

Просмотров 41 тыс.

642

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 4 фев 2025

Комментарии • 120

@sohelsayyad5572 Год назад ⁺¹
Informative video... Nd comment section too.
Thanks Raja sir 💐
@rajasdataengineering7585 Год назад ⁺¹
Thanks and welcome!
@parameshgosula5510 3 года назад ⁺¹
Very informative..keep continuing..
@rajasdataengineering7585 3 года назад
Thanks Paramesh
@shubhamalsunde3230 Год назад ⁺¹
nice information
@rajasdataengineering7585 Год назад
Thanks
@tanushreenagar3116 Год назад ⁺¹
Superb sir now I have cleared this concept
@rajasdataengineering7585 Год назад
Great to hear👍🏻
@MrVivekc 2 года назад ⁺¹
Very good content. keep it up.
@rajasdataengineering7585 2 года назад
Thanks Vivek
@MrVivekc 2 года назад ⁺¹
@@rajasdataengineering7585 wanted to confirm one thing. Is delta lake feature available in spark 3.x onwards?
@rajasdataengineering7585 2 года назад
Yes available
@MrVivekc 2 года назад
@@rajasdataengineering7585 OK and can we implement delta lake in spark 2.3.x?
@RadhaJhaq 2 года назад ⁺¹
Very well explained
@rajasdataengineering7585 2 года назад
Thank you
@ravinarang6865 Год назад ⁺¹
Your videos are nice.
@rajasdataengineering7585 Год назад
Glad you like them!
@joyo2122 3 года назад ⁺¹
Great Video for data scientist like me
@rajasdataengineering7585 3 года назад
Thank you
@abhaybisht101 3 года назад ⁺¹
Nice content Raja👍
@rajasdataengineering7585 3 года назад
Thanks Abhay
@rambabuposa5082 Год назад ⁺¹
Hi Raja, nice videos. have gone through all of your videos.
In this video, you have titled like this SCD Type1. As per my knowledge, its Delta Lake with all kinds of history (versions). I think it should be SCD Type2.
@rajasdataengineering7585 Год назад
Hi Rambabu, thanks for your comment. I hope I explained merge statement which overwrites the previous version and not maintaining history. Anyway I will check the video and make corrections if needed.
Also I have posted another video on scd type 2
@sanjayss8564 2 года назад ⁺¹
Great explanation 👏👏, thanks
@rajasdataengineering7585 2 года назад
Thanks Sanjay!
@sravankumar1767 3 года назад
Nice explanation 👌
@rajasdataengineering7585 3 года назад
Thank you Sravan
@kartikeshsaurkar4353 2 года назад ⁺⁵
I think video title should change to "how to implement SCD 1 in databricks". It'll reach to larger audience
@rajasdataengineering7585 2 года назад ⁺¹
Sure, will change it as suggested
@awasthi4948 2 года назад ⁺¹
Truly appreciate your efforts!!
Can you please share the script which you have used, So that we can do hands on same. ...
@MrinalDas3107 3 года назад ⁺¹
loved the content. Thanks Brother
@rajasdataengineering7585 3 года назад
Thank you Mrinal.
@MrinalDas3107 3 года назад ⁺¹
@@rajasdataengineering7585 I am planning to make a transition to Data Engineering and would love to have someone like you guide me in my journey. Could we connect in some way?
@rajasdataengineering7585 3 года назад ⁺¹
Sure, please contact me at audaciousazure@gmail.com
@MrinalDas3107 3 года назад
@@rajasdataengineering7585 sure let me do that right away. Thanks alot :-)
@surenderraja1304 Год назад ⁺²
Very Nice . Is it possible to supply the column names dynamically from somewhere. currently the columns names ON condition is hardcoded as id and also the set columns are hardcoded. can we try to pull those columns dynamically from a list or array or config file
@rajasdataengineering7585 Год назад
Yes that's possible
@kamaltheja 4 месяца назад
Can you please share all the notebooks in this series?
@muvvalabhaskar3948 9 месяцев назад ⁺¹
Hi in this example there is only one table
If there are multiple tables with multiple columns and primary key also different for each table how do we generalize this one
@rajasdataengineering7585 9 месяцев назад
We can create parameterized UDF to make it generic
@leviettung2000 2 года назад ⁺¹
Hi Raja,
i am also doing upsert with structure streaming into Azure SQL database. Everything is not as it should be. I can upload via connect ODBC on normal connection but not in writeStream. Error that ODBC is not installed (but I do). I upsert with forEach.
Can you give me some advice, many thanks
@cajaykiran 2 года назад ⁺¹
Thank you
@rajeshmanepalli7367 10 месяцев назад ⁺¹
in real time which one we can use either pyspark or sql which one is effective
@rajasdataengineering7585 10 месяцев назад
Both are performing at same level. Its all about developer's convenience
@pritamsuryavanshi2582 3 года назад ⁺²
Could you make a video on "How to implement SCD 2 using PySpark/Spark SQL in Databricks" ? Thanks.
@rajasdataengineering7585 3 года назад ⁺²
Sure Pritam, it would be my next video as per your request
@JL-qc5gq 2 года назад ⁺²
How do we update records in db table via jdbc in databricks? I tried read and write (overwrite/append) but not update.
@rajasdataengineering7585 2 года назад
You can use options presql and postsql statement using jdbc connection
@JL-qc5gq 2 года назад ⁺¹
@@rajasdataengineering7585 do you have example? let's say updating records using presql and postsql options
@rajasdataengineering7585 2 года назад
I haven't yet created a video on it
@bachannigam4332 7 месяцев назад
I have a suggestion, create a stored procedure to delete the records which r found, call the procedure before insert, then insert all the values as new values
@Ek_e_Alfaaz Год назад ⁺²
Sir please make playlist on streaming
@rajasdataengineering7585 Год назад ⁺¹
Sure, will create a playlist for streaming concepts
@yourtuber6367 Год назад
Please upload delta live table series
@perryliu6573 2 года назад ⁺¹
How do we manage if the one of rows in the source table got deleted and we also want to delete this row in the target table?
@rajasdataengineering7585 2 года назад ⁺¹
There is an option for whenmatcheddelete as well. But that's is for matching cases. In your case if source df contains latest snapshot, then better you can go for truncate and load
@yogeshgavali5238 10 месяцев назад
How can we delete the data which is not in source in same merge statement for pyspark?
@ashishsharan6503 3 года назад ⁺¹
what will be the syntax for inserting record manually into Delta lake and dataframe using PySpark
@rajasdataengineering7585 3 года назад ⁺¹
HI Ashish, It is same SQL syntax but use %sql magic command if you are using pyhon or scala notebook.
If you want insert based on dataframe, convert the dataframe into temp view first then you can use sql insert syntax
@sanjaynath7206 2 года назад ⁺¹
SCD Type 2 video has been removed or made private? Could you please make it public? Awesome videos!
@rajasdataengineering7585 2 года назад
Sure Sanjay, will add that one
@pradnyavantkalamkar6936 2 года назад
@@rajasdataengineering7585 please upload SCD type 2 video
@DevelopingI Год назад
Hey Thank you for the video. I am using the Method 1 to Perform Merge on a big table (1TB). It takes 3+ hours to do that.
Can you please suggest how can I improve that?
Also is it possible and advised to perform Merges on Parquet rather than converting these to Delta?
@prabhatgupta6415 Год назад
hv u got the soln?
@suhassajjan4056 Год назад ⁺¹
In my case table is in Hive ,can I implement same solution
@rajasdataengineering7585 Год назад
Yes you can implement for hive table also
@ashishsharan6503 3 года назад ⁺¹
From where I can get the scripts you have shown in the tutorials, I liked them very much
@rajasdataengineering7585 3 года назад
Will share the scripts
@sivaani37 Год назад ⁺¹
Can we directly update a table in SQL server instead of delta table ?
@rajasdataengineering7585 Год назад ⁺¹
Yes we can do
@sivaani37 Год назад ⁺¹
@@rajasdataengineering7585 can you post a video on that ?
@rajasdataengineering7585 Год назад
Sure will create a video on this request
@kunalmishra348 Год назад ⁺¹
Hello
Can you please tell how to change the data type of columns of the created delta table .
For ex : In this video you have created
@rajasdataengineering7585 Год назад
Sure, will create a video on this requirement
@keerthanavijayakumar9754 Год назад ⁺¹
@rajasdataengineering7585 hi sir,
I have data in RDBM sql(source)
I do some transformation and write that data in postgres db using pyspark. As this job is triggered on an hourly basis and fetching the data form source in 8 hour interval, there are so many duplicates in postgres table how to overcome that. Plsss explain me. Pls
@rajasdataengineering7585 Год назад ⁺¹
Hi Keerthana, better create a view at postgres side with logic of handling duplicates and ingest data from view to databricks
@keerthanavijayakumar9754 Год назад
@@rajasdataengineering7585 I need to move that data to warehouse. Transformed data can only be written but existing data and transformed data having duplication. How to write without again entring the same data in postgres
@keerthanavijayakumar9754 Год назад ⁺¹
Hi sir,
Is that possible through pyspark standalone. Pls explain
@rajasdataengineering7585 Год назад ⁺¹
Hi Keerthana, yes it is possible only through pyspark also
@krishnamurthy9720 2 года назад ⁺¹
I am getting error when inserting specific columns instead of all columns. Saying column is missing in insert
@rajasdataengineering7585 2 года назад
For update, you can give specific columns but for insert complete list of columns to be provided
@gyan_chakra 2 года назад ⁺¹
Which join is equivalent to merge ?
@rajasdataengineering7585 2 года назад ⁺²
Merge is upsert operation, not joining operation. However internally it is equivalent of outer join
@gyan_chakra 2 года назад ⁺¹
@@rajasdataengineering7585 thank you so much 😊👌
@ashishsharan6503 3 года назад ⁺¹
Do we have SCD type 1 and Type 2 videos in PySpark and Spark SQL ?
@rajasdataengineering7585 3 года назад
HI Ashish, This tutorial can be used for SCD Type1 and will post another video for SCD Type 2
@krishnamurthy9720 3 года назад
How do we write that number of inserted records count into some audit table
@rajasdataengineering7585 3 года назад ⁺³
Hi Krishna,
Very good question.
In my example, delta table is located at /FileStore/tables/delta_merge.
So after performing merge operation on this delta table, you can follow below steps
from delta.tables import *
delta_df = DeltaTable.forPath(spark, "/FileStore/tables/delta_merge")
lastOperationDF = delta_df.history(1) # get the last operation
display(lastOperationDF)
explode_df = lastOperationDF.select(lastOperationDF.operation,explode(lastOperationDF.operationMetrics))
display(explode_df)
The column operationMetrics would contains all metrics including number of records inserted and number of records updated etc.,
explode_df can be used to retrieve these metrics.
Hope it helps
@krishnamurthy9720 3 года назад
@@rajasdataengineering7585 thank you for instant reply..
@rajasdataengineering7585 3 года назад
You are welcome
@midhunssivan139 Год назад ⁺¹
Can you please explain pyspark to Oracle table update.Insert i am able to do.
@rajasdataengineering7585 Год назад
We need to use jdbc driver for Oracle database as well. The process is same as ms SQL table in this example
@midhunssivan139 Год назад
I tried using jdbc.It is inserting.But update statement not supported,either it will overwrite the full data.Is there anyway to execute merge statement on rdbms using pyspark
@dataanalyst3210 7 месяцев назад ⁺¹
please add one project production ready which live
@rajasdataengineering7585 7 месяцев назад
Sure, I will create a project for this requirement
@sravankumar1767 2 года назад ⁺¹
How to Merge Spark DataFrame - Complex type if we have two json files json 1 schema and json2 schema is differenr how can we merge using pyspark. can you please explain this scenario.
@rajasdataengineering7585 2 года назад ⁺¹
When you say merge, I assume you mean union of 2 dataframes from json files.
Pls let me know if you mean SQL merge operation, not union.
For union, number of columns and datatype should match. So you need to alter the dataframe first to meet these 2 conditions and it can be combined
@sravankumar1767 2 года назад ⁺¹
@@rajasdataengineering7585 yes we can use union but how can we handle complex json when the schema is different then finally we can use UNION. Can you please explain this scenario
@rajasdataengineering7585 2 года назад ⁺¹
First, we need to flatten the nested json fields and remove unwanted columns and make the schema same for both dataframe.
Now we can apply union.
If you have any sample dataset, you can share it to my mail box. I can help you
@sravankumar1767 2 года назад
@@rajasdataengineering7585 can you please share your mail id
@rajasdataengineering7585 2 года назад ⁺¹
audaciousazure@gmail.com
@rohitkumar-nk6sd 2 года назад ⁺¹
Hi how to merge on 2 columns?
@rajasdataengineering7585 2 года назад
I have posted another video for SCD type 2 and explained merging multiple columns. Kindly refer that video
@rohitkumar-nk6sd 2 года назад ⁺¹
@@rajasdataengineering7585
Hi sir
In my case
I need to merge on 2 columns with 2 merge keys
Please help me out 🙏
@rajasdataengineering7585 2 года назад
Please refer this video ruclips.net/video/GhBlup-8JbE/видео.html
@shalinikumari-qx9tn 3 года назад
Please make a video on delta lake
@rajasdataengineering7585 3 года назад
Sure Shalini, will make videos on delta lake
@tejadeep-u2n Год назад
can you provide total script we can usefully to practice
@mohanishsingh9786 2 года назад
Sir can you share the notebook please
@DebayanKar7 Год назад
Please add the html version of your notebook
@AK-ff7xt Год назад
Hi,
Need this notebook. Can you please share
@weldervicentemoreiramartin9467 2 года назад
04:40
@weldervicentemoreiramartin9467 2 года назад ⁺¹
Hello, I copied the template from the databricks documentation and saw that there are some differences in your example. Why doesn't the documentation model work?
from delta.tables import *
deltaTableVendas = DeltaTable.forPath(spark, 'dbfs:/mnt/bronze/vendas/')
deltaTableVendasUpdates = DeltaTable.forPath(spark, 'dbfs:/mnt/silver/vendas/')
dfUpdates = deltaTableVendasUpdates.toDF()
deltaTableVendas.alias('vendas') \
.merge(
dfUpdates.alias('updates'),
'vendas.numero_transacao = updates.numero_transacao'
) \
.whenMatchedUpdate(set =
{
"numero_transacao": "updates.numero_transacao",
"numped": "updates.numped",
"codcli": "updates.codcli",
"codprod": "updates.codprod",
"data_venda": "updates.data_venda",
"quantidade": "updates.quantidade",
"valor": "updates.valor"

}
) \
.whenNotMatchedInsert(values =
{
"numero_transacao": "updates.numero_transacao",
"numped": "updates.numped",
"codcli": "updates.codcli",
"codprod": "updates.codprod",
"data_venda": "updates.data_venda",
"quantidade": "updates.quantidade",
"valor": "updates.valor"
}
) \
.execute()
@rajasdataengineering7585 2 года назад ⁺¹
Ideally this code should work as well. Need to look into details if any specific error. Could you elaborate more if you get any error
@nagamanickam6604 9 месяцев назад ⁺¹
Thank you
@rajasdataengineering7585 9 месяцев назад
You're welcome
@shalinikumari-qx9tn 3 года назад
Please make a video on delta lake

Следующие

Автовоспроизведение

60. Databricks & Pyspark: Delta Lake Audit Log Table with Operation Metrics