Now YOU can practice PySpark Coding questions without paying SO MUCH to the paid platforms. I have spent a lot of time to create this SOLUTION so that you don't need to PAY to anyone. Looking for your SUPPORT ❤
Really amazing content, well explained. I'm transitioning to data engineering, so i have been learning pyspark for interview. i believe this is all you need especially for entry level DE job roles, thanks so much Ansh!!!
For question10 , its asking to remove duplicates while preserving the order. while row_number function does the job, its not maintaining the order. instead df_deduplicated = df.dropDuplicates() df_deduplicated.display() maintains the order. Love your videos. love your Aura keep making more videos.
Timestamps (Powered by Merlin AI) 00:06 - Prepare for 2025 PySpark interviews with comprehensive resources. 02:50 - Introduction to a specialized PySpark interview notebook for real-time scenarios. 07:34 - Preparing PySpark interview resources to boost confidence. 09:52 - Creating a supportive Telegram community for resource sharing and error debugging. 14:27 - Creating a free Databricks Community Edition account. 16:40 - Guide on accessing Databricks for PySpark interview preparation. 20:21 - Understanding duplicates in PySpark DataFrames is crucial for interviews. 22:14 - Effective interview strategies for handling coding questions. 26:07 - Understanding data engineer responsibilities and data handling techniques is crucial. 28:15 - Career preparation: Watch recommended video for complete interview readiness. 32:16 - Converting date columns in PySpark DataFrames efficiently. 34:32 - Sorting and deduplicating data in PySpark efficiently. 38:22 - Understanding schema merging is crucial for handling inconsistent data in PySpark. 40:34 - PySpark optimizes data processing by reducing disk writes using in-memory computations. 45:01 - Handling null values in PySpark DataFrames using fillna. 47:07 - Calculate and retrieve top five users by total actions. 51:06 - Using window functions to analyze recent customer transactions 53:18 - Demonstrating filtering and sorting of DataFrame in PySpark. 57:37 - Identifying customers without purchases in 30 days using PySpark. 59:50 - Calculate users with purchase gaps over 30 days using PySpark. 1:03:59 - Transforming a text column into an array and exploding its values. 1:06:19 - Grouping and counting words in PySpark with aliasing techniques. 1:10:44 - Calculate cumulative sales sum over time using PySpark. 1:12:49 - Using cumulative sum with window functions in PySpark. 1:17:18 - Using row number to retain order while removing duplicates. 1:19:32 - Surrogate keys simplify database design by replacing complex primary keys. 1:23:41 - Focus on covering all key topics, including data aggregation in PySpark. 1:25:54 - Double aggregation to find top-selling products per month. 1:30:33 - Extract highest sales product per month using dense rank. 1:33:17 - Debugging PySpark code focuses on correcting syntax and indentation issues. 1:37:40 - Understanding Spark architecture and components in PySpark. 1:39:54 - Understanding Spark architecture and driver node functionality. 1:44:25 - Applying merge conditions using Delta tables in PySpark. 1:46:53 - Understanding upsert and schema inference in PySpark. 1:50:54 - Understanding RDDs, DataFrames, and Datasets in PySpark. 1:52:53 - DataFrames enhance usability and integration in PySpark compared to RDDs. 1:57:13 - Understanding logical and physical plans in PySpark for optimization. 1:59:29 - Spark chooses the optimal join type based on cost models. 2:03:51 - Understanding Spark's entry points and transformation types. 2:05:59 - Narrow transformations allow independent data processing without shuffling between machines. 2:10:27 - Understanding partitioning and data management in PySpark. 2:12:31 - Understanding data storage and management in PySpark using cache and persist. 2:17:11 - Partitions in PySpark enable massive parallel processing for efficient data handling. 2:19:17 - Counting employees by department and classifying sales transactions. 2:24:03 - Understanding fundamental concepts is essential for implementing solutions in data processing. 2:26:14 - Using current timestamp for tracking record changes in PySpark. 2:30:36 - Understanding temporary and global views in PySpark SQL. 2:32:59 - Understanding Global Temp views and flattening nested structures in PySpark. 2:37:37 - Understanding data partitioning in PySpark for optimized storage. 2:39:45 - Using Snappy compression enhances performance with Parquet files. 2:43:58 - Understanding the optimize and Z order by commands in PySpark. 2:46:05 - Understanding Z-Order and Data Skipping in Spark for Efficient Querying. 2:50:20 - Understanding DataFrame actions and lazy evaluation in PySpark. 2:52:40 - Understanding actions and lazy evaluation in PySpark. 2:57:22 - Key advantages of Delta Lake and memory management in PySpark jobs. 2:59:48 - Optimizing memory and mitigating data skew with AQE in PySpark. 3:04:31 - Dynamic join optimization in PySpark enhances performance. 3:06:29 - Discussing skew data handling methods in PySpark using salting and AQE. 3:10:56 - Understanding Spark's memory management and Delta Lake's time travel features. 3:13:18 - Utilize Delta Lake's time travel feature to recover deleted data. 3:17:49 - Using collect_list for aggregating product names by category in PySpark. 3:20:38 - Using collect_set to find unique product IDs per customer. 3:25:49 - Using concat_ws for efficient string concatenation in PySpark. 3:28:23 - Calculate product counts for customers using PySpark DataFrame operations. 3:33:02 - Validating phone numbers using prefix filtering in PySpark. 3:35:25 - Calculate average courses per student from a dataset. 3:40:28 - Discussion on preparing for PySpark interviews with valuable resources. 3:43:01 - Determine if postal codes are standard or custom based on length.
Hey Ansh, I can’t thank you enough for this wonderful Masterpiece. Your explanation is always clear and really made a difference in how I understood the material. You have a great teaching style that makes even the most complicated topics seem accessible. I’ve learned so much from you, and I’m really grateful for the time and effort you put into helping us improve our skills!♥👏👏👏
Hello Ansh, I am just in love with your content, bro. you made my life easier thank you so much, I am happy I found your channel. I have a request for you: what kind of SQL questions will be asked for the Azure data engineer role. If possible, can you make a video of it?
Ansh bro, I struggle a lot while answering ci/cd related questions in the interviews. Can you explain how they move environments and do prod deployment in databricks. Thanks
Hii Ansh , At 1:09:26 there are two product rows in which 1 has , with it which has created the issue of calculating product column as 1. How are you going to correct that issue?
How did you guys open this file? its in .dbc format edit : Steps to Import a .dbc File into Databricks: Log in to Databricks. Go to the Workspace tab. Navigate to a folder or create a new folder. Click on the "Import" button (usually at the top-right). Choose the .dbc file (PySpark Interview.dbc) from your local machine. Click Import.
Hi Ansh , Start with the Azure free trial, getting a phone number error like "We’re unable to validate your phone number." tried with different country phone numbers. Getting same error, please assist me.
DP 203 Expiring? And introducing new Certificate Microsoft Fabric? Is it True? ADF is no longer? it will be part of Fabric in future? everything under Fabric in Future for Data engineers? Can u please make video on this for clarity and confusion which citification's are useful in the future for Azure Data Engineers. which side we should focus most. Fabric, ADF, Synapse, Or Databricks? Plz make one video on priority basis. Thanks in Advance Ansh.
for Q 10 While preparing a data pipeline, you notice some duplicate rows in a dataset. How would you remove the duplicates without affecting the original order? Correct answer is to use monotonically_increasing_id along with dropDuplicates to preserve the order , we can't use Window function with ROW_NUMBER data = [("John", 25), ("Jane", 30), ("John", 25), ("Alice", 22)] columns = ["name", "age"] df = spark.createDataFrame(data, columns) df.show() df = df.withColumn("index", f.monotonically_increasing_id()) df.printSchema() df.show() df = df.dropDuplicates(["name", "age"]).orderBy("index").drop("index") df.show()
Ansh, I have a small suggestion for you. It would be great if you could upload videos on Friday nights. Since many of us are working professionals, this would allow us to dedicate time over the weekend, on Saturday and Sunday, to watch and engage with the content.
First of all, thank you for your efforts! Your content is truly helpful for those who are serious about the data engineering field. I have recommended your channel to many people, including my colleagues and friends. They have subscribed and are regularly following your videos. Please continue creating more videos-we are here to support you always 😊🙏🙂
for question 1: can we use this ,its easier right ? df1=df.groupBy("product_id").agg(max("sales").alias("sales"), max("date").alias("date")) df1.display()
Please create a video on ADF interview questions for real time scenarios and also once end to end project on adf only which covers all the activities mostly used and important scenarios. @anshLambaJSR
Now YOU can practice PySpark Coding questions without paying SO MUCH to the paid platforms. I have spent a lot of time to create this SOLUTION so that you don't need to PAY to anyone. Looking for your SUPPORT ❤
really really love you bhai the amount of hard work you are putting to teach, its forcing us to study hard
"You have a gift for explaining difficult topics with clarity. Your videos have saved countless hours of frustration!"
Thanks bro....
stop using gpt for commenting. that quoted one is 100% gpt.
@SatyamKumarJha-d2e it's means you also searching same things from Chatgpt....😂
@SatyamKumarJha-d2e It means you also using GPT right ??...So please add nice comments to the chat instead of antics like NARAZ FUFA...😂
Thank you for your kind words:) Happy Learning
I have seen alot of pyspark videos but your teaching is the best.
This came in just in time! Blessings bro
Glad this video helped you :)
No words for your hard work bro, Thank you so much for bringing us this knowledge.
Glad you liked it :)
UR content is going crazy and for DE i can say ur channel is one stop solution
Indeed!
Thank you Ansh Bro. I was waiting for this video for a long time.
I am happy that you loved this video :)
Really amazing content, well explained. I'm transitioning to data engineering, so i have been learning pyspark for interview.
i believe this is all you need especially for entry level DE job roles, thanks so much Ansh!!!
Thank you for your kind words :)
Hi Ansh,
I am a big fan of you as you are really doing good things to get proper knowledge and helping guys to save there money.
Keep it up 👍
Happy you found my videos useful :)
Thanks buddy, I am learning new frameworks from your tutorials ❤ it is very helpful 🙏🏾
Glad to hear that. Happy Learning :)
For question10 ,
its asking to remove duplicates while preserving the order. while row_number function does the job, its not maintaining the order.
instead
df_deduplicated = df.dropDuplicates()
df_deduplicated.display()
maintains the order.
Love your videos.
love your Aura
keep making more videos.
Thanks a lot Ansh ..this is a masterpiece for the Pyspark interview
Thank you for your kind words :)
Awesome man… We appreciate your efforts❤❤
Glad to hear this:) Happy Learning!
Timestamps (Powered by Merlin AI)
00:06 - Prepare for 2025 PySpark interviews with comprehensive resources.
02:50 - Introduction to a specialized PySpark interview notebook for real-time scenarios.
07:34 - Preparing PySpark interview resources to boost confidence.
09:52 - Creating a supportive Telegram community for resource sharing and error debugging.
14:27 - Creating a free Databricks Community Edition account.
16:40 - Guide on accessing Databricks for PySpark interview preparation.
20:21 - Understanding duplicates in PySpark DataFrames is crucial for interviews.
22:14 - Effective interview strategies for handling coding questions.
26:07 - Understanding data engineer responsibilities and data handling techniques is crucial.
28:15 - Career preparation: Watch recommended video for complete interview readiness.
32:16 - Converting date columns in PySpark DataFrames efficiently.
34:32 - Sorting and deduplicating data in PySpark efficiently.
38:22 - Understanding schema merging is crucial for handling inconsistent data in PySpark.
40:34 - PySpark optimizes data processing by reducing disk writes using in-memory computations.
45:01 - Handling null values in PySpark DataFrames using fillna.
47:07 - Calculate and retrieve top five users by total actions.
51:06 - Using window functions to analyze recent customer transactions
53:18 - Demonstrating filtering and sorting of DataFrame in PySpark.
57:37 - Identifying customers without purchases in 30 days using PySpark.
59:50 - Calculate users with purchase gaps over 30 days using PySpark.
1:03:59 - Transforming a text column into an array and exploding its values.
1:06:19 - Grouping and counting words in PySpark with aliasing techniques.
1:10:44 - Calculate cumulative sales sum over time using PySpark.
1:12:49 - Using cumulative sum with window functions in PySpark.
1:17:18 - Using row number to retain order while removing duplicates.
1:19:32 - Surrogate keys simplify database design by replacing complex primary keys.
1:23:41 - Focus on covering all key topics, including data aggregation in PySpark.
1:25:54 - Double aggregation to find top-selling products per month.
1:30:33 - Extract highest sales product per month using dense rank.
1:33:17 - Debugging PySpark code focuses on correcting syntax and indentation issues.
1:37:40 - Understanding Spark architecture and components in PySpark.
1:39:54 - Understanding Spark architecture and driver node functionality.
1:44:25 - Applying merge conditions using Delta tables in PySpark.
1:46:53 - Understanding upsert and schema inference in PySpark.
1:50:54 - Understanding RDDs, DataFrames, and Datasets in PySpark.
1:52:53 - DataFrames enhance usability and integration in PySpark compared to RDDs.
1:57:13 - Understanding logical and physical plans in PySpark for optimization.
1:59:29 - Spark chooses the optimal join type based on cost models.
2:03:51 - Understanding Spark's entry points and transformation types.
2:05:59 - Narrow transformations allow independent data processing without shuffling between machines.
2:10:27 - Understanding partitioning and data management in PySpark.
2:12:31 - Understanding data storage and management in PySpark using cache and persist.
2:17:11 - Partitions in PySpark enable massive parallel processing for efficient data handling.
2:19:17 - Counting employees by department and classifying sales transactions.
2:24:03 - Understanding fundamental concepts is essential for implementing solutions in data processing.
2:26:14 - Using current timestamp for tracking record changes in PySpark.
2:30:36 - Understanding temporary and global views in PySpark SQL.
2:32:59 - Understanding Global Temp views and flattening nested structures in PySpark.
2:37:37 - Understanding data partitioning in PySpark for optimized storage.
2:39:45 - Using Snappy compression enhances performance with Parquet files.
2:43:58 - Understanding the optimize and Z order by commands in PySpark.
2:46:05 - Understanding Z-Order and Data Skipping in Spark for Efficient Querying.
2:50:20 - Understanding DataFrame actions and lazy evaluation in PySpark.
2:52:40 - Understanding actions and lazy evaluation in PySpark.
2:57:22 - Key advantages of Delta Lake and memory management in PySpark jobs.
2:59:48 - Optimizing memory and mitigating data skew with AQE in PySpark.
3:04:31 - Dynamic join optimization in PySpark enhances performance.
3:06:29 - Discussing skew data handling methods in PySpark using salting and AQE.
3:10:56 - Understanding Spark's memory management and Delta Lake's time travel features.
3:13:18 - Utilize Delta Lake's time travel feature to recover deleted data.
3:17:49 - Using collect_list for aggregating product names by category in PySpark.
3:20:38 - Using collect_set to find unique product IDs per customer.
3:25:49 - Using concat_ws for efficient string concatenation in PySpark.
3:28:23 - Calculate product counts for customers using PySpark DataFrame operations.
3:33:02 - Validating phone numbers using prefix filtering in PySpark.
3:35:25 - Calculate average courses per student from a dataset.
3:40:28 - Discussion on preparing for PySpark interviews with valuable resources.
3:43:01 - Determine if postal codes are standard or custom based on length.
Superb work bro .... Great energy 🎉
Glad you liked it:) Happy Learning
Your Data Fam is here
Glad to see the excitement
Thanks Ansh bhai
Happy that you liked it:)
Thank you Ansh, currently studying the pyspark course video.. once it is completed this will be very much useful ❤
Great that you are watching my other course as well! Happy learning :)
You are awesome @Ansh Lamba
Happy Learning :)
Thank you ansh
Thanks for sharing
Great content - this will help😊
Happy that you found it useful :)
Love you bro❤ love the way of your explanation..very informative
I'm happy that you loved this video :)
Hey Ansh, I can’t thank you enough for this wonderful Masterpiece. Your explanation is always clear and really made a difference in how I understood the material. You have a great teaching style that makes even the most complicated topics seem accessible. I’ve learned so much from you, and I’m really grateful for the time and effort you put into helping us improve our skills!♥👏👏👏
Waiting for this❤
Happy Learning :)
Fabulous Ansh❤
Happy Learning :)
this is so helpful! but im unable to access the notebook.. could you please check?
It's there github.com/anshlambagit/PySparkInterview
You are awsome bro...
Great
Covered all the things bro this video is enough for preparation. Thanks a Ton.....!!!!!👌👌👏👏👏🙌🙌🙌🙌
Happy to hear that :)
awesome explanation!!!! and really helpful content. Kindly make such video on SCALA as well ..
Stay tuned :)
Thanks a lot
awesome Bro 🙏🙏
Happy Learning :)
Hello Ansh, I am just in love with your content, bro. you made my life easier thank you so much, I am happy I found your channel. I have a request for you: what kind of SQL questions will be asked for the Azure data engineer role. If possible, can you make a video of it?
Thank you for your kind words :) Stay tuned
Learning Growing thanks ❤
Happy Learning :)
Thanks a lot Ansh ❤
Happy learning:)
Thanks brother 💖
Glad you liked it :)
Informative one broooo!!!!
Can you please create videos on delta live tables?
Thank you. Stay Tuned :)
Like it bro...❤❤
Happy you liked it:)
This is brilliant
Thank you for your kind words :)
Guru 🔥🔥🔥🔥🔥🔥
Happy Learning :)
You are awesome bro ❤ wow😮
Thank you for your kind words:) Happy Learning!
Thank you Ansh.
Glad you liked it:)
You are awesome ❤@Ansh bro
Happy Learning :)
not able to open notebook .dbc can you please change extention and send,thank you
Wow thanks ansh
Glad you liked it:)
Thanks Ansh
Happy you liked it :)
Hey Ansh, Can we get interview questions even for Azure Data Factory. Thank you soo much for your wonderful help
Stay Tuned :)
Thanks a lot Ansh. Can you make same kind of real time interview question for ADF as well it's so helpful.
Stay tuned :)
@ Thank you☺
Hi @AnshLambaJSR , I can't use DBC file in work laptop. Can you please provide CSV format please
Bro, how can a notebook output be CSV ?
@aniketghodake1366 so, which app shall I use to open it?
Ansh bro, I struggle a lot while answering ci/cd related questions in the interviews. Can you explain how they move environments and do prod deployment in databricks. Thanks
HI Ansh, Unable to view the Notebook in the github.Seems empty. Can you reshare it.
Hii Ansh ,
At 1:09:26 there are two product rows in which 1 has , with it which has created the issue of calculating product column as 1.
How are you going to correct that issue?
Hey @AnshLambaJSR Love your videos. Can we also get a video on Azure interview questions??
Stay tuned and Happy Learning :)
OP bro OP........
Happy Learning :)
You know one telugu famous politician dialogue. " Nenu vinnanu nenu vunnanu", That's perfectly opt for you.
Meaning of this telugu dialogue?
Thank you ANSH for great content...❤
Please make one for AZURE
Any classes you are arranging?
kindly requesting that will you post scala complete explanation vedio in ytub🤗
Stay tuned :)
also checkpointing() what does it do
Can you please make a tutorial of apache airflow
Stay tuned :)
How did you guys open this file?
its in .dbc format
edit : Steps to Import a .dbc File into Databricks:
Log in to Databricks.
Go to the Workspace tab.
Navigate to a folder or create a new folder.
Click on the "Import" button (usually at the top-right).
Choose the .dbc file (PySpark Interview.dbc) from your local machine.
Click Import.
bro can you make vdo on optimization in adf databricks and how do data quality check in bronze layer
ansh ,can you make adf real time scenarios questions playlist
Stay Tuned :)
Bro.
Rather than in single video.
Break into series for this one .it will.be more helpful during preparation
Make a video on sql and more queries also
Stay tuned :)
Thank you❤ thank you ❤ thank you❤ thank you❤ thank you❤ thank you❤ thank you ❤ , ............ Thanks ❤ a lot for the video , its very useful ❤
I am happy that you loved this PySpark Interview questions video:)
Hi Ansh , Start with the Azure free trial, getting a phone number error like "We’re unable to validate your phone number." tried with different country phone numbers. Getting same error, please assist me.
bang on bro
Bro content is good ..but one issue ...no one can understand your content when in a hurry ..I mean in 2x speed
Don't be in hurry bro.
unable to dowload the notebook
If you make one video for AIRFLOW, that would be helpful for many people
DP 203 Expiring? And introducing new Certificate Microsoft Fabric? Is it True? ADF is no longer? it will be part of Fabric in future? everything under Fabric in Future for Data engineers? Can u please make video on this for clarity and confusion which citification's are useful in the future for Azure Data Engineers. which side we should focus most. Fabric, ADF, Synapse, Or Databricks? Plz make one video on priority basis. Thanks in Advance Ansh.
STay tuned and Happy Learning :)
HI ANSH, GOOD START IN 2025
Happy you liked it :)
for Q 10 While preparing a data pipeline, you notice some duplicate rows in a dataset.
How would you remove the duplicates without affecting the original order?
Correct answer is to use monotonically_increasing_id along with dropDuplicates to preserve the order , we can't use Window function with ROW_NUMBER
data = [("John", 25), ("Jane", 30), ("John", 25), ("Alice", 22)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
df = df.withColumn("index", f.monotonically_increasing_id())
df.printSchema()
df.show()
df = df.dropDuplicates(["name", "age"]).orderBy("index").drop("index")
df.show()
Bro what about machine learning with azure data engineering?
Finally data engineer aspirants found the RUclips channel where they can learn everything 🎉
Thanks a lot✨🛐
Thank you so much for your kind words :)
Bro sql and python questions?
Great content as always ❤ You are a gift for your data fam🎉
Thank you for your kind words :)
make a interview question on a ADF
Stay tuned:)
bro please bring end to end project on fabric
Stay tuned :)
Ansh, I have a small suggestion for you. It would be great if you could upload videos on Friday nights. Since many of us are working professionals, this would allow us to dedicate time over the weekend, on Saturday and Sunday, to watch and engage with the content.
Thank you for your suggestion :)
Exactly I was thinking to comment the same, Hi ansh this would help us so much😊
Brother how can we just get a data engineering job as a fresher bcz i see opening for 3+ years experience for it in MUMBAI location.
waiting for DLT Project
Stay Tuned :)
Azure keyvault video
Airflow tution request
Great continue on the Azure DE interview series.
More to come!
Superb video,now a company asking azure devops and cicd integration as well
Stay tuned :)
Bhai Teri sakal to itni achi h ni itni style na marke sidha sidha padhale
Azure logic app
also please make a video on why AZURE DATA ENGINEERING is retiring & how AZURE FABRIC is replacing AZURE DATA ENGINEERING
I have spent 6 years using azure tools. Should I give dp 203 certif and complete it or wait, learn and prep for MS Fabric
Stay tuned :)
First of all, thank you for your efforts! Your content is truly helpful for those who are serious about the data engineering field. I have recommended your channel to many people, including my colleagues and friends. They have subscribed and are regularly following your videos. Please continue creating more videos-we are here to support you always 😊🙏🙂
Glad to hear this:) Happy Learning :)
our SUPERHERO is back !!! 💟❣❣💟💟❣❣❤🔥❤🔥❤🔥❤🔥
Happy to see such excitement for my videos :) Happy Learning!
Worst Chapri way of talking
Thanks for the lovely compliment
for question 1:
can we use this ,its easier right ?
df1=df.groupBy("product_id").agg(max("sales").alias("sales"), max("date").alias("date"))
df1.display()
Please create a video on ADF interview questions for real time scenarios and also once end to end project on adf only which covers all the activities mostly used and important scenarios. @anshLambaJSR
Stay tuned :)
@@AnshLambaJSR thanks bro