Many have asked for the file I used for this video- You can download it from here - drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing Remove the last 2 line from the csv file
Would like to congratulate to you sir!!!! Really liked your passion of making people educate about these cutting edge new technologies and also giving the whole picture not sticking to just solve one problem...loved your work and you inspired me to always try to give others what u have as it will only come back to you...once again kuddos to your passion and humanity
Sir you are providing us great content that too free of cost....... I sincerely want to thank you for all the hard work that you are putting in ... Also Sir could please suggest us some personal projects that we could take up to impart this knowledge.
Hello Sir Just wanted to confirm Spark is framework which works on the principal of Distributed Datasets and here we are using the pyspark library in the databricks notebook in order to perform the EDA and data cleaning. Right ?
Yes right this is databricks specific feature. If one is running outside of spark then we can convert spark dataframe to pandas and do EDA. In that case all data is brought to driver node and no longer distributed. I think I have covered it as well towards end of video
@AIEngineering - I would like to learn Spark. So, I am following your "Mastering APACHE Spark" playlist. Am I right to understand that the videos are in proper order in the playlist of 30 videos? Because as playlist progresses, I see some MLOps video as well. So just wanna seek your help in understanding the order is correct or not. Thanks for your help with this tutorials
Akshay . First 8 videos are key if you are focusing on Data Engineering and Big Data. From ML video it is if you want to understand Spark for machine learning. You can pick the topic you are looking at to learn here
Hi Sir, Thanks for your time. At 18:32 time, When saying about creating "Exposure" column. What is revol_bal (Revolving Balance). Is it (rev_util)% * Loan_amnt.? Because below statement is throwing me error. lc_df = lc_df.withColumn("exposure",when(lc_df.bad_loan=="No",col("revol_bal")).otherwise(-10*col("revol_bal"))) display(lc_df) Error: org.apache.spark.sql.AnalysisException: cannot resolve '`revol_bal`' given input columns:[.....................] Please correct me if I'm wrong.
Hello Sir, It will be very helpful if you can make a dedicated video on How to prepare for interview of Data Engg profile along with topic and Sub topic details... I am a beginer and I want to move in Data engg filed, I have working experience on SQL, Python Sorry to give trouble with my silly doubt Thanks in advance
royxss yes you can use colab and run pyspark. I wanted to show databricks feature of EDA and also in future videos scala spark. I doubt we can run scala on colab. But to your point pyspark can be run in colab
Did you do it as in my notebook here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Lending%20Club%20EDA.ipynb You can leave it as well as it is just to set label and text location
Sir, can you rearrange the video sequence in this mastering Apache Spark Playlist as it would be good if we get every video one after the other. Thank you.
Kamdeep.. I have arranged in this order data analysis/cleaning, eda, data engineering and then ML Only thing I did is moved transformation part 2 to end as it is kind of optional if required. Do you suggest any change?
Sir, can you recommend the best course to learn and apply all of these different stages of DATA SCIENCE using APACHE SPARK!! That would be a great help!! Thank you!!
Karndeep, If you go for course as well they will be covering on toy dataset and just explain you functions. There are plenty online, some good one you can find is on MapR website but they are outdated. My plan when I started doing Spark video is to cover functions and take it with real world dataset like Lending Club, Churn dataset. I have lot of videos on Spark in my course below and it is free ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI I will be adding to it. In case if you feel I am missing something and if I feel it might help I can add it. Also I cover all of Spark not only data engineering. I have ML and in future graph is also coming
You can use pandas if your data is not large and can fit easily into your system. In case if you are dealing with very large dataset that needs distributed computing, thats where Spark comes into play
Many have asked for the file I used for this video- You can download it from here -
drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing
Remove the last 2 line from the csv file
Would like to congratulate to you sir!!!! Really liked your passion of making people educate about these cutting edge new technologies and also giving the whole picture not sticking to just solve one problem...loved your work and you inspired me to always try to give others what u have as it will only come back to you...once again kuddos to your passion and humanity
Hats off to you sir. :)
for providing such a wonderful explanation and detail analysis. Thank you once again.
Very nice and useful pointers. Thank you very much.
Sir you are providing us great content that too free of cost....... I sincerely want to thank you for all the hard work that you are putting in ...
Also Sir could please suggest us some personal projects that we could take up to impart this knowledge.
Well explained in each code and scenario. Thanks a lot
Thank you so much sir, really respect what you are doing to help people that want to learn and make a career in data
Very very usefull playlist,thanks for sharing indepth knowledge, I have question- how to use spark with snowflake, how to connect?
Great videos! Looking forward to more videos. Keep up the good work! :)
DAMN no DISLIKES thats so cool sirji
Hello Sir
Just wanted to confirm Spark is framework which works on the principal of Distributed Datasets and here we are using the pyspark library in the databricks notebook in order to perform the EDA and data cleaning. Right ?
Yes right this is databricks specific feature. If one is running outside of spark then we can convert spark dataframe to pandas and do EDA. In that case all data is brought to driver node and no longer distributed. I think I have covered it as well towards end of video
@AIEngineering - I would like to learn Spark. So, I am following your "Mastering APACHE Spark" playlist. Am I right to understand that the videos are in proper order in the playlist of 30 videos? Because as playlist progresses, I see some MLOps video as well. So just wanna seek your help in understanding the order is correct or not. Thanks for your help with this tutorials
Akshay . First 8 videos are key if you are focusing on Data Engineering and Big Data. From ML video it is if you want to understand Spark for machine learning. You can pick the topic you are looking at to learn here
Hi Sir,
Thanks for your time. At 18:32 time, When saying about creating "Exposure" column. What is revol_bal (Revolving Balance).
Is it (rev_util)% * Loan_amnt.?
Because below statement is throwing me error.
lc_df = lc_df.withColumn("exposure",when(lc_df.bad_loan=="No",col("revol_bal")).otherwise(-10*col("revol_bal")))
display(lc_df)
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`revol_bal`' given input columns:[.....................]
Please correct me if I'm wrong.
revol_bal is column in input dataset. Can you check if that column is present as part of your data?
@@AIEngineeringLife Got it Sir. Able to include in 'df.select()' and continued with analysis. Thank you.
Hello Sir,
It will be very helpful if you can make a dedicated video on How to prepare for interview of Data Engg profile along with topic and Sub topic details...
I am a beginer and I want to move in Data engg filed, I have working experience on SQL, Python
Sorry to give trouble with my silly doubt
Thanks in advance
Can you please explain your choice of databricks community edition? Can we use it for free completely just like Colab?
royxss yes you can use colab and run pyspark. I wanted to show databricks feature of EDA and also in future videos scala spark. I doubt we can run scala on colab. But to your point pyspark can be run in colab
Thanks. Helpful.
Hi Sir,
If possible please explain the usage of " -,-" from line "sns.countplot(pd_df.loc[pd_df['total_acc']
Did you do it as in my notebook here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Lending%20Club%20EDA.ipynb
You can leave it as well as it is just to set label and text location
Sir, have you shared this notebooks anywhere like github..
How can i access them
Check this repo out - github.com/srivatsan88/Mastering-Apache-Spark
Sir, can you rearrange the video sequence in this mastering Apache Spark Playlist as it would be good if we get every video one after the other. Thank you.
Kamdeep.. I have arranged in this order data analysis/cleaning, eda, data engineering and then ML
Only thing I did is moved transformation part 2 to end as it is kind of optional if required. Do you suggest any change?
@@AIEngineeringLife No Sir, No changes needed i got your point.🙂
Sir, can you recommend the best course to learn and apply all of these different stages of DATA SCIENCE using APACHE SPARK!! That would be a great help!!
Thank you!!
Karndeep, If you go for course as well they will be covering on toy dataset and just explain you functions. There are plenty online, some good one you can find is on MapR website but they are outdated. My plan when I started doing Spark video is to cover functions and take it with real world dataset like Lending Club, Churn dataset. I have lot of videos on Spark in my course below and it is free
ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
I will be adding to it. In case if you feel I am missing something and if I feel it might help I can add it. Also I cover all of Spark not only data engineering. I have ML and in future graph is also coming
@@AIEngineeringLife Thankyou Sir, I am going through all your videos and it is helping me alot.
With the videos, I would suggest to keep the spark book open for any references. I find spark documentation to be bit difficult to wade through.
Can you make video on how to use snowpark?
if i'm preparing for de interview should I use spark or pandas/matplot for data cleaning. which one do u suggest
You can use pandas if your data is not large and can fit easily into your system. In case if you are dealing with very large dataset that needs distributed computing, thats where Spark comes into play
AIEngineering tq!!
hello sir, where can i find the dataset you used in this video??
It is over here -
drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing
Remove the last 2 line from the csv file
thank you, you're such a life saver, your explanations on the vlogs answered all my questions
Where to find the notebooks, code ?
Lorenco.. You can find notebook here
github.com/srivatsan88/RUclipsLI
@@AIEngineeringLife Thank you sir. Cheers! Keep on the good content!
Day 2 Crash Course : databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6601281371951374/1205982057485879/6899823951304183/latest.html