Thank you Srivatsan, Amazing sessions.. I am doing hands-on parallel to your sessions which give me confidence in new skill. Please let us know if we can contribute to your work.
Thanks so much sir , really great insights you have provided .After watching whole vedio I have question like when we have to use spark sql and when to use spark dataframe api .I see both are doing same job
Really nice explanation and real time scenarios. I am a ETL developer, and this video seems perfect. It’s not like basic . Srivatsan is there any way I can reach out to you for more understanding?
Thanks Sir, your courses are really helpful. Can you please provide the link of AIRBNB files as we only have access to the 2021 files in the website and it is kind of different data when compared to your data
Hello Sir, while using any UDF in my code for processing i am getting below error, i tried to import each sql function separately instead of import all as * , which i found from stack overflow but still not able to help. can you suggest.? code: from pyspark.sql.types import FloatType from pyspark.sql.functions import regexp_replace def trim_char(string): return regexp_replace(string, '\\$', " " ) spark.udf.register("trim_func", trim_char) trim_func_udf = udf(trim_char) df_listing.select("property_type","price",trim_func_udf("price").cast(FloatType()).alias("price_f")).groupBy("property_type").agg(({"price_f":"average"})).show() Error: jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement) AttributeError: 'NoneType' object has no attribute '_jvm'
Ravindranath.. You can try playing around with datasets like Lending Club, Amazon reviews and see. Now you can define your own analysis criteria here.. What products are better accepted or disregarded by customers in case of Amazon or why do Lending cub customer default
Hello sir, whatever programming we are doing here say dataframe or SQL does this come under python programming? Becoz mostly we are executing SQL commands.
SQL syntax is similar to how you use in other ecosystem but the backend execution engine in Spark is different as it execute the command on distributed datasets
@@AIEngineeringLife whatever you have covered in these two videos on data engineering on airbnb files, is this enough to understand pyspark, sql ? Also when i go idle for more than 2 hrs, cluster gets detacched, and recently cluster also got deleted, i had to create new cluster. is there anything i can do to avoid that.
@@harshithag5769 .. databricks stops the cluster if it is not used for more than 2 hours. It is easy to spin up and attach that cluster to your notebook. There is no option here.. Yes it is enough and also check my pyspark function video to learn more functions
Hi sir, I just had one question, I am able to do 90% of the transformations with spark SQL, is that a good approach when dealing with data engineering specifically or it is advised to use pyspark functions to do the transformations? Btw great content!
You can do Spark SQL very much if that solves your problem. Spark SQL complies to same DAG as pyspark functions. In some cases pyspark functions is easy to write multi level functions rather multiple SQL's. In many cases I use hybrid approach of both SQL's and functions in my program. My next part on this will cover Spark SQLs
I am not able to run the following,getting error :column object is not callable,can you kindly help. Also where can I get you code for reference. from pyspark.sql.functions import isnan,when,count,col # COMMAND ---------- #df_listing.show() display(df_listing.select([count(when(isnan(c) | col(c).isnull(),c)).alias(c) for c in df_listing.columns]))
I see this calling function is fine but maybe some issue with some modification to data frame maybe. Here is my code - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Data%20Engineering%20-%20Part%202.ipynb
Thank you Srivatsan for the wonderful content. These set of videos are extremely insightful. Having gone through a bunch of Data Engineering courses, I can vouch that this course is the best in terms of the range of concepts covered, difficulty level, pace and clarity of understanding. Doing hands-on parallelly is extremely helpful and solidifies understanding of concepts. Just a small thing, the audio quality in some parts of the videos are sometimes a little jittery. Other than that , great content. Keep it coming. Love your posts on LinkedIn too!
Thanks Raghuram.. Yes sometime audio quality deteriorates and I am trying my best to fix it but in long videos not able to fix this in between patchy noises.. Will try to get better as I learn more about audio processing :)
sir why we creating function for convert string to DateTime format instead of we can use cast function df_review_dt = df_review_fil.withColumn('datetime', F.col('date').cast('date'))
Thanks Ampolu. Is it very bad?. I noticed for few seconds in between. Problem is when I record long video at one stretch it happens. Else have to do multiple short parts
i am able to complete the Data Engineering Setting up AWS S3 Bucket with three tables successfully and i have created one more workbook and trying to access the listing table. it 's throwing error. " code %sql select * from listing Error File "", line 5 select * from listing ^ SyntaxError: invalid syntax Not sure why. how to trouble shoot the error
These tutorials are really awesome and to the point.
Thank you so much sir for providing such good content on Data Engineering. !
Salute to you Srivatsan....for your effort
Thank you Srivatsan, Amazing sessions.. I am doing hands-on parallel to your sessions which give me confidence in new skill. Please let us know if we can contribute to your work.
Great Govind.. :). Thanks for asking. I will reach out if I need any help on this :)
valuable content just right there. Thank you again, Srivatsan.
Thank you so much! Your content are very helpful.
Thanks so much sir , really great insights you have provided .After watching whole vedio I have question like when we have to use spark sql and when to use spark dataframe api .I see both are doing same job
Really nice explanation and real time scenarios. I am a ETL developer, and this video seems perfect. It’s not like basic . Srivatsan is there any way I can reach out to you for more understanding?
You can message me on LinkedIn in case if you have any specific questions
Thanks sir,very helpful 🙂
Thanks Sir, your courses are really helpful.
Can you please provide the link of AIRBNB files as we only have access to the 2021 files in the website and it is kind of different data when compared to your data
What tool are you using in this video?? Is it azure databricks?? Great video by the way. I am new to data engineering.
This is databricks
Hello Sir,
while using any UDF in my code for processing i am getting below error, i tried to import each sql function separately instead of import all as * , which i found from stack overflow but still not able to help.
can you suggest.?
code:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import regexp_replace
def trim_char(string):
return regexp_replace(string, '\\$', " " )
spark.udf.register("trim_func", trim_char)
trim_func_udf = udf(trim_char)
df_listing.select("property_type","price",trim_func_udf("price").cast(FloatType()).alias("price_f")).groupBy("property_type").agg(({"price_f":"average"})).show()
Error:
jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
AttributeError: 'NoneType' object has no attribute '_jvm'
We need to see if regex_replace is serializable as UDF. Need to try it. Else better to use string based functions like strip or replace
Sir i tried string.strip and string.replace but still the same error.
Hi Srivatsan,
Nice tutorial. Will you be able to help in pointing to some practical use cases (small and big) or exercises for some hands-on.
Ravindranath.. You can try playing around with datasets like Lending Club, Amazon reviews and see. Now you can define your own analysis criteria here.. What products are better accepted or disregarded by customers in case of Amazon or why do Lending cub customer default
@@AIEngineeringLife Thank you so much for the response. I will check out the datasets you pointed to. I really appreciate the response.
Thanks Srivatsan, good sessions. Could you please share your notebooks in the video please if possible. Thanks again!
You can find my entire free spark course notebook here - github.com/srivatsan88/Mastering-Apache-Spark
Hello sir, whatever programming we are doing here say dataframe or SQL does this come under python programming? Becoz mostly we are executing SQL commands.
SQL syntax is similar to how you use in other ecosystem but the backend execution engine in Spark is different as it execute the command on distributed datasets
@@AIEngineeringLife whatever you have covered in these two videos on data engineering on airbnb files, is this enough to understand pyspark, sql ? Also when i go idle for more than 2 hrs, cluster gets detacched, and recently cluster also got deleted, i had to create new cluster. is there anything i can do to avoid that.
@@harshithag5769 .. databricks stops the cluster if it is not used for more than 2 hours. It is easy to spin up and attach that cluster to your notebook. There is no option here.. Yes it is enough and also check my pyspark function video to learn more functions
Hi sir, I just had one question, I am able to do 90% of the transformations with spark SQL, is that a good approach when dealing with data engineering specifically or it is advised to use pyspark functions to do the transformations?
Btw great content!
You can do Spark SQL very much if that solves your problem. Spark SQL complies to same DAG as pyspark functions. In some cases pyspark functions is easy to write multi level functions rather multiple SQL's. In many cases I use hybrid approach of both SQL's and functions in my program. My next part on this will cover Spark SQLs
Thanks for the quick response !
And right , hybrid seems like the right approach .
Is there a way to find dependent views on a table in Databricks ?
Ashwin.. I have not tried it but have you check metastore that you have ?. - docs.databricks.com/data/metastores/index.html
I am not able to run the following,getting error :column object is not callable,can you kindly help. Also where can I get you code for reference.
from pyspark.sql.functions import isnan,when,count,col
# COMMAND ----------
#df_listing.show()
display(df_listing.select([count(when(isnan(c) | col(c).isnull(),c)).alias(c) for c in df_listing.columns]))
I see this calling function is fine but maybe some issue with some modification to data frame maybe. Here is my code - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Data%20Engineering%20-%20Part%202.ipynb
Thank so much
Thank you Srivatsan for the wonderful content. These set of videos are extremely insightful. Having gone through a bunch of Data Engineering courses, I can vouch that this course is the best in terms of the range of concepts covered, difficulty level, pace and clarity of understanding. Doing hands-on parallelly is extremely helpful and solidifies understanding of concepts. Just a small thing, the audio quality in some parts of the videos are sometimes a little jittery. Other than that , great content. Keep it coming. Love your posts on LinkedIn too!
Thanks Raghuram.. Yes sometime audio quality deteriorates and I am trying my best to fix it but in long videos not able to fix this in between patchy noises.. Will try to get better as I learn more about audio processing :)
thank you. it works
sir why we creating function for convert string to DateTime format instead of we can use cast function df_review_dt = df_review_fil.withColumn('datetime', F.col('date').cast('date'))
Yes you can Arvind. I just wanted to demonstrate spark udf here
Hi sir, can u make next video with echo-free? thank u for valuable info
Thanks Ampolu. Is it very bad?. I noticed for few seconds in between. Problem is when I record long video at one stretch it happens. Else have to do multiple short parts
💖👍
i am able to complete the Data Engineering Setting up AWS S3 Bucket with three tables successfully and i have created one more workbook and trying to access the listing table. it 's throwing error. "
code
%sql
select * from listing
Error
File "", line 5
select * from listing
^
SyntaxError: invalid syntax
Not sure why. how to trouble shoot the error
sir if there way to contact you regarding some doubt and career Whatsapp, telegram and sir thank you for amazing content