Mastering Data Engineering using Apache Spark - Part 2

AIEngineering

Просмотров 12 тыс.

211

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 9 фев 2025
#dataengineering #machinelearning #apachespark
In this video I will be walking through all functions and transformations of Data Engineering both basic ones and advanced ones. We will be going hands on and I will explain each and every function of Apache Spark and its application to real world feature engineering pipeline
This is continuation of my previous video where I setup S3 and airbnb data - • Data Engineering - Set...
Other videos related to data engineering and data analysis
Data Cleaning and Analysis - • Data Cleaning and Anal...
EDA using Spark/databricks - • Exploratory Data Analy...

Комментарии • 44

@shankarsaripalli3260 3 года назад ⁺²
These tutorials are really awesome and to the point.
@pariksheetde4573 4 года назад ⁺¹
Salute to you Srivatsan....for your effort
@ratulghosh8174 4 года назад ⁺²
Thank you so much sir for providing such good content on Data Engineering. !
@hemanthdevarapati519 4 года назад ⁺¹
valuable content just right there. Thank you again, Srivatsan.
@youavang1973 3 года назад ⁺¹
Thank you so much! Your content are very helpful.
@GovindKumarMaliLucky 4 года назад ⁺³
Thank you Srivatsan, Amazing sessions.. I am doing hands-on parallel to your sessions which give me confidence in new skill. Please let us know if we can contribute to your work.
@AIEngineeringLife 4 года назад ⁺¹
Great Govind.. :). Thanks for asking. I will reach out if I need any help on this :)
@raghuramas6804 4 года назад ⁺²
Thank you Srivatsan for the wonderful content. These set of videos are extremely insightful. Having gone through a bunch of Data Engineering courses, I can vouch that this course is the best in terms of the range of concepts covered, difficulty level, pace and clarity of understanding. Doing hands-on parallelly is extremely helpful and solidifies understanding of concepts. Just a small thing, the audio quality in some parts of the videos are sometimes a little jittery. Other than that , great content. Keep it coming. Love your posts on LinkedIn too!
@AIEngineeringLife 4 года назад
Thanks Raghuram.. Yes sometime audio quality deteriorates and I am trying my best to fix it but in long videos not able to fix this in between patchy noises.. Will try to get better as I learn more about audio processing :)
@sauravkumar-qf4sm 4 года назад ⁺¹
Thanks sir,very helpful 🙂
@prakashtripathi8556 3 года назад
Thanks so much sir , really great insights you have provided .After watching whole vedio I have question like when we have to use spark sql and when to use spark dataframe api .I see both are doing same job
@ravee9090 4 года назад ⁺¹
Hi Srivatsan,
Nice tutorial. Will you be able to help in pointing to some practical use cases (small and big) or exercises for some hands-on.
@AIEngineeringLife 4 года назад
Ravindranath.. You can try playing around with datasets like Lending Club, Amazon reviews and see. Now you can define your own analysis criteria here.. What products are better accepted or disregarded by customers in case of Amazon or why do Lending cub customer default
@ravee9090 4 года назад
@@AIEngineeringLife Thank you so much for the response. I will check out the datasets you pointed to. I really appreciate the response.
@kphorce 3 года назад
What tool are you using in this video?? Is it azure databricks?? Great video by the way. I am new to data engineering.
@AIEngineeringLife 3 года назад
This is databricks
@manavmishra9436 2 года назад
Thanks Sir, your courses are really helpful.
Can you please provide the link of AIRBNB files as we only have access to the 2021 files in the website and it is kind of different data when compared to your data
@aragornguan7692 4 года назад ⁺¹
Thanks Srivatsan, good sessions. Could you please share your notebooks in the video please if possible. Thanks again!
@AIEngineeringLife 4 года назад ⁺¹
You can find my entire free spark course notebook here - github.com/srivatsan88/Mastering-Apache-Spark
@shrutilekhamukherjee2580 4 года назад ⁺¹
Really nice explanation and real time scenarios. I am a ETL developer, and this video seems perfect. It’s not like basic . Srivatsan is there any way I can reach out to you for more understanding?
@AIEngineeringLife 4 года назад
You can message me on LinkedIn in case if you have any specific questions
@jeremynx 2 года назад
thank you. it works
@pankushkukreja3101 4 года назад ⁺¹
Hello Sir,
while using any UDF in my code for processing i am getting below error, i tried to import each sql function separately instead of import all as * , which i found from stack overflow but still not able to help.
can you suggest.?
code:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import regexp_replace
def trim_char(string):
return regexp_replace(string, '\\$', " " )
spark.udf.register("trim_func", trim_char)
trim_func_udf = udf(trim_char)
df_listing.select("property_type","price",trim_func_udf("price").cast(FloatType()).alias("price_f")).groupBy("property_type").agg(({"price_f":"average"})).show()
Error:
jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
AttributeError: 'NoneType' object has no attribute '_jvm'
@AIEngineeringLife 4 года назад
We need to see if regex_replace is serializable as UDF. Need to try it. Else better to use string based functions like strip or replace
@pankushkukreja3101 4 года назад
Sir i tried string.strip and string.replace but still the same error.
@ashwinvijay373 4 года назад ⁺¹
Is there a way to find dependent views on a table in Databricks ?
@AIEngineeringLife 4 года назад
Ashwin.. I have not tried it but have you check metastore that you have ?. - docs.databricks.com/data/metastores/index.html
@shrutilekhamukherjee2580 4 года назад ⁺¹
I am not able to run the following,getting error :column object is not callable,can you kindly help. Also where can I get you code for reference.
from pyspark.sql.functions import isnan,when,count,col
# COMMAND ----------
#df_listing.show()
display(df_listing.select([count(when(isnan(c) | col(c).isnull(),c)).alias(c) for c in df_listing.columns]))
@AIEngineeringLife 4 года назад ⁺¹
I see this calling function is fine but maybe some issue with some modification to data frame maybe. Here is my code - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Data%20Engineering%20-%20Part%202.ipynb
@shrutilekhamukherjee2580 4 года назад
Thank so much
@harshithag5769 3 года назад
Hello sir, whatever programming we are doing here say dataframe or SQL does this come under python programming? Becoz mostly we are executing SQL commands.
@AIEngineeringLife 3 года назад
SQL syntax is similar to how you use in other ecosystem but the backend execution engine in Spark is different as it execute the command on distributed datasets
@harshithag5769 3 года назад
@@AIEngineeringLife whatever you have covered in these two videos on data engineering on airbnb files, is this enough to understand pyspark, sql ? Also when i go idle for more than 2 hrs, cluster gets detacched, and recently cluster also got deleted, i had to create new cluster. is there anything i can do to avoid that.
@AIEngineeringLife 3 года назад
@@harshithag5769 .. databricks stops the cluster if it is not used for more than 2 hours. It is easy to spin up and attach that cluster to your notebook. There is no option here.. Yes it is enough and also check my pyspark function video to learn more functions
@prathit3082 4 года назад ⁺¹
Hi sir, I just had one question, I am able to do 90% of the transformations with spark SQL, is that a good approach when dealing with data engineering specifically or it is advised to use pyspark functions to do the transformations?
Btw great content!
@AIEngineeringLife 4 года назад
You can do Spark SQL very much if that solves your problem. Spark SQL complies to same DAG as pyspark functions. In some cases pyspark functions is easy to write multi level functions rather multiple SQL's. In many cases I use hybrid approach of both SQL's and functions in my program. My next part on this will cover Spark SQLs
@prathit3082 4 года назад ⁺¹
Thanks for the quick response !
And right , hybrid seems like the right approach .
@arvindkumar-ug1zf 4 года назад
sir why we creating function for convert string to DateTime format instead of we can use cast function df_review_dt = df_review_fil.withColumn('datetime', F.col('date').cast('date'))
@AIEngineeringLife 4 года назад
Yes you can Arvind. I just wanted to demonstrate spark udf here
@ampolusantosh5350 5 лет назад ⁺¹
Hi sir, can u make next video with echo-free? thank u for valuable info
@AIEngineeringLife 5 лет назад
Thanks Ampolu. Is it very bad?. I noticed for few seconds in between. Problem is when I record long video at one stretch it happens. Else have to do multiple short parts
@soumyadrip 5 лет назад ⁺¹
💖👍
@Raghav_kalapatapu 4 года назад
i am able to complete the Data Engineering Setting up AWS S3 Bucket with three tables successfully and i have created one more workbook and trying to access the listing table. it 's throwing error. "
code
%sql
select * from listing
Error
File "", line 5
select * from listing
^
SyntaxError: invalid syntax
Not sure why. how to trouble shoot the error
@arvindkumar-ug1zf 4 года назад
sir if there way to contact you regarding some doubt and career Whatsapp, telegram and sir thank you for amazing content

Следующие

Автовоспроизведение

Machine Learning using Apache Spark - Overview (Spark ML)