Data Engineering - Setting up AWS S3 and Getting Data into Spark - Part 1

AIEngineering

Просмотров 34 тыс.

452

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 10 фев 2025
#datascience #dataengineering #apachespark
In this video we will be setting up Amazon S3 bucket, settingright permissions and connecting to it from databricks. We will also see some common challenges when getting data into spark. In next video we will be going through details of various data engineering functions. You can also watch my detailed videos on data collection to understand the process of collecting data across multiple systems of engagement
Data Collection part 1 - Data Architecture - • Data Collection - Data...
Data Collection Part 2 - Centralizing data assets - • Data Collection - Cen...
Data Collection Part 3 - • Data Collection - Summ...

Комментарии • 76

@ijeffking 5 лет назад ⁺⁵
Through this video, you have disseminated quite beautifully invaluable knowledge. Thank you so much.
@carltonpatterson5539 4 года назад ⁺¹
I haven't reached the end of this video, but I just want to say it's fantastic - thank you all the way from London
@AIEngineeringLife 4 года назад
Thank you Carlton for sending this note :)
@venkatramachandran2912 4 года назад ⁺²
Plain, simple and too good. Fantastic.
@imransharief2891 5 лет назад ⁺¹
Superb at 4.2 8.29 10.41s some echo is there but got so much knowledge thank u sir
@dsezhilarasu 4 года назад ⁺¹
Thank you for your clear videos. Looking forward to seeing more real-life scenario data engineering pipeline examples.
@AIEngineeringLife 4 года назад
Ezhil.. Have you seen my mastering spark playlist that has details on DE pipelines. Are you looking for anything in specific let me know - ruclips.net/p/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO
@joannwatu7603 3 года назад
Thank you so much for this content!
@AnandKumar-dc2bf 3 года назад
Super Sir. Thanks for this video very helpful
@anmoljaiswal6686 4 года назад ⁺¹
Very much useful! Good explanation :)
@sandeepsingavarapu3839 3 года назад ⁺¹
Really Informative video, Thank you. Is there a way to personally connect with you to discuss on Data Engineering as a career and learning and end to end path?.
@rushikeshbulbule8120 5 лет назад ⁺¹
superb ....
@sagnikachakraborty1493 4 года назад
Hi, awesome video.
Could you show another video to integrate AWS with PySpark? As in, some more real time usecases similar to this one
@chetanmundhe8619 4 года назад
Very nice, can u make a video on next part where u can show what parameters you will monitor for the same business case with code✌✌👌
@manaspradhan2166 5 лет назад ⁺¹
Nice Video, Could you please also make a video on how can we read data from S3 to a notebook which is in EC2
@AIEngineeringLife 5 лет назад
Manas.. I am planing to do it once I get to cloud section of my Course
ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
Will get there in some time soon
@ashwinvijay373 4 года назад ⁺¹
When you save the data as table in parquet format, is it stored at aws s3 level or databricks level ?
@AIEngineeringLife 4 года назад ⁺¹
Ashwin.. It will be in S3 as my S3 is mounted on to databricks now
@nikhilmeghnani6234 4 года назад ⁺¹
Thanks for sharing great knowledge.
Could you please share the code for reference or git hub link if any or this databricks notebook.
@AIEngineeringLife 4 года назад ⁺¹
You can check it here - github.com/srivatsan88/Mastering-Apache-Spark
@abhishekkulkarni6020 4 года назад ⁺¹
Can you show us a way to use role instead of credentials? from EC2 to S3 using pyspark
@AIEngineeringLife 4 года назад
Abhishek.. I can but that is not possible in community edition of databricks to my knowledge. Only in enterprise edition we can set it up.. I will try to show it within AWS or cross cloud in one of my video
@abhishekkulkarni6020 4 года назад
AIEngineering
Can you please explain what you mean by community and enterprise edition? I have an ec2 with spark(pyspark) set up and I am trying to access a file in the S3 bucket by using roles. Can you please show us that
@AIEngineeringLife 4 года назад
The video you commented under uses databricks and thought you were referring to databricks.. If it is ec2 that should not be that complex. Will show it. On a side note any reason for going to ec2 and not EMR based Spark?
@abhishekkulkarni6020 4 года назад
AIEngineering
We are planning to test on a stand alone ec2 before we run on an EMR , also to avoid more cost by having the EMR running in the dev environment
@moulalichebolu1946 3 года назад
hi sir,
i have a doubt .my doubt is csv file have some columns are having data 2 columns have "," inside the text..it's a csv file every "," treated as new column change the column types they null values......how can solve this issue
@AIEngineeringLife 3 года назад
In spark dataframe options have you tried quote options?
@Cricketpracticevideoarchive 4 года назад ⁺¹
I have tried to connect with S3 and of your access key has "\" in it the replace it with %2F or else you will not be able to connect with S3
@tridipdas5445 3 года назад
Hello Sir, just a doubt. Suppose If I have multiple files for multiple users in a S3 bucket and I just need to load these files from the bucket,convert into a dataframe, do some filtration and return the response to a dashboard. So for this, do I need to create multiple sparksession for multiple files or should loading the files in different dataframes in a single sparksession will be okay? Please help me out.
I am getting confused between whether to use multiple spark sessions or not.
@AIEngineeringLife 3 года назад ⁺¹
If all the files are of same format then you can move it to single S3 and read all files together. Even if it is separate S3 you can create multiple dataframe objects in single session
@tridipdas5445 3 года назад
@@AIEngineeringLife Thank you ,Sir. Also can multiple users perform data processing of different datasets at the same time in a single spark session ??
@harshithag5769 3 года назад
Hello sir wwhen you say it gets mounted on s3data. Does the data get copies in our local drive?I don't see any data getting copied in databricks
@AIEngineeringLife 3 года назад
It is more of a passthrough. So it is logical. You can query in your local Dbfs and it will retrieve data from S3
@jaswantthadi5312 3 года назад
@@AIEngineeringLife sir can you plz upload the listings and reviews.CVS files to GitHub. I don’t see them in your GitHub repo
@AIEngineeringLife 3 года назад
@@jaswantthadi5312 .. It must be in Airbnb website that showed in my video above. Are you not able to download it?
@jaswantthadi5312 3 года назад
Hello sir, import urllib is throwing error. Do we need to add external libraries in the cluster.
@AIEngineeringLife 3 года назад
Yes.. just pip install if the package is not there. What error are you seeing?
@dkyadav6971 5 лет назад
Knowlede is Power u showed us ,Just Upload that in S3 Public Bucket and share URL or , I will sync it on my bucket, off course with your IAM Permission
@AIEngineeringLife 5 лет назад ⁺¹
AIwaya.. Thank you and this is airbnb dataset you can download the 3 from here
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/listings.csv.gz
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/reviews.csv.gz
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/visualisations/neighbourhoods.csv
Just unzip the first 2
@royxss 5 лет назад
Can you explain why do you prefer AWS with Databricks instead of Azure as Azure Databricks have partnered?
@AIEngineeringLife 5 лет назад ⁺¹
royxss.. It can be anything.. Azure blob storage or Google GCS. Just that I took AWS S3 as example here :)
@royxss 5 лет назад ⁺³
@@AIEngineeringLife Thanks for answering my question. I would request for another advice. I want to play with kafka and spark streaming and then possibly MLlib. Can you tell me any use case with source system to try out?
@AIEngineeringLife 5 лет назад
@@royxss Streaming sources are difficult to get but you can write your own scraping component. Let me see if some good use case are there and update
@royxss 5 лет назад ⁺¹
@@AIEngineeringLife Thanks a lot. Maybe a video of the pipeline in the future would be awesome :-D. Just an idea!
@abhishek007123 4 года назад
Hello Sir,
Great and structured explanation..!
Am getting any error while executing the code. Plz resolve it, it's:
:5: error: . expected
import urllib
@abhishek007123 4 года назад
Adding "%python" platform to the cell resolved the issue.
@AIEngineeringLife 4 года назад
Where are you using it?. Datsbricks?. In databricks python is default kernel unless changed. Can you please confirm?
@abhishek007123 4 года назад ⁺¹
@@AIEngineeringLife yeah, initially I using scala , later I found that it should be python.
Thank you very for support. Sir.!❤️❤️
@gururajangovindan7766 4 года назад
The videos are really good. I am trying to replicate the same in my account. I have a trouble getting the files from S3 into Databricks. I am getting an error with -display(dbutils.fs.ls('/mnt/s3data')). Error Message :the authorization mechanism you have provided is not supported.
@AIEngineeringLife 4 года назад
Gururajan.. Can you check if in AWS you have set s3 policy for the bucket and also the code is similar to below with your credentials
import urllib
ACCESS_KEY = ""
SECRET_KEY = ""
ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = ""
MOUNT_NAME = "s3data"
dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
@gururajangovindan7766 4 года назад
@@AIEngineeringLife S3 bucket has AmazonS3full access and I have used the same code as above. But I am getting an error when running - display(dbutils.fs.ls('/mnt/s3data'))
Error :
java.rmi.RemoteException: com.databricks.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/' XML Error Message: InvalidRequestThe authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.988C6BCCCD57FBB0aM7FNZvIJDGnwi9JQ0Pg1Idj6hlO/8BUkFNdwyQ8UvbhxdT0GoWM5MfRe02CM2JHy5m7/RWeDQ8=; nested exception is:
@ravee9090 4 года назад ⁺²
@@gururajangovindan7766 Error is related to the region in which you created S3 bucket. I faced the same issue. After creating the S3 bucket in US region, I am able to access it.
@gururajangovindan7766 4 года назад
ravindranath oruganti Thank you. Need to check the region details
@HSHIHADAH 4 года назад
@@gururajangovindan7766 did you solve this issue, this issue still appear with me even the region is US
@divendughati6114 4 года назад
Can someone tell me, where we open spark environment in the AWS for the bucket that is created.
@AIEngineeringLife 4 года назад
Divendu.. it must be in my video... if you go to AWS IAM section you can see that option
@fintech1378 4 года назад
@@AIEngineeringLife which video is it?
@gopalbehara8283 Год назад
Can you share the notebook files sir
@photomyste3279 5 месяцев назад
How to automate this script
@siddhantsapte 4 года назад
Any git repo for this?

Следующие

Автовоспроизведение

Mastering Data Engineering using Apache Spark - Part 2