Data Engineering - Setting up AWS S3 and Getting Data into Spark - Part 1
HTML-код
- Опубликовано: 10 фев 2025
- #datascience #dataengineering #apachespark
In this video we will be setting up Amazon S3 bucket, settingright permissions and connecting to it from databricks. We will also see some common challenges when getting data into spark. In next video we will be going through details of various data engineering functions. You can also watch my detailed videos on data collection to understand the process of collecting data across multiple systems of engagement
Data Collection part 1 - Data Architecture - • Data Collection - Data...
Data Collection Part 2 - Centralizing data assets - • Data Collection - Cen...
Data Collection Part 3 - • Data Collection - Summ...
Through this video, you have disseminated quite beautifully invaluable knowledge. Thank you so much.
I haven't reached the end of this video, but I just want to say it's fantastic - thank you all the way from London
Thank you Carlton for sending this note :)
Plain, simple and too good. Fantastic.
Superb at 4.2 8.29 10.41s some echo is there but got so much knowledge thank u sir
Thank you for your clear videos. Looking forward to seeing more real-life scenario data engineering pipeline examples.
Ezhil.. Have you seen my mastering spark playlist that has details on DE pipelines. Are you looking for anything in specific let me know - ruclips.net/p/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO
Thank you so much for this content!
Super Sir. Thanks for this video very helpful
Very much useful! Good explanation :)
Really Informative video, Thank you. Is there a way to personally connect with you to discuss on Data Engineering as a career and learning and end to end path?.
superb ....
Hi, awesome video.
Could you show another video to integrate AWS with PySpark? As in, some more real time usecases similar to this one
Very nice, can u make a video on next part where u can show what parameters you will monitor for the same business case with code✌✌👌
Nice Video, Could you please also make a video on how can we read data from S3 to a notebook which is in EC2
Manas.. I am planing to do it once I get to cloud section of my Course
ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
Will get there in some time soon
When you save the data as table in parquet format, is it stored at aws s3 level or databricks level ?
Ashwin.. It will be in S3 as my S3 is mounted on to databricks now
Thanks for sharing great knowledge.
Could you please share the code for reference or git hub link if any or this databricks notebook.
You can check it here - github.com/srivatsan88/Mastering-Apache-Spark
Can you show us a way to use role instead of credentials? from EC2 to S3 using pyspark
Abhishek.. I can but that is not possible in community edition of databricks to my knowledge. Only in enterprise edition we can set it up.. I will try to show it within AWS or cross cloud in one of my video
AIEngineering
Can you please explain what you mean by community and enterprise edition? I have an ec2 with spark(pyspark) set up and I am trying to access a file in the S3 bucket by using roles. Can you please show us that
The video you commented under uses databricks and thought you were referring to databricks.. If it is ec2 that should not be that complex. Will show it. On a side note any reason for going to ec2 and not EMR based Spark?
AIEngineering
We are planning to test on a stand alone ec2 before we run on an EMR , also to avoid more cost by having the EMR running in the dev environment
hi sir,
i have a doubt .my doubt is csv file have some columns are having data 2 columns have "," inside the text..it's a csv file every "," treated as new column change the column types they null values......how can solve this issue
In spark dataframe options have you tried quote options?
I have tried to connect with S3 and of your access key has "\" in it the replace it with %2F or else you will not be able to connect with S3
Hello Sir, just a doubt. Suppose If I have multiple files for multiple users in a S3 bucket and I just need to load these files from the bucket,convert into a dataframe, do some filtration and return the response to a dashboard. So for this, do I need to create multiple sparksession for multiple files or should loading the files in different dataframes in a single sparksession will be okay? Please help me out.
I am getting confused between whether to use multiple spark sessions or not.
If all the files are of same format then you can move it to single S3 and read all files together. Even if it is separate S3 you can create multiple dataframe objects in single session
@@AIEngineeringLife Thank you ,Sir. Also can multiple users perform data processing of different datasets at the same time in a single spark session ??
Hello sir wwhen you say it gets mounted on s3data. Does the data get copies in our local drive?I don't see any data getting copied in databricks
It is more of a passthrough. So it is logical. You can query in your local Dbfs and it will retrieve data from S3
@@AIEngineeringLife sir can you plz upload the listings and reviews.CVS files to GitHub. I don’t see them in your GitHub repo
@@jaswantthadi5312 .. It must be in Airbnb website that showed in my video above. Are you not able to download it?
Hello sir, import urllib is throwing error. Do we need to add external libraries in the cluster.
Yes.. just pip install if the package is not there. What error are you seeing?
Knowlede is Power u showed us ,Just Upload that in S3 Public Bucket and share URL or , I will sync it on my bucket, off course with your IAM Permission
AIwaya.. Thank you and this is airbnb dataset you can download the 3 from here
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/listings.csv.gz
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/reviews.csv.gz
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/visualisations/neighbourhoods.csv
Just unzip the first 2
Can you explain why do you prefer AWS with Databricks instead of Azure as Azure Databricks have partnered?
royxss.. It can be anything.. Azure blob storage or Google GCS. Just that I took AWS S3 as example here :)
@@AIEngineeringLife Thanks for answering my question. I would request for another advice. I want to play with kafka and spark streaming and then possibly MLlib. Can you tell me any use case with source system to try out?
@@royxss Streaming sources are difficult to get but you can write your own scraping component. Let me see if some good use case are there and update
@@AIEngineeringLife Thanks a lot. Maybe a video of the pipeline in the future would be awesome :-D. Just an idea!
Hello Sir,
Great and structured explanation..!
Am getting any error while executing the code. Plz resolve it, it's:
:5: error: . expected
import urllib
Adding "%python" platform to the cell resolved the issue.
Where are you using it?. Datsbricks?. In databricks python is default kernel unless changed. Can you please confirm?
@@AIEngineeringLife yeah, initially I using scala , later I found that it should be python.
Thank you very for support. Sir.!❤️❤️
The videos are really good. I am trying to replicate the same in my account. I have a trouble getting the files from S3 into Databricks. I am getting an error with -display(dbutils.fs.ls('/mnt/s3data')). Error Message :the authorization mechanism you have provided is not supported.
Gururajan.. Can you check if in AWS you have set s3 policy for the bucket and also the code is similar to below with your credentials
import urllib
ACCESS_KEY = ""
SECRET_KEY = ""
ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = ""
MOUNT_NAME = "s3data"
dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
@@AIEngineeringLife S3 bucket has AmazonS3full access and I have used the same code as above. But I am getting an error when running - display(dbutils.fs.ls('/mnt/s3data'))The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256. 988C6BCCCD57FBB0 aM7FNZvIJDGnwi9JQ0Pg1Idj6hlO/8BUkFNdwyQ8UvbhxdT0GoWM5MfRe02CM2JHy5m7/RWeDQ8= ; nested exception is:
Error :
java.rmi.RemoteException: com.databricks.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/' XML Error Message:
InvalidRequest
@@gururajangovindan7766 Error is related to the region in which you created S3 bucket. I faced the same issue. After creating the S3 bucket in US region, I am able to access it.
ravindranath oruganti Thank you. Need to check the region details
@@gururajangovindan7766 did you solve this issue, this issue still appear with me even the region is US
Can someone tell me, where we open spark environment in the AWS for the bucket that is created.
Divendu.. it must be in my video... if you go to AWS IAM section you can see that option
@@AIEngineeringLife which video is it?
Can you share the notebook files sir
How to automate this script
Any git repo for this?