Ezhil.. Have you seen my mastering spark playlist that has details on DE pipelines. Are you looking for anything in specific let me know - ruclips.net/p/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO
Really Informative video, Thank you. Is there a way to personally connect with you to discuss on Data Engineering as a career and learning and end to end path?.
Manas.. I am planing to do it once I get to cloud section of my Course ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI Will get there in some time soon
hi sir, i have a doubt .my doubt is csv file have some columns are having data 2 columns have "," inside the text..it's a csv file every "," treated as new column change the column types they null values......how can solve this issue
Hello Sir, just a doubt. Suppose If I have multiple files for multiple users in a S3 bucket and I just need to load these files from the bucket,convert into a dataframe, do some filtration and return the response to a dashboard. So for this, do I need to create multiple sparksession for multiple files or should loading the files in different dataframes in a single sparksession will be okay? Please help me out. I am getting confused between whether to use multiple spark sessions or not.
If all the files are of same format then you can move it to single S3 and read all files together. Even if it is separate S3 you can create multiple dataframe objects in single session
@@AIEngineeringLife Thank you ,Sir. Also can multiple users perform data processing of different datasets at the same time in a single spark session ??
Abhishek.. I can but that is not possible in community edition of databricks to my knowledge. Only in enterprise edition we can set it up.. I will try to show it within AWS or cross cloud in one of my video
AIEngineering Can you please explain what you mean by community and enterprise edition? I have an ec2 with spark(pyspark) set up and I am trying to access a file in the S3 bucket by using roles. Can you please show us that
The video you commented under uses databricks and thought you were referring to databricks.. If it is ec2 that should not be that complex. Will show it. On a side note any reason for going to ec2 and not EMR based Spark?
AIEngineering We are planning to test on a stand alone ec2 before we run on an EMR , also to avoid more cost by having the EMR running in the dev environment
@@AIEngineeringLife Thanks for answering my question. I would request for another advice. I want to play with kafka and spark streaming and then possibly MLlib. Can you tell me any use case with source system to try out?
Knowlede is Power u showed us ,Just Upload that in S3 Public Bucket and share URL or , I will sync it on my bucket, off course with your IAM Permission
AIwaya.. Thank you and this is airbnb dataset you can download the 3 from here data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/listings.csv.gz data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/reviews.csv.gz data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/visualisations/neighbourhoods.csv Just unzip the first 2
Hello Sir, Great and structured explanation..! Am getting any error while executing the code. Plz resolve it, it's: :5: error: . expected import urllib
The videos are really good. I am trying to replicate the same in my account. I have a trouble getting the files from S3 into Databricks. I am getting an error with -display(dbutils.fs.ls('/mnt/s3data')). Error Message :the authorization mechanism you have provided is not supported.
Gururajan.. Can you check if in AWS you have set s3 policy for the bucket and also the code is similar to below with your credentials import urllib ACCESS_KEY = "" SECRET_KEY = "" ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "" MOUNT_NAME = "s3data" dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
@@AIEngineeringLife S3 bucket has AmazonS3full access and I have used the same code as above. But I am getting an error when running - display(dbutils.fs.ls('/mnt/s3data')) Error : java.rmi.RemoteException: com.databricks.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/' XML Error Message: InvalidRequestThe authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.988C6BCCCD57FBB0aM7FNZvIJDGnwi9JQ0Pg1Idj6hlO/8BUkFNdwyQ8UvbhxdT0GoWM5MfRe02CM2JHy5m7/RWeDQ8=; nested exception is:
@@gururajangovindan7766 Error is related to the region in which you created S3 bucket. I faced the same issue. After creating the S3 bucket in US region, I am able to access it.
Through this video, you have disseminated quite beautifully invaluable knowledge. Thank you so much.
I haven't reached the end of this video, but I just want to say it's fantastic - thank you all the way from London
Thank you Carlton for sending this note :)
Superb at 4.2 8.29 10.41s some echo is there but got so much knowledge thank u sir
Plain, simple and too good. Fantastic.
Thank you for your clear videos. Looking forward to seeing more real-life scenario data engineering pipeline examples.
Ezhil.. Have you seen my mastering spark playlist that has details on DE pipelines. Are you looking for anything in specific let me know - ruclips.net/p/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO
Really Informative video, Thank you. Is there a way to personally connect with you to discuss on Data Engineering as a career and learning and end to end path?.
Thank you so much for this content!
Super Sir. Thanks for this video very helpful
Very much useful! Good explanation :)
When you save the data as table in parquet format, is it stored at aws s3 level or databricks level ?
Ashwin.. It will be in S3 as my S3 is mounted on to databricks now
Very nice, can u make a video on next part where u can show what parameters you will monitor for the same business case with code✌✌👌
superb ....
Nice Video, Could you please also make a video on how can we read data from S3 to a notebook which is in EC2
Manas.. I am planing to do it once I get to cloud section of my Course
ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
Will get there in some time soon
hi sir,
i have a doubt .my doubt is csv file have some columns are having data 2 columns have "," inside the text..it's a csv file every "," treated as new column change the column types they null values......how can solve this issue
In spark dataframe options have you tried quote options?
Hi, awesome video.
Could you show another video to integrate AWS with PySpark? As in, some more real time usecases similar to this one
Hello Sir, just a doubt. Suppose If I have multiple files for multiple users in a S3 bucket and I just need to load these files from the bucket,convert into a dataframe, do some filtration and return the response to a dashboard. So for this, do I need to create multiple sparksession for multiple files or should loading the files in different dataframes in a single sparksession will be okay? Please help me out.
I am getting confused between whether to use multiple spark sessions or not.
If all the files are of same format then you can move it to single S3 and read all files together. Even if it is separate S3 you can create multiple dataframe objects in single session
@@AIEngineeringLife Thank you ,Sir. Also can multiple users perform data processing of different datasets at the same time in a single spark session ??
Hello sir, import urllib is throwing error. Do we need to add external libraries in the cluster.
Yes.. just pip install if the package is not there. What error are you seeing?
Can you show us a way to use role instead of credentials? from EC2 to S3 using pyspark
Abhishek.. I can but that is not possible in community edition of databricks to my knowledge. Only in enterprise edition we can set it up.. I will try to show it within AWS or cross cloud in one of my video
AIEngineering
Can you please explain what you mean by community and enterprise edition? I have an ec2 with spark(pyspark) set up and I am trying to access a file in the S3 bucket by using roles. Can you please show us that
The video you commented under uses databricks and thought you were referring to databricks.. If it is ec2 that should not be that complex. Will show it. On a side note any reason for going to ec2 and not EMR based Spark?
AIEngineering
We are planning to test on a stand alone ec2 before we run on an EMR , also to avoid more cost by having the EMR running in the dev environment
Hello sir wwhen you say it gets mounted on s3data. Does the data get copies in our local drive?I don't see any data getting copied in databricks
It is more of a passthrough. So it is logical. You can query in your local Dbfs and it will retrieve data from S3
@@AIEngineeringLife sir can you plz upload the listings and reviews.CVS files to GitHub. I don’t see them in your GitHub repo
@@jaswantthadi5312 .. It must be in Airbnb website that showed in my video above. Are you not able to download it?
Can you explain why do you prefer AWS with Databricks instead of Azure as Azure Databricks have partnered?
royxss.. It can be anything.. Azure blob storage or Google GCS. Just that I took AWS S3 as example here :)
@@AIEngineeringLife Thanks for answering my question. I would request for another advice. I want to play with kafka and spark streaming and then possibly MLlib. Can you tell me any use case with source system to try out?
@@royxss Streaming sources are difficult to get but you can write your own scraping component. Let me see if some good use case are there and update
@@AIEngineeringLife Thanks a lot. Maybe a video of the pipeline in the future would be awesome :-D. Just an idea!
Knowlede is Power u showed us ,Just Upload that in S3 Public Bucket and share URL or , I will sync it on my bucket, off course with your IAM Permission
AIwaya.. Thank you and this is airbnb dataset you can download the 3 from here
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/listings.csv.gz
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/reviews.csv.gz
data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/visualisations/neighbourhoods.csv
Just unzip the first 2
Thanks for sharing great knowledge.
Could you please share the code for reference or git hub link if any or this databricks notebook.
You can check it here - github.com/srivatsan88/Mastering-Apache-Spark
I have tried to connect with S3 and of your access key has "\" in it the replace it with %2F or else you will not be able to connect with S3
How to automate this script
Hello Sir,
Great and structured explanation..!
Am getting any error while executing the code. Plz resolve it, it's:
:5: error: . expected
import urllib
Adding "%python" platform to the cell resolved the issue.
Where are you using it?. Datsbricks?. In databricks python is default kernel unless changed. Can you please confirm?
@@AIEngineeringLife yeah, initially I using scala , later I found that it should be python.
Thank you very for support. Sir.!❤️❤️
Can you share the notebook files sir
The videos are really good. I am trying to replicate the same in my account. I have a trouble getting the files from S3 into Databricks. I am getting an error with -display(dbutils.fs.ls('/mnt/s3data')). Error Message :the authorization mechanism you have provided is not supported.
Gururajan.. Can you check if in AWS you have set s3 policy for the bucket and also the code is similar to below with your credentials
import urllib
ACCESS_KEY = ""
SECRET_KEY = ""
ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = ""
MOUNT_NAME = "s3data"
dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
@@AIEngineeringLife S3 bucket has AmazonS3full access and I have used the same code as above. But I am getting an error when running - display(dbutils.fs.ls('/mnt/s3data'))The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256. 988C6BCCCD57FBB0 aM7FNZvIJDGnwi9JQ0Pg1Idj6hlO/8BUkFNdwyQ8UvbhxdT0GoWM5MfRe02CM2JHy5m7/RWeDQ8= ; nested exception is:
Error :
java.rmi.RemoteException: com.databricks.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/' XML Error Message:
InvalidRequest
@@gururajangovindan7766 Error is related to the region in which you created S3 bucket. I faced the same issue. After creating the S3 bucket in US region, I am able to access it.
ravindranath oruganti Thank you. Need to check the region details
@@gururajangovindan7766 did you solve this issue, this issue still appear with me even the region is US
Can someone tell me, where we open spark environment in the AWS for the bucket that is created.
Divendu.. it must be in my video... if you go to AWS IAM section you can see that option
@@AIEngineeringLife which video is it?
Any git repo for this?