Data Engineering - Setting up AWS S3 and Getting Data into Spark - Part 1

Поделиться
HTML-код
  • Опубликовано: 7 ноя 2024

Комментарии • 76

  • @ijeffking
    @ijeffking 4 года назад +5

    Through this video, you have disseminated quite beautifully invaluable knowledge. Thank you so much.

  • @carltonpatterson5539
    @carltonpatterson5539 3 года назад +1

    I haven't reached the end of this video, but I just want to say it's fantastic - thank you all the way from London

  • @imransharief2891
    @imransharief2891 4 года назад +1

    Superb at 4.2 8.29 10.41s some echo is there but got so much knowledge thank u sir

  • @venkatramachandran2912
    @venkatramachandran2912 3 года назад +2

    Plain, simple and too good. Fantastic.

  • @dsezhilarasu
    @dsezhilarasu 4 года назад +1

    Thank you for your clear videos. Looking forward to seeing more real-life scenario data engineering pipeline examples.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Ezhil.. Have you seen my mastering spark playlist that has details on DE pipelines. Are you looking for anything in specific let me know - ruclips.net/p/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO

  • @sandeepsingavarapu3839
    @sandeepsingavarapu3839 3 года назад +1

    Really Informative video, Thank you. Is there a way to personally connect with you to discuss on Data Engineering as a career and learning and end to end path?.

  • @joannwatu7603
    @joannwatu7603 3 года назад

    Thank you so much for this content!

  • @AnandKumar-dc2bf
    @AnandKumar-dc2bf 3 года назад

    Super Sir. Thanks for this video very helpful

  • @anmoljaiswal6686
    @anmoljaiswal6686 3 года назад +1

    Very much useful! Good explanation :)

  • @ashwinvijay373
    @ashwinvijay373 4 года назад +1

    When you save the data as table in parquet format, is it stored at aws s3 level or databricks level ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Ashwin.. It will be in S3 as my S3 is mounted on to databricks now

  • @chetanmundhe8619
    @chetanmundhe8619 4 года назад

    Very nice, can u make a video on next part where u can show what parameters you will monitor for the same business case with code✌✌👌

  • @rushikeshbulbule8120
    @rushikeshbulbule8120 4 года назад +1

    superb ....

  • @manaspradhan2166
    @manaspradhan2166 4 года назад +1

    Nice Video, Could you please also make a video on how can we read data from S3 to a notebook which is in EC2

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Manas.. I am planing to do it once I get to cloud section of my Course
      ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
      Will get there in some time soon

  • @moulalichebolu1946
    @moulalichebolu1946 3 года назад

    hi sir,
    i have a doubt .my doubt is csv file have some columns are having data 2 columns have "," inside the text..it's a csv file every "," treated as new column change the column types they null values......how can solve this issue

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      In spark dataframe options have you tried quote options?

  • @sagnikachakraborty1493
    @sagnikachakraborty1493 3 года назад

    Hi, awesome video.
    Could you show another video to integrate AWS with PySpark? As in, some more real time usecases similar to this one

  • @tridipdas5445
    @tridipdas5445 3 года назад

    Hello Sir, just a doubt. Suppose If I have multiple files for multiple users in a S3 bucket and I just need to load these files from the bucket,convert into a dataframe, do some filtration and return the response to a dashboard. So for this, do I need to create multiple sparksession for multiple files or should loading the files in different dataframes in a single sparksession will be okay? Please help me out.
    I am getting confused between whether to use multiple spark sessions or not.

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад +1

      If all the files are of same format then you can move it to single S3 and read all files together. Even if it is separate S3 you can create multiple dataframe objects in single session

    • @tridipdas5445
      @tridipdas5445 3 года назад

      @@AIEngineeringLife Thank you ,Sir. Also can multiple users perform data processing of different datasets at the same time in a single spark session ??

  • @jaswantthadi5312
    @jaswantthadi5312 3 года назад

    Hello sir, import urllib is throwing error. Do we need to add external libraries in the cluster.

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Yes.. just pip install if the package is not there. What error are you seeing?

  • @abhishekkulkarni6020
    @abhishekkulkarni6020 4 года назад +1

    Can you show us a way to use role instead of credentials? from EC2 to S3 using pyspark

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Abhishek.. I can but that is not possible in community edition of databricks to my knowledge. Only in enterprise edition we can set it up.. I will try to show it within AWS or cross cloud in one of my video

    • @abhishekkulkarni6020
      @abhishekkulkarni6020 4 года назад

      AIEngineering
      Can you please explain what you mean by community and enterprise edition? I have an ec2 with spark(pyspark) set up and I am trying to access a file in the S3 bucket by using roles. Can you please show us that

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      The video you commented under uses databricks and thought you were referring to databricks.. If it is ec2 that should not be that complex. Will show it. On a side note any reason for going to ec2 and not EMR based Spark?

    • @abhishekkulkarni6020
      @abhishekkulkarni6020 4 года назад

      AIEngineering
      We are planning to test on a stand alone ec2 before we run on an EMR , also to avoid more cost by having the EMR running in the dev environment

  • @harshithag5769
    @harshithag5769 3 года назад

    Hello sir wwhen you say it gets mounted on s3data. Does the data get copies in our local drive?I don't see any data getting copied in databricks

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      It is more of a passthrough. So it is logical. You can query in your local Dbfs and it will retrieve data from S3

    • @jaswantthadi5312
      @jaswantthadi5312 3 года назад

      @@AIEngineeringLife sir can you plz upload the listings and reviews.CVS files to GitHub. I don’t see them in your GitHub repo

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      @@jaswantthadi5312 .. It must be in Airbnb website that showed in my video above. Are you not able to download it?

  • @royxss
    @royxss 4 года назад

    Can you explain why do you prefer AWS with Databricks instead of Azure as Azure Databricks have partnered?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      royxss.. It can be anything.. Azure blob storage or Google GCS. Just that I took AWS S3 as example here :)

    • @royxss
      @royxss 4 года назад +3

      @@AIEngineeringLife Thanks for answering my question. I would request for another advice. I want to play with kafka and spark streaming and then possibly MLlib. Can you tell me any use case with source system to try out?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      @@royxss Streaming sources are difficult to get but you can write your own scraping component. Let me see if some good use case are there and update

    • @royxss
      @royxss 4 года назад +1

      @@AIEngineeringLife Thanks a lot. Maybe a video of the pipeline in the future would be awesome :-D. Just an idea!

  • @dkyadav6971
    @dkyadav6971 4 года назад

    Knowlede is Power u showed us ,Just Upload that in S3 Public Bucket and share URL or , I will sync it on my bucket, off course with your IAM Permission

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      AIwaya.. Thank you and this is airbnb dataset you can download the 3 from here
      data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/listings.csv.gz
      data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/data/reviews.csv.gz
      data.insideairbnb.com/united-states/ny/new-york-city/2020-02-12/visualisations/neighbourhoods.csv
      Just unzip the first 2

  • @nikhilmeghnani6234
    @nikhilmeghnani6234 4 года назад +1

    Thanks for sharing great knowledge.
    Could you please share the code for reference or git hub link if any or this databricks notebook.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      You can check it here - github.com/srivatsan88/Mastering-Apache-Spark

  • @Cricketpracticevideoarchive
    @Cricketpracticevideoarchive 4 года назад +1

    I have tried to connect with S3 and of your access key has "\" in it the replace it with %2F or else you will not be able to connect with S3

  • @photomyste3279
    @photomyste3279 2 месяца назад

    How to automate this script

  • @abhishek007123
    @abhishek007123 3 года назад

    Hello Sir,
    Great and structured explanation..!
    Am getting any error while executing the code. Plz resolve it, it's:
    :5: error: . expected
    import urllib

    • @abhishek007123
      @abhishek007123 3 года назад

      Adding "%python" platform to the cell resolved the issue.

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Where are you using it?. Datsbricks?. In databricks python is default kernel unless changed. Can you please confirm?

    • @abhishek007123
      @abhishek007123 3 года назад +1

      @@AIEngineeringLife yeah, initially I using scala , later I found that it should be python.
      Thank you very for support. Sir.!❤️❤️

  • @gopalbehara8283
    @gopalbehara8283 Год назад

    Can you share the notebook files sir

  • @gururajangovindan7766
    @gururajangovindan7766 4 года назад

    The videos are really good. I am trying to replicate the same in my account. I have a trouble getting the files from S3 into Databricks. I am getting an error with -display(dbutils.fs.ls('/mnt/s3data')). Error Message :the authorization mechanism you have provided is not supported.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Gururajan.. Can you check if in AWS you have set s3 policy for the bucket and also the code is similar to below with your credentials
      import urllib
      ACCESS_KEY = ""
      SECRET_KEY = ""
      ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "")
      AWS_BUCKET_NAME = ""
      MOUNT_NAME = "s3data"
      dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

    • @gururajangovindan7766
      @gururajangovindan7766 4 года назад

      @@AIEngineeringLife S3 bucket has AmazonS3full access and I have used the same code as above. But I am getting an error when running - display(dbutils.fs.ls('/mnt/s3data'))
      Error :
      java.rmi.RemoteException: com.databricks.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/' XML Error Message: InvalidRequestThe authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.988C6BCCCD57FBB0aM7FNZvIJDGnwi9JQ0Pg1Idj6hlO/8BUkFNdwyQ8UvbhxdT0GoWM5MfRe02CM2JHy5m7/RWeDQ8=; nested exception is:

    • @ravee9090
      @ravee9090 4 года назад +2

      @@gururajangovindan7766 Error is related to the region in which you created S3 bucket. I faced the same issue. After creating the S3 bucket in US region, I am able to access it.

    • @gururajangovindan7766
      @gururajangovindan7766 4 года назад

      ravindranath oruganti Thank you. Need to check the region details

    • @HSHIHADAH
      @HSHIHADAH 4 года назад

      @@gururajangovindan7766 did you solve this issue, this issue still appear with me even the region is US

  • @divendughati6114
    @divendughati6114 4 года назад

    Can someone tell me, where we open spark environment in the AWS for the bucket that is created.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Divendu.. it must be in my video... if you go to AWS IAM section you can see that option

    • @fintech1378
      @fintech1378 3 года назад

      @@AIEngineeringLife which video is it?

  • @siddhantsapte
    @siddhantsapte 4 года назад

    Any git repo for this?