Big Data Hadoop Spark Cluster on AWS EMR Cloud | Big Data on AWS Cloud | Production Big Data Cluster

Поделиться
HTML-код
  • Опубликовано: 17 окт 2024

Комментарии • 64

  • @sumitmittal07
    @sumitmittal07  2 года назад

    Checkout the Big Data course details here: trendytech.in/?referrer=youtube_bd22

  • @gebrilyoussef6851
    @gebrilyoussef6851 Месяц назад

    Sumit, you are the master trainer of Big Data. Thank you so much for all the efforts you made.

  • @anuragdubey5898
    @anuragdubey5898 2 года назад +2

    Very informative session. Have learnt a lot and even cleared my doubts as well. Easy and simplified way of explanation made it a best video for AWS use in Bigdata. Thanks for the session..

  • @NaturalPro100
    @NaturalPro100 4 года назад +2

    This really cleared some basics I required for starting spark with AWS.Content and explanation is up to the point.Thanks for sharing Sumit.

  • @udaynayak4788
    @udaynayak4788 2 года назад

    one of the best informative session , thank you so much for sharing.

  • @sampaar
    @sampaar 3 года назад +1

    Amazing presentation. Better than many of the udemy courses that I have come across.

  • @VallabhGhodkeB
    @VallabhGhodkeB 2 года назад

    Top Stuff this is. just got started way to go

  • @mdabdulmujeebmalik422
    @mdabdulmujeebmalik422 4 года назад

    Excellent video on AWS and how to run spark job on AWS. Amazing!!. Thank You so much for the video and kudos to the instructor.

  • @kashamp9388
    @kashamp9388 2 года назад +1

    best session ever. concise

    • @sumitmittal07
      @sumitmittal07  2 года назад

      Glad you are liking the my teaching :)

  • @laxmisuresh
    @laxmisuresh 3 года назад

    Very meaningful presentation. Explained in correct pace and with proper content.

  • @gauravrai4398
    @gauravrai4398 4 года назад +1

    Very lucid and concise explanation .... A job well done!

  • @RaviKumar-oy5jq
    @RaviKumar-oy5jq 4 года назад +2

    Excellent session ..

  • @vijeandran
    @vijeandran 3 года назад

    Neat explanation.... and very very informative video....

  • @sridharreddy9605
    @sridharreddy9605 3 года назад

    very clear explaination thank you for your time...

  • @datoalavista581
    @datoalavista581 2 года назад

    Thank you for sharing

  • @vairammoorthy6665
    @vairammoorthy6665 4 года назад

    best tutorial for AWS EMR

  • @swaroupbanikk4444
    @swaroupbanikk4444 2 года назад

    BEST

  • @subratakr5353
    @subratakr5353 4 года назад

    Thanks for the lovely presentation!
    Had 2 questions though :
    1) When you say your are running code in master do you mean namenode of the cluster ? Where is the namenode for this this cluster ?
    2) Since data is stored in S3 does EMR copy it to hdfs and then spark reads from hdfs eventually ? in which hdfs path is the data stored ?

    • @vijeandran
      @vijeandran 3 года назад +2

      Answer 1: Here namenode, driver node, edge node and master node all are the same.
      2. As soon as you create one master and 2 slave nodes, These slave node's harddisk behaves like HDFS, and spark will fetch files from this disk and run as in memory of the slave nodes. Here 1 master and slave acts as a processing unit that is part inside AWS. S3 also is a part of AWS and it is a storage unit. When you want to process data, you are copying the data file from storage part s3 to processing part HDFS, where HDFS is present in the 2 slave nodes that you created. then use can run spark jar file.

  • @amitbajpai6209
    @amitbajpai6209 4 года назад +1

    Best video to get an overall understanding of AWS EMR..
    It was really helpful 😊
    Kudos to the Instructor !!
    Liked 👍 and Subscribed..
    Hoping for more such videos..

  • @sohailhosseini2266
    @sohailhosseini2266 2 года назад +1

    Thanks for the video!

  • @divakarluffy3773
    @divakarluffy3773 2 года назад

    one video resolved all my doubts , Thanks

    • @sumitmittal07
      @sumitmittal07  2 года назад +1

      Happy to hear that your doubts are resolved

  • @pankajnakil6173
    @pankajnakil6173 3 года назад

    Very useful & good explanation..

  • @ririraman7
    @ririraman7 2 года назад +1

    beautiful

  • @shilparathore8849
    @shilparathore8849 4 года назад +1

    Very well explained thanks for sharing

  • @ramprasadbandari8195
    @ramprasadbandari8195 4 года назад +1

    Excellent explanation and very useful Info!!

  • @NIHAL960
    @NIHAL960 4 года назад +5

    S3: Amazon storage
    On demand instance : Available on demand
    Spot instance : Available at high discounts for temporary basis, can be taken back with 2 min warning
    Reserved instance : Available if commitment is long such as a year at discounted price as compared to on demand
    Types of nodes:
    1. Master Node: This manages the cluster. This is single ec2 instance.
    2. Core Node: Each cluster has one or more core node, It hosts data and runs tasks
    3. Task Node: This can only run task and not store,, Required if application is compute heavy. Spot instance are good choice for it.
    Cluster:
    1 Transient cluster terminates automatically.
    2. long running cluster requires manual termination.

  • @rrjishan
    @rrjishan 3 года назад

    as we say , on amazon aws we can shut down the cluster after computation and data will be saved in s3 . So, clusters only responsible to compute data? isn't data also stored in clusters. Getting bit confusing..please clear it

  • @puneetnaik8719
    @puneetnaik8719 4 года назад

    Great explanation sir..thanks for video.

  • @anuj3922
    @anuj3922 3 года назад

    EMR cluster is on hourly rate --if. I don't use it do I still have to pay for it--if I build it just for learning purpose and come back to it as per my learning scope ?

  • @keyursolanki
    @keyursolanki 10 месяцев назад

    will there be default allow access to s3 from emr cluster?

  • @dineshughade6570
    @dineshughade6570 Год назад

    Nice explanation. Can we have a pdf of this video?

  • @dharmeswaranparamasivam5498
    @dharmeswaranparamasivam5498 4 года назад

    Very good session. Thanks for doing this.

  • @fzgarcia
    @fzgarcia 4 года назад

    Do you know if in free tier account I can run a EMR cluster like this?
    Even if I can only run micro t3 in free tier, I can create a manual cluster with minimum 3 nodes of micro t3 or more nodes? Thanks.

  • @Dyslexic_Neuron
    @Dyslexic_Neuron 3 года назад

    very good explanation . Can u make a video on spark shuffle and issues

  • @AparnaBL
    @AparnaBL 4 года назад +1

    Moreover hdfs data is ephemeral right ...if you want the data to exist even after cluster is terminated ...we can use S3

    • @sumitmittal07
      @sumitmittal07  4 года назад

      absolutely. you can see same thing is mentioned around 24th minute of the session

    • @AparnaBL
      @AparnaBL 4 года назад +2

      @@sumitmittal07 yeah @ 22:36

  • @fzgarcia
    @fzgarcia 4 года назад

    Thank you, nice presentation!

  • @BinduReddy-n1q
    @BinduReddy-n1q Год назад

    How to save the wordcount output in HDFS and also in S3.

  • @Naveen-xi7os
    @Naveen-xi7os 4 года назад

    it was awesome session

  • @gaurav1825
    @gaurav1825 4 года назад

    Sir please give some guidance of AWS EMR with Apache Flink and Hudi .

  • @amulmgr
    @amulmgr 3 года назад

    thankyou very much for video

  • @diptyojha174
    @diptyojha174 4 года назад +1

    Very nice explanation

  • @techtransform
    @techtransform 3 года назад

    Excellent Explanation :)

  • @piby1802
    @piby1802 4 года назад

    Really nice presentation! Thank you!

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 4 года назад

    Thanks , but why hdfs data gone when cluster shutdown ? as hdfs is persistant when cluster is up it would be automatically available right ?

    • @vijeandran
      @vijeandran 3 года назад

      When you start the cluster you are creating three instances... one for master and two for datanode. These nodes are available only for that session, because it is virtual only for that session, once you terminate the cluster, the instance created 1 master and 2 slave will be killed and due to that data present in the HDFS will be deleted. As summit said if you want to run your cluster continuously then the data would be available in HDFS, where the amazon will put more bill for continuous usage of cluster.

  • @sancharighosh8204
    @sancharighosh8204 3 года назад

    Can you make some tutorials on Databricks

  • @priyabhatia4107
    @priyabhatia4107 4 года назад

    Great content!!

  • @satishj801
    @satishj801 2 года назад

    @1:14:30 , he downloaded the jar from S3 but he didn't copy it to hdfs just like he copied book-data.txt and he mentioned he is running the jar from hdfs not from S3 , but its the same step as @ 1:06:52 , I'm bit confused at that point . If some one has understood please drop a reply.

    • @user-co8oc1rm5w
      @user-co8oc1rm5w 2 года назад

      jar file he kept in root path of the cluster but the file to access for processing that he kept in hdfs e.g the directory which he created named '/data'. thats y he menioned he is running the jar from hdfs because the file to be processed i.e. book-data.txt he downloaded to hdfs in place of s3 then he changed the file location in scala code then recreated jar and placed that jar to s3 first then downloaded that jar from s3 to master node and executed spark job to process the book-data.txt file from hdfs not from s3.

    • @satishj801
      @satishj801 2 года назад

      @@user-co8oc1rm5w Thanks for the explanation👌🏻

  • @rajsekhargada9212
    @rajsekhargada9212 2 года назад

    I think S3 is not distributed storage

    • @sumitmittal07
      @sumitmittal07  2 года назад

      its a object store, but in this scenario its a replacement of distributed storage and serving similar usecase.

  • @makdan1331
    @makdan1331 4 года назад

    where is the jar file?