What is Apache Spark? Learn Apache Spark in 15 Minutes

Поделиться
HTML-код
  • Опубликовано: 26 ноя 2024
  • #apachespark #databricks #sparkteam #dataengineering #pyspark #architecture
    In this video, I have covered the most important topic of Data Engineer which is "Apache Spark". Especially, I have talked about the complete end to end Architecture of Spark Spark covering all the individual components as below,
    1. Driver Programme
    2. Worker Node
    3. Cluster Manager
    4. Spark Context or Spark Session
    5. DAG
    6. RDD
    7. Lazy Evaluation
    8. Stages and
    9. Tasks
    Watch the complete video and get a complete understanding of Apache Spark.
    - - - Book a Private One on One Meeting with me (1 Hour) - - -
    www.buymeacoff...
    - - - Express your encouragement by brewing up a cup of support for me - - -
    www.buymeacoff...
    - - - Other useful playlist: - - -
    1. Microsoft Fabric Playlist: • Microsoft Fabric Tutor...
    2. Azure General Topics Playlist: • Azure Beginner Tutorials
    3. Azure Data Factory Playlist: • Azure Data Factory Tut...
    4. Databricks CICD Playlist: • CI/CD (Continuous Inte...
    5. Azure Databricks Playlist: • Azure Databricks Tutor...
    6. Azure End to End Project Playlist: • End to End Azure Data ...
    7. End to End Azure Data Engineering Project: • An End to End Azure Da...
    - - - Let’s Connect: - - -
    Email: mrktalkstech@gmail.com
    Instagram: mrk_talkstech
    - - - About me: - - -
    Mr. K is a passionate teacher created this channel for only one goal "TO HELP PEOPLE LEARN ABOUT THE MODERN DATA PLATFORM SOLUTIONS USING CLOUD TECHNOLOGIES"
    I will be creating playlist which covers the below topics (with DEMO)
    1. Azure Beginner Tutorials
    2. Azure Data Factory
    3. Azure Synapse Analytics
    4. Azure Databricks
    5. Microsoft Power BI
    6. Azure Data Lake Gen2
    7. Azure DevOps
    8. GitHub (and several other topics)
    After creating some basic foundational videos, I will be creating some of the videos with the real time scenarios / use case specific to the three common Data Fields,
    1. Data Engineer
    2. Data Analyst
    3. Data Scientist
    Can't wait to help people with my videos.
    - - - Support me: - - -
    Please Subscribe: / @mr.ktalkstech

Комментарии • 43

  • @MohanKrishna-yi9cc
    @MohanKrishna-yi9cc 2 месяца назад +5

    OMG , I even tried many UDEMY courses to understand this. None of the tutor explained this clearly .. iam loving it ... Sir please start full databricks course to help us. Please.. 🙏

    • @mr.ktalkstech
      @mr.ktalkstech  Месяц назад

      Thank you so much :) Sure, will do :)

  • @dprakash1793
    @dprakash1793 3 месяца назад +3

    this is what i was looking for well explained. thank you

  • @manikandan-fq5sh
    @manikandan-fq5sh 2 месяца назад +2

    simply great explanation about SPARK architecture, how its connected step by step it connects all the dots in Spark.

    • @mr.ktalkstech
      @mr.ktalkstech  Месяц назад

      Thank you so much :)

    • @manikandan-fq5sh
      @manikandan-fq5sh Месяц назад

      @@mr.ktalkstech looking further concept in Spark, it would be great if you try full course

  • @chandrakumar348
    @chandrakumar348 Месяц назад

    Simple and brilliant analogy Mr K

  • @benim1917
    @benim1917 3 месяца назад

    Clear and well explained

  • @rakeshverma6867
    @rakeshverma6867 3 месяца назад

    Simplest and excellent explanation Mr K.

  • @digitalabi
    @digitalabi Месяц назад

    I appreciate your explanation; it has clarified the topic for me. Thank you. 🙏🏼
    However, I have one question: if the CSV files are split into two, how will one worker determine if there are any duplications with another worker work?

  • @sharaniyaswaminathan8760
    @sharaniyaswaminathan8760 3 месяца назад

    Excellent! Thank you for explaining this.

  • @dogzrgood
    @dogzrgood Месяц назад

    Great explanation. Do you have a full pyspark tutorial?

  • @shyammaths5705
    @shyammaths5705 2 месяца назад

    this is so simple and clear explanation that
    it made to share to my friends.
    keep making video
    your efforts putting great impact in our life.

  • @smderoller
    @smderoller 3 месяца назад

    Very well explained!!!

  • @Bijuthtt
    @Bijuthtt 3 месяца назад

    Awesome explanation bro.

  • @selvakumarr.k.8660
    @selvakumarr.k.8660 3 месяца назад

    Useful presentation

  • @satish1012
    @satish1012 16 дней назад

    This is my understanding
    - Apache Spark falls under the compute category.
    -It's related to MapReduce but is faster due to in-memory processing.
    -Spark can read large datasets from object stores like S3 or Azure Blob Storage.
    -It dynamically scales compute resources, similar to autoscaling and Kubernetes orchestration.
    -It processes the data to deliver analytics, ML models, or other results efficiently.

  • @shabeerkhan379
    @shabeerkhan379 2 месяца назад

    Really good

  • @062nanthagopalm6
    @062nanthagopalm6 3 месяца назад

    Wow! Just mind blowing brother💥💥!! Looking for more DE fundamentals videos ✨♥️👌

  • @neeraj_dama
    @neeraj_dama 3 месяца назад

    thanks for this

  • @AlexFosterAI
    @AlexFosterAI Месяц назад

    hey man, may be worth a shot checking out LakeSail's PySail built on rust. supposedly 4x faster with 90% less hardware a cost according to their latest benchmarks. and can migrate existing python code. might be cool to make a vid on!
    love ur content!

  • @seethaba
    @seethaba Месяц назад

    Great primer @Mr. K! Thanks. Quick question - How does the driver program create task partitions for the plan? For example, if there are duplicates across two worker nodes, wouldn't the count be misrepresented if it simply adds 4500 and 5500? Does this get auto-handled or do we have to control the partitioning logic?

    • @Bhavik_9988
      @Bhavik_9988 Месяц назад

      According to number of partitions of the files. You can also control over task by setting up configuration of partitions limit after each transformation.by using below code
      spark.conf.set("spark.sql.shuffle.partitions", num_partitions)
      The task is always depends on the number of partitions.
      Your question is that each and every worker nodes have duplicates and in count operation it will just sum the results right.
      Ans- after getting the result from each worker nodes the driver program will again aggregate it and then give the final result

  • @Abhinavkumar-kt8gj
    @Abhinavkumar-kt8gj 3 месяца назад

    Excellent!

  • @mgdesire9255
    @mgdesire9255 3 месяца назад

    waiting for your pyspark playlist:)

  • @zakeerp
    @zakeerp 3 месяца назад +1

    Hi , what tools used to create this type of videos. Please help.

    • @mr.ktalkstech
      @mr.ktalkstech  3 месяца назад +1

      Final cut pro, CapCut, PowerPoint and After effects.

    • @zakeerp
      @zakeerp 3 месяца назад

      @@mr.ktalkstech thank you for the info

  • @kirankarthikeyan4940
    @kirankarthikeyan4940 3 месяца назад

    will this same topic be covered in the other channel (Mr.K Talks Tech Tamil)

  • @PatelTushya
    @PatelTushya 13 дней назад

    Respect++

  • @neuera9556
    @neuera9556 2 месяца назад

    Did not said about rddd