What is Apache Spark? Learn Apache Spark in 15 Minutes
HTML-код
- Опубликовано: 26 ноя 2024
- #apachespark #databricks #sparkteam #dataengineering #pyspark #architecture
In this video, I have covered the most important topic of Data Engineer which is "Apache Spark". Especially, I have talked about the complete end to end Architecture of Spark Spark covering all the individual components as below,
1. Driver Programme
2. Worker Node
3. Cluster Manager
4. Spark Context or Spark Session
5. DAG
6. RDD
7. Lazy Evaluation
8. Stages and
9. Tasks
Watch the complete video and get a complete understanding of Apache Spark.
- - - Book a Private One on One Meeting with me (1 Hour) - - -
www.buymeacoff...
- - - Express your encouragement by brewing up a cup of support for me - - -
www.buymeacoff...
- - - Other useful playlist: - - -
1. Microsoft Fabric Playlist: • Microsoft Fabric Tutor...
2. Azure General Topics Playlist: • Azure Beginner Tutorials
3. Azure Data Factory Playlist: • Azure Data Factory Tut...
4. Databricks CICD Playlist: • CI/CD (Continuous Inte...
5. Azure Databricks Playlist: • Azure Databricks Tutor...
6. Azure End to End Project Playlist: • End to End Azure Data ...
7. End to End Azure Data Engineering Project: • An End to End Azure Da...
- - - Let’s Connect: - - -
Email: mrktalkstech@gmail.com
Instagram: mrk_talkstech
- - - About me: - - -
Mr. K is a passionate teacher created this channel for only one goal "TO HELP PEOPLE LEARN ABOUT THE MODERN DATA PLATFORM SOLUTIONS USING CLOUD TECHNOLOGIES"
I will be creating playlist which covers the below topics (with DEMO)
1. Azure Beginner Tutorials
2. Azure Data Factory
3. Azure Synapse Analytics
4. Azure Databricks
5. Microsoft Power BI
6. Azure Data Lake Gen2
7. Azure DevOps
8. GitHub (and several other topics)
After creating some basic foundational videos, I will be creating some of the videos with the real time scenarios / use case specific to the three common Data Fields,
1. Data Engineer
2. Data Analyst
3. Data Scientist
Can't wait to help people with my videos.
- - - Support me: - - -
Please Subscribe: / @mr.ktalkstech
OMG , I even tried many UDEMY courses to understand this. None of the tutor explained this clearly .. iam loving it ... Sir please start full databricks course to help us. Please.. 🙏
Thank you so much :) Sure, will do :)
this is what i was looking for well explained. thank you
Thank you so much :)
simply great explanation about SPARK architecture, how its connected step by step it connects all the dots in Spark.
Thank you so much :)
@@mr.ktalkstech looking further concept in Spark, it would be great if you try full course
Simple and brilliant analogy Mr K
Clear and well explained
Thank you so much :)
Simplest and excellent explanation Mr K.
Thank you so much :)
I appreciate your explanation; it has clarified the topic for me. Thank you. 🙏🏼
However, I have one question: if the CSV files are split into two, how will one worker determine if there are any duplications with another worker work?
Excellent! Thank you for explaining this.
Thank you so much :)
Great explanation. Do you have a full pyspark tutorial?
this is so simple and clear explanation that
it made to share to my friends.
keep making video
your efforts putting great impact in our life.
Thank you so much :)
Very well explained!!!
Thank you so much :)
Awesome explanation bro.
Thank you so much :)
Useful presentation
Thank you so much :)
This is my understanding
- Apache Spark falls under the compute category.
-It's related to MapReduce but is faster due to in-memory processing.
-Spark can read large datasets from object stores like S3 or Azure Blob Storage.
-It dynamically scales compute resources, similar to autoscaling and Kubernetes orchestration.
-It processes the data to deliver analytics, ML models, or other results efficiently.
Really good
Wow! Just mind blowing brother💥💥!! Looking for more DE fundamentals videos ✨♥️👌
Thank you so much :)
thanks for this
hey man, may be worth a shot checking out LakeSail's PySail built on rust. supposedly 4x faster with 90% less hardware a cost according to their latest benchmarks. and can migrate existing python code. might be cool to make a vid on!
love ur content!
Great primer @Mr. K! Thanks. Quick question - How does the driver program create task partitions for the plan? For example, if there are duplicates across two worker nodes, wouldn't the count be misrepresented if it simply adds 4500 and 5500? Does this get auto-handled or do we have to control the partitioning logic?
According to number of partitions of the files. You can also control over task by setting up configuration of partitions limit after each transformation.by using below code
spark.conf.set("spark.sql.shuffle.partitions", num_partitions)
The task is always depends on the number of partitions.
Your question is that each and every worker nodes have duplicates and in count operation it will just sum the results right.
Ans- after getting the result from each worker nodes the driver program will again aggregate it and then give the final result
Excellent!
Thank you so much :)
waiting for your pyspark playlist:)
Very soon :)
Hi , what tools used to create this type of videos. Please help.
Final cut pro, CapCut, PowerPoint and After effects.
@@mr.ktalkstech thank you for the info
will this same topic be covered in the other channel (Mr.K Talks Tech Tamil)
No brother :)
Respect++
Did not said about rddd