Apache Spark | Spark Interview Question | Spark Optimization { PartitionBy & Repartition }
HTML-код
- Опубликовано: 15 сен 2024
- Apache Spark | Spark Interview Question | Spark Optimization { PartitionBy & Repartition }
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
⭐ Kite is a free AI-powered coding assistant for Python that will help you code smarter and faster. Integrates with Atom, PyCharm, VS Code, Sublime, Vim, and Spyder. I've been using Kite for 6 months and I love it! www.kite.com/g....
The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you're typing. I've been using Kite for 6 months, and I love it!
-----------------------------------------------------------
Apache Spark Interview Question - In this video, we will learn to answer the interview question on What is the Difference between partitionBy and repartition in Apache Spark. We will understand this Spark optimization techniques with small demo.
-----------------------------------------------------------
DataSet for you to work:
github.com/aza...
Blog link to learn more on Spark:
www.learntospark.com
Linkedin profile:
/ azarudeen-s-83652474
FB page:
/ learntospark-104523781...
#apachespark #spark #sparkoptimization
Awesome video. Greatly explained
Good work azar. I gone through your list of videos from last one year, lot of videos are there. Could you please make one consolidated series of videos for big data engineer interview questions and answers starting from hdfs, sqoop, hadoop, apache spark, hive, impala etc. Similarly one more series of lectures for beginners to understand each concept in detailed. If this is very big thing, could you please provide the list of sources and learning path and provide a clear cut strategy and resources for interviews and as well as for learning purpose and create separate play lists for it.
Hi, you should clearly explain as why the source file after reading shows only 1 partition. simply telling that the file is small does not gives clarity to the audience. Technically there are lot of factors which can attribute to the decision of how many partitions will be created such as Parallelism, Block Size, Input Format and Splitting Logic, Data Locality and Cluster Configuration etc
Superly explained...Thanks for sharing the knowledge
Very informative and helpful videos... Thanks for sharing the knowledge 👍👍
Thanks for your support :)
Thanks Azar. It’s really helpful 👍
Excellent Azar
After watching the complete video, I understood that repartition should be used @ reading the file and PartitionBy should be used @ Writing. Repartition actually divides the data into the number given @ repartition.
But didn't get clarity on Partitionby . Could someone please explain on this
Superb
why do we need files in memory? (repartition use). Isn't file always used when storing in disk?
very nice explanation. 👌 👍
Thanks for your support
voice was so good azar
Thank you, please share note books
Usefull..
Hi Azar, which is faster (Repartition vs PartitionBy vs Coalesce) if we are dealing with 1 TB data?
Thank you
They do different things. partitionBy is for writing the df into separate subdirectories based on partition column. While repartition and colasce deal with distribution of data inmemory among executors. coalesce is used to reduce partitions and tries to avoid full reshuffling, so it will be faster than repartition which can both increase or decrease partitions but does it with full reshuffle. But if you decrease the partition too much than capacity of executor, it can lead to OOM. So it depends on the problem you are solving ie you want to reduce skewness or distribute more
Hello Azarudeen, Your vedios are awesome.
I have one question can you please provide me code .. 1) I want decrypt the file using private key .My all files PGP encrypted file and private key stored in S3 bucket.. please help me to provide the code.
If possible can u pls share some sample file..
Bro partitionby is action or transaction
partitionBy is not related to RDD. It's neither action nor transformation. Right Azar?
parititionby is a method relating to DataFrameWriter and is not related to dataframe, so it is neither action nor transformation.