Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1
HTML-код
- Опубликовано: 26 июл 2024
- This video is part of the Spark Interview Questions Series.
Spark Memory issues are one of most common problems faced by developers. so Suring spark interviews, This is one of very common interview questions. In this video we will cover ffollowing
What is Memory issue in spark
What components can face Out of memory issue in spark
Out of memory issue in Driver
out of memory issue in Executor
How Spark's performance is impacted by Dynamic Partition Pruning
Here are a few Links useful for you
Git Repo: github.com/harjeet88/
Spark Interview Questions: • Spark Interview Questions
If you are interested to join our community. Please join the following groups
Telegram: t.me/bigdata_hkr
Whatsapp: chat.whatsapp.com/KKUmcOGNiix...
You can drop me an email for any queries at
aforalgo@gmail.com
#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3
So well explained, even the images were very useful. Thank you very much!
Great please don't stop from uploading new contents!!
recently discovered this channel. this is gold
Thanks Nikhil :)
True
It is a great vedio. Content is very useful. Keep it up man 👍🏻👍🏻👍🏻
very well expained, thank you
Lots of respect for ur content ❤️
Thanks mate
Thank you so much. I m facing many times this auestion recent days. 👍
Thanks :)
Is the 2nd Part not there yet?
Your videos are AWSUUMMM !!! :D
Great video.. perfect explanation
Your videos on Trouble shooting are pretty good.
Thanks Sree Ram... :)
Neatly explained thank you....
Great information...... 👏👏👏
Very useful videos. Thank you :)
Thanks Nisha
Very useful.please keep making more such videos
Thanks Viraaj :)
Can you please also show code to repartition and increase executor on dummy process by changing values so that you can show us the impact on the run time of the jobs ? That will be really great to understand concepts
Waiting for Part 2 :)
Will come soon :)
Pure content, great topic, informative, interactive and simple.Thanks you!!
Very nice video!! thank you
Thanks Ajay
Very helpful
Waiting for part 2! :🙈
Working on it... Will post in few weeks. I need to explain one related concept first before that video
amazing
Nice video Sir. And mostly asked question in interview . Could you please make one video, related to other issues we do face in Spark .
Sure Sambit... Do u have any other suggestion on questions?
@@DataSavvy could you please explain , how to deal with the semi structured data, from ingestion to computation .
Great video. Can you share the source of information for further reading?
U are one of the best mentor I have ever seen on youtube. The way you explain in awesome and all real-time questions.
if my cluster memory is 10 GB and the date we want to process is 20 Gb will it process the data? sir can you please explain this topic
No you can not process it
You can do it using MapReduce if it is in batch layer or non used iterative algorithms like machine learning algos
Can you explain me difference between yarn memory over head vs Spark reserved and user memory?
Hi Sir, Could you please make a video on the factors that decide the number of tasks, stages, and jobs created after submitting our application.
Thank you . Can you make video about what is Azure Sql?
Great video! I had a question regarding the yarn memory overhead. When a pyspark job runs, my understanding is that python worker processes are started within the memory allocated to the executor. JVM then sends data back and forth to these python processes. Won't the allocated python objects use the memory of these python processes instead of the yarn memory overhead?
the worker nodes run based on resources of the yarn memory. Yarn is normally run on a shared cluster thus there always a tug of war between the tenants of the cluster for memory. as a result, one cannot always use too much memory. However, when there is ample yarn memory there is a process called preemption which gets more memory for the executor memory,
i have one doubt:
reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals.
Thank you for your time.
Nice Video again Harjeet :) , Hey Can you make videos on Test cases on spark/scala as well, i have scene no one talk about it.
Hi Ravi, test cases are generally about functional and use case specific...
Ravishankar Maybe you can try using holdenkarau
Thanks for suggesting... Looks like a good resource... I will go through this github.com/holdenk/spark-testing-base
Is there any real-time spark project. Please upload video on it. It would be helpful.
Good videos
Thanks Ravi :)
Recruiters say that you dont have production experience and POC spark working will not help. How can we convince despite having a good understanding of PYspark. Plz sugget
Dear Data Savvy,
Could you please clarify, if we go for broadcast join mean, it copies the small file into all available executor memory right? how come it causes the driver out of memory exception.
That file is first brought on driver and merged(if it has multiple partitions) then it is sent to executors
@@DataSavvy Thanks for the answer
Thanks
@@DataSavvy Isn't brodcast done executor-executor similar to bittorrent? Please correct me if i am wrong
can you please give example of each OOM what you have explained here, lots of blogs are given with same explanations. what extra is here. please provide with examples. it would be great.
Hi @all
I just got to know about the wonderful videos in datasavvy channel.
In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue?
Can Simeon please explain this?
Little confused here
Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
Issue : container killed by yarn . Spark application Exited 1. This is the most common in aws glue or any spark jobs . increasing spark.yarn.executor.memoryOverhead and spark.yarn.executor.memory willl help but make sure it shouldn't increase than the total yarn.nodemanger memory or else there'll be a issue of configuration.
Spark on kubernetes works completely different. This works only for spark on hadoop.
When loading a file to data frame you get oom error, how will u rectify it? Can we get a demo?
Hello. I have 16 crore records on which i want to use window function. But order by is taking huge time and giving memory issue. is there any alternative approach
Nice Video. Question: In case when we call coalesce(1), does it causes any OOM issues either in driver or executor? if calling this operation does not through any OOM what could be the reason? Please clarify.
U are right... Coalesce can also cause memory breach in few situations...
@@DataSavvy Thanks. In that case OOM will happen at executor side not at driver side. is my understanding correct?
Yes...
Wait... A correction here... Repartition (1) can cause issue , not coalesce (1) as coalesce will not cause shuffle and data will stay on same machines...
@@DataSavvyThanks. i was about to ask the same question. u replied in time. Kudos
question if i use pyspark do i still get does errors ?? another question in instead of collect what other command ca we use
SaveAsFile instead of collect
how would we know that which file is small and which file is larger . one interview asked this question to me.
You can list the files in folder and see the size of file... Hdfs fs -ls... This is command
thank you@@DataSavvy
but i am using s3 bucket so@@DataSavvy
Where is the second part ?
Part 2??
Why use rdd in all question?? Why not dataframe?
Is groupbykey also cause of Out of Memory Right
U are right... If there is skewness in data...vin case of group by key, we can end up facing Memory issue
How to avoid collect operation
You usually don't need collect.. Can you give an example where you are using it.. I can suggest, how to avoid it and rightly code
i cant able to join your whatsapp group i am facing some issue in my local machine while setting up spark; please let me know where to post my query
Please join telegram group and send query there... We have moved to telegram... Http://t.me/bigdata_hkr
@@DataSavvy aforalgo@gmail.com dropped a mail already could you please check the issue which i faced
I am learning concepts but without real time experience I am not able to get practice on Data Collection from various sources
I am able to clean the data well using Pyspark and can do ML using Spark ML by MLlib library
But please suggest some sources to practice for Data Collection from various sources
Thank you
Sure, let me look into this and I will share some link... You can join our document library and data Savvy group... U will get lot of relevent information there
Who is the person who dislikes this video... I think.. frustrating with life or wife... 😀😀😀
Ha ha ha 😀
The whatsapp group is full
Yes... Please join telegram group
Can you pls share your telegram group name?