Questions : 1) Why you shifted from map reduce development to spark development? 2) How Spark Engine is different from Hadoop Map Reduce engine? 3) What are the steps for spark jobs optimization? 4) What is executor and executor core? Reference in terms of process & threads 5) How to you identify that your hive script is slow? 6) When do we use partitioning and bucketing in hive? 7) Small file problem in hive ? ---> Skewiness 8) How do you improve high cardinality issue in dataset? In resect of Hive. 9) How do you care code merging with other teams, explain your development process? 10) Again, Small files issue in Hadoop ? 11) Metadatasize of hadoop ? 12) How spark is differentiated from Map Reduce? 13) In a class having 3 fields name,age,salary & you are creating series of objects from this class? How do you compare the object ----(I didn't got the question exactly) 14) Scala : what is === in joins conditions? What does it means? Hope so it will help?
Thank you guys, you are big reason why I got job in aws data Engineer. Spark and optimizations are most asked questions. Partitioning and bucketing with Hive as well. I would also add that the interviewers are similar to real setting because they usually point to you to the right direction of the answer they looking so always listen to their follow up.
How do you optimize a job? - Check the input data size and output data size and correlate to operating cluster memory - Check Input partition size,output partition size and number of partitions along with shuffle partition and decide number of cores - Check for disk memory spills during stage execution - Number of executors used for given cluster size - Available cluster memory and memory in use by the application/job - Check average run time of all stages in the job, to identify any data skewed stage tasks - Check whether the table is partitioned by column or not / bucketing
I must really appreciate for posting this interview in public domain. This is a really good one.. it would be really great to see a video on process to optimize the job
really great video, it would have been much greater, if you can answer the questions which the candidate was not able to answer, like what are symptoms of a job, on which you will decide that you should increase the number of executors or memory per executors. Can anyone please answer here, so that it may be beneficial for candidates. Thanks a lot for this video.
hadoop is meant for handling big files in small numbers and also small file problem arises when file size is less than HDFS block size [ 64 or 128 ] . Moreover, handling bulk number of small files may increase pressure on Name node , if we have option to handle big file. so in hadoop file size matters alot so only Partitioning and Buckting came into picture. correct me if i did mistake
Partitioning and Bucketing is related to YARN ( processing side of Hadoop ) HDFS small files explained : blog.cloudera.com/the-small-files-problem/ ( storage side of Hadoop ) Also to handle huge number of small files, we need to increase NN heap (1Million blocks count-> 1GB) there then causing GC issue and making things complicated.
Default block size is 128MB, when small size files will be created using partitioning. Lot of storage will go waste. And required horizontal Scaling ( fails the purpose of distribution)
@@kartikeyapande5878 Yes, When dealing with small files in a distributed storage system with a default block size of 128MB, indeed, there can be inefficiencies and wasted storage space due to the space allocated for each block. This issue is commonly known as the "small files problem" in distributed storage systems like Hadoop's HDFS. Here are a few strategies to mitigate this problem: 1. **Combine Small Files**: One approach is to periodically combine small files into larger ones. This process is often referred to as file compaction or consolidation. By combining multiple small files into larger ones, you can reduce the overhead of storing metadata for each individual file and make better use of the storage space. 2. **Adjust Block Size**: Depending on your workload, you might consider adjusting the default block size. While larger block sizes are more efficient for storing large files, smaller block sizes can be more suitable for small files. However, this adjustment requires careful consideration since smaller block sizes can increase the overhead of managing metadata and may impact performance. 3. **Use Alternate Storage Solutions**: Depending on your specific requirements, you might explore alternative storage solutions that are better suited for managing small files. For example, using a distributed object storage system like Amazon S3 or Google Cloud Storage might be more efficient for storing and retrieving small files compared to traditional block-based storage systems. 4. **Metadata Optimization**: Optimizing the metadata management mechanisms within your distributed storage system can help reduce the overhead associated with storing small files. Techniques such as hierarchical namespace structures, metadata caching, and efficient indexing can improve the performance and scalability of the system when dealing with small files. 5. **Compression and Deduplication**: Employing compression and deduplication techniques can help reduce the overall storage footprint of small files. By compressing similar files or identifying duplicate content and storing it only once, you can optimize storage utilization and reduce wastage. 6. **Object Storage**: Consider using object storage solutions that are designed to efficiently store and retrieve small objects. These systems typically offer features such as fine-grained metadata management, scalable architectures, and optimized data access patterns for small files. Each of these strategies has its own trade-offs in terms of complexity, performance, and overhead. The most suitable approach depends on the specific requirements and constraints of your application and infrastructure.
Nice video. The purpose of using '===' while joining is to make sure that we are comparing right values (join key value) and right data type as well. Please correct me if my understanding is wrong.
Hi sir, it's really helpful for me because I have issues lots of questions which you asked there, thank you so much sir. Please make one more videos on advance level SPRK series please.
Hello, I was asked the followed questions in a AWS developer interview- Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it. Q2. So where does pyspark come into play in this? Q3. Which all libraries will you need to import to run the above glue job? Q4. What are shared variables in pyspark Q5. How to optimize glue jobs Q6. How to protect sensitive data in your data. Q7. How do you identify sensitive information in your data. Q8. How do you provision a S3 bucket? Q9. How do I check if a file has been changed or deleted? Q10. How do I protect my file having sensitive data stored in S3? Q11. How does KMS work? Q12. Do you know S3 glacier? Q13. Have you worked on S3 glacier?
@@DataSavvy no no .. definitely not that. I was saying discussion way of interviewing is more effective as per my opinion. I feel more comfortable and and able to express that way.
Sir.. can you please provide us Hadoop & Spark developer with SCALA video's Beginners to Perfect It's very very very Useful For us sirr.. because cheaked all types of video's on the youtube no one can do it ... Or sir hai bhi to wo paid courses hai like EDUREKA , Intellipath , Simpllearn etc . So , sir please make it earlier..Need its bhot the students
and please make 1. hands on schema evolution with all formats 2. databricks delta lake 3. how to connect with different datasources You are the only creator to be expected from
Seems like most questions are on how to optimize the jobs. Not much on the technical side. Does Data engineer interviews goes like this, or any other technical questions are asked too?
Hello sir this first time I am getting touch with you . It was a great interview I have seen so many tricky questions . I am preparing for spark Administrator interview do you have some spark tunning interview questions and some advance interview questions related to spark
Hi Tushar... I am happy this video is useful for you. There is a playlist, on my channel for spark performance tuning and I am also working on a new series... Let me know if u need anything extra
its correct. hdfs has own framework and spark has its own.hdfs works in batch processing process where spark works by inmemory computation. everything consider as info dumb
I will still ask this question. When comparing Hadoop with Spark, it is assumed that we are comparing 2 processing engines not a processing engine to file system. We expect a candidate to be sane enough to understand that. Also, Spark is built on top of MR concept this very good question to test your understanding of it.
@@hasifs8139 anyone who has picked up spark in last 3 years does not need to understand mr to be good at spark or data processing. Spark implementation is way different than mr to make any comparison. Do you do same or similar steps to optimise joins in spark and mr, no. You can keep asking this question and rejecting good candidates.
Hi... All answers of the questions are available as different videos on data Savvy channel... Let me know if you find anything missing... I will add that... I will add answer to scala questions also
sir this was really helpful and informative....i'm a fresher and seeking an opportunity to work with big data technologies like hadoop, spark, kafka, etc.....please guide me how do i enter the corporate world starting with these technologies as there are very less firms that hires a fresher for such technologies.....
In spark it is not possible to apply Bucketing without partitioning the tables. So If we do not find a suitable column to partition the table, how will we proceed a head with optimization ?
You have to find if your job is compute intensive or IO intensive.. you will get hints of that in logs files... I realised after taking this mock interview that I should create a video on that... I am working on this.. thanks for asking this question 😀
@@DataSavvy I do not completely agree to what you said. Or may be the question looks a bit off. To start with increasing executor memory will have a limit restricted by the total memory available(depending on the instance type you are using). Memory usage tuning at executor level would need considering 1) the amount of memory used by your objects (you may want your entire dataset to fit in memory), 2) the cost of accessing those objects, and 3) the overhead of garbage collection (if you have high turnover in terms of objects). Now, when we say increasing number of executors - I consider this as scaling needed to meet the job requirements. IO intensive doesn't directly mean increase the number of executors rather increasing the level of parallelism(dependent on the underlying part files(/size) etc.) which starts at the executor level. So, I would rather look at this answer like optimizing executor config for a (considerably)small dataset and tuning the executor config first and then assessing the level of scaling need viz. increasing the number of executors to meet the scale of the actual data. I would like to discuss ahead with you
Hi, I am sorry, i did not understand your question completely.. are u asking u want to do a mock interview with me on scala and spark? if yes... please drop me a message at aforalgo@gmail.com. we can workout this
@@DataSavvy My best guess is = and == are already reserved operators. = is assignment operator like val a=5 == is comparison opertor if object type is same like how we use in a a normal string comparison for example === is comparison operator if object type is different like when we compare two different colums for different datasets dataframes
may be we can use vectorization also in such scenarios and as he said bucketing on id column can help drastically, apart from it choosing right file format can work as well.
I liked some of the questions but not the answers completely. Say the HDFS block size, memory used per file in name node and the type safe equality. How do you plan to publish the right content/answers?
If we use 800 executors for 100GB input data like you've mentioned in your example,Spark would be then busy in managing the high volume of executors rather than on data processing. So, it could better to use an executor for 5-10GB which would leave us to use 10-20 executors for 100GB data. If you're having any explanation for using 800 executors, then do post it.
Can you please suggest any correct way to determine executor cores n executor memory by looking at the input data. Without hit n trail and instead going for that thumb rule that we have assuming 5 would be the optimized number for core.. Any other way to determine
It purely depends on size of input data and kind of processing/computation like heavy join or simple scan of data. In general, worker nodes (data nodes) of size {Cores 40 (80 w/ Hyperthreading) ; Memory 500 GiB} then ~ 1vCore for every 5GB.
@@DataSavvy In databricks When i am creating a managed hive table by "using json" keyword Its creating fine but when i am creating external table its showing error
@@hasnainmotagamwala2608 Hi, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr
@@msbhargavi1846 If you have too many small files then your name node will have to keep metadata for these each of these metadta takes around 100-150 bytes so just think if you have millions of small files how much memory name node has to exhaust to manage this ....
Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr
Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr
There is no such thing as in memory processing.. Memory is used to store data that can be reused. 4 years back I was grilled on this 'in memory processing' stuff in one of the big4 firm.
You should google the meaning of in memory processing once..It doesn't mean that your memory will process the Data for you 😂😂😂😂 Even kids there in school know that cpu does the actual computations..
No In- memory is usually your RAM, where data is stored and computed in parallel. Hence it is fast. Can you just let me know how you got grilled for this?
hadoop work in batches by moving data in the hdfs. Meanwhile Spark does its operation in-memory, the data is cached in memory and all operations are done live. Unless you were asked questions for hadoop i don't see how you could get grilled for this ...
@@EnimaMoe Spark doesnot do operations in Memory. In fact, no processing happens in memory. I am not talking about the concept. I am talking about the phrase that is used "in memory processing". For those advising me to Google about spark, just an FYI, It's been years since in am using spark. You can always challenge whatever is written ot told by someone. Tc.
Q1: What made you move to Spark from Hadoop/MR? Both the question and answer is wrong. Hadoop is a file system whereas spark is a framework/library to process data in distributed fashion. There is no such thing as 1 better than other. It's like comparing apples and oranges.
Hi Omkar... Hadoop is combination of Map Reduce and HDFS.. hdfs is file system and MR is processing engine... Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing... You will generally see people who are working in big data using this kind of language... Generally people who started with Hadoop and then moves to spark processing engine later
@@DataSavvy Can you tell me where and how you do MR using Hadoop? And can you elaborate what exactly you mean by "Hadoop MR style of programming?" If the interviewer is using this language, clearly he has learnt and limited his knowledge to tutorials. Someone who has worked on large scale clusters using EMR or his own EC2 cluster wont use such vague language.
@@DataSavvy I understand what is map reduce.. its a paradigm and not a framework/library that you are asking. The interviewer asked this question: Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing? This question itself is wrong as Spark is a framework that allows you to process data using map reduce paradigm. There is no such thing as "Hadoop MR style of processing".
I see... You seems to have issue with words used to frame the question... I think we should focus on intent of question, rather than thinking to much about the words...
@@MrTeslaX Thanks for the advice, luckily I rarely have to go for interviews nowadays. Personally, I don't hire people who use a lot of 'I' in their answers, because they are just ignoring the existence of others in the team. Most likely they are a bad team player and don't want such people in my team.
@@hasifs8139 Thanks for your response. I live and work in the US and have attended FANG companies and other small companies as well. One of the most important things I was told to keep in mind was to highlight my contribution and achievement and not talk about the overall work done. Be specific and talk about the work you have done and use I while talking about them.
@@MrTeslaX Thanks for your explanation. Yes, you must definitely highlight your contributions and achievements within the project. All I am saying is that you should not give the impression that you did it all on your own. Also what difference does it make, if you are living in the US or Germany(where I am) or anywhere else?
Questions :
1) Why you shifted from map reduce development to spark development?
2) How Spark Engine is different from Hadoop Map Reduce engine?
3) What are the steps for spark jobs optimization?
4) What is executor and executor core? Reference in terms of process & threads
5) How to you identify that your hive script is slow?
6) When do we use partitioning and bucketing in hive?
7) Small file problem in hive ? ---> Skewiness
8) How do you improve high cardinality issue in dataset? In resect of Hive.
9) How do you care code merging with other teams, explain your development process?
10) Again, Small files issue in Hadoop ?
11) Metadatasize of hadoop ?
12) How spark is differentiated from Map Reduce?
13) In a class having 3 fields name,age,salary & you are creating series of objects from this class? How do you compare the object ----(I didn't got the question exactly)
14) Scala : what is === in joins conditions? What does it means?
Hope so it will help?
Yes Thank you 👍
Thank you,..
Thank you
13 question is not clear even to me
Thank you guys, you are big reason why I got job in aws data Engineer. Spark and optimizations are most asked questions. Partitioning and bucketing with Hive as well. I would also add that the interviewers are similar to real setting because they usually point to you to the right direction of the answer they looking so always listen to their follow up.
How you are feeling now do you transitioned your career from some other tech ..? Do you face complexities in your day to day activities..?
How do you optimize a job?
- Check the input data size and output data size and correlate to operating cluster memory
- Check Input partition size,output partition size and number of partitions along with shuffle partition and decide number of cores
- Check for disk memory spills during stage execution
- Number of executors used for given cluster size
- Available cluster memory and memory in use by the application/job
- Check average run time of all stages in the job, to identify any data skewed stage tasks
- Check whether the table is partitioned by column or not / bucketing
Instablaster.
perfect!
Great..
Hello .
Can you guide me where I can learn all these concepts
Is it possible to explain all of this in details? Best with an example
I must really appreciate for posting this interview in public domain. This is a really good one.. it would be really great to see a video on process to optimize the job
since this is mock interview at the end the interviewers should hv given feedback in the call itself so its helpful for viewers
really great video, it would have been much greater, if you can answer the questions which the candidate was not able to answer, like what are symptoms of a job, on which you will decide that you should increase the number of executors or memory per executors. Can anyone please answer here, so that it may be beneficial for candidates. Thanks a lot for this video.
@Data Savvy
It can be watched at one stretch. Really helpful. 👍🏻🙌🏻
hadoop is meant for handling big files in small numbers and also small file problem arises when file size is less than HDFS block size [ 64 or 128 ] . Moreover, handling bulk number of small files may increase pressure on Name node , if we have option to handle big file. so in hadoop file size matters alot so only Partitioning and Buckting came into picture. correct me if i did mistake
Partitioning and Bucketing is related to YARN ( processing side of Hadoop )
HDFS small files explained : blog.cloudera.com/the-small-files-problem/ ( storage side of Hadoop )
Also to handle huge number of small files, we need to increase NN heap (1Million blocks count-> 1GB) there then causing GC issue and making things complicated.
Default block size is 128MB, when small size files will be created using partitioning. Lot of storage will go waste. And required horizontal Scaling ( fails the purpose of distribution)
But we can configure block size aswell right?
@@kartikeyapande5878 Yes,
When dealing with small files in a distributed storage system with a default block size of 128MB, indeed, there can be inefficiencies and wasted storage space due to the space allocated for each block. This issue is commonly known as the "small files problem" in distributed storage systems like Hadoop's HDFS.
Here are a few strategies to mitigate this problem:
1. **Combine Small Files**: One approach is to periodically combine small files into larger ones. This process is often referred to as file compaction or consolidation. By combining multiple small files into larger ones, you can reduce the overhead of storing metadata for each individual file and make better use of the storage space.
2. **Adjust Block Size**: Depending on your workload, you might consider adjusting the default block size. While larger block sizes are more efficient for storing large files, smaller block sizes can be more suitable for small files. However, this adjustment requires careful consideration since smaller block sizes can increase the overhead of managing metadata and may impact performance.
3. **Use Alternate Storage Solutions**: Depending on your specific requirements, you might explore alternative storage solutions that are better suited for managing small files. For example, using a distributed object storage system like Amazon S3 or Google Cloud Storage might be more efficient for storing and retrieving small files compared to traditional block-based storage systems.
4. **Metadata Optimization**: Optimizing the metadata management mechanisms within your distributed storage system can help reduce the overhead associated with storing small files. Techniques such as hierarchical namespace structures, metadata caching, and efficient indexing can improve the performance and scalability of the system when dealing with small files.
5. **Compression and Deduplication**: Employing compression and deduplication techniques can help reduce the overall storage footprint of small files. By compressing similar files or identifying duplicate content and storing it only once, you can optimize storage utilization and reduce wastage.
6. **Object Storage**: Consider using object storage solutions that are designed to efficiently store and retrieve small objects. These systems typically offer features such as fine-grained metadata management, scalable architectures, and optimized data access patterns for small files.
Each of these strategies has its own trade-offs in terms of complexity, performance, and overhead. The most suitable approach depends on the specific requirements and constraints of your application and infrastructure.
Ur interview is very helpful.
Keep up the good work 👍👍👍
Thanks Chaitanya :)
Amazing .. thanks @Data Savvy for your efforts :)
Also spark does dynamic executor allocation so you dont need to pass 800 executors as input. Size your job by running test loads.
Nice video. The purpose of using '===' while joining is to make sure that we are comparing right values (join key value) and right data type as well. Please correct me if my understanding is wrong.
You are right... Using more keywords here will help in giving better answer
using 3 equals (===) is a method defined in column class in scala that is specifically designed to compare columns in dataframes.
@@DeekshithnagHi.. are you working as a data engineer?
Hi sir, it's really helpful for me because I have issues lots of questions which you asked there, thank you so much sir.
Please make one more videos on advance level SPRK series please.
This is very useful. Please make more videos like this.
Thanks Kranthi... We will create more videos like this
Hello, I was asked the followed questions in a AWS developer interview-
Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it.
Q2. So where does pyspark come into play in this?
Q3. Which all libraries will you need to import to run the above glue job?
Q4. What are shared variables in pyspark
Q5. How to optimize glue jobs
Q6. How to protect sensitive data in your data.
Q7. How do you identify sensitive information in your data.
Q8. How do you provision a S3 bucket?
Q9. How do I check if a file has been changed or deleted?
Q10. How do I protect my file having sensitive data stored in S3?
Q11. How does KMS work?
Q12. Do you know S3 glacier?
Q13. Have you worked on S3 glacier?
Wish I didn't have the haircut that day😂😂😀😀😂😂😂
This is fine Bro :)
arindam patra the king of datasavvy
It would be really helpful if you could make more such a mock interviews. I think we have only 3 live interviews yet on channel
Very 2 helpful nd plz have 1,2 more interviews of same level.
Great effort by interviewer and interviewee.
Thanks for your kind words... Yup more interviews are planned
Awesome video. Thank you for putting this out. It's helpful.
Thanks Sukanya
amazing job , really interesting thank you for sharing this interview with us.
Thank you very much, this is very useful!!!
Can you post video now for data engineering interview and also post question sets as well
Very Informative.. Thanks a lot Guys...
Thanks Rahul... Sathya and Arindham helped with this :)
Awesome Harjeet sir!!
I can even watch such thousand videos at a stretch😁
Very informative!!!
Can't wait for long, please upload as much as u can sir.
Thanks Rohit... Yes, I will try my best :)
It fine i..but it should be more of a discussion than question answer session.
Hi Gaurav... Are u suggesting, way of interviewing is not appropriate
@@DataSavvy no no .. definitely not that.
I was saying discussion way of interviewing is more effective as per my opinion.
I feel more comfortable and and able to express that way.
Very much helpful. Thanks a lot for uploading.
Keep up the excellent work👍 expecting more such videos.
Thanks :)
Can you do interview questions on scala. I believe these are really imp for cracking tough interviews
Yes Rahul... I will plan for that
This guy was giving Interview for TCS Data science role once.
which Guy
@Data Savvy Nice one very informative
Thanks Rajesh... More videos like this will be posted
This is good but its just basics questions for DE, it would be great if you share some code and advanced logic questions for DE daily uses.
There will be more videos... With more complex problems.
sir can you please make a video on which language is best for Data engineering is it scala or python?
Nice discussion
Good logical questions 👌👌
Thanks David :)
Awesome. Keep it up 👍🏻
Thanks mate
That broadcast join example looked cooked up 😆😆
attach a feedback video to the same it will go long way in knowing what should have been better answered
Thank you sir ..it is very helpful
Thanks Nivedita
Hey, It was really helpfull Thank you 👍
Sir.. can you please provide us Hadoop & Spark developer with SCALA video's Beginners to Perfect
It's very very very Useful For us sirr.. because cheaked all types of video's on the youtube no one can do it ...
Or sir hai bhi to wo paid courses hai like EDUREKA , Intellipath , Simpllearn etc . So , sir please make it earlier..Need its bhot the students
Student able to learn and gain more and more knowledge but haven't money 😭
Sure Randeep.. I will plan for spark course
We can use Bucketing when there are lot of small files ... Correct me if i wrong...
i like all videos of yours :) very informative
Thanks Vidya... I am happy that these videos are useful for you :)
Can you teach us big data from scratch? Your videos are really useful.
Sure... Which topics are you looking for
@@DataSavvy from basics.
Ok... Planning that :)
@@DataSavvy Thanks..waiting for it.
@@DataSavvy sir do it love u advance
Informative .also it will be good if the correct answers also mentioned.
Thanks Nitish... Most of answers are available as dedicated video on channel
@@DataSavvy Sorry i could not get, we have separate video with answers ?
Please make a course on databricks certification for pyspark
Sure Mani... Added in my list. Thanks for suggestion :)
and please make
1. hands on schema evolution with all formats
2. databricks delta lake
3. how to connect with different datasources
You are the only creator to be expected from
Seems like most questions are on how to optimize the jobs. Not much on the technical side. Does Data engineer interviews goes like this, or any other technical questions are asked too?
Can you explain this question with a video.
What is the best way to join 3 table in spark.
Sure Nivedita...
Pls cover estimating spark cluster size on cloud infrastructure like aws
Sure Abhinee... Thanks for suggestion
Hello sir this first time I am getting touch with you . It was a great interview I have seen so many tricky questions . I am preparing for spark Administrator interview do you have some spark tunning interview questions and some advance interview questions related to spark
Hi Tushar... I am happy this video is useful for you. There is a playlist, on my channel for spark performance tuning and I am also working on a new series... Let me know if u need anything extra
This is really helpful
Good questions
Actually asking to compare spark n hadoop is incorrect. Should ask mr vs spark. Also spark has improved insanely so pls interviewers RIP this question
its correct. hdfs has own framework and spark has its own.hdfs works in batch processing process where spark works by inmemory computation. everything consider as info dumb
@@KiranKumar-um2gz spark is a framework in itself and it also does batch processing. Pls dnt spread half knowledge
True
I will still ask this question. When comparing Hadoop with Spark, it is assumed that we are comparing 2 processing engines not a processing engine to file system. We expect a candidate to be sane enough to understand that. Also, Spark is built on top of MR concept this very good question to test your understanding of it.
@@hasifs8139 anyone who has picked up spark in last 3 years does not need to understand mr to be good at spark or data processing. Spark implementation is way different than mr to make any comparison. Do you do same or similar steps to optimise joins in spark and mr, no. You can keep asking this question and rejecting good candidates.
Nice and informative video. Can you please answer the question asked in interview. How to compare two Scala objects based one variable value.
Hi... All answers of the questions are available as different videos on data Savvy channel... Let me know if you find anything missing... I will add that... I will add answer to scala questions also
Sir, can you please do more interviews like this as it is helpful ..
Yes Chetan, I am already planning for more videos on this
Please add videos for fresher also
Can please explain roles and responsibilities of spark and scala
Any new video?i will appreciate
just posted ruclips.net/video/pTFkjdNng-U/видео.html
sir this was really helpful and informative....i'm a fresher and seeking an opportunity to work with big data technologies like hadoop, spark, kafka, etc.....please guide me how do i enter the corporate world starting with these technologies as there are very less firms that hires a fresher for such technologies.....
HAVE you done any projects??/
@@satyanathparvatham4406 worked on a hive project and other thn that I keep practicing some scenarios (spark) on databricks.
Very informative. Thanks :) Can you suggest some small Spark project for portfolio building?
What is your profile and which Language you use
@@DataSavvy I'm Scala Spark engineer.
I'm am familiar with Cassandra, Kafka, HDFS, Spark.
Same request...can you please suggest some small spark project using python..
I am currently working on creating videos and explaining a end to end project..
Hi Data Savvy, unable to join Telegram, authorization issue....
Hello sir,u r videos are very helpful.....I am unable to join in u r telegram group.....plz help me sir
Hi sir, will u plz exaplain difference b/w map and foreach....
Will create video on this...
In spark it is not possible to apply Bucketing without partitioning the tables. So If we do not find a suitable column to partition the table, how will we proceed a head with optimization ?
Can you answer to question how do you decide when to increase executor or memory question please ? :)
You have to find if your job is compute intensive or IO intensive.. you will get hints of that in logs files... I realised after taking this mock interview that I should create a video on that... I am working on this.. thanks for asking this question 😀
@@DataSavvy I do not completely agree to what you said. Or may be the question looks a bit off. To start with increasing executor memory will have a limit restricted by the total memory available(depending on the instance type you are using). Memory usage tuning at executor level would need considering 1) the amount of memory used by your objects (you may want your entire dataset to fit in memory), 2) the cost of accessing those objects, and 3) the overhead of garbage collection (if you have high turnover in terms of objects). Now, when we say increasing number of executors - I consider this as scaling needed to meet the job requirements. IO intensive doesn't directly mean increase the number of executors rather increasing the level of parallelism(dependent on the underlying part files(/size) etc.) which starts at the executor level. So, I would rather look at this answer like optimizing executor config for a (considerably)small dataset and tuning the executor config first and then assessing the level of scaling need viz. increasing the number of executors to meet the scale of the actual data. I would like to discuss ahead with you
Hi, can you do interview on Scala and spark.
Hi, I am sorry, i did not understand your question completely.. are u asking u want to do a mock interview with me on scala and spark? if yes... please drop me a message at aforalgo@gmail.com. we can workout this
Whats the purpose of using '===' while joining? nice video btw.
Thanks Nikhil... Will post a video about the answer in few days :)
@@DataSavvycan you post something like this on Airflow?
@@DataSavvy My best guess is = and == are already reserved operators.
= is assignment operator like val a=5
== is comparison opertor if object type is same like how we use in a a normal string comparison for example
=== is comparison operator if object type is different like when we compare two different colums for different datasets dataframes
I'm moving from web development to spark development. Any inputs on that please !! Can I switch without any experience of working with spark.
what is the answer to the scala question at 31:00, eliminating duplicate objects in a set on the basis of name?
arindam patra the king of datasavvy
Could you please answer...How to achieve optimisation in hive query with columns that have high cardinality
may be we can use vectorization also in such scenarios and as he said bucketing on id column can help drastically, apart from it choosing right file format can work as well.
Can you guys tell which companies would be interviewing in this pattern?
Hi, we can use distinct method in Scala for reading unique name rt??
Absolutely.
I liked some of the questions but not the answers completely. Say the HDFS block size, memory used per file in name node and the type safe equality. How do you plan to publish the right content/answers?
Hi Nithin... This was a completely impromptu interview... Are u looking for answer of any specific question?
If we use 800 executors for 100GB input data like you've mentioned in your example,Spark would be then busy in managing the high volume of executors rather than on data processing. So, it could better to use an executor for 5-10GB which would leave us to use 10-20 executors for 100GB data. If you're having any explanation for using 800 executors, then do post it.
let me look into this
not 800 executors- he said to use 800 cores for maximum parallelism- Keep five courses in each executor, resulting into 160 executors in total.
Sathiyan is a genius
I agree :)
Can you make one interview video for Bigdata developer with 2-3 yrs of exp.
Sure Pratik... That is already in plan... This interview also fits in less than 4 year exp category
@4.40 is the interviewer expecting ans DAG?
How can we make only name as a deciding factor to remove duplicity in a set instead of all the entries it take in Scala?
I have been working on big data for quite sometime now , but i dont know why I cant clear interviews
Can you please suggest any correct way to determine executor cores n executor memory by looking at the input data. Without hit n trail and instead going for that thumb rule that we have assuming 5 would be the optimized number for core.. Any other way to determine
It purely depends on size of input data and kind of processing/computation like heavy join or simple scan of data.
In general, worker nodes (data nodes) of size {Cores 40 (80 w/ Hyperthreading) ; Memory
500 GiB} then ~ 1vCore for every 5GB.
Hi it would be nice if you give correct answers if the answer is wrong..
Any resource link to do a spark related mini project ?
I am creating a series for spark project... Will post soon
wow, kindly make hive integration with spark
Hmmmm... What is the problem that you are facing in integration...
@@DataSavvy In databricks When i am creating a managed hive table by "using json" keyword Its creating fine but when i am creating external table its showing error
@@DataSavvy Why "using keyword dosent work with external tables
I am fresher can I start my carrier with big data.
I am new to data engineering field which language should i choose scala or python . Which language has more job roles ?
go with python (pyspark I suggest)
How many experience this candidate is having?
Did the participant consent for posting this online? If not u should blur his face
Yes.. It was agreed with participants
Hi sir, if you are providing interview guidance personally please let me know..I'll contact personally...I need guidance
Join our group... It's very vibrant and people help each other a lot
Please share group link we will join
chat.whatsapp.com/KKUmcOGNiixH8NdTWNKMGZ
@@hasnainmotagamwala2608 Hi, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr
@@DataSavvy it is giving not authorized to access.
Hi, why hadoop doesn't support small files ?
He meant to ask... Why small files are not good for Hadoop... Hadoop can store small files though
@@DataSavvy thanks, But why its not good? performance issue.. how it exatly?
@@msbhargavi1846 If you have too many small files then your name node will have to keep metadata for these each of these metadta takes around 100-150 bytes so just think if you have millions of small files how much memory name node has to exhaust to manage this ....
@@rajeshkumardash1222 yes got it.... thanks
Thanks Rajesh
Small file issue 16:45
Hi how are you
Please create new watsup grp as it is full
Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr
Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr
Everyone is in nightwear 😂😂😂
There is no such thing as in memory processing.. Memory is used to store data that can be reused. 4 years back I was grilled on this 'in memory processing' stuff in one of the big4 firm.
You should google the meaning of in memory processing once..It doesn't mean that your memory will process the Data for you 😂😂😂😂 Even kids there in school know that cpu does the actual computations..
No In- memory is usually your RAM, where data is stored and computed in parallel. Hence it is fast.
Can you just let me know how you got grilled for this?
hadoop work in batches by moving data in the hdfs. Meanwhile Spark does its operation in-memory, the data is cached in memory and all operations are done live. Unless you were asked questions for hadoop i don't see how you could get grilled for this ...
@@EnimaMoe Spark doesnot do operations in Memory. In fact, no processing happens in memory. I am not talking about the concept. I am talking about the phrase that is used "in memory processing".
For those advising me to Google about spark, just an FYI, It's been years since in am using spark.
You can always challenge whatever is written ot told by someone. Tc.
Q1: What made you move to Spark from Hadoop/MR? Both the question and answer is wrong. Hadoop is a file system whereas spark is a framework/library to process data in distributed fashion. There is no such thing as 1 better than other. It's like comparing apples and oranges.
Hi Omkar... Hadoop is combination of Map Reduce and HDFS.. hdfs is file system and MR is processing engine... Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing... You will generally see people who are working in big data using this kind of language... Generally people who started with Hadoop and then moves to spark processing engine later
@@DataSavvy Can you tell me where and how you do MR using Hadoop? And can you elaborate what exactly you mean by "Hadoop MR style of programming?" If the interviewer is using this language, clearly he has learnt and limited his knowledge to tutorials. Someone who has worked on large scale clusters using EMR or his own EC2 cluster wont use such vague language.
Hi Omkar... Plz read en.m.wikipedia.org/wiki/MapReduce ... or Google Hadoop map reduce...
@@DataSavvy I understand what is map reduce.. its a paradigm and not a framework/library that you are asking. The interviewer asked this question: Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing? This question itself is wrong as Spark is a framework that allows you to process data using map reduce paradigm. There is no such thing as "Hadoop MR style of processing".
I see... You seems to have issue with words used to frame the question... I think we should focus on intent of question, rather than thinking to much about the words...
This one can't considered as spark interview.
Hi Krishna... Please share your expectation... I will cover that as another video
Please dont use the word "we" use "i"
Never use 'I' unless you are working alone.
@@hasifs8139 Always use I in interview
@@MrTeslaX Thanks for the advice, luckily I rarely have to go for interviews nowadays. Personally, I don't hire people who use a lot of 'I' in their answers, because they are just ignoring the existence of others in the team. Most likely they are a bad team player and don't want such people in my team.
@@hasifs8139 Thanks for your response. I live and work in the US and have attended FANG companies and other small companies as well. One of the most important things I was told to keep in mind was to highlight my contribution and achievement and not talk about the overall work done. Be specific and talk about the work you have done and use I while talking about them.
@@MrTeslaX Thanks for your explanation. Yes, you must definitely highlight your contributions and achievements within the project. All I am saying is that you should not give the impression that you did it all on your own. Also what difference does it make, if you are living in the US or Germany(where I am) or anywhere else?
Terrible accent… 😮
Hey 👋.. thank you for this awesome initiative. Btw one thing the whatsapp link does not work!!