Это видео недоступно.
Сожалеем об этом.

Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

Поделиться
HTML-код
  • Опубликовано: 28 сен 2020
  • This video is part of the Spark Interview Questions Series.
    A lot of subscribers has requested me to give some experience on how an actual Big Dta Interview look like. In This Video we have covered what usually happens in Big data or data engineering interview happens.
    There will be more videos covering different aspects of Data Engineering Interviews.
    Here are a few Links useful for you
    Git Repo: github.com/harjeet88/
    Spark Interview Questions: • Spark Interview Questions
    If you are interested to join our community. Please join the following groups
    Telegram: t.me/bigdata_hkr
    Whatsapp: chat.whatsapp.com/KKUmcOGNiix...
    You can drop me an email for any queries at
    aforalgo@gmail.com
    #apachespark #sparktutorial #bigdata
    #spark #hadoop #spark3 #bigdata #dataengineer

Комментарии • 233

  • @tradingtransformation346
    @tradingtransformation346 2 года назад +73

    Questions :
    1) Why you shifted from map reduce development to spark development?
    2) How Spark Engine is different from Hadoop Map Reduce engine?
    3) What are the steps for spark jobs optimization?
    4) What is executor and executor core? Reference in terms of process & threads
    5) How to you identify that your hive script is slow?
    6) When do we use partitioning and bucketing in hive?
    7) Small file problem in hive ? ---> Skewiness
    8) How do you improve high cardinality issue in dataset? In resect of Hive.
    9) How do you care code merging with other teams, explain your development process?
    10) Again, Small files issue in Hadoop ?
    11) Metadatasize of hadoop ?
    12) How spark is differentiated from Map Reduce?
    13) In a class having 3 fields name,age,salary & you are creating series of objects from this class? How do you compare the object ----(I didn't got the question exactly)
    14) Scala : what is === in joins conditions? What does it means?
    Hope so it will help?

  • @bramar1278
    @bramar1278 3 года назад +24

    I must really appreciate for posting this interview in public domain. This is a really good one.. it would be really great to see a video on process to optimize the job

  • @Nasirmah
    @Nasirmah 2 года назад +12

    Thank you guys, you are big reason why I got job in aws data Engineer. Spark and optimizations are most asked questions. Partitioning and bucketing with Hive as well. I would also add that the interviewers are similar to real setting because they usually point to you to the right direction of the answer they looking so always listen to their follow up.

    • @karna9156
      @karna9156 Год назад

      How you are feeling now do you transitioned your career from some other tech ..? Do you face complexities in your day to day activities..?

  • @JaiGurujiAshwi
    @JaiGurujiAshwi 2 года назад +2

    Hi sir, it's really helpful for me because I have issues lots of questions which you asked there, thank you so much sir.
    Please make one more videos on advance level SPRK series please.

  • @AhlamLamo
    @AhlamLamo 3 года назад +1

    amazing job , really interesting thank you for sharing this interview with us.

  • @amansinghshrinet8594
    @amansinghshrinet8594 3 года назад +5

    @Data Savvy
    It can be watched at one stretch. Really helpful. 👍🏻🙌🏻

  • @sukanyapatnaik7633
    @sukanyapatnaik7633 3 года назад +1

    Awesome video. Thank you for putting this out. It's helpful.

  • @ramkumarananthapalli7151
    @ramkumarananthapalli7151 3 года назад

    Very much helpful. Thanks a lot for uploading.

  • @mayanksrivastava4121
    @mayanksrivastava4121 Год назад

    Amazing .. thanks @Data Savvy for your efforts :)

  • @rohitrathod8150
    @rohitrathod8150 3 года назад +6

    Awesome Harjeet sir!!
    I can even watch such thousand videos at a stretch😁
    Very informative!!!
    Can't wait for long, please upload as much as u can sir.

    • @DataSavvy
      @DataSavvy  3 года назад +2

      Thanks Rohit... Yes, I will try my best :)

  • @MoinKhan-cg8cu
    @MoinKhan-cg8cu 3 года назад

    Very 2 helpful nd plz have 1,2 more interviews of same level.
    Great effort by interviewer and interviewee.

    • @DataSavvy
      @DataSavvy  3 года назад

      Thanks for your kind words... Yup more interviews are planned

  • @sathyansathyan3213
    @sathyansathyan3213 3 года назад +1

    Keep up the excellent work👍 expecting more such videos.

  • @sujaijain4511
    @sujaijain4511 2 года назад +1

    Thank you very much, this is very useful!!!

  • @chaitanya5869
    @chaitanya5869 3 года назад +3

    Ur interview is very helpful.
    Keep up the good work 👍👍👍

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Thanks Chaitanya :)

  • @MageswaranD
    @MageswaranD 3 года назад +35

    How do you optimize a job?
    - Check the input data size and output data size and correlate to operating cluster memory
    - Check Input partition size,output partition size and number of partitions along with shuffle partition and decide number of cores
    - Check for disk memory spills during stage execution
    - Number of executors used for given cluster size
    - Available cluster memory and memory in use by the application/job
    - Check average run time of all stages in the job, to identify any data skewed stage tasks
    - Check whether the table is partitioned by column or not / bucketing

  • @rahulpandit9082
    @rahulpandit9082 3 года назад +2

    Very Informative.. Thanks a lot Guys...

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Thanks Rahul... Sathya and Arindham helped with this :)

  • @kranthikumarjorrigala
    @kranthikumarjorrigala 3 года назад +1

    This is very useful. Please make more videos like this.

    • @DataSavvy
      @DataSavvy  3 года назад

      Thanks Kranthi... We will create more videos like this

  • @deepikalamba1058
    @deepikalamba1058 3 года назад +1

    Hey, It was really helpfull Thank you 👍

  • @shubhamshingi4657
    @shubhamshingi4657 3 года назад +12

    It would be really helpful if you could make more such a mock interviews. I think we have only 3 live interviews yet on channel

  • @vidyac6775
    @vidyac6775 3 года назад

    i like all videos of yours :) very informative

    • @DataSavvy
      @DataSavvy  3 года назад

      Thanks Vidya... I am happy that these videos are useful for you :)

  • @ajithkannan522
    @ajithkannan522 2 года назад +2

    since this is mock interview at the end the interviewers should hv given feedback in the call itself so its helpful for viewers

  • @surajyejare1627
    @surajyejare1627 2 года назад

    This is really helpful

  • @tradingtexi
    @tradingtexi 3 года назад +18

    really great video, it would have been much greater, if you can answer the questions which the candidate was not able to answer, like what are symptoms of a job, on which you will decide that you should increase the number of executors or memory per executors. Can anyone please answer here, so that it may be beneficial for candidates. Thanks a lot for this video.

    • @shivankchaturvedi210
      @shivankchaturvedi210 Год назад +1

      Bhai he has already made a video how to set executor config see that video you will get the answer

  • @rohitkamra1628
    @rohitkamra1628 3 года назад +1

    Awesome. Keep it up 👍🏻

  • @kaladharnaidusompalyam851
    @kaladharnaidusompalyam851 3 года назад +11

    hadoop is meant for handling big files in small numbers and also small file problem arises when file size is less than HDFS block size [ 64 or 128 ] . Moreover, handling bulk number of small files may increase pressure on Name node , if we have option to handle big file. so in hadoop file size matters alot so only Partitioning and Buckting came into picture. correct me if i did mistake

    • @sank9508
      @sank9508 3 года назад

      Partitioning and Bucketing is related to YARN ( processing side of Hadoop )
      HDFS small files explained : blog.cloudera.com/the-small-files-problem/ ( storage side of Hadoop )
      Also to handle huge number of small files, we need to increase NN heap (1Million blocks count-> 1GB) there then causing GC issue and making things complicated.

  • @nivedita5639
    @nivedita5639 3 года назад +1

    Thank you sir ..it is very helpful

  • @saeba7528
    @saeba7528 3 года назад +2

    sir can you please make a video on which language is best for Data engineering is it scala or python?

  • @rajeshkumardash1222
    @rajeshkumardash1222 3 года назад +2

    @Data Savvy Nice one very informative

    • @DataSavvy
      @DataSavvy  3 года назад

      Thanks Rajesh... More videos like this will be posted

  • @abhinee
    @abhinee 3 года назад +3

    Also spark does dynamic executor allocation so you dont need to pass 800 executors as input. Size your job by running test loads.

  • @davidgupta110
    @davidgupta110 3 года назад +1

    Good logical questions 👌👌

  • @kiranmudradi26
    @kiranmudradi26 3 года назад +5

    Nice video. The purpose of using '===' while joining is to make sure that we are comparing right values (join key value) and right data type as well. Please correct me if my understanding is wrong.

    • @DataSavvy
      @DataSavvy  3 года назад +1

      You are right... Using more keywords here will help in giving better answer

    • @deekshithnag1655
      @deekshithnag1655 Год назад

      using 3 equals (===) is a method defined in column class in scala that is specifically designed to compare columns in dataframes.

    • @Varnam_trendZ
      @Varnam_trendZ 9 месяцев назад

      ​@@deekshithnag1655Hi.. are you working as a data engineer?

  • @enishaeshwar7617
    @enishaeshwar7617 2 года назад

    Good questions

  • @kashifanwar4034
    @kashifanwar4034 3 года назад

    How can we make only name as a deciding factor to remove duplicity in a set instead of all the entries it take in Scala?

  • @call_me_sruti
    @call_me_sruti 3 года назад +2

    Hey 👋.. thank you for this awesome initiative. Btw one thing the whatsapp link does not work!!

  • @venkataramanak8264
    @venkataramanak8264 2 года назад

    In spark it is not possible to apply Bucketing without partitioning the tables. So If we do not find a suitable column to partition the table, how will we proceed a head with optimization ?

  • @digwijoymandal5216
    @digwijoymandal5216 2 года назад +1

    Seems like most questions are on how to optimize the jobs. Not much on the technical side. Does Data engineer interviews goes like this, or any other technical questions are asked too?

  • @RahulRawat-wu1vv
    @RahulRawat-wu1vv 3 года назад +10

    Can you do interview questions on scala. I believe these are really imp for cracking tough interviews

    • @DataSavvy
      @DataSavvy  3 года назад +3

      Yes Rahul... I will plan for that

  • @chetankakkireni8870
    @chetankakkireni8870 3 года назад

    Sir, can you please do more interviews like this as it is helpful ..

    • @DataSavvy
      @DataSavvy  3 года назад

      Yes Chetan, I am already planning for more videos on this

  • @Karmihir
    @Karmihir 3 года назад +2

    This is good but its just basics questions for DE, it would be great if you share some code and advanced logic questions for DE daily uses.

    • @DataSavvy
      @DataSavvy  3 года назад +2

      There will be more videos... With more complex problems.

  • @vijeandran
    @vijeandran 3 года назад

    Hi Data Savvy, unable to join Telegram, authorization issue....

  • @MrRemyguy
    @MrRemyguy 3 года назад

    I'm moving from web development to spark development. Any inputs on that please !! Can I switch without any experience of working with spark.

  • @sssaamm29988
    @sssaamm29988 2 года назад

    what is the answer to the scala question at 31:00, eliminating duplicate objects in a set on the basis of name?

  • @bhavaniv1721
    @bhavaniv1721 3 года назад

    Can please explain roles and responsibilities of spark and scala

  • @tusharsonawane3055
    @tusharsonawane3055 3 года назад +6

    Hello sir this first time I am getting touch with you . It was a great interview I have seen so many tricky questions . I am preparing for spark Administrator interview do you have some spark tunning interview questions and some advance interview questions related to spark

    • @DataSavvy
      @DataSavvy  3 года назад

      Hi Tushar... I am happy this video is useful for you. There is a playlist, on my channel for spark performance tuning and I am also working on a new series... Let me know if u need anything extra

  • @shivamgupta-bc7fn
    @shivamgupta-bc7fn 3 года назад

    Can you guys tell which companies would be interviewing in this pattern?

  • @yugandharpulicherla4078
    @yugandharpulicherla4078 3 года назад

    Nice and informative video. Can you please answer the question asked in interview. How to compare two Scala objects based one variable value.

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Hi... All answers of the questions are available as different videos on data Savvy channel... Let me know if you find anything missing... I will add that... I will add answer to scala questions also

  • @KahinBhiKuchBhi
    @KahinBhiKuchBhi Год назад

    We can use Bucketing when there are lot of small files ... Correct me if i wrong...

  • @lavanyareddy310
    @lavanyareddy310 3 года назад

    Hello sir,u r videos are very helpful.....I am unable to join in u r telegram group.....plz help me sir

  • @dramaqueen5889
    @dramaqueen5889 2 года назад

    I have been working on big data for quite sometime now , but i dont know why I cant clear interviews

  • @msbhargavi1846
    @msbhargavi1846 3 года назад +2

    Hi, we can use distinct method in Scala for reading unique name rt??

  • @abhinee
    @abhinee 3 года назад +2

    Pls cover estimating spark cluster size on cloud infrastructure like aws

    • @DataSavvy
      @DataSavvy  3 года назад

      Sure Abhinee... Thanks for suggestion

  • @ShashankGupta347
    @ShashankGupta347 2 года назад +2

    Default block size is 128MB, when small size files will be created using partitioning. Lot of storage will go waste. And required horizontal Scaling ( fails the purpose of distribution)

    • @kartikeyapande5878
      @kartikeyapande5878 2 месяца назад

      But we can configure block size aswell right?

    • @ShashankGupta347
      @ShashankGupta347 2 месяца назад

      @@kartikeyapande5878 Yes,
      When dealing with small files in a distributed storage system with a default block size of 128MB, indeed, there can be inefficiencies and wasted storage space due to the space allocated for each block. This issue is commonly known as the "small files problem" in distributed storage systems like Hadoop's HDFS.
      Here are a few strategies to mitigate this problem:
      1. **Combine Small Files**: One approach is to periodically combine small files into larger ones. This process is often referred to as file compaction or consolidation. By combining multiple small files into larger ones, you can reduce the overhead of storing metadata for each individual file and make better use of the storage space.
      2. **Adjust Block Size**: Depending on your workload, you might consider adjusting the default block size. While larger block sizes are more efficient for storing large files, smaller block sizes can be more suitable for small files. However, this adjustment requires careful consideration since smaller block sizes can increase the overhead of managing metadata and may impact performance.
      3. **Use Alternate Storage Solutions**: Depending on your specific requirements, you might explore alternative storage solutions that are better suited for managing small files. For example, using a distributed object storage system like Amazon S3 or Google Cloud Storage might be more efficient for storing and retrieving small files compared to traditional block-based storage systems.
      4. **Metadata Optimization**: Optimizing the metadata management mechanisms within your distributed storage system can help reduce the overhead associated with storing small files. Techniques such as hierarchical namespace structures, metadata caching, and efficient indexing can improve the performance and scalability of the system when dealing with small files.
      5. **Compression and Deduplication**: Employing compression and deduplication techniques can help reduce the overall storage footprint of small files. By compressing similar files or identifying duplicate content and storing it only once, you can optimize storage utilization and reduce wastage.
      6. **Object Storage**: Consider using object storage solutions that are designed to efficiently store and retrieve small objects. These systems typically offer features such as fine-grained metadata management, scalable architectures, and optimized data access patterns for small files.
      Each of these strategies has its own trade-offs in terms of complexity, performance, and overhead. The most suitable approach depends on the specific requirements and constraints of your application and infrastructure.

  • @nitishr5197
    @nitishr5197 3 года назад +2

    Informative .also it will be good if the correct answers also mentioned.

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Thanks Nitish... Most of answers are available as dedicated video on channel

    • @chaitanyachimakurthi2370
      @chaitanyachimakurthi2370 3 года назад

      @@DataSavvy Sorry i could not get, we have separate video with answers ?

  • @biswadeeppatra1726
    @biswadeeppatra1726 3 года назад

    Can you please suggest any correct way to determine executor cores n executor memory by looking at the input data. Without hit n trail and instead going for that thumb rule that we have assuming 5 would be the optimized number for core.. Any other way to determine

    • @sank9508
      @sank9508 3 года назад

      It purely depends on size of input data and kind of processing/computation like heavy join or simple scan of data.
      In general, worker nodes (data nodes) of size {Cores 40 (80 w/ Hyperthreading) ; Memory
      500 GiB} then ~ 1vCore for every 5GB.

  • @rheaalexander4798
    @rheaalexander4798 3 года назад

    Could you please answer...How to achieve optimisation in hive query with columns that have high cardinality

    • @sakshamsrivastava9894
      @sakshamsrivastava9894 3 года назад +1

      may be we can use vectorization also in such scenarios and as he said bucketing on id column can help drastically, apart from it choosing right file format can work as well.

  • @raviteja1875
    @raviteja1875 3 года назад

    attach a feedback video to the same it will go long way in knowing what should have been better answered

  • @mohammadjunaidshaik2664
    @mohammadjunaidshaik2664 3 года назад

    I am fresher can I start my carrier with big data.

  • @msbhargavi1846
    @msbhargavi1846 3 года назад

    Hi sir, will u plz exaplain difference b/w map and foreach....

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Will create video on this...

  • @nivedita5639
    @nivedita5639 3 года назад +1

    Can you explain this question with a video.
    What is the best way to join 3 table in spark.

  • @hgjhhj3491
    @hgjhhj3491 2 года назад

    That broadcast join example looked cooked up 😆😆

  • @arindampatra6283
    @arindampatra6283 3 года назад +14

    Wish I didn't have the haircut that day😂😂😀😀😂😂😂

  • @Anandhusreekumar
    @Anandhusreekumar 3 года назад +1

    Very informative. Thanks :) Can you suggest some small Spark project for portfolio building?

    • @DataSavvy
      @DataSavvy  3 года назад

      What is your profile and which Language you use

    • @Anandhusreekumar
      @Anandhusreekumar 3 года назад +1

      @@DataSavvy I'm Scala Spark engineer.
      I'm am familiar with Cassandra, Kafka, HDFS, Spark.

    • @sindhugarlapati2776
      @sindhugarlapati2776 3 года назад

      Same request...can you please suggest some small spark project using python..

    • @DataSavvy
      @DataSavvy  3 года назад +2

      I am currently working on creating videos and explaining a end to end project..

  • @ravurunaveenkumar7987
    @ravurunaveenkumar7987 3 года назад +1

    Hi, can you do interview on Scala and spark.

    • @DataSavvy
      @DataSavvy  3 года назад

      Hi, I am sorry, i did not understand your question completely.. are u asking u want to do a mock interview with me on scala and spark? if yes... please drop me a message at aforalgo@gmail.com. we can workout this

  • @MANGESHpawarsm42
    @MANGESHpawarsm42 8 месяцев назад

    Please add videos for fresher also

  • @MaheshKumar-yz7ns
    @MaheshKumar-yz7ns 3 года назад

    @4.40 is the interviewer expecting ans DAG?

  • @rajdeepsinghborana2409
    @rajdeepsinghborana2409 3 года назад +2

    Sir.. can you please provide us Hadoop & Spark developer with SCALA video's Beginners to Perfect
    It's very very very Useful For us sirr.. because cheaked all types of video's on the youtube no one can do it ...
    Or sir hai bhi to wo paid courses hai like EDUREKA , Intellipath , Simpllearn etc . So , sir please make it earlier..Need its bhot the students

    • @rajdeepsinghborana2409
      @rajdeepsinghborana2409 3 года назад

      Student able to learn and gain more and more knowledge but haven't money 😭

    • @DataSavvy
      @DataSavvy  3 года назад

      Sure Randeep.. I will plan for spark course

  • @adhikariaman01
    @adhikariaman01 3 года назад +1

    Can you answer to question how do you decide when to increase executor or memory question please ? :)

    • @DataSavvy
      @DataSavvy  3 года назад +7

      You have to find if your job is compute intensive or IO intensive.. you will get hints of that in logs files... I realised after taking this mock interview that I should create a video on that... I am working on this.. thanks for asking this question 😀

    • @NithinAP007
      @NithinAP007 3 года назад +3

      @@DataSavvy I do not completely agree to what you said. Or may be the question looks a bit off. To start with increasing executor memory will have a limit restricted by the total memory available(depending on the instance type you are using). Memory usage tuning at executor level would need considering 1) the amount of memory used by your objects (you may want your entire dataset to fit in memory), 2) the cost of accessing those objects, and 3) the overhead of garbage collection (if you have high turnover in terms of objects). Now, when we say increasing number of executors - I consider this as scaling needed to meet the job requirements. IO intensive doesn't directly mean increase the number of executors rather increasing the level of parallelism(dependent on the underlying part files(/size) etc.) which starts at the executor level. So, I would rather look at this answer like optimizing executor config for a (considerably)small dataset and tuning the executor config first and then assessing the level of scaling need viz. increasing the number of executors to meet the scale of the actual data. I would like to discuss ahead with you

  • @RanjanKumar-ue5id
    @RanjanKumar-ue5id 3 года назад +1

    Any resource link to do a spark related mini project ?

    • @DataSavvy
      @DataSavvy  3 года назад

      I am creating a series for spark project... Will post soon

  • @darshitsuthar6653
    @darshitsuthar6653 3 года назад +5

    sir this was really helpful and informative....i'm a fresher and seeking an opportunity to work with big data technologies like hadoop, spark, kafka, etc.....please guide me how do i enter the corporate world starting with these technologies as there are very less firms that hires a fresher for such technologies.....

    • @satyanathparvatham4406
      @satyanathparvatham4406 3 года назад

      HAVE you done any projects??/

    • @darshitsuthar6653
      @darshitsuthar6653 3 года назад

      @@satyanathparvatham4406 worked on a hive project and other thn that I keep practicing some scenarios (spark) on databricks.

  • @lavuittech3136
    @lavuittech3136 3 года назад +2

    Can you teach us big data from scratch? Your videos are really useful.

    • @DataSavvy
      @DataSavvy  3 года назад +2

      Sure... Which topics are you looking for

    • @lavuittech3136
      @lavuittech3136 3 года назад +2

      @@DataSavvy from basics.

    • @DataSavvy
      @DataSavvy  3 года назад +3

      Ok... Planning that :)

    • @lavuittech3136
      @lavuittech3136 3 года назад +1

      @@DataSavvy Thanks..waiting for it.

    • @rudraganesh1507
      @rudraganesh1507 3 года назад

      @@DataSavvy sir do it love u advance

  • @NithinAP007
    @NithinAP007 3 года назад +1

    I liked some of the questions but not the answers completely. Say the HDFS block size, memory used per file in name node and the type safe equality. How do you plan to publish the right content/answers?

    • @DataSavvy
      @DataSavvy  3 года назад

      Hi Nithin... This was a completely impromptu interview... Are u looking for answer of any specific question?

  • @gauravlotekar660
    @gauravlotekar660 3 года назад +5

    It fine i..but it should be more of a discussion than question answer session.

    • @DataSavvy
      @DataSavvy  3 года назад

      Hi Gaurav... Are u suggesting, way of interviewing is not appropriate

    • @gauravlotekar660
      @gauravlotekar660 3 года назад

      @@DataSavvy no no .. definitely not that.
      I was saying discussion way of interviewing is more effective as per my opinion.
      I feel more comfortable and and able to express that way.

  • @Manisood001
    @Manisood001 3 года назад +2

    Please make a course on databricks certification for pyspark

    • @DataSavvy
      @DataSavvy  3 года назад

      Sure Mani... Added in my list. Thanks for suggestion :)

    • @Manisood001
      @Manisood001 3 года назад

      and please make
      1. hands on schema evolution with all formats
      2. databricks delta lake
      3. how to connect with different datasources
      You are the only creator to be expected from

  • @pratikj2538
    @pratikj2538 3 года назад

    Can you make one interview video for Bigdata developer with 2-3 yrs of exp.

    • @DataSavvy
      @DataSavvy  3 года назад

      Sure Pratik... That is already in plan... This interview also fits in less than 4 year exp category

  • @taherkutarwadli8368
    @taherkutarwadli8368 Год назад

    I am new to data engineering field which language should i choose scala or python . Which language has more job roles ?

  • @jeevithat6038
    @jeevithat6038 3 года назад

    Hi it would be nice if you give correct answers if the answer is wrong..

  • @newbeautifulworldq2936
    @newbeautifulworldq2936 Год назад +1

    Any new video?i will appreciate

    • @DataSavvy
      @DataSavvy  Год назад

      just posted ruclips.net/video/pTFkjdNng-U/видео.html

  • @mohitupadhayay1439
    @mohitupadhayay1439 Год назад

    This guy was giving Interview for TCS Data science role once.

  • @nikhilmishra7572
    @nikhilmishra7572 3 года назад +1

    Whats the purpose of using '===' while joining? nice video btw.

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Thanks Nikhil... Will post a video about the answer in few days :)

    • @harshavardhanreddyakiti4655
      @harshavardhanreddyakiti4655 3 года назад

      @@DataSavvycan you post something like this on Airflow?

    • @abhirupghosh806
      @abhirupghosh806 3 года назад

      @@DataSavvy My best guess is = and == are already reserved operators.
      = is assignment operator like val a=5
      == is comparison opertor if object type is same like how we use in a a normal string comparison for example
      === is comparison operator if object type is different like when we compare two different colums for different datasets dataframes

  • @abhinee
    @abhinee 3 года назад +6

    Actually asking to compare spark n hadoop is incorrect. Should ask mr vs spark. Also spark has improved insanely so pls interviewers RIP this question

    • @KiranKumar-um2gz
      @KiranKumar-um2gz 3 года назад

      its correct. hdfs has own framework and spark has its own.hdfs works in batch processing process where spark works by inmemory computation. everything consider as info dumb

    • @abhinee
      @abhinee 3 года назад +1

      @@KiranKumar-um2gz spark is a framework in itself and it also does batch processing. Pls dnt spread half knowledge

    • @alokdutta4712
      @alokdutta4712 3 года назад

      True

    • @hasifs8139
      @hasifs8139 3 года назад

      I will still ask this question. When comparing Hadoop with Spark, it is assumed that we are comparing 2 processing engines not a processing engine to file system. We expect a candidate to be sane enough to understand that. Also, Spark is built on top of MR concept this very good question to test your understanding of it.

    • @abhinee
      @abhinee 3 года назад +2

      @@hasifs8139 anyone who has picked up spark in last 3 years does not need to understand mr to be good at spark or data processing. Spark implementation is way different than mr to make any comparison. Do you do same or similar steps to optimise joins in spark and mr, no. You can keep asking this question and rejecting good candidates.

  • @sheshkumar8502
    @sheshkumar8502 4 месяца назад

    Hi how are you

  • @anudeepyerikala8517
    @anudeepyerikala8517 3 года назад

    arindam patra the king of datasavvy

  • @msbhargavi1846
    @msbhargavi1846 3 года назад

    Hi, why hadoop doesn't support small files ?

    • @DataSavvy
      @DataSavvy  3 года назад

      He meant to ask... Why small files are not good for Hadoop... Hadoop can store small files though

    • @msbhargavi1846
      @msbhargavi1846 3 года назад

      @@DataSavvy thanks, But why its not good? performance issue.. how it exatly?

    • @rajeshkumardash1222
      @rajeshkumardash1222 3 года назад +2

      @@msbhargavi1846 If you have too many small files then your name node will have to keep metadata for these each of these metadta takes around 100-150 bytes so just think if you have millions of small files how much memory name node has to exhaust to manage this ....

    • @msbhargavi1846
      @msbhargavi1846 3 года назад

      @@rajeshkumardash1222 yes got it.... thanks

    • @DataSavvy
      @DataSavvy  3 года назад

      Thanks Rajesh

  • @sangeethanagarajan278
    @sangeethanagarajan278 3 года назад

    How many experience this candidate is having?

  • @Manisood001
    @Manisood001 3 года назад

    wow, kindly make hive integration with spark

    • @DataSavvy
      @DataSavvy  3 года назад

      Hmmmm... What is the problem that you are facing in integration...

    • @Manisood001
      @Manisood001 3 года назад +1

      @@DataSavvy In databricks When i am creating a managed hive table by "using json" keyword Its creating fine but when i am creating external table its showing error

    • @Manisood001
      @Manisood001 3 года назад

      @@DataSavvy Why "using keyword dosent work with external tables

  • @ashutoshrai5342
    @ashutoshrai5342 3 года назад +2

    Sathiyan is a genius

  • @ferozqureshi5228
    @ferozqureshi5228 Год назад +1

    If we use 800 executors for 100GB input data like you've mentioned in your example,Spark would be then busy in managing the high volume of executors rather than on data processing. So, it could better to use an executor for 5-10GB which would leave us to use 10-20 executors for 100GB data. If you're having any explanation for using 800 executors, then do post it.

    • @DataSavvy
      @DataSavvy  Год назад

      let me look into this

    • @kshitizagarwal8389
      @kshitizagarwal8389 Месяц назад

      not 800 executors- he said to use 800 cores for maximum parallelism- Keep five courses in each executor, resulting into 160 executors in total.

  • @lavakumar5181
    @lavakumar5181 3 года назад +1

    Hi sir, if you are providing interview guidance personally please let me know..I'll contact personally...I need guidance

    • @DataSavvy
      @DataSavvy  3 года назад

      Join our group... It's very vibrant and people help each other a lot

    • @subhaniguddu2870
      @subhaniguddu2870 3 года назад

      Please share group link we will join

    • @DataSavvy
      @DataSavvy  3 года назад +1

      chat.whatsapp.com/KKUmcOGNiixH8NdTWNKMGZ

    • @DataSavvy
      @DataSavvy  3 года назад

      @@hasnainmotagamwala2608 Hi, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

    • @arunnautiyal2424
      @arunnautiyal2424 3 года назад

      @@DataSavvy it is giving not authorized to access.

  • @GauravSingh-dn2yx
    @GauravSingh-dn2yx 3 года назад

    Everyone is in nightwear 😂😂😂

  • @atanu4321
    @atanu4321 3 года назад

    Small file issue 16:45

  • @THEPOSTBYLOT
    @THEPOSTBYLOT 3 года назад

    Please create new watsup grp as it is full

    • @DataSavvy
      @DataSavvy  3 года назад +2

      Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

  • @Anonymous-fe2ep
    @Anonymous-fe2ep 10 месяцев назад +1

    Hello, I was asked the followed questions in a AWS developer interview-
    Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it.
    Q2. So where does pyspark come into play in this?
    Q3. Which all libraries will you need to import to run the above glue job?
    Q4. What are shared variables in pyspark
    Q5. How to optimize glue jobs
    Q6. How to protect sensitive data in your data.
    Q7. How do you identify sensitive information in your data.
    Q8. How do you provision a S3 bucket?
    Q9. How do I check if a file has been changed or deleted?
    Q10. How do I protect my file having sensitive data stored in S3?
    Q11. How does KMS work?
    Q12. Do you know S3 glacier?
    Q13. Have you worked on S3 glacier?

  • @seaofpoppies
    @seaofpoppies 3 года назад +2

    There is no such thing as in memory processing.. Memory is used to store data that can be reused. 4 years back I was grilled on this 'in memory processing' stuff in one of the big4 firm.

    • @arindampatra6283
      @arindampatra6283 3 года назад

      You should google the meaning of in memory processing once..It doesn't mean that your memory will process the Data for you 😂😂😂😂 Even kids there in school know that cpu does the actual computations..

    • @b__05
      @b__05 3 года назад

      No In- memory is usually your RAM, where data is stored and computed in parallel. Hence it is fast.
      Can you just let me know how you got grilled for this?

    • @EnimaMoe
      @EnimaMoe 3 года назад +2

      hadoop work in batches by moving data in the hdfs. Meanwhile Spark does its operation in-memory, the data is cached in memory and all operations are done live. Unless you were asked questions for hadoop i don't see how you could get grilled for this ...

    • @seaofpoppies
      @seaofpoppies 3 года назад

      @@EnimaMoe Spark doesnot do operations in Memory. In fact, no processing happens in memory. I am not talking about the concept. I am talking about the phrase that is used "in memory processing".
      For those advising me to Google about spark, just an FYI, It's been years since in am using spark.
      You can always challenge whatever is written ot told by someone. Tc.

  • @omkarkulkarni9202
    @omkarkulkarni9202 3 года назад

    Q1: What made you move to Spark from Hadoop/MR? Both the question and answer is wrong. Hadoop is a file system whereas spark is a framework/library to process data in distributed fashion. There is no such thing as 1 better than other. It's like comparing apples and oranges.

    • @DataSavvy
      @DataSavvy  3 года назад

      Hi Omkar... Hadoop is combination of Map Reduce and HDFS.. hdfs is file system and MR is processing engine... Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing... You will generally see people who are working in big data using this kind of language... Generally people who started with Hadoop and then moves to spark processing engine later

    • @omkarkulkarni9202
      @omkarkulkarni9202 3 года назад

      @@DataSavvy Can you tell me where and how you do MR using Hadoop? And can you elaborate what exactly you mean by "Hadoop MR style of programming?" If the interviewer is using this language, clearly he has learnt and limited his knowledge to tutorials. Someone who has worked on large scale clusters using EMR or his own EC2 cluster wont use such vague language.

    • @DataSavvy
      @DataSavvy  3 года назад

      Hi Omkar... Plz read en.m.wikipedia.org/wiki/MapReduce ... or Google Hadoop map reduce...

    • @omkarkulkarni9202
      @omkarkulkarni9202 3 года назад +1

      @@DataSavvy I understand what is map reduce.. its a paradigm and not a framework/library that you are asking. The interviewer asked this question: Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing? This question itself is wrong as Spark is a framework that allows you to process data using map reduce paradigm. There is no such thing as "Hadoop MR style of processing".

    • @DataSavvy
      @DataSavvy  3 года назад +1

      I see... You seems to have issue with words used to frame the question... I think we should focus on intent of question, rather than thinking to much about the words...

  • @iam_krishna15
    @iam_krishna15 2 года назад

    This one can't considered as spark interview.

    • @DataSavvy
      @DataSavvy  2 года назад

      Hi Krishna... Please share your expectation... I will cover that as another video

  • @gauthamn1603
    @gauthamn1603 3 года назад

    Please dont use the word "we" use "i"

    • @hasifs8139
      @hasifs8139 3 года назад +1

      Never use 'I' unless you are working alone.

    • @MrTeslaX
      @MrTeslaX 3 года назад

      @@hasifs8139 Always use I in interview

    • @hasifs8139
      @hasifs8139 3 года назад

      @@MrTeslaX Thanks for the advice, luckily I rarely have to go for interviews nowadays. Personally, I don't hire people who use a lot of 'I' in their answers, because they are just ignoring the existence of others in the team. Most likely they are a bad team player and don't want such people in my team.

    • @MrTeslaX
      @MrTeslaX 3 года назад +1

      @@hasifs8139 Thanks for your response. I live and work in the US and have attended FANG companies and other small companies as well. One of the most important things I was told to keep in mind was to highlight my contribution and achievement and not talk about the overall work done. Be specific and talk about the work you have done and use I while talking about them.

    • @hasifs8139
      @hasifs8139 3 года назад +1

      @@MrTeslaX Thanks for your explanation. Yes, you must definitely highlight your contributions and achievements within the project. All I am saying is that you should not give the impression that you did it all on your own. Also what difference does it make, if you are living in the US or Germany(where I am) or anywhere else?

  • @ldk6853
    @ldk6853 2 месяца назад

    Terrible accent… 😮