RDD vs DataFrame vs Datasets | Spark Tutorial Interview Questions

Поделиться
HTML-код
  • Опубликовано: 27 май 2018
  • As part of our spark Interview question Series, we want to help you prepare for your spark interviews. We will discuss various topics about spark like Lineage, reduceby vs group by, yarn client mode vs yarn cluster mode etc. As part of this video we are covering
    difference between rdd , dataframe and datasets.
    Please subscribe to our channel.
    Here is link to other spark interview questions
    • Spark Interview Questions
    Here is link to other Hadoop interview questions
    • 1.1 Why Spark is Faste...

Комментарии • 74

  • @ganeshdhareshwar6053
    @ganeshdhareshwar6053 4 года назад

    nicely explained. Thank you for your effort on gathering information and publishing it. much needed videos it is

  • @TusharKakaiya
    @TusharKakaiya 3 года назад

    Really helpful content. Much appreciated.

  • @apekshatrivedi8689
    @apekshatrivedi8689 3 года назад

    Very nice explanation. Your videos really help me while preparing for interviews. Highly recommend. Thank you!

  • @souravsinha5330
    @souravsinha5330 Год назад +1

    Nice and clear explanation. To the point thanks.

  • @ravinderkarra3187
    @ravinderkarra3187 6 лет назад +5

    DataFrame also serialize the data into off-heap storage in binary format and then perform transformations directly on off heap memory as spark understands the schema. Also provides a Tungsten physical execution back-end which explicitly manages memory and dynamically generates byte-code for expression evaluation. So does memory management better here.

    • @akp7-7
      @akp7-7 Год назад

      yes ,so dataframe is more fast as compare to dataset?

  • @bhargavhr1891
    @bhargavhr1891 6 лет назад +2

    Again an very nice video, thanks and it would be great if you provide a pseudo code or simple code sytax for each abstractions so that understanding will be very clear

  • @someshmungikar4466
    @someshmungikar4466 3 года назад

    cooollll great answer sir... thanks !!!

  • @rameshgangabathula6221
    @rameshgangabathula6221 4 года назад +3

    Nice explanation. Can you please explain how to do check pointing & resume a failed spark job(due to action/transformation failure and executor memory exceeded) in another video?

  • @RajKumar-zw7vt
    @RajKumar-zw7vt 5 лет назад +1

    Nice video bro...

  • @rahulshandilya880
    @rahulshandilya880 4 года назад +2

    When to use dataframe and when to use dataset and when to use Rdd and spark sql, sparkSession

  • @RahulRawat-wu1vv
    @RahulRawat-wu1vv 5 лет назад

    It will serialize the data or deserialize coz as far as i know we deserialization is conversion of byte stream to. Java object. Please correct if i am wrong.

  • @max6447
    @max6447 3 года назад +1

    Thanks your videos are very useful !

  • @nehabansal677
    @nehabansal677 5 лет назад +1

    Great content... Very helpful for interviews

    • @DataSavvy
      @DataSavvy  5 лет назад

      Thanks ... Please watch full spark interview series

  • @shubhamkumar-uz7ux
    @shubhamkumar-uz7ux 5 лет назад +1

    Very informative ..just one thing voice is too low in video .

  • @naresh5273
    @naresh5273 5 лет назад +1

    Thank you.
    Last time in my interview,
    interviewer asked me same question...

    • @DataSavvy
      @DataSavvy  5 лет назад +1

      Thanks Kartik... I am happy this content was useful to you... can you share other questions asked by your interviewer?

  • @arundhingra4536
    @arundhingra4536 5 лет назад +4

    @Data Savvy - A small correction, at 8:10 you mentioned that we cannot do map, join and other operations on a DataFrame

    • @sharathchandra5314
      @sharathchandra5314 5 лет назад

      Data Savvy should have said that if we use Map, Join and other operations that take HigherOrderFunctions then lets forget Optimization by Spark Framework.

  • @ajaypratap4025
    @ajaypratap4025 5 лет назад +3

    When to use dataframe and when to use dataset?

    • @owaisshaikh3983
      @owaisshaikh3983 3 года назад

      when you have strict data type use data frame (more convenient) or else dataset

  • @anilcvs1
    @anilcvs1 6 лет назад +2

    Please show me some real times scenarios in videos.

    • @DataSavvy
      @DataSavvy  6 лет назад +1

      Hi Anil, thanks for comment... Give me some example, I will create video for that ... Please subscribe to channel

    • @akshathab.s6751
      @akshathab.s6751 5 лет назад

      Hi real time scenario like industry level data processing I mean for performance tunning when their is large amount of data to be process, like wise which component to be preffered like dataframe r daraset or rdd... In what suitation which methodology is suitable.

  • @raviyadav-dt1tb
    @raviyadav-dt1tb 7 месяцев назад +1

    Please provide aws questions and answers

    • @DataSavvy
      @DataSavvy  6 месяцев назад +1

      Planning that

    • @raviyadav-dt1tb
      @raviyadav-dt1tb 6 месяцев назад +1

      @@DataSavvy can you please provide interview questions from scala program, several times I getting rejections due to scala program

    • @DataSavvy
      @DataSavvy  6 месяцев назад +1

      I will add that in my list... Need to work on when can i start that

    • @raviyadav-dt1tb
      @raviyadav-dt1tb 6 месяцев назад +1

      @@DataSavvy please do thank you 🙏

    • @DataSavvy
      @DataSavvy  6 месяцев назад +1

      Thank you

  • @SuperSazzad2010
    @SuperSazzad2010 5 лет назад

    Hi Please throw some light on the fact that DataFrame make use of Java Serialization.. But What is the use of off-heap?

    • @DataSavvy
      @DataSavvy  5 лет назад

      Dataframe can use java serialization or kryo... Off
      Heap is used for shuffle

    • @akp7-7
      @akp7-7 Год назад

      @@DataSavvy So dataframe uses java serialization or it is used by dataset?

  • @yeoreumkwon
    @yeoreumkwon 5 лет назад +1

    If I understood correctly, PySpark does not support the Datasets because Python is not a type-safe language, right?

    • @DataSavvy
      @DataSavvy  5 лет назад

      You are right my friend... Data set philosophy is different from philosophy of python language

    • @yeoreumkwon
      @yeoreumkwon 5 лет назад +1

      ​@@DataSavvy So I will have to learn Scala for using Spark Datasets.
      Thank you very much for your effort. I really enjoy your series.

    • @DataSavvy
      @DataSavvy  5 лет назад

      If u are particular about using dataset then yes.. use a JVM language Scala or java... I am happy that you like the series... Please suggest more topics that you are interested in

  • @ambikaiyer29
    @ambikaiyer29 5 лет назад +1

    Hi - Can you please share details on why dataset api is not available in Python?

    • @DataSavvy
      @DataSavvy  5 лет назад +1

      Because Python is not type safe ... And datasets are type safe

    • @vinodmani3900
      @vinodmani3900 5 лет назад

      Thanks @@DataSavvy . I was behind this for some time , wondering why all the API are for dataframe in PySpark. So it means in PySpark we need to code with dataframe itself right ?

  • @chiranjeevikatta8116
    @chiranjeevikatta8116 3 года назад +1

    I am new to the spark and big data world. I choose to use/learn pyspark because I am familiar with python. I got to know the python is not type-safe and does not support for datasets. Can someone say does pyspark is used in building real-world applications Or Do I need to learn scala/java.
    Thanks.
    -Great video

    • @DataSavvy
      @DataSavvy  3 года назад +1

      PySpark is used for lot of real time projects... I have generally seen people doing ml or data analysis project using pyspark... Data ingestion teams use scala... However this is not always true... It usually boils down to comfort level of developer and team composition

    • @chiranjeevikatta8116
      @chiranjeevikatta8116 3 года назад +1

      @@DataSavvy thank you. Good work. I took online courses but I got more clarity after watching your videos.

    • @DataSavvy
      @DataSavvy  3 года назад +1

      Thanks Chiranjeevi... Very happy to hear that :)

  • @rakeshkumarsharma3920
    @rakeshkumarsharma3920 5 лет назад

    How can we automate incremental import in SQOOP?

    • @lokeshmvs
      @lokeshmvs 5 лет назад +1

      Use "Sqoop job --create " create a job for incremental import, so that sqoop job will store the meta data of incremental load.

    • @rakeshkumarsharma3920
      @rakeshkumarsharma3920 5 лет назад

      @@lokeshmvs I am asking if I have to do incremental import for 50 table, and that job will get execute at mid night .then how do I archive it .
      Please let me know with examples

  • @Pratik0917
    @Pratik0917 4 месяца назад

    Then people arent using dataset everywhere?

  • @TheBjjninja
    @TheBjjninja 5 лет назад +2

    Can you fix your volume please

  • @alexanderkorchagin67
    @alexanderkorchagin67 4 года назад +1

    ERROR! Actualy Dataframe, Dataset, RDD - it is correct order of performance from very effective to not effective. DF is better performance then DS because not using serialization and desirialization when work with data

    • @DataSavvy
      @DataSavvy  4 года назад

      Hi Alexander, Excuse me if explanation was not clear. Message was that DS uses encoders which are more efficient way of serializing and deseriailizing data than kryo or default serialization... so shuffle operations will be more efficient as they involve serialization and deseriailizing. In general there is not much difference in performance of data frame and dataset these days...could you elaborate on df not using serialization and deseriailization... I did not get what u meant there

    • @akp7-7
      @akp7-7 Год назад

      @@DataSavvy i recently learnt that in DF serialization is managed by tungsten binary format..encoders however in DS serialization is managed by java serialization.so DF performance is little fast than DS

  • @luckyomprakash8437
    @luckyomprakash8437 5 лет назад

    What questions one can ask to check Spark RDD experience?

  • @iftekharkhan3254
    @iftekharkhan3254 10 месяцев назад

    Sound is low

  • @akashchaudhary6953
    @akashchaudhary6953 2 года назад

    sir mera interview leke mujhe famous ker do.. 💌

  • @dharmendrabhojwani
    @dharmendrabhojwani 5 лет назад +2

    Very low voice.

  • @meswapnilspal
    @meswapnilspal 5 лет назад

    volume is very low

  • @mikecmw8492
    @mikecmw8492 6 лет назад +3

    Please remake this video with real examples. for example, open a spark2 REPL and load a file full of data. Show how to create RDD, DF, DS. Then show some operations with each. Having just text on the screen will not help in an interview. Most interviews are now hands on, especially with big data. Thank you

    • @DataSavvy
      @DataSavvy  5 лет назад

      sure Mike, will do this... adding this in my next steps.. I appreciate these suggestions.

  • @5669ashish
    @5669ashish 5 лет назад +1

    Bhai Hindi me video kar sakte ho kya ik hi time me samajh aa sakta hai

    • @DataSavvy
      @DataSavvy  5 лет назад +2

      Sab ko Hindi samajh Nahi ayegi Bhai... Vaise kya samajh Nahi Aya ek time mein... I can help

  • @knightganesh
    @knightganesh 4 года назад

    Voice is very low please work on it

    • @DataSavvy
      @DataSavvy  4 года назад

      You are right.. I have improved in New videos

  • @sergeibatiuk3468
    @sergeibatiuk3468 4 года назад

    How many times he said 'you know'

  • @rishabhjain8558
    @rishabhjain8558 3 года назад

    You talk extra, not from the topics .
    Also your concepts are not clear.
    And try to give some examples.
    Always showing ppt and dictating.

    • @DataSavvy
      @DataSavvy  3 года назад

      Thanks for your Wise comments... Will try to Improve...