cache and persist in spark | Lec-20

Поделиться
HTML-код
  • Опубликовано: 7 сен 2024
  • In this video I have talked about Adaptive Query Execution.
    Directly connect with me on:- topmate.io/man...
    spark.apache.o...
    spark.apache.o...
    For more queries reach out to me on my below social media handle.
    Follow me on LinkedIn:- / manish-kumar-373b86176
    Follow Me On Instagram:- / competitive_gyan1
    Follow me on Facebook:- / manish12340
    My Second Channel -- / @competitivegyan1
    Interview series Playlist:- • Interview Questions an...
    My Gear:-
    Rode Mic:-- amzn.to/3RekC7a
    Boya M1 Mic-- amzn.to/3uW0nnn
    Wireless Mic:-- amzn.to/3TqLRhE
    Tripod1 -- amzn.to/4avjyF4
    Tripod2:-- amzn.to/46Y3QPu
    camera1:-- amzn.to/3GIQlsE
    camera2:-- amzn.to/46X190P
    Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
    Pentab (Small size):-- amzn.to/3RpmIS0
    Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
    Laptop -- amzn.to/3Ns5Okj
    Mouse+keyboard combo -- amzn.to/3Ro6GYl
    21 inch Monitor-- amzn.to/3TvCE7E
    27 inch Monitor-- amzn.to/47QzXlA
    iPad Pencil:-- amzn.to/4aiJxiG
    iPad 9th Generation:-- amzn.to/470I11X
    Boom Arm/Swing Arm:-- amzn.to/48eH2we
    My PC Components:-
    intel i7 Processor:-- amzn.to/47Svdfe
    G.Skill RAM:-- amzn.to/47VFffI
    Samsung SSD:-- amzn.to/3uVSE8W
    WD blue HDD:-- amzn.to/47Y91QY
    RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
    Gigabyte Motherboard:-- amzn.to/3RFUTGl
    O11 Dynamic Cabinet:-- amzn.to/4avkgSK
    Liquid cooler:-- amzn.to/472S8mS
    Antec Prizm FAN:-- amzn.to/48ey4Pj

Комментарии • 52

  • @gauravpp5768
    @gauravpp5768 10 месяцев назад +3

    bhai itna detail aur clear me to kahi bhi nahi hai, appreaciate brother

  • @shubhamwaingade4144
    @shubhamwaingade4144 7 месяцев назад +1

    Awesome Explanation brother!!!! Going watch the entire playlist now!
    Please keep posting such amazing content...

  • @Speed-0-meter
    @Speed-0-meter 11 месяцев назад +3

    Please make video on
    1. dataset vs dataframe vs rdd
    2. Hive with spark
    3. Checkpointing

  • @aditya9c
    @aditya9c 5 месяцев назад

    what a superb vedio. Keep making these vdos for us.....huge thanks

  • @OmprakashSingh-sv5zi
    @OmprakashSingh-sv5zi 11 месяцев назад +1

    Bhut badhiya super lecture Manish bhai

  • @apoorvkansal9266
    @apoorvkansal9266 3 месяца назад

    Notes whatever Manish Bhai explained regarding caching of Dataframe(s) :
    Also, as we know that Dataframe objects get stored in Executor Memory Pool during computations i.e. transforms but this memory is short-lived i.e. only till the time a transformation is done on a Dataframe.
    So what we can do is that if our usecase is such that joins are happening on that Dataframe or on particular Dataframes and we need to :
    1. store the required partition of the Dataframe or
    2. save execution time,
    3. re-calculation i.e. efforts(resources) of executor(s) 4. and the cost .
    Then we can use cache() so that required partition(s) of Dataframe(s) get stored in Storage Memory Pool. We already know that Memory Storage Pool is used for storing intermediate state of the tasks especially during Joins and also used to store cached data and Memory Eviction is done using LRU method.

  • @harshitasija5801
    @harshitasija5801 9 месяцев назад +1

    Really great Explanation

  • @mallangivinaykumar9500
    @mallangivinaykumar9500 8 месяцев назад

    Your way of teaching is awesome. Please make videos in the English language.

  • @subrahmanyanmn9189
    @subrahmanyanmn9189 10 месяцев назад +2

    Manish Bhai, Thanks a lot for the video series. Its really helpful.
    1 doubt, since Spark removes data stored in memory on basis of LRU, then is it compulsory/necessary to unpersist?

    • @ritikpatil4077
      @ritikpatil4077 9 месяцев назад +2

      Good question, but the only usecase i could think of this.
      If you have big dataframe (still need to be smaller than storage memory 😁) and u using it repeatedly. U can cache it.
      and after your usage u can unpersist it so u save storage memory.

  • @luvvkatara1466
    @luvvkatara1466 5 месяцев назад

    Thank you Manish bhai for this lecture

  • @payalbhatia6927
    @payalbhatia6927 Месяц назад

    where could we see high cpu usage in spark UI when data coming from disk to memory and gets deserialized ?

  • @tanushreenagar3116
    @tanushreenagar3116 7 месяцев назад

    Best content sir 🎉

  • @ajaypatil1881
    @ajaypatil1881 9 месяцев назад

    great video bhaiya

  • @da_nalyst
    @da_nalyst 11 месяцев назад

    Manish bhai thank you for great explanation. Ek doubt h. Df jo cache kiya, vo sabhi executor pe alag alag store hota h ya kisi bhi ek executor ki memory me store hota.

  • @algorhythm3103
    @algorhythm3103 5 месяцев назад

    Nice video sir!

  • @ajinkyadeshmukh2343
    @ajinkyadeshmukh2343 2 месяца назад

    manish bhai cache() default storage level MEMORY-ONLY hai aap please ik baar spark Documnetation check kar lena

  • @dkr1998
    @dkr1998 11 месяцев назад +1

    can u pls make a video on serialization and deserialization.

  • @upendrareddy5880
    @upendrareddy5880 Месяц назад

    Hi manish,
    Which one is correct?
    For df. Show() it gets one complete partition, in that only 20 rows are to be taken by default
    for df. Show() it gets default 20 rows from across all the partitions(not from single partition)

  • @aneksingh4496
    @aneksingh4496 9 месяцев назад

    Very very informative .. Can you please make videos on spark scenario based

  • @Ks-yi8ky
    @Ks-yi8ky 5 месяцев назад

    Supperb sir..

  • @adarsharora6097
    @adarsharora6097 6 месяцев назад

    Thanks Manish!

  • @pratyushkumar8567
    @pratyushkumar8567 9 месяцев назад +1

    Manish bhaiya thoda specfic please explain as we know cache data storage memory pool mai present hota hai So when we do .cache() on the dataframe the result of it stored in storage memory pool or executor memory pool? please explain

  • @AIBard-pk5bs
    @AIBard-pk5bs 3 месяца назад

    Hi Manish. Thanks for your video. Its very good.
    Could you please clear one doubt of mine.
    At 20:43 Why is that there are 4 partitions created even when the size of dataframe is almost around 11MB which is smaller than 128 MB (default partition size).
    I don't see anywhere in the video spark.sql.files.maxPartitionBytes getting updated? Then what is the reason for above behaviour?

    • @manish_kumar_1
      @manish_kumar_1  3 месяца назад

      Default parallelism is applied to 4. I think because of that 4 partition would have shown there. Anyway I will check that

    • @AIBard-pk5bs
      @AIBard-pk5bs 3 месяца назад

      @@manish_kumar_1 Thanks for your reply

  • @pramod3469
    @pramod3469 11 месяцев назад

    Thanks Manish very well explained when we choose memory only you said if partition size > memory available then it will re calculate
    What will be recalculated here ?

    • @syedadnan4910
      @syedadnan4910 11 месяцев назад +1

      the partition that is not stored in the memory will be recalculated ,as we know that it follows full partition principle only

  • @sanooosai
    @sanooosai 5 месяцев назад

    thank you sir

  • @madanmohan6487
    @madanmohan6487 11 месяцев назад

    Please make video on dimension table vs fact table

  • @Speed-0-meter
    @Speed-0-meter 11 месяцев назад

    Spark streaming no one is talking. Would be good if u throw some light.

  • @amankumarsrivastava6852
    @amankumarsrivastava6852 11 месяцев назад

    Bhai ye saare pyspark dataframe api ke functions kahan se practice karu , bhulte jarha hoon , koi site ya practive platform ho jaha practice kar saku plz suggest .

  • @pawansalwe1926
    @pawansalwe1926 9 месяцев назад

    tq

  • @mmohammedsadiq2483
    @mmohammedsadiq2483 11 месяцев назад

    Can you provide spark documentations URL and reference books , Thanks

  • @dkr1998
    @dkr1998 11 месяцев назад

    Hi Manish, How much Jio pays to 3 YOE in data engineering ?

  • @arpittrivedi6636
    @arpittrivedi6636 11 месяцев назад

    Sir app big data scratch se sikhte ho kya? Please share video id possible

  • @raajnghani
    @raajnghani 11 месяцев назад

    How can we unbroadcast any dataframe?

  • @piyushjain5852
    @piyushjain5852 11 месяцев назад

    what do you mean by recalculating the partition when it exceeds the memory?

    • @manish_kumar_1
      @manish_kumar_1  11 месяцев назад

      Jo memory store nhi hua usko DAG ke help se recalculate karna parta hai if needed

    • @piyushjain5852
      @piyushjain5852 11 месяцев назад

      @@manish_kumar_1 lekin recalculate krne k bad bhi store to krna hi padega na fir vo kse handle hoga jb phle ni ho paya?

    • @manish_kumar_1
      @manish_kumar_1  11 месяцев назад

      @@piyushjain5852 wo store nhi karega. Jitni baar v aap same df ka data use kariyega utni baar data recalculate hoga. Memory to storage memory pool ka full hai naa ki executor memory pool ka. Isliye hi to performance impact Hota hai.

  • @amanpirjade9
    @amanpirjade9 11 месяцев назад

    Make video on SQL for data engineering

  • @shobhitsharma2137
    @shobhitsharma2137 11 месяцев назад +1

    sir i want to talk to you in person

  • @raghavendrakulkarni3920
    @raghavendrakulkarni3920 11 месяцев назад

    Itna time kyun ? Bahot jyada gap why?

  • @sankuM
    @sankuM 11 месяцев назад +1

    bhai @@manish_kumar_1, some issue with the video quality.. only 360p is coming up even thought it's been 7 mins of the video upload!! 🤨🤨

    • @manish_kumar_1
      @manish_kumar_1  11 месяцев назад +1

      May be 4k process nahi hua hoga tab tak