37. Databricks | Pyspark: Dataframe Checkpoint

Raja's Data Engineering

Просмотров 18 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 ноя 2024

Комментарии •

@mukilanlakshmanan8968 Год назад ⁺¹
Very Helpful, Super explanation on the concept 👍
@rajasdataengineering7585 Год назад
Glad it was helpful!
@sanjayr3597 Год назад ⁺¹
Good video...nice comment section.. thank you for answering people's comment ..:) extra information is always good.
@rajasdataengineering7585 Год назад
Thanks! Hope it helps
@StxExodux 2 года назад ⁺³
I found this in my research:
Furthermore, rdd.persist(StorageLevel.DISK_ONLY) is also different from checkpoint. Through the former can persist RDD partitions to disk, the partitions are managed by blockManager. Once driver program finishes, which means the thread where CoarseGrainedExecutorBackend lies in stops, blockManager will stop, the RDD cached to disk will be dropped (local files used by blockManager will be deleted). But checkpoint will persist RDD to HDFS or local directory. If not removed manually, they will always be on disk, so they can be used by the next driver program.
when error occurs, the next run will read data from checkpoint, but the downside is that checkpoint needs to execute the job twice.
@rajasdataengineering7585 2 года назад
That's absolutely true. Thank you for sharing the additional input 👍🏻
@gurumoorthysivakolunthu9878 Год назад
Hi Sir...
This is Great effort... Thank you for going in deep research for understanding...
But, what does mean by / why does checkpoint need to run the job twice...?
@tanushreenagar3116 Год назад ⁺¹
Best explanation 👌
@rajasdataengineering7585 Год назад ⁺¹
Glad you liked it
@iamkiri_ Год назад
First of all thanks for detailed response for all those questions asked -:) .
I have question -
Q1. what if we loose checkpoint data in both wrkrnode and external disc in the absence of DAG before those checkpoints . Is it recalculated again?
Q2 : Is checkpoint results are completely copied to each and every worker nod in the cluster? If yes then any data loss replicated from other cluster workernodes
@mohitupadhayay1439 Год назад ⁺²
Excellent video Raja. Just a feedback I hope you had kept content that starts at 4:07 earlier. It helps to first understand a business use case and then jump to theoretical part.
Question : How is checkpoint different then PERSIST then? Since both stores the dataframe in DISK.
ALso, could you help sharing a video writing code so we can actually analyse the stuff.
Thanks!
@karthikeyana6490 11 месяцев назад ⁺¹
Hi Raja, any comments on this??
@rajasdataengineering7585 11 месяцев назад ⁺¹
Pesist has flexibility of choosing disk or memory for storage, whereas checkpoint is always on disk
@karthikeyana6490 11 месяцев назад
@@rajasdataengineering7585 Oh okay. Thanks for the quick reply!
@ATHARVA89 Год назад ⁺¹
Raja can you prepare video tutorials on the latest developments in databricks like DLT, autoloader, change data feed mechanism. As companies nowadays are starting to involve these into the projects. and also a separate playlist on streaming including spark streaming and kafka would be really beneficial
thanks a lot!!!
@rajasdataengineering7585 Год назад
Sure Atharva, will cover those topics soon
@ashutoshjadhav6922 9 месяцев назад ⁺¹
Raja always amaze us with such informative content♥️🫡
@rajasdataengineering7585 9 месяцев назад
Thanks Ashutosh ❤️
@nagamohan160 2 года назад ⁺¹
nice explaintion
@rajasdataengineering7585 2 года назад
Thanks
@dataworksstudio 2 года назад ⁺¹
Great video sir! 😇🙌
@rajasdataengineering7585 2 года назад
Thank you Amar 🙌
@JamesPatagel-o1r Год назад ⁺¹
such a great video
@rajasdataengineering7585 Год назад
Glad you enjoyed it
@prathapganesh7021 5 месяцев назад ⁺¹
Nice explanation thank you
@rajasdataengineering7585 5 месяцев назад
Glad you liked it! Keep watching
@saravninja 2 года назад ⁺²
Another great video raja!!! Question - 1. When you refer intermediate result would store in cache, is that each executor’s on heap memory or offheap memory ? If yes how it can be shared across executor/worker node?
2. Checkpoint- which disc it would write intermediate result, each worker node disc?? If yes then how it can share across cluster. It would impact parallelism right
Ideally it should be common storage(disc) where all cluster can refer common storage for faster parallelism
@rajasdataengineering7585 2 года назад ⁺²
Hi Sarav, very good questions.
1. When we perform cache, the intermediate result set would be stored in memory of worker nodes I.e on- heap memory. Again it would be in distributed nature across multiple worker nodes.
2.Checkpoint would always write the intermediate result into disc. Disc could be either worker node's disc or external storage disc such as hdfs etc., If we store the data in worker node storage, it is called local checkpoint, whereas storing into external system is called standard checkpoint. It is always better to go with standard checkpoint, as storage is guaranteed. While storing in worker node storages, if there is node failure, we lose the data, remember checkpoint already truncated lineage graph as well. So we lost the data and could not recompute.
In local checkpoint, when you Store the intermediate result, it means it stores across multiple worker node in distributed nature. When the subsequent process reads this checkpointed data, it would again create number of partitiosns based on spark confiig. Default parallelism is 8 and default block size is 128 MB.
Hope it clarifies your doubts
@saravninja 2 года назад
@@rajasdataengineering7585 thanks for deep dive response and clear crystal clarity. Yes, standard checkpoint is more reliable than local checkpoint. I hope “ Disk only” in persist refers checkpoint . I believe persist disk can also write to external storage not just worker node disc. Please advise.
@rajunaik8803 Год назад ⁺¹
HI Raja, when you say, checkpoint will store the intermediate result in disk, it looks like Persist right.
eg: df.cache(DISK_ONLY) if so what is the main difference here between cache and checkpoint?
@rajasdataengineering7585 Год назад ⁺¹
Cache only stores the result within memory
Checkpoint only stores the result within disk
Persist has the flexibility of choosing between memory and disk
@vipinkumarjha5587 2 года назад ⁺²
Hi Raja, Thanks for such informative material. Can we have a demo using checkpoint in your next video. Thanks in advance
@rajasdataengineering7585 2 года назад
sure Vipin, will make demo video on checkpointing
@datningole1038 2 года назад ⁺²
Hi Raja , it's nice explaination..can you please give example of how to create create and use checkpoint?
@rajasdataengineering7585 2 года назад
Hi Dat, there are 2 steps involved
1. Config checkpoint directory
2. Checkpoint any dataframe
I have given the syntax in the video. Please follow accordingly.
@ArpitSrivastava1994 Год назад ⁺¹
Thanks for great explanation, need one clarification, if the databricks cluster is restarted , then cache,persist and checkpoints get reset right?
@rajasdataengineering7585 Год назад
Good question. When cluster is restarted, all the cached/persisted/checkpointed data would be erased off. It will be recreated when we run certain action again
@manjushang Год назад
Nicely explained
@rajasdataengineering7585 Год назад
Thank you!
@sravankumar1767 2 года назад
Nice explanation Raj 👌 👍
@rajasdataengineering7585 2 года назад
Thanks Sravan
@zonnalobo 2 года назад
How to reuse the checkpoint data whe resubmit the job? I got that the job keep writing the checkpoint everytime we reubmit the job so I have so many duplication checkpoint data.
@swarnalathabanala1665 9 месяцев назад ⁺¹
Check Point and Persist both same?
@rajasdataengineering7585 9 месяцев назад
No both are different. Persist can store the data either in memory or disk but checkpoint stores data only in disk
@manikandanmuthiah438 2 года назад
Checkpoint similar to Persist?
@rajasdataengineering7585 2 года назад
No, persist has option of storing data at both memory and disc, with many options. But checkpoint can store data only in disc
@manikandanmuthiah438 2 года назад ⁺¹
@@rajasdataengineering7585 yeah, so persist(DISK_ONLY) = checkpoint right? what is the difference between checkpoint and localCheckpoint
@rajasdataengineering7585 2 года назад ⁺¹
@@manikandanmuthiah438 Absolutely, that is right. persist(DISK_ONLY) = checkpoint
Local checkpoint means storing the intermediate result into worker node's disc, whereas standard checkpoint would be storing the data into reliable storage point such as DBFS, HDFS etc.,
@stepup2me1 2 года назад
if there are 100 transformations and i create a dataframe checkpoint at 50th transformation then the computation is done and the data is stored even before the action is called ?
@rajasdataengineering7585 2 года назад ⁺²
Good question.
It is depending on parameter "eager". By default, data would be stored only when an action is called. But you want to make it immediate, you can set eager paramer True. Eager evaluation is just opposite to lazy evaluation

Следующие

Автовоспроизведение

38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip