10 best practices for building Data lakes| Evolution of data lakes
HTML-код
- Опубликовано: 26 июл 2024
- This video talks about What is a data lake, evolution of data lakes and the 10 best practices to keep in mind while designing and developing data lakes
What are Data lake
• What are Data Lakes | ...
How to build healthy data pipelines
• How to Build healthy d...
Disaster Recovery
• Disaster Recovery in 1...
Chaos Engineering
• What is Chaos Engineer... Наука
Thanks for the great content, appreciate your work.
Thanks Amritesh
Thanks for the video. It was very informative.
Thanks gaurav
Learnt few new dimensions related to Data Lake. Thank you so much for sharing your wealth of knowledge dear Shreya.
Thanks arun
❤
Great content covered in right pace. Thank you!
Thanks Farhad
Very useful content Shreya…
Thanks Praveen
Lots of great points you have highlighted and touched upon, as always very useful !! Thanks Shreya
Thanks Ranjith
Awesome video. Madam, can you please make video on how to tackle system design and what tools to use for building pipelines both in inhouse/cloud? It would be helpful.
Thanks Kiran sure
Very well described on concepts standpoint, One example bridging the concept would have been a value addition.
Please do suggest a book/articles for a deep drive as well.
I have few queries, it would be really helpful if you can provide some thoughts on that:
1. Let's say we are having a common batch pipeline for multiple customers from different data sources (s3, mysql etc. ) , One Identifying timestamp column as YYYYMMDD as it has frequency of 24hrs.Now if I want to convert it to real time ELT, what are the things to keep in mind?
2. Extending the above question, if we are running parallel ETL batches for let's say 30 customer each. What is the best optimization strategy, To run ELT jobs parallel for each customer (best performance) or Increase the nodes run the jobs? Which is the best way to scale keeping cost as a constraint?
1. From batch to Realtime it has to be a complete architecture and technology choice change. you need to also check if its real time ingestion and processing or real time consumption as well. This would determine how you are designing the system.
2. Orchestrating jobs in parallel , in sequence or a mix depends on many factors:
a. SLA
b. Is each flow independent or has dependency
c. Cost
d. Consumption pattern
you may not need to have all flows in parallel with higher nodes (as its extra cost) if the data isn't interdependent or doesn't has same SLAs