10 best practices for building Data lakes| Evolution of data lakes

Поделиться
HTML-код
  • Опубликовано: 26 июл 2024
  • This video talks about What is a data lake, evolution of data lakes and the 10 best practices to keep in mind while designing and developing data lakes
    What are Data lake
    • What are Data Lakes | ...
    How to build healthy data pipelines
    • How to Build healthy d...
    Disaster Recovery
    • Disaster Recovery in 1...
    Chaos Engineering
    • What is Chaos Engineer...
  • НаукаНаука

Комментарии • 17

  • @amriteshsingh2952
    @amriteshsingh2952 Год назад +1

    Thanks for the great content, appreciate your work.

  • @GauravSharmagvs
    @GauravSharmagvs 2 года назад +1

    Thanks for the video. It was very informative.

  • @arunalex1988
    @arunalex1988 2 года назад +1

    Learnt few new dimensions related to Data Lake. Thank you so much for sharing your wealth of knowledge dear Shreya.

  • @TotuBabyBird
    @TotuBabyBird 2 года назад +1

  • @Learn2Share786
    @Learn2Share786 2 года назад +1

    Great content covered in right pace. Thank you!

  • @Praveen_Kumar_R_CBE
    @Praveen_Kumar_R_CBE 2 года назад +1

    Very useful content Shreya…

  • @ranjithpals
    @ranjithpals 2 года назад +1

    Lots of great points you have highlighted and touched upon, as always very useful !! Thanks Shreya

  • @kiranmudradi26
    @kiranmudradi26 2 года назад +1

    Awesome video. Madam, can you please make video on how to tackle system design and what tools to use for building pipelines both in inhouse/cloud? It would be helpful.

  • @nikhilgupta110
    @nikhilgupta110 2 года назад

    Very well described on concepts standpoint, One example bridging the concept would have been a value addition.
    Please do suggest a book/articles for a deep drive as well.
    I have few queries, it would be really helpful if you can provide some thoughts on that:
    1. Let's say we are having a common batch pipeline for multiple customers from different data sources (s3, mysql etc. ) , One Identifying timestamp column as YYYYMMDD as it has frequency of 24hrs.Now if I want to convert it to real time ELT, what are the things to keep in mind?
    2. Extending the above question, if we are running parallel ETL batches for let's say 30 customer each. What is the best optimization strategy, To run ELT jobs parallel for each customer (best performance) or Increase the nodes run the jobs? Which is the best way to scale keeping cost as a constraint?

    • @BigDataThoughts
      @BigDataThoughts  2 года назад +1

      1. From batch to Realtime it has to be a complete architecture and technology choice change. you need to also check if its real time ingestion and processing or real time consumption as well. This would determine how you are designing the system.
      2. Orchestrating jobs in parallel , in sequence or a mix depends on many factors:
      a. SLA
      b. Is each flow independent or has dependency
      c. Cost
      d. Consumption pattern
      you may not need to have all flows in parallel with higher nodes (as its extra cost) if the data isn't interdependent or doesn't has same SLAs