Netflix: Active-Active Behind the Scenes

Поделиться
HTML-код
  • Опубликовано: 29 июл 2024
  • Speaker: Roopa Tangirala, Senior Cloud Data Architect at Netflix
    High availability is an important requirement for any online business and trying to architect around failures and expecting infrastructure to fail, and even then be highly available, is the key to success. One such effort here at Netflix was the Active-Active implementation where we provided region resiliency. This presentation will discuss the brief overview of the active-active implementation and how it leveraged Cassandra’s architecture in the backend to achieve its goal. It will cover our journey through A-A from Cassandra’s perspective, the data validation we did to prove the backend would work without impacting customer experience. The various problems we faced, like long repair times and gc_grace settings, plus lessons learned and what would we do differently next time around, will also be discussed.
  • НаукаНаука

Комментарии • 1

  • @ferozshaik100
    @ferozshaik100 5 лет назад +1

    Thanks for sharing your experience with Active - Active C* deployment. However, we have one particular use case where after a region failure, the app users cannot be hosted back to their original region due to the inconsistent data (Assuming the region restoration would take more than 5 - 6 hours , so hints are also dropped!). It would cause our application to have corruptions. The only way was to do cassandra repair and make the region/nodes consistent before we can host back those users. However, due to the tight SLA's, we cannot finish repairs by then and this is causing us to think for a solution to address it. The repairs cannot be fastened due to the abundance of data and multiple jvm/server we run. Any idea's or help regarding it would be appreciated! We are using vnodes BTW.