19 - Google BigQuery / Dremel (CMU Advanced Databases / Spring 2023)

Поделиться
HTML-код
  • Опубликовано: 25 янв 2025

Комментарии • 2

  • @StasPakhomov-wj1nn
    @StasPakhomov-wj1nn Год назад +3

    The missing 18th lecture can be substituted with 2020's at the link below! Cheers

  • @SteveLoughran
    @SteveLoughran Год назад

    AFAIK Hadoop MR will only use local storage between Map and Reduce; output of each job is committed to shared storage. That is where writing to HDFS takes place; writing to cloud storage is "tricker" due to non-Posix semantics, especially on rename, plus tendency to throttle.
    And complex SQL-equivalent statements can be multiple MR jobs. Storage of intermediate shuffle data is managed by the Yarn Node Manager, so outlives mapper/reducer processes. And you can also plug in new shufflers, e.g. for Spark. Does contain the "hosts are long lived" assumption, so doesn't suit compute-only VMs running on spot prices.