19 - Google BigQuery / Dremel (CMU Advanced Databases / Spring 2023)

แชร์
ฝัง
  • เผยแพร่เมื่อ 9 ม.ค. 2025

ความคิดเห็น • 2

  • @StasPakhomov-wj1nn
    @StasPakhomov-wj1nn ปีที่แล้ว +3

    The missing 18th lecture can be substituted with 2020's at the link below! Cheers

  • @SteveLoughran
    @SteveLoughran ปีที่แล้ว

    AFAIK Hadoop MR will only use local storage between Map and Reduce; output of each job is committed to shared storage. That is where writing to HDFS takes place; writing to cloud storage is "tricker" due to non-Posix semantics, especially on rename, plus tendency to throttle.
    And complex SQL-equivalent statements can be multiple MR jobs. Storage of intermediate shuffle data is managed by the Yarn Node Manager, so outlives mapper/reducer processes. And you can also plug in new shufflers, e.g. for Spark. Does contain the "hosts are long lived" assumption, so doesn't suit compute-only VMs running on spot prices.