AFAIK Hadoop MR will only use local storage between Map and Reduce; output of each job is committed to shared storage. That is where writing to HDFS takes place; writing to cloud storage is "tricker" due to non-Posix semantics, especially on rename, plus tendency to throttle. And complex SQL-equivalent statements can be multiple MR jobs. Storage of intermediate shuffle data is managed by the Yarn Node Manager, so outlives mapper/reducer processes. And you can also plug in new shufflers, e.g. for Spark. Does contain the "hosts are long lived" assumption, so doesn't suit compute-only VMs running on spot prices.
The missing 18th lecture can be substituted with 2020's at the link below! Cheers
AFAIK Hadoop MR will only use local storage between Map and Reduce; output of each job is committed to shared storage. That is where writing to HDFS takes place; writing to cloud storage is "tricker" due to non-Posix semantics, especially on rename, plus tendency to throttle.
And complex SQL-equivalent statements can be multiple MR jobs. Storage of intermediate shuffle data is managed by the Yarn Node Manager, so outlives mapper/reducer processes. And you can also plug in new shufflers, e.g. for Spark. Does contain the "hosts are long lived" assumption, so doesn't suit compute-only VMs running on spot prices.