To sum up 20:00, the WAL protocol requires that redo logs get written to disk before DB blocks in the buffer pool can be written to disk too. This 2nd process (writting buffer pool to disk) takes CPU and memory, especially in the case of a "torn page" (when writing some OS blocks (that make up one DB block) failed, and writing the missing OS blocks need to be may need to be re-attempted using the redo logs). But in Aurora this 2nd process is not performed by the writer node. Instead, the writer node just sends its redo logs to the storage layer (and waits to have the confirmation that those redo logs have been written on the disks of 4 our of 6 copies before considering it a success), and never has to send the buffer pool to disk. The storage layer takes care of this : it creates the data block from the redo logs. As explained in 25:30, each storage node first writes to disk the redo logs it has received from the writer nodes, and acknowleges back to it (so that it can get that 4 out of 6 quorum required to consider the TX as committed), then it assembles in memory a coherent sequence of redo logs (if it has some missing logs, it asks the other storage nodes for the missing redo logs), and then applies those redo logs to the data blocks on disk. For backups/recovery purposes, redo logs are also streamed to S3 by the storage nodes (I wonder how we deal with the fact that 6 storage servers do that for the same data) ; WhitePapers say that there is a target 5min RPO when it comes to the last redo log streamed to S3. The storage node also streams to S3 the data blocks it has updated thanks to the redo logs (so that the data block backed up to S3 can be directly used, rather than having to apply redo logs for PITR).
You say we define a successful writting when there 4 of 6 replicas(in storage) have sent back the ack, and we just read one of the 6 copies to retrieve the data. How could you gaurantee the mysql would not read the bad one?
I recheck the video. You reclaim that you would do 3 out of 6 read for small part of reads, and most of time just 1 out of 6. I just want to know why Aurora permit the dangerous read? I mean in quorum theory, we need enough copies to make sure we can read the latest and right version data.
@@amywaken6800 Great question. This 4/6 write quorum is for the writing of the WAL logs (aka bin logs, or redo logs). The storage nodes use those WAL logs to then write the data on the DB blocks (to be more specific, they first update the buffer pool, and then write it to disk). Those DB blocks are therefore correct for all 6 storage nodes. Indeed, the 2 nodes that may not have received the WAL logs have requested the WAL logs to the other 4 storage nodes before writing anything in their buffer pool. Note: You may have heard that there's a "3/6 read quorum in Aurora", but that's the quorum required for those
It was a pleasure to meet you at the last reinvent session, and the content has been very useful to me. thank you, Richard.
To sum up 20:00, the WAL protocol requires that redo logs get written to disk before DB blocks in the buffer pool can be written to disk too. This 2nd process (writting buffer pool to disk) takes CPU and memory, especially in the case of a "torn page" (when writing some OS blocks (that make up one DB block) failed, and writing the missing OS blocks need to be may need to be re-attempted using the redo logs).
But in Aurora this 2nd process is not performed by the writer node. Instead, the writer node just sends its redo logs to the storage layer (and waits to have the confirmation that those redo logs have been written on the disks of 4 our of 6 copies before considering it a success), and never has to send the buffer pool to disk. The storage layer takes care of this : it creates the data block from the redo logs.
As explained in 25:30, each storage node first writes to disk the redo logs it has received from the writer nodes, and acknowleges back to it (so that it can get that 4 out of 6 quorum required to consider the TX as committed), then it assembles in memory a coherent sequence of redo logs (if it has some missing logs, it asks the other storage nodes for the missing redo logs), and then applies those redo logs to the data blocks on disk.
For backups/recovery purposes, redo logs are also streamed to S3 by the storage nodes (I wonder how we deal with the fact that 6 storage servers do that for the same data) ; WhitePapers say that there is a target 5min RPO when it comes to the last redo log streamed to S3. The storage node also streams to S3 the data blocks it has updated thanks to the redo logs (so that the data block backed up to S3 can be directly used, rather than having to apply redo logs for PITR).
You say we define a successful writting when there 4 of 6 replicas(in storage) have sent back the ack, and we just read one of the 6 copies to retrieve the data. How could you gaurantee the mysql would not read the bad one?
I recheck the video. You reclaim that you would do 3 out of 6 read for small part of reads, and most of time just 1 out of 6. I just want to know why Aurora permit the dangerous read? I mean in quorum theory, we need enough copies to make sure we can read the latest and right version data.
Since replicas are asyncronous. I'm wonder how do you define the order of read and write in different servers.
@@amywaken6800 Great question.
This 4/6 write quorum is for the writing of the WAL logs (aka bin logs, or redo logs). The storage nodes use those WAL logs to then write the data on the DB blocks (to be more specific, they first update the buffer pool, and then write it to disk). Those DB blocks are therefore correct for all 6 storage nodes. Indeed, the 2 nodes that may not have received the WAL logs have requested the WAL logs to the other 4 storage nodes before writing anything in their buffer pool.
Note: You may have heard that there's a "3/6 read quorum in Aurora", but that's the quorum required for those
Love the deep dive of concept.
Can this magic be done for SQL Server? Or would M$ need to share the source code with you?
You definitely need the source code to make such a fundamental change to a database engine.
Pretty Good one !
Thanks 🙏💕💲