Loved this talk. Just one comment at 8:36 (Referring to example provided of 100 rows) Parquet is not purely columnar. It is actually hybrid, where the rows are divided into RowGroups and each RowGroup is stored in a columnar format. This hybrid format actually helps in row reconstruction. Also, with Apache Delta coming becoming more mainstream (which also uses Parquet but with a commit log) there is little reason to use pure Parquet :)
This is enterprise level explanation which is highly useful. Great work Omkar !!
Probably the best talk so far citing the real life issues faced and their solutions.
Loved this talk. Just one comment at 8:36 (Referring to example provided of 100 rows) Parquet is not purely columnar. It is actually hybrid, where the rows are divided into RowGroups and each RowGroup is stored in a columnar format. This hybrid format actually helps in row reconstruction. Also, with Apache Delta coming becoming more mainstream (which also uses Parquet but with a commit log) there is little reason to use pure Parquet :)
Very useful ideas from real life scenarios
how to set columnar compression
@omkar thanks for your talk and just to let u know we are facing yarn memory overhead issue with spark 2.4 as well when we are doing spark sql joins
I am new to spark. Can anyone please tell me exactly for which operations 5 stages in left diagram and 2 stages in right diagram are formed?