Wonderful Demonstration and very handy notebook. Following are my assumptions. 1. Deltalake keeps multiple version of the data( like HBASE ) . 2. Deltalake takes care of the automicity for the user showing only the latest file if not specified otherwise. 3. Deltalake checks the schema before appending to prevent corruption of the table, this makes developers job easy, similar things can be achieved with manual effort like manually mentioning the schema instead of infering it. 4. In case of update it always overwrites the entire table or the entire partition(dataframes are immutable) . Questions. 1. If it keeps multiple version is there a default limit for number of versions ? 2. As it keeps multiple versions so is it only for smaller tables ? for tables in terabytes wont it be a waste of space? 3. In relational DB data is tightly coupled with metadata/schema , so we can only get the data only from the table not the data files . But in hive / spark this is different. external tables are also allowed . Without having access the metadata, we can recreate the table . How it is handled in DeltaLake , because we have multiple snapshot/version of the same table , without the log/metadata will someone be able to access it? In hive/Spark multiple table with different tool ( hive, presto, spark) can be created on the same data. Can other tool share the same data with deltalake ?
For updates, it will not overwrite the entire table, but look at the files that has the data that needs to be updated and create the new copy of only those files . Such files will have the updates in them + non update records in that file.To eventually clean up the older version you will have to run a vacuum command. Currently only sparksql works for querying the delta location but I believe they are working on making presto, hive work with it.
If the streaming / batch notebook you demonstrated were being run in a workflow and and lets say100k rows have streamed in successfully, but then an error occurs and the job fails. As I understand it, the 100K rows and all other changes that occurred in the workflow would be automatically rolled back. Is this correct?
Please refer to the "Importing Notebooks" section of github.com/delta-io/delta/tree/master/examples/tutorials/saiseu19#importing-notebooks for step-by-step instructions. HTH!
Amazing Hands On Session
Wonderful Demonstration and very handy notebook.
Following are my assumptions.
1. Deltalake keeps multiple version of the data( like HBASE ) .
2. Deltalake takes care of the automicity for the user showing only the latest file if not specified otherwise.
3. Deltalake checks the schema before appending to prevent corruption of the table, this makes developers job easy, similar things can be achieved with manual effort like manually mentioning the schema instead of infering it.
4. In case of update it always overwrites the entire table or the entire partition(dataframes are immutable) .
Questions.
1. If it keeps multiple version is there a default limit for number of versions ?
2. As it keeps multiple versions so is it only for smaller tables ? for tables in terabytes wont it be a waste of space?
3. In relational DB data is tightly coupled with metadata/schema , so we can only get the data only from the table not the data files . But in hive / spark this is different. external tables are also allowed . Without having access the metadata, we can recreate the table . How it is handled in DeltaLake , because we have multiple snapshot/version of the same table , without the log/metadata will someone be able to access it? In hive/Spark multiple table with different tool ( hive, presto, spark) can be created on the same data. Can other tool share the same data with deltalake ?
For updates, it will not overwrite the entire table, but look at the files that has the data that needs to be updated and create the new copy of only those files . Such files will have the updates in them + non update records in that file.To eventually clean up the older version you will have to run a vacuum command. Currently only sparksql works for querying the delta location but I believe they are working on making presto, hive work with it.
Starts at 3:10
Thanks Andy, I trimmed it. Video starts right at 0:00
If the streaming / batch notebook you demonstrated were being run in a workflow and and lets say100k rows have streamed in successfully, but then an error occurs and the job fails. As I understand it, the 100K rows and all other changes that occurred in the workflow would be automatically rolled back. Is this correct?
Great demo... very useful for learning delta architecture
Thanks for the feedback Nithin! Glad you enjoyed it.
Can you help to share the steps on how to import the notebook from the github link to databricks community edition.
Please refer to the "Importing Notebooks" section of github.com/delta-io/delta/tree/master/examples/tutorials/saiseu19#importing-notebooks for step-by-step instructions. HTH!