Amazing job. Thank you!! What is the best way to read this delta tables now? Data Catalog and then Athena? I would like to see this data in our QuickSight.
Great job! Your channel reminds me of your colleague Julien Simon's. 1- Could you show how to use AWS Lake Formation's governed tables in Amazon EMR? What is the difference between governed tables and Apache Iceberg/Apache Hudi/Delta Lake? 2- Could you have a demo of Pandas-on-Spark when Apache Spark 3.2 would be available in Amazon EMR? I'm interested to know whether it is possible to run Pandas code on Amazon EMR without big changes. 3- Could you talk about the book about Amazon EMR that your colleague Sakti Mishra will publish soon? I would like to know if it can help me to prepare for the AWS Data Analytics Certification.
1- With AWS Lake Formation's Governed Tables, you have some limitations you need to know about it docs.aws.amazon.com/lake-formation/latest/dg/governed-table-restrictions.html, so from my perspective here are the main differences between those and Apache Iceberg/Apache Hudi/Delta Lake 2- The last version of EMR is based on Spark 3.1.2 docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-6.5.0.html 3- You can find Sakti Mishra's book here amzn.to/3sRare1
Both Hudi and Iceberg have "maintenance' operations you can run, including compaction. For Iceberg ( iceberg.apache.org/docs/1.2.0/maintenance/#compact-data-files ) and Hudi ( hudi.apache.org/docs/compaction/ ).
I haven't used it personally, but looks like there is an Iceberg connector you can subscribe to from Glue Studio. Dremio has a pretty good tutorial about it here: www.dremio.com/resources/tutorials/getting-started-with-apache-iceberg-using-aws-glue-and-dremio/
@@dacort Yes saw that one but it does do the CDC part. The one you have on your channel is what we are looking for. Hope we can replicate the same using Glue connector for Iceberg. So far no luck but will work with the support if the connector does not work.
Hi Arjun - Glue does have the ability to connect to Hudi tables, but there are some different steps to set it up. You can find more details here: aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/
It should work provided you supply the necessary dependancies and spark configs. I have done basic CRUD on all these files type in Glue without using connectors.
Amazing job, Thank you Dacort
Thats an excellent demonstration
Brilliant work made this so easy to understand. Great overview!
Nice speech and manner! Clear mind!
great tutorial
Amazing job. Thank you!! What is the best way to read this delta tables now? Data Catalog and then Athena? I would like to see this data in our QuickSight.
Amazing job, Damon.
Thank you!
Great job! Your channel reminds me of your colleague Julien Simon's.
1- Could you show how to use AWS Lake Formation's governed tables in Amazon EMR? What is the difference between governed tables and Apache Iceberg/Apache Hudi/Delta Lake?
2- Could you have a demo of Pandas-on-Spark when Apache Spark 3.2 would be available in Amazon EMR? I'm interested to know whether it is possible to run Pandas code on Amazon EMR without big changes.
3- Could you talk about the book about Amazon EMR that your colleague Sakti Mishra will publish soon? I would like to know if it can help me to prepare for the AWS Data Analytics Certification.
1- With AWS Lake Formation's Governed Tables, you have some limitations you need to know about it docs.aws.amazon.com/lake-formation/latest/dg/governed-table-restrictions.html, so from my perspective here are the main differences between those and Apache Iceberg/Apache Hudi/Delta Lake
2- The last version of EMR is based on Spark 3.1.2 docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-6.5.0.html
3- You can find Sakti Mishra's book here amzn.to/3sRare1
Does any of these have a "vacuum" equivalent, or how do you do housekeeping / maintenance on these incremental data lakes?
Both Hudi and Iceberg have "maintenance' operations you can run, including compaction. For Iceberg ( iceberg.apache.org/docs/1.2.0/maintenance/#compact-data-files ) and Hudi ( hudi.apache.org/docs/compaction/ ).
Hi Damon. Thank you but we are trying to implement the Iceberg format using Glue. Do you have any idea if Glue Spark will support Iceberg?
I haven't used it personally, but looks like there is an Iceberg connector you can subscribe to from Glue Studio. Dremio has a pretty good tutorial about it here: www.dremio.com/resources/tutorials/getting-started-with-apache-iceberg-using-aws-glue-and-dremio/
@@dacort Yes saw that one but it does do the CDC part. The one you have on your channel is what we are looking for. Hope we can replicate the same using Glue connector for Iceberg. So far no luck but will work with the support if the connector does not work.
Does this same approach work with Spark on Glue job? Trying it but with no such luck.
Hi Arjun - Glue does have the ability to connect to Hudi tables, but there are some different steps to set it up. You can find more details here: aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/
It should work provided you supply the necessary dependancies and spark configs. I have done basic CRUD on all these files type in Glue without using connectors.