Hi Alex, this looks really great and I can imagine so many use cases with it. I just wonder if there is a way to tell what has changed between the branches at the moment? And please correct me if I'm wrong, but is this only for use with parquet or structured data files? I have a project where we use other data format like fastq, fasta that is widely used in bioinformatics to store genetics informatoin, they are nothing like parquet, and I don't think any engines can query anything from them. We keep them in a "data warehouse" (s3 bucket) and we would need to version them. Would Nessie be a good use case for this? Thanks!
You are correct this mainly for structured and semi structured data. You’d need to take that data and find a way to represent it in Parquet/Iceberg to leverage Nessie. For versioning non-iceberg datasets you may want to use git or LakeFS. (Depending on what you are trying to achieve)
Dremio can do a lot of ingestion work, and those capabilities are growing everyday. - Using CTAS, INSERT INTO and COPY INTO commands we can move data from any of our sources into Apache Iceberg tables on our data lake.
Why is the spark configuration with all of the lakehouse services hardcoded in a notebook? Shouldn’t these configurations be incorporated into the docker image you’re using for Spark?
I do that primarily for educational purposes to help people learn the spark configs so they can apply the learning to their environment. Many tutorials abstract configs then when people try to apply what they learned they don't know what the configs are or where they come from. - Alex
Great article Alex. Slight issue creating a view in Dremio, I get the following exception "Validation of view sql failed. Version context for table nessie.names must be specified using AT SQL syntax". Nothing obvious in the console output, any ideas?
@@AlexMercedCoder Thanks Alex. This would seem to be a limitation of the 'Save as View' dialogue, as it doesn't allow me to do this and it doesn't default to the branch you're in the context of currently.
If your following this tutorial sometimes Spark has some weird dns issues with the docker network. The solution is to use the ip address of the Nessie container which you can find by inspecting the network in the docker desktop ui or inspecting the network using the docker CLI to find the ip address of the Nessie container. If you run into a "Unknown Host" issue using minio:9000 then there may be an issue with the DNS in your Docker network that watches the name minio with the ip address of the image on the docker network. In this situation replace minio with the containers ip address. You can look up the ip address of the container with docker inspect minio and look for the ip address in the network section and update the STORAGE_URI variable for example STORAGE_URI = "172.18.0.6:9000"
I'd have to see the whole log output and catalog settings to determine the issue. If you want message me on LinkedIn and I can examine further. - Alex Merced
Awesomw tutorial, just a question, trying to create the table, I'm getting this error (can you help).... { "name": "Py4JJavaError", "message": "An error occurred while calling o64.sql. : java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/AnsiCast \tat org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.$anonfun$apply$6(IcebergSparkSessionExtensions.scala:54) .....
This is so nice. Now I dont have to pay for databricks in order to learn Spark!
Hi Alex, this looks really great and I can imagine so many use cases with it. I just wonder if there is a way to tell what has changed between the branches at the moment? And please correct me if I'm wrong, but is this only for use with parquet or structured data files? I have a project where we use other data format like fastq, fasta that is widely used in bioinformatics to store genetics informatoin, they are nothing like parquet, and I don't think any engines can query anything from them. We keep them in a "data warehouse" (s3 bucket) and we would need to version them. Would Nessie be a good use case for this? Thanks!
You are correct this mainly for structured and semi structured data. You’d need to take that data and find a way to represent it in Parquet/Iceberg to leverage Nessie. For versioning non-iceberg datasets you may want to use git or LakeFS. (Depending on what you are trying to achieve)
@@Dremio Thanks Alex that's really useful as always!
Awesome.. Just what I was looking to get rid of AWS. How can I create tables from a CSV file uploaded in minio?
This should help -> www.dremio.com/blog/ingesting-data-into-apache-iceberg-tables-with-dremio-a-unified-path-to-iceberg/
amazing stuff. thank you so much for that.
i was wondering if spark is a must or can we just use Dremio to do the data ingestion too?
Dremio can do a lot of ingestion work, and those capabilities are growing everyday.
- Using CTAS, INSERT INTO and COPY INTO commands we can move data from any of our sources into Apache Iceberg tables on our data lake.
great video.
how orchestration all that?
Airflow would be the most likely way to orchestrate it all. Dremio has a rest API to send it SQL programattically.
Why is the spark configuration with all of the lakehouse services hardcoded in a notebook? Shouldn’t these configurations be incorporated into the docker image you’re using for Spark?
I do that primarily for educational purposes to help people learn the spark configs so they can apply the learning to their environment. Many tutorials abstract configs then when people try to apply what they learned they don't know what the configs are or where they come from. - Alex
Great article Alex. Slight issue creating a view in Dremio, I get the following exception "Validation of view sql failed. Version context for table nessie.names must be specified using AT SQL syntax". Nothing obvious in the console output, any ideas?
That means the table is in Nessie and it needs to know which branch your using so it would be AT BRANCH main
@@AlexMercedCoder Thanks Alex. This would seem to be a limitation of the 'Save as View' dialogue, as it doesn't allow me to do this and it doesn't default to the branch you're in the context of currently.
we cant able to read files direcly from minio bucket to appache spark .
How can we can read file from mino bucket and process in spark ?
If your following this tutorial sometimes Spark has some weird dns issues with the docker network. The solution is to use the ip address of the Nessie container which you can find by inspecting the network in the docker desktop ui or inspecting the network using the docker CLI to find the ip address of the Nessie container.
If you run into a "Unknown Host" issue using minio:9000 then there may be an issue with the DNS in your Docker network that watches the name minio with the ip address of the image on the docker network. In this situation replace minio with the containers ip address. You can look up the ip address of the container with docker inspect minio and look for the ip address in the network section and update the STORAGE_URI variable for example STORAGE_URI = "172.18.0.6:9000"
This tutorial does the same thing without spark www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/
Is there a new link for the article? The Flink+Nessie article is still available, but the "Blog Tutorial" link is dead.
both links still seem to be working for me.
I got an error Failed to load class "org.slf4j.impl.StaticLoggerBinder", when running the script for spark
I'd have to see the whole log output and catalog settings to determine the issue. If you want message me on LinkedIn and I can examine further.
- Alex Merced
even i am getting the same error
Awesomw tutorial, just a question, trying to create the table, I'm getting this error (can you help)....
{
"name": "Py4JJavaError",
"message": "An error occurred while calling o64.sql.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/AnsiCast
\tat org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.$anonfun$apply$6(IcebergSparkSessionExtensions.scala:54) .....
I’d need to see the code and the error can you send me more details at Alex.merced@dremio.com or provide as much context as you can