Tutorial: Building a Data Lakehouse with Apache Iceberg, Spark, Dremio, Nessie & Minio

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ม.ค. 2025

ความคิดเห็น •

  • @apollon456
    @apollon456 6 หลายเดือนก่อน +3

    This is so nice. Now I dont have to pay for databricks in order to learn Spark!

  • @hieuthaingoc
    @hieuthaingoc 3 หลายเดือนก่อน +1

    Hi Alex, this looks really great and I can imagine so many use cases with it. I just wonder if there is a way to tell what has changed between the branches at the moment? And please correct me if I'm wrong, but is this only for use with parquet or structured data files? I have a project where we use other data format like fastq, fasta that is widely used in bioinformatics to store genetics informatoin, they are nothing like parquet, and I don't think any engines can query anything from them. We keep them in a "data warehouse" (s3 bucket) and we would need to version them. Would Nessie be a good use case for this? Thanks!

    • @Dremio
      @Dremio  3 หลายเดือนก่อน +1

      You are correct this mainly for structured and semi structured data. You’d need to take that data and find a way to represent it in Parquet/Iceberg to leverage Nessie. For versioning non-iceberg datasets you may want to use git or LakeFS. (Depending on what you are trying to achieve)

    • @hieuthaingoc
      @hieuthaingoc 3 หลายเดือนก่อน

      @@Dremio Thanks Alex that's really useful as always!

  • @OswaldoSaumet
    @OswaldoSaumet 11 หลายเดือนก่อน +1

    Awesome.. Just what I was looking to get rid of AWS. How can I create tables from a CSV file uploaded in minio?

    • @Dremio
      @Dremio  11 หลายเดือนก่อน

      This should help -> www.dremio.com/blog/ingesting-data-into-apache-iceberg-tables-with-dremio-a-unified-path-to-iceberg/

  • @SharonLavie_1981
    @SharonLavie_1981 ปีที่แล้ว

    amazing stuff. thank you so much for that.
    i was wondering if spark is a must or can we just use Dremio to do the data ingestion too?

    • @Dremio
      @Dremio  ปีที่แล้ว

      Dremio can do a lot of ingestion work, and those capabilities are growing everyday.
      - Using CTAS, INSERT INTO and COPY INTO commands we can move data from any of our sources into Apache Iceberg tables on our data lake.

  • @orafaelgf
    @orafaelgf 8 หลายเดือนก่อน +1

    great video.
    how orchestration all that?

    • @Dremio
      @Dremio  5 หลายเดือนก่อน

      Airflow would be the most likely way to orchestrate it all. Dremio has a rest API to send it SQL programattically.

  • @apollon456
    @apollon456 6 หลายเดือนก่อน +1

    Why is the spark configuration with all of the lakehouse services hardcoded in a notebook? Shouldn’t these configurations be incorporated into the docker image you’re using for Spark?

    • @Dremio
      @Dremio  5 หลายเดือนก่อน +2

      I do that primarily for educational purposes to help people learn the spark configs so they can apply the learning to their environment. Many tutorials abstract configs then when people try to apply what they learned they don't know what the configs are or where they come from. - Alex

  • @joeingle1745
    @joeingle1745 7 หลายเดือนก่อน

    Great article Alex. Slight issue creating a view in Dremio, I get the following exception "Validation of view sql failed. Version context for table nessie.names must be specified using AT SQL syntax". Nothing obvious in the console output, any ideas?

    • @AlexMercedCoder
      @AlexMercedCoder 7 หลายเดือนก่อน +1

      That means the table is in Nessie and it needs to know which branch your using so it would be AT BRANCH main

    • @joeingle1745
      @joeingle1745 7 หลายเดือนก่อน

      @@AlexMercedCoder Thanks Alex. This would seem to be a limitation of the 'Save as View' dialogue, as it doesn't allow me to do this and it doesn't default to the branch you're in the context of currently.

  • @Amateur_Pandit
    @Amateur_Pandit 8 หลายเดือนก่อน

    we cant able to read files direcly from minio bucket to appache spark .
    How can we can read file from mino bucket and process in spark ?

    • @Dremio
      @Dremio  8 หลายเดือนก่อน

      If your following this tutorial sometimes Spark has some weird dns issues with the docker network. The solution is to use the ip address of the Nessie container which you can find by inspecting the network in the docker desktop ui or inspecting the network using the docker CLI to find the ip address of the Nessie container.
      If you run into a "Unknown Host" issue using minio:9000 then there may be an issue with the DNS in your Docker network that watches the name minio with the ip address of the image on the docker network. In this situation replace minio with the containers ip address. You can look up the ip address of the container with docker inspect minio and look for the ip address in the network section and update the STORAGE_URI variable for example STORAGE_URI = "172.18.0.6:9000"

    • @Dremio
      @Dremio  8 หลายเดือนก่อน

      This tutorial does the same thing without spark www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/

  • @gfinleyg
    @gfinleyg 10 หลายเดือนก่อน

    Is there a new link for the article? The Flink+Nessie article is still available, but the "Blog Tutorial" link is dead.

    • @Dremio
      @Dremio  10 หลายเดือนก่อน

      both links still seem to be working for me.

  • @joshuajames7231
    @joshuajames7231 10 หลายเดือนก่อน

    I got an error Failed to load class "org.slf4j.impl.StaticLoggerBinder", when running the script for spark

    • @Dremio
      @Dremio  10 หลายเดือนก่อน

      I'd have to see the whole log output and catalog settings to determine the issue. If you want message me on LinkedIn and I can examine further.
      - Alex Merced

    • @khushimuddi7337
      @khushimuddi7337 6 หลายเดือนก่อน

      even i am getting the same error

  • @marceloacarrasco
    @marceloacarrasco 11 หลายเดือนก่อน +1

    Awesomw tutorial, just a question, trying to create the table, I'm getting this error (can you help)....
    {
    "name": "Py4JJavaError",
    "message": "An error occurred while calling o64.sql.
    : java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/AnsiCast
    \tat org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.$anonfun$apply$6(IcebergSparkSessionExtensions.scala:54) .....

    • @Dremio
      @Dremio  11 หลายเดือนก่อน

      I’d need to see the code and the error can you send me more details at Alex.merced@dremio.com or provide as much context as you can