Data Warehouse Ingestion Patterns with Apache NiFi

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ย. 2024
  • This video talks through the pros and cons of three patterns you can use in Apache NiFi to ingest data into a table created with the Iceberg format.
    - 1st option: PutIceberg
    Simply push data using the PutIceberg processor. Super efficient but really only does inserts of new data into the table. It may not be a fit in all cases.
    - 2nd option: PutDatabaseRecord
    Great option that is a bit more generic than the previous one if the destination is not an Iceberg formatted table. In this case the data is sent over JDBC. Great for small datasets but won't be super efficient for huge datasets.
    - 3rd option: Staging area with external temporary tables
    A bit more involved in terms of flow design but more reliable and very flexible while very efficient as it delegates most of the work to the query engine. In this case data is pushed into a staging area of the object store, you create an external table on top of it, then merge the data from that external table into your final table, and do some cleanup.
    Thanks for watching the video! As always, feel free to ask comments and share your feedback. And let me know what you'd like to see for the next video!

ความคิดเห็น • 10

  • @franckroutier
    @franckroutier หลายเดือนก่อน

    Hi, and thanks for the video.
    I have question through... would there be a way to handle transactions in a scenario where I'm upserting into multiple tables, and I'd like the whole process to succeed or fail ?
    Coming from Talend, I usually have a pre-job that starts a transaction on a db connection, all "processors" will use the transaction, and in the post-job I will commit or rollback, depending on whether there is an error or not.

    • @pvillard31
      @pvillard31  หลายเดือนก่อน +2

      I guess the closest thing to what you describe is the option in ExecuteSQL and/or ExecuteSQLRecord processors to set SQL queries in the pre-query property and in the post-query properties. But if you mean a transaction to the database that would span across multiple processors in the flow, then it's not possible today. I could see ways of implementing this with custom processors and controller services but there is nothing out of the box today. That could be a valid feature request if you'd like to file a JIRA in the Apache NiFi project.

  • @clintonchikwata4049
    @clintonchikwata4049 2 หลายเดือนก่อน

    Thanks Third Option is phenomenal

    • @clintonchikwata4049
      @clintonchikwata4049 2 หลายเดือนก่อน

      @Pierre when using option 3 how would you handle a scenario where you want a surrogate key on the destination table

  • @nasrinidhal4162
    @nasrinidhal4162 22 ชั่วโมงที่ผ่านมา

    Thanks for sharing! Insightful content.
    I am a starter and I am wondering whether Nifi is able to handle cross-team collaboration? if so, I would be glad if you can share some useful links.
    At the same, I doubt if it is really a good choice for heavy ETL/ELT or even CDC? (even though it is possible to implement it)
    I see it good only as a mediation and routing tool, am I mistaken?
    Thank you for your feedback!

    • @pvillard31
      @pvillard31  4 ชั่วโมงที่ผ่านมา

      Hi, NiFi is definitely able to handle cross-team collaboration. The concept of registry client is usually what is recommended to version control flow definitions and have multiple people working on the same flows as well as building CI/CD pipelines to test and promote flows in upper environments. NiFi should be considered more as an ELT rather than an ELT. Any kind of transformation is technically doable at FlowFile level in NiFi but if you need to do complex transformations over multiple FlowFiles (joins, aggregations, etc), then a proper engine like Flink for example would likely be better (or delegate this to whatever destination system you're using - data warehouse, etc). Finally, CDC is definitely something you can do very well with NiFi. Some vendors providing support on NiFi are providing NiFi processors based on Debezium for capturing CDC events as well as processors to push those events into systems (Iceberg, Kudu, etc). There are some things to keep in mind when designing a flow to make sure events ordering is preserved but there are many options to do that in NiFi very well. Hope this helps!

    • @nasrinidhal4162
      @nasrinidhal4162 3 ชั่วโมงที่ผ่านมา

      @@pvillard31 Hi,
      So Buckets can be considered as separate projects in Nifi where data engineers can work together without disturbing other teams that are on other buckets using the same Nifi instance?
      And if a team want to test or deploy a given version it could be done through scripts that they need to implement and maintain?
      If so, this would be very interesting! I will try to have a closer look.
      Thank you and keep posting!

    • @pvillard31
      @pvillard31  2 ชั่วโมงที่ผ่านมา

      @@nasrinidhal4162 Yeah, buckets can be a separation for different teams or for logically grouping flows serving a similar purpose and then you have flows versioned in that bucket and multiple people can work on the same flow. I have a video coming soon with some nice features of NiFi 2 with branching, filing pull request and comparing versions before merging a pull request for a new flow version. I have a series of blog post and videos coming that are focusing on CI/CD with NiFi.

    • @nasrinidhal4162
      @nasrinidhal4162 ชั่วโมงที่ผ่านมา

      @@pvillard31 Cool! That would be amazing!
      Thanks for sharing again and keep posting.

  • @LesterMartinATL
    @LesterMartinATL 2 หลายเดือนก่อน

    Good stuff!