DBT Core on Cloud Run Job

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ย. 2024

ความคิดเห็น • 16

  • @HARDselection
    @HARDselection 2 หลายเดือนก่อน +1

    As a member of a very small data team managing a complex orchestration workload, this is exactly what I was looking for. Thanks!

  • @fbnz742
    @fbnz742 วันที่ผ่านมา

    Hi Richard, thank you so much for sharing this. This is exactly what I wanted. I have a few questions:
    1. Do you have any example on how to orchestrate it using Composer? I mean the DAG code.
    2. I am quite new to DBT. I used DBT Cloud before and I could run everything (Upstream + Downstream jobs) or just Upstream, just Downstream, etc. Can I do it using DBT Core + Cloud Run?
    3. This is quite off-topic to the video but wanna ask: DBT Cloud offers a VERY nice visualization of the full chain of dependencies. Is there any way to do it outside of DBT Cloud?
    Thanks again!

  • @adeolamorren2678
    @adeolamorren2678 หลายเดือนก่อน

    One seperate question; if we have dependencies, since it's a serverless environment, we should add the dbt deps command in the dockerfile args, or runtime override args right?

    • @practicalgcp2780
      @practicalgcp2780  หลายเดือนก่อน

      No I don’t think that is the right way to do it, serverless environments you can still package up dependencies, and this is something you typically need to do in build time not run time, I.e, while you are packing the container in your CI pipeline. DBT can generate a lock file which can be used to ensure packaging consistency on versions, so you don’t end up having different versions each time you run the build. See docs.getdbt.com/reference/commands/deps
      The other reason you don’t want to do that at run time is it could be very slow to install dependencies each time because it requires downloading these, plus you may not want internet access on a production environment to be more secure in some setups so doing this in build time makes a lot more sense

  • @agss
    @agss 2 หลายเดือนก่อน +1

    Thank you for the very insightful video!
    What is your take on using Dataform instead of DBT, when it comes to capabilities of both tools and ease to deploy and manage those solutions?

    • @practicalgcp2780
      @practicalgcp2780  2 หลายเดือนก่อน +2

      Thank you and spot on question, I was wondering who is going to ask this first 🙌 I am actually making a Dataform video in the background but don’t want to public it unless I am 100% sure I am saying something useful.
      But based on my current findings, you could use either and depends on what you need both can be a good fit. Dataform is a lot easier to get up and running but it’s quite new and I won’t recommend using it for something too critical at this stage, and it’s also missing some key features like templating using jinja (I don’t really like the JavaScript templating system, as it’s built on typescript, that is something no one uses, you would be lock-in to something with no support which in my view is quite dangerous). But it is something a lot easier to get up and running natively in gcp.
      DBT is still the go to choice in my view, because it is built in Python has a strong open source community. For mission critical data modelling work, I still think DBT is much better.

    • @agss
      @agss 2 หลายเดือนก่อน

      @@practicalgcp2780 you brought up exactly what I was worrying about.
      I highly appreciate your insight!

    • @strmanlt
      @strmanlt 16 วันที่ผ่านมา +1

      Our team was debating migrating from dbt to Dataform. Dataform is actually is a pretty decent tool, but the main issues for us was the 1000 node limit per repo. So maybe if you have very simple models that do not require a lot of notes it would work fine, but for us the long term scalability was the deciding factor

    • @practicalgcp2780
      @practicalgcp2780  16 วันที่ผ่านมา

      @@strmanlt thanks for the input on this! Can I ask what is the 1000 node you are referring to? Can you share the docs on this. Is it 1000 node limit on number of steps / sql you can write?

    • @fbnz742
      @fbnz742 วันที่ผ่านมา

      Just wante to share my thoughts here: I used Dataform for an entire project and it worked quite well. My data model was not so complex, and I learned how to integrate its logs with Airflow, being able to set up alerts to Slack, pointing to the log file of the failed job, etc, however, I agree that Dataform templating is very strange. I personally don't have expertise with JavaScript so suffered a lot with some things, but I was able to do pretty much all I wanted. I suffered a lot with looking for things in the internet, and DBT is the exact opposite: you can find tons of content online. I would go with DBT.

  • @adeolamorren2678
    @adeolamorren2678 หลายเดือนก่อน

    with this approach is it possible to add environment variables that are isolated for each run? I basically want to pass environment variables for each run when I invoke google cloud run

    • @practicalgcp2780
      @practicalgcp2780  หลายเดือนก่อน +1

      Environment variables are typically not designed for manipulating runtime variables each time, these are typically set for each environment, and stick to each deployment not run.
      But it looks like both options are possible, and stick to passing command line arguments because that’s more appropriate to override compared to environment variables. See this article on how to do it, it’s explained well chrlschn.medium.com/programmatically-invoke-cloud-run-jobs-with-runtime-overrides-4de96cbd158c

  • @10xApe
    @10xApe 4 หลายเดือนก่อน

    Can cloud run be used for Power BI datarefresh gateway ?

    • @practicalgcp2780
      @practicalgcp2780  3 หลายเดือนก่อน

      I haven’t used power BI so I googled what is data refresh gateway, according to learn.microsoft.com/en-us/power-bi/connect-data/refresh-scheduled-refresh it looks like it’s some sort of service you can control refresh via a schedule? Unless there is some sort of API it allows you to trigger from the Google Cloud ecosystem I am not sure if you can use it. I assume you are thinking of triggering some DBT job first then refresh the dashboard?