PracticalGCP
PracticalGCP
  • 63
  • 108 077
Save 50 percent of your Data Engineering effort via Continuous Queries
Can #BigQuery Continuous Queries save your organisation 50% of data engineering effort?
I knew I was a bit late to the conversation, but I had thoroughly tested this feature to provide a well-rounded review. I had planned to dive deep into which use cases benefited the most, where the real time savings came from, and, most importantly, whether it was ready for production environments. I also intended to cover some of the key challenges I had encountered. Hope it's worth the wait!
A big shoutout to Nick Orlove for his incredible support and passion in driving the Continuous Queries feature in BigQuery. He’s been instrumental in gathering feedback, and even authored a fantastic article that’s definitely worth a read: BigQuery Continuous Queries Makes Data Analysis Real-Time (cloud.google.com/blog/products/data-analytics/bigquery-continuous-queries-makes-data-analysis-real-time)
Agenda
01:48 - What is Continuous Queries
04:18 - How does Continuous Queries work
08:02 - Saving 50% of the data engineernig effort?
14:16 - Continuous Queries in Action!
25:42 - Concerns about concurrency and cost
28:50 - Features I would love for mission critical pipelines
34:52 - Considerations to run on production
36:28 - Next steps
Deck: docs.google.com/presentation/d/1EU_8hhz9QtkrFJNQKQsICdOlyu2mV_Zd75ziN6QMKZk
While talking about this extremely powerful feature, here’s a significant issue I discovered just last 2 days ago that I hope is only a limitation of the public review. If you skip to 24:21 in the video, you’ll see that I demonstrated only one continuous query can be submitted with 50 slots, and just three with 100 slots. This doesn’t seem logical (given continuous queries do not use much slots at all if you look at the metrics) and makes the entry cost extremely high ($2,500 per month for 50 slots). It doesn’t seem practical if 100 slots only allow for three submissions. I’ve tested this with two different Google Cloud' accounts, and encountered the same issue.
Given there is no way to increase concurrency manually (quote from docs "You can't configure continuous query concurrency. BigQuery automatically determines the number of continuous queries that can run concurrently, based on available reservation assignments that use the CONTINUOUS job type."), I couldn't find how to resolve this at this stage.
มุมมอง: 1 290

วีดีโอ

Stream sharing with Pub/Sub using Analytics Hub
มุมมอง 422หลายเดือนก่อน
In this video, I focus on the challenges I've faced and explain why this simple addition to Analytics Hub can be an extremely effective method for sharing streaming data. This is particularly beneficial in large organisations where multiple topics are published across various projects, and numerous subscribers belong to different teams. - 01:28 Quick intro to Analytics Hub - 02:19 Quick intro t...
Seamless transition of Vector Search from BigQuery to Feature Store
มุมมอง 407หลายเดือนก่อน
Last week I found a Notebook created by Elia (Google) and Lorenzo (Google) that greatly simplifies transitioning Vector Search from BigQuery to Feature Store, enabling a smooth shift from offline to online serving with minimal code changes. Considering my last RAG video didn't cover online serving in depth, I think this is a perfect topic for a follow-up video. I'll demonstrate how easy it is t...
Run Cloud Composer Locally
มุมมอง 5752 หลายเดือนก่อน
Google Cloud has introduced a command-line interface (CLI) for running an Airflow environment with Cloud Composer. This tool offers arguably the most convenient method for operating Airflow in a Composer-like setting for local development purposes. The significance of this tool lies in its comprehensive features for local development and its ability to easily incorporate additional Python packa...
DBT Core on Cloud Run Job
มุมมอง 1.8K4 หลายเดือนก่อน
I'm thrilled to share my latest video: "DBT Core on Cloud Run Job" This tutorial is tailored for those looking to streamline their data transformation workflows in a serverless environment using Google Cloud's powerful service, Cloud Run. Whether you're a seasoned professional or just starting, this video has valuable insights for everyone. 🔹 What's Inside? - A step-by-step guide on setting up ...
How to build a sustainable data ecosystem on Google Cloud
มุมมอง 6365 หลายเดือนก่อน
Today, I’m sharing my experience on how to establish a data ecosystem within Google Cloud that addresses some significant challenges such as enhancing the speed of development, improving data management and sharing, ensuring quality, and most importantly, identifying clear methods to evaluate our progress. It’s essential to recognise that creating a lasting data system isn’t merely about adheri...
A practical application leveraging Langchain and BigQuery Vector Search
มุมมอง 2.3K6 หลายเดือนก่อน
Today, I'm thrilled to share insights into the integration of Langchain and BigQuery Vector search through a practical application that I've developed. This video presentation goes beyond theoretical discussion, offering a hands-on look at leveraging these cutting-edge technologies. I cover critical topics like Large Language Models (LLM), Embeddings, and Vector Search, which are fundamental to...
Scaling development teams with Cloud Workstations
มุมมอง 6197 หลายเดือนก่อน
In today's rapidly evolving digital landscape, the need for flexible and secure development environments has never been greater. Enter Cloud Workstation, a game-changing solution that empowers development teams to harness the full potential of cloud computing, specifically within the Google Cloud Platform (GCP). In this video, we'll delve deep into the world of Cloud Workstation, exploring how ...
Privileged Just-in-time access on Google Cloud with JIT
มุมมอง 9337 หลายเดือนก่อน
Just-In-Time privileged access is a method for managing access to Google Cloud projects in a more secure and efficient manner. It's an approach that aligns with the principle of least privilege, granting users only the access they need to perform specific tasks and only when they need it. This method helps reduce risks, such as accidental modifications or deletions of resources, and creates an ...
Real-time Analytics with Cloud Spanner CDC
มุมมอง 4618 หลายเดือนก่อน
In the realm of relational databases, Cloud Spanner stands as a remarkable force, offering unparalleled horizontal scalability, reaching near-infinite capacity. Its 99.999% availability SLA across regions makes it a formidable contender for even the most demanding transactional workloads. Cloud Spanner's native support for CDC (change data capturing) through its "Change Stream" feature empowers...
Streaming data from BigQuery to Datastore using Dataflow
มุมมอง 1.1K9 หลายเดือนก่อน
🚀 Today, I'm eager to discuss a method for using a Dataflow streaming pipeline to move data from BigQuery to Cloud Datastore.. At first glance, this approach might seem unconventional. Why? Because BigQuery isn't typically associated with streaming capabilities. However, I believe this strategy has immense potential. 🔍 Here's the context: a significant portion of our data now resides in BigQuer...
Serverless distributed processing with BigFrames
มุมมอง 2.3K10 หลายเดือนก่อน
Exciting news from Google Cloud with the launch of BigFrames (in preview). 🚀🚀🚀 This new library has significant potential to streamline processes that were traditionally managed by more intricate technologies like Apache Beam (Dataflow) or Spark. It also fills the gap between local Pandas operations running on Jupyter and deploying large-scale workloads in production, and enables faster interac...
Automated data profiling and quality scan via Dataplex
มุมมอง 7K11 หลายเดือนก่อน
Data quality is a critical concern within a complex data environment, particularly when dealing with a substantial volume of data distributed across multiple locations. To systematically identify and visualise potential issues, establish periodic scans, and notify the relevant teams at an organisational level on a significant scale, where should one begin? This is precisely where the automated ...
Centralised Data Sharing using Analytics Hub
มุมมอง 2.9Kปีที่แล้ว
Sharing data in a medium - large organisation has always been a big challenge. In today's talk I've described some of these data sharing challenges I've seen over the past years in different organisations, and how the new Google Cloud product Analytics Hub can potentially solve this in a much easier and user friendly way in the analytics community. 01:50 - Data Sharing challenges 04:59 - What i...
BigQuery to Datastore via Remote Functions
มุมมอง 1.6Kปีที่แล้ว
BigQuery Remote Functions has revolutionised the way we design data pipelines, eliminating the need for additional overhead between teams. Thanks to the exceptional efforts of the Unytics.io team, for creating Bigfunctions (unytics.io/bigfunctions/), a remarkable library collection bundled with Cloud Run deployment for BigQuery Remote Functions. With just a simple SQL query, we can seamlessly p...
Connect to services on another VPC via Private Service Connect (PSC)
มุมมอง 8Kปีที่แล้ว
Connect to services on another VPC via Private Service Connect (PSC)
Cloud PubSub Multi-Team Design
มุมมอง 969ปีที่แล้ว
Cloud PubSub Multi-Team Design
Snapshotting Data using PubSub and Datastore for Efficient API Serving
มุมมอง 612ปีที่แล้ว
Snapshotting Data using PubSub and Datastore for Efficient API Serving
Cloud Run with IAP
มุมมอง 6Kปีที่แล้ว
Cloud Run with IAP
Run Apache Spark jobs on serverless Dataproc
มุมมอง 4.3Kปีที่แล้ว
Run Apache Spark jobs on serverless Dataproc
GPG GCS File Decryptor on Cloud Run
มุมมอง 1.1Kปีที่แล้ว
GPG GCS File Decryptor on Cloud Run
Is BigQuery Slot Autoscaling any good?
มุมมอง 2.5Kปีที่แล้ว
Is BigQuery Slot Autoscaling any good?
Cloud Run PubSub Consumer via Pull Subscription
มุมมอง 5Kปีที่แล้ว
Cloud Run PubSub Consumer via Pull Subscription
Data-Aware Scheduling in Airflow 2.4
มุมมอง 1.1Kปีที่แล้ว
Data-Aware Scheduling in Airflow 2.4
Improve Organisation Observability with Cloud Asset Inventory
มุมมอง 772ปีที่แล้ว
Improve Organisation Observability with Cloud Asset Inventory
Near real-time CDC using DataStream
มุมมอง 6Kปีที่แล้ว
Near real-time CDC using DataStream
The 2022 Wrap-up
มุมมอง 328ปีที่แล้ว
The 2022 Wrap-up
Run DBT jobs with Cloud Batch
มุมมอง 1.2Kปีที่แล้ว
Run DBT jobs with Cloud Batch
Super Lightweight Real-time Ingestion Design
มุมมอง 1.1Kปีที่แล้ว
Super Lightweight Real-time Ingestion Design
Cloud Datastore TTL
มุมมอง 1Kปีที่แล้ว
Cloud Datastore TTL

ความคิดเห็น

  • @listenpramod
    @listenpramod วันที่ผ่านมา

    Thanks Richard, very well explained. I would say we need links for your videos on GCP docs on the respective topic. If I want to push events from onprem to a topic in gcp and then use Analytic Hub to share the stream to downsteam other GCP project, What do you think were the topics receive the events from on prem should be created

  • @fbnz742
    @fbnz742 วันที่ผ่านมา

    Hi Richard, thank you so much for sharing this. This is exactly what I wanted. I have a few questions: 1. Do you have any example on how to orchestrate it using Composer? I mean the DAG code. 2. I am quite new to DBT. I used DBT Cloud before and I could run everything (Upstream + Downstream jobs) or just Upstream, just Downstream, etc. Can I do it using DBT Core + Cloud Run? 3. This is quite off-topic to the video but wanna ask: DBT Cloud offers a VERY nice visualization of the full chain of dependencies. Is there any way to do it outside of DBT Cloud? Thanks again!

  • @s7006
    @s7006 10 วันที่ผ่านมา

    very through analysis ! one of the best TH-cam channels out there on GCP product deep dives.

  • @AdrianIborra
    @AdrianIborra 11 วันที่ผ่านมา

    Are you referring to Dbt Core as that not have VCS? From your point of view, does GCP one similar service like DBT that support a real world complex client system?

  • @sergioortega5130
    @sergioortega5130 14 วันที่ผ่านมา

    Excellent video, I spent hours figuring out where the run.invoke role needed to go, maybe I skipped it but is not mentioned anywhere in docs 🥲 Gold, thanks man!

  • @AghaOwais
    @AghaOwais 24 วันที่ผ่านมา

    Hi My DBT code is successfully deployed to Google Cloud Run. I am using DBT Core not using DBT Cloud. The only issue is when I am hitting the URL, "Not Found" shows. I have identified the issue when code is running it keeps looking for dbt_cloud.yml but how can it be used when I am only using DBT Core. Please sort out. Thanks

  • @ItsMe-mh5ib
    @ItsMe-mh5ib 24 วันที่ผ่านมา

    what happens if your source query combines multiple tables?

    • @practicalgcp2780
      @practicalgcp2780 24 วันที่ผ่านมา

      The short answer is most of these won’t work. At least during the public review. You can use certain sub queries if they don’t have keyword like Exists or NOT Exists. JOIN won’t work either. See the list of limitations here cloud.google.com/bigquery/docs/continuous-queries-introduction#limitations This makes sense because it’s a storage layer feature so it is very hard to implement things like listening to append logs on two separate tables together and somehow put them together. I would suggest focus on reverse ETL use cases which it’s mostly useful for the time being.

    • @ItsMe-mh5ib
      @ItsMe-mh5ib 24 วันที่ผ่านมา

      @@practicalgcp2780 thank you

  • @nickorlove-dev
    @nickorlove-dev 24 วันที่ผ่านมา

    LOVE LOVE LOVE the passion from our Google Developer Expert program, and Richard for going above and beyond to create this video! It's super exciting to see the enthusiasm being generated around BigQuery continuous queries! Quick feedback up regarding the concerns/recommendations highlighted in the video: - All feedback is welcome and valid, so THANK YOU! Seriously! - The observed query concurrency limit of 1 query max for 50 slots and 3 queries max for 100 slots is an identified bug. We're in the process of fixing this, which will raise this limit and allow BigQuery to dynamically adjust concurrent continuous queries being submitted based on the available CONTINUOUS reservation assignment resources. - Continuous queries is currently in public preview, which simply means we aren't done with feature development yet. There are some really exciting items on our roadmap, which I cannot comment on in such a public forum, but concerns over cost efficiency, monitoring, administration, etc are at the VERY TOP of that list.

    • @practicalgcp2780
      @practicalgcp2780 24 วันที่ผ่านมา

      Amazing ❤ thanks for the kind words and also the clarification on the concurrency bug, can’t wait to see it gets lifted so we can try it at scale!

  • @mohammedsafiahmed1639
    @mohammedsafiahmed1639 25 วันที่ผ่านมา

    so this is like CDC for BQ tables?

    • @practicalgcp2780
      @practicalgcp2780 25 วันที่ผ่านมา

      Yes pretty much, via SQL but more a reverse of CDC (reverse ETL in streaming mode if you prefer to call it that).

  • @mohammedsafiahmed1639
    @mohammedsafiahmed1639 25 วันที่ผ่านมา

    thanks! Good to see you back

    • @practicalgcp2780
      @practicalgcp2780 25 วันที่ผ่านมา

      😊 was on holiday for a couple of weeks

  • @SwapperTheFirst
    @SwapperTheFirst หลายเดือนก่อน

    thanks. It is trivial to connect from local vscode - it is just some small gcloud create tcp tunnel and this is it. though you're right that web browser experience is surpriningly good.

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      Yup, I think a lot of the times when you have remote workforce it is easier to keep everything together so you can install plugins as well. That tunnel thing can work but a lot of times it’s just additional risk to manage and IT issues to resolve when something doesn’t work.

  • @user-wf5er3eo8v
    @user-wf5er3eo8v หลายเดือนก่อน

    This is a good content. I have a question. I have a uses case where i have a Data which has columns: "customers_reviews","Country","Year","sentiment". I am trying to create a chat bot where it can answer queries like: "Negative comments related to xyz issue from USA from year 2023." for this I need to filter the data for USA and for year 2023 with embeddings for xyz issue to be searched from the database. Which database will be suitable for this: Bigquery or Cloud SQL or Alloy DB. All these have the vector search capabilities. But need to look for most suitable and easy to understand. Thanks

    • @practicalgcp2780
      @practicalgcp2780 29 วันที่ผ่านมา

      One important thing to understand is the difference between database suitable for highly concurrent traffic (b2c or consumer traffic) vs b2b (internal or external business has small amount of users). BigQuery can be suitable for b2b when the amount of users using it at the same time peak, is low. For all b2c traffic you never want to use BigQuery because it’s not designed for such thing. There are 3 databases on GCP can be suitable for b2c traffic, and all of them supports highly concurrent workload. Cloudsql, alloydb and vertex feature store vector search if you want serverless. You can use any of the 3, whichever you are more comfortable with, vertex feature store can be quite convenient if your data is in BigQuery, a video I create recently might give you some good ideas on how to do this th-cam.com/video/QIZwwCmEhzI/w-d-xo.html

  • @adeolamorren2678
    @adeolamorren2678 หลายเดือนก่อน

    One seperate question; if we have dependencies, since it's a serverless environment, we should add the dbt deps command in the dockerfile args, or runtime override args right?

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      No I don’t think that is the right way to do it, serverless environments you can still package up dependencies, and this is something you typically need to do in build time not run time, I.e, while you are packing the container in your CI pipeline. DBT can generate a lock file which can be used to ensure packaging consistency on versions, so you don’t end up having different versions each time you run the build. See docs.getdbt.com/reference/commands/deps The other reason you don’t want to do that at run time is it could be very slow to install dependencies each time because it requires downloading these, plus you may not want internet access on a production environment to be more secure in some setups so doing this in build time makes a lot more sense

  • @adeolamorren2678
    @adeolamorren2678 หลายเดือนก่อน

    with this approach is it possible to add environment variables that are isolated for each run? I basically want to pass environment variables for each run when I invoke google cloud run

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      Environment variables are typically not designed for manipulating runtime variables each time, these are typically set for each environment, and stick to each deployment not run. But it looks like both options are possible, and stick to passing command line arguments because that’s more appropriate to override compared to environment variables. See this article on how to do it, it’s explained well chrlschn.medium.com/programmatically-invoke-cloud-run-jobs-with-runtime-overrides-4de96cbd158c

  • @aniket-kulkarni
    @aniket-kulkarni หลายเดือนก่อน

    After researching so much on this topic, finally, a video that explains clearly especially motivations and the problem that we are going to solve with PSC.

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      Comments like this is what keeps me going mate ❤ thanks for the feedback

  • @ritwikverma2463
    @ritwikverma2463 หลายเดือนก่อน

    Thank you Richard for great GCP tutorials, please continue making these GCP video series.

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      Thanks, will do! Glad you liked these.

  • @42svb58
    @42svb58 หลายเดือนก่อน

    thank you for posting these videos!

  • @dollan1991
    @dollan1991 หลายเดือนก่อน

    I can't find it now, but I remember reading an GCP Issue Tracker that stated that the sync will always take 5min+ due to resources that needs to be provisioned in the background

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      I guess for daily workload it’s ok. These things don’t typically need to be that up to date for most use cases. I do want to try that continues mode though, which is more likely designed for real time sync

  • @travisbot1414
    @travisbot1414 หลายเดือนก่อน

    Awesome videos there are awesome, you should make courses that cover the content for courses $$$$$$$$

    • @practicalgcp2780
      @practicalgcp2780 หลายเดือนก่อน

      Haha, thanks. It’s more important to share knowledge for free so more companies can adopt Google cloud and make it work better and hope it will become the no. 1 cloud provider 😎 maybe one day in the future I will make a course.

    • @johnphillip9013
      @johnphillip9013 หลายเดือนก่อน

      @@practicalgcp2780thank you so much

  • @AI0331
    @AI0331 2 หลายเดือนก่อน

    This is really an amazing video. especially the trouble shooting part. very clear😊 Love it!!

  • @yinliu5471
    @yinliu5471 2 หลายเดือนก่อน

    I like this video, it is the most informational and practical video for the topic IAP. Thanks for sharing

  • @ritwikverma2463
    @ritwikverma2463 2 หลายเดือนก่อน

    Hi Richard, can we create dataproc serverless job in different gcp project using service account?

    • @practicalgcp2780
      @practicalgcp2780 2 หลายเดือนก่อน

      I am not sure I understood you fully, but service account can do anything in any project regardless which project the service account is created from. The way it works is by granting the service account IAM permission from the project you want the job to be created. Then it will work. But it may not be best way to do it as that one service account may have too much permission and scope. You can use separate service account, one for each project if you want to reduce scope, or have a master one to impersonate as other service account in those project but keep in mind it’s key to reduce scope of what each service account can do, otherwise when there is a breach, it can be massive damage on everything all together.

  • @HARDselection
    @HARDselection 2 หลายเดือนก่อน

    As a member of a very small data team managing a complex orchestration workload, this is exactly what I was looking for. Thanks!

  • @nishantmiglani7021
    @nishantmiglani7021 2 หลายเดือนก่อน

    Thanks a lot, Richard He, for creating this insightful video on Analytics Hub.

  • @Iyanu-eb2eh
    @Iyanu-eb2eh 2 หลายเดือนก่อน

    how do you know if the sql table is actually connected?

    • @practicalgcp2780
      @practicalgcp2780 2 หลายเดือนก่อน

      Sorry it’s been a while since I created this, if it works it is connected right? Am I missing something?

  • @agss
    @agss 2 หลายเดือนก่อน

    Thank you for the very insightful video! What is your take on using Dataform instead of DBT, when it comes to capabilities of both tools and ease to deploy and manage those solutions?

    • @practicalgcp2780
      @practicalgcp2780 2 หลายเดือนก่อน

      Thank you and spot on question, I was wondering who is going to ask this first 🙌 I am actually making a Dataform video in the background but don’t want to public it unless I am 100% sure I am saying something useful. But based on my current findings, you could use either and depends on what you need both can be a good fit. Dataform is a lot easier to get up and running but it’s quite new and I won’t recommend using it for something too critical at this stage, and it’s also missing some key features like templating using jinja (I don’t really like the JavaScript templating system, as it’s built on typescript, that is something no one uses, you would be lock-in to something with no support which in my view is quite dangerous). But it is something a lot easier to get up and running natively in gcp. DBT is still the go to choice in my view, because it is built in Python has a strong open source community. For mission critical data modelling work, I still think DBT is much better.

    • @agss
      @agss 2 หลายเดือนก่อน

      @@practicalgcp2780 you brought up exactly what I was worrying about. I highly appreciate your insight!

    • @strmanlt
      @strmanlt 16 วันที่ผ่านมา

      Our team was debating migrating from dbt to Dataform. Dataform is actually is a pretty decent tool, but the main issues for us was the 1000 node limit per repo. So maybe if you have very simple models that do not require a lot of notes it would work fine, but for us the long term scalability was the deciding factor

    • @practicalgcp2780
      @practicalgcp2780 16 วันที่ผ่านมา

      @@strmanlt thanks for the input on this! Can I ask what is the 1000 node you are referring to? Can you share the docs on this. Is it 1000 node limit on number of steps / sql you can write?

    • @fbnz742
      @fbnz742 วันที่ผ่านมา

      Just wante to share my thoughts here: I used Dataform for an entire project and it worked quite well. My data model was not so complex, and I learned how to integrate its logs with Airflow, being able to set up alerts to Slack, pointing to the log file of the failed job, etc, however, I agree that Dataform templating is very strange. I personally don't have expertise with JavaScript so suffered a lot with some things, but I was able to do pretty much all I wanted. I suffered a lot with looking for things in the internet, and DBT is the exact opposite: you can find tons of content online. I would go with DBT.

  • @Rising_Ballers
    @Rising_Ballers 2 หลายเดือนก่อน

    Hi Richard, Love your content, always wanted someone to do GCP training videos emphasizing real world use cases, I work in Bigquery and Composer, I wanted to learn dataproc and dataflow. But everywhere i see same type of trainings not much focusing on real world implementations. I wanted to learn how dataproc and dataflow jobs are deployed in different environments like dev test and prod, your videos are helping a lot, hope you will do more videos on dataflow and dataproc, how we use this in real projects in how we create these jobs using CICD

    • @practicalgcp2780
      @practicalgcp2780 2 หลายเดือนก่อน

      No worries glad you found this useful ❤

    • @Rising_Ballers
      @Rising_Ballers 2 หลายเดือนก่อน

      @@practicalgcp2780 I have one doubt, in an organization if we have many dataproc jobs how will we create it in different environments like dev test and prod, can you please do a video on that

  • @ayoubelmaaradi7409
    @ayoubelmaaradi7409 3 หลายเดือนก่อน

    🤩🤩🤩🤩🤩

  • @ap2394
    @ap2394 3 หลายเดือนก่อน

    Thanks for detailed video. Can we have scheduling at task level ? Eg : if have 2 task in downstream DAG and both are different on different dataset. Can I control the schedule at task level ?

    • @practicalgcp2780
      @practicalgcp2780 20 วันที่ผ่านมา

      Just realised I never replied to this one, my apologies. I am not sure that is the right way to think about how this works. Regardless which task it is, or which dag, it’s about listening to a change event from something got triggered in the upstream dataset, then react to that event. As long as you design dags in a way it is the right behaviour to trigger a dag, based on a change event, then it will work.

  • @viralsurani7944
    @viralsurani7944 3 หลายเดือนก่อน

    Getting below error while running pipeline with DirectRunner. Any idea? Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.

  • @DExpertz
    @DExpertz 3 หลายเดือนก่อน

    I appreciate this video Sir, 😍 (Subscribed and liked) will share too with my team.

    • @practicalgcp2780
      @practicalgcp2780 3 หลายเดือนก่อน

      Thanks so much for you support ❤

    • @DExpertz
      @DExpertz 3 หลายเดือนก่อน

      @@practicalgcp2780 Of course man, thank you for sharing this informations in a simpler way

  • @10xApe
    @10xApe 4 หลายเดือนก่อน

    Can cloud run be used for Power BI datarefresh gateway ?

    • @practicalgcp2780
      @practicalgcp2780 3 หลายเดือนก่อน

      I haven’t used power BI so I googled what is data refresh gateway, according to learn.microsoft.com/en-us/power-bi/connect-data/refresh-scheduled-refresh it looks like it’s some sort of service you can control refresh via a schedule? Unless there is some sort of API it allows you to trigger from the Google Cloud ecosystem I am not sure if you can use it. I assume you are thinking of triggering some DBT job first then refresh the dashboard?

  • @adityab693
    @adityab693 4 หลายเดือนก่อน

    In my org, They have to get an exception and analyticshub.listing.subscribe role is not available. Also data can be shared within org vpc, how about sharing outside of vpc?

  • @SamirSeth
    @SamirSeth 4 หลายเดือนก่อน

    Simply the best (and only) clear explanation of how this works. Thank you very much.

  • @QuynhNguyen-zy2rs
    @QuynhNguyen-zy2rs 4 หลายเดือนก่อน

    Hi, After you have created data profile scan and data quality scan, is the insights tab displayed? I don't see the insights tab in your video. Please explain to me! Thanks!

  • @alifarah9
    @alifarah9 4 หลายเดือนก่อน

    Really appreciate these high quality videos ! Seriously your videos are better than the official video for GCP. What makes these videos invaluable is you teach frok first principles and talk about problem that will be faced in any cloud environment not GCP.

    • @practicalgcp2780
      @practicalgcp2780 4 หลายเดือนก่อน

      Thank you so much 🙏 you are right the principal are very much the same, no matter which cloud provider it is. Although my focus is GCP because it is something I believe as an ecosystem it’s much more powerful but remains the easiest to implement and scale compares to other cloud providers.

  • @anantvardhan1212
    @anantvardhan1212 4 หลายเดือนก่อน

    Amazing explanation! However, I have a doubt regarding the use of OAuth 2.0 creds in this whole setup. Does the OAuth client ID represent the backend service here, which is delegating authentication to IAP?

    • @practicalgcp2780
      @practicalgcp2780 4 หลายเดือนก่อน

      Thank you and I don't think this was explained well in the video. I did some more reading and one thing I noticed here is the docs here on how to create the backend service of LB has changed cloud.google.com/iap/docs/enabling-cloud-run#enabling. As you can see at 15:08 in the video it use to require the client_id and client_secret to create the backend to enable IAP, but that doesn't seem to be there anymore. The latest docs has a note saying "The ability to authenticate users with a Google-managed OAuth client is available in Preview.". Well technically if it's in preview it should not update the docs to remove this option but if it is true then it means by default it will use the google managed oauth client and creating the credentials manually is no longer required. I've not tested this out yet but I think it's worth trying it without using a custom credential and just enable IAP. I think it makes sense as creating it manually and then specify is a lot faff as you need to manage the secret rotation etc yourself.

    • @practicalgcp2780
      @practicalgcp2780 4 หลายเดือนก่อน

      And my understanding the way this works is when a user comes in, the user will pass the auth header, the load balancer backend will intercept and use IAP to do the verification to see if the user has permission or not which is defined in IAM with the user group. Because the IAP SA has been granted the invoker access to the cloud run service, hence user will be granted access after passing through the IAP validation

  • @harshchoudhary6069
    @harshchoudhary6069 4 หลายเดือนก่อน

    How we can share the authorized view using analytics hub?

    • @practicalgcp2780
      @practicalgcp2780 4 หลายเดือนก่อน

      It makes no difference using authorised views, as authorised view permissions are managed the same way as tables, different to normal views. However, using authorised views has some tradeoffs, a key one being losing metadata such as column descriptions which isn’t great for data consumers. But it does have the advantage if you don’t want to duplicate data models or increase latencies

  • @LongHD14
    @LongHD14 4 หลายเดือนก่อน

    May I ask one more question regarding this matter? I would like to implement permissions for a chat application concerning access to documents. For example, a person should only have access to certain tables or specific fields within those tables, and if they don't have permissions, they wouldn't be able to search. Do you have any suggestions or keywords that might help with this issue? Thank you very much for your assistance

    • @practicalgcp2780
      @practicalgcp2780 4 หลายเดือนก่อน

      That is something you have to do through some sort of RBAC implementation (role based access control). That isn’t anything to do with the search, it’s more on mapping out the role of a user through logging in like what most applications do today. Then depends on the role, you can add specific filters in the search queries, like filtering via certain metadata or have a set of tables you can restrict based on roles etc.

    • @LongHD14
      @LongHD14 4 หลายเดือนก่อน

      Sure, I understand that. However, I'm looking for a service that can assist me with implementing RBAC.

    • @practicalgcp2780
      @practicalgcp2780 4 หลายเดือนก่อน

      ok I see. I think it really depends on what you are using. For example, if you are building a backend with Python. You can use Django which has a RBAC module, but generally any framework would have some sort of RBAC component you can use. If it’s an internal app (like for within the company use) then you can simplify things by just using IAP, but IAP isn’t suitable for external consumer applications

    • @LongHD14
      @LongHD14 4 หลายเดือนก่อน

      @@practicalgcp2780 thank you for your answer!

  • @kavirajansakthivel
    @kavirajansakthivel 5 หลายเดือนก่อน

    Hello Richard, It was wonderful video, but somehow i couldn't setup the tcp proxy, how did you do it ? through reverse proxy method or auth proxy mehtod ? I see you are the only successful person who has done this so far ? could you please create a tutorial video for the same ?

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Hi there, it’s been a while since I did it last time, it’s gonna be quite difficult to understand what your problems are as it’s a quite complex setup. If I remember correctly this is the documentation I followed cloud.google.com/datastream/docs/private-connectivity#reverse-csql-proxy, make sure you follow this step by step, especially don’t forget to open the required firewall rules as this can be a common cause.

  • @LongHD14
    @LongHD14 5 หลายเดือนก่อน

    Wow, this video is incredibly insightful and informative! 👏 I've learned so much and am grateful that you've shared this valuable content with us. Just a quick question: Could I apply these concepts to create a conversation app that details the findings in the search results? Looking forward to your guidance on this.

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Glad you found it useful! I don’t see why not, but as I mentioned, for a conversational app, I assume you are talking about an app that is consumer (real customers) facing, the concept is exactly the same but you need to change the vector db to something that supports highly concurrent workload. So BigQuery is out of the picture, you can look at vertex AI vector search and also AlloyDB which I am hearing a lot lately. I haven’t tried either yet but as far as I know they are both valid approach for consumer apps supports highly concurrent workload. The docs for alloyDB is here cloud.google.com/alloydb/docs/ai/work-with-embeddings

    • @LongHD14
      @LongHD14 5 หลายเดือนก่อน

      Thank you for your valuable insights and guidance!

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      You are welcome ;)

  • @kamalmuradov6731
    @kamalmuradov6731 5 หลายเดือนก่อน

    I implemented a similar solution using Cloud Workflows (CW) + Cloud Functions (CF). The CW runs a loop and makes N requests to the CF in parallel each iteration, where N is equal to the CF’s max instances. I’ll look into querying Stackdriver each loop to dynamically determine concurrency. I chose CW over Cloud Scheduler (CS) for a few reasons. First, CS is limited to at most 1 run per minute, which wasn’t fast enough to keep my workers busy (they process a batch in under 30 seconds). Second, CS can’t make N requests in parallel so would required something in between to replicate the CW is doing. Third, CW has a configurable retry policy which is handy for dealing with the occasional CF network issues. One caveat with CW is that a single execution is limited to 100k steps. To workaround this issue, I limit each CW execution to 10k loops, at the end of which it triggers a new workflow execution and exits. I setup an alerting policy to ensure there is always exactly 1 execution of this workflow running and haven’t had any issues.

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Hmm, interesting approach. Although I am not sure we are comparing apples and apples here. The solution demonstrated in this video is an always on approach. In other words, the pull subscriber is always on listening to the PubSub subscriber, it doesn’t die after processing all remaining messages, but simply waits. So if you change the interval of Cloud Scheduler to 10 minutes, and let the pull subscriber run for 9 minutes 50 seconds for example, it will not get killed until it reaches to that timeout (which is in the code example I gave). I am not sure if I misunderstood you here, but the solution here is no different to what you would normally do with a GKE deployment, it’s just an alternative without needing any infrastructure.

    • @kamalmuradov6731
      @kamalmuradov6731 5 หลายเดือนก่อน

      That sounds correct! In my case the CF does a “synchronous pull” of a few thousand messages, processes them, and acks them all in bulk. So it’s not an always-on streaming setup like what you demoed here. It handles 1 batch per request, shuts down, and then is invoked again in the next loop by the CW. For this particular use case, batching is advantageous so I went with synchronous pull. But it would be straightforward to switch the CF to a streaming pull if batching was not necessary.

  • @stevenchang4784
    @stevenchang4784 5 หลายเดือนก่อน

    vscode is browser base environment , but a lot of users restricts cloud shell and ssh connection. Do you think cloud workstation could bypass those restrictions.

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Hi there, I haven’t used this at scale yet but my understanding is one of the most important reasons for having cloud workstation is to get around these restrictions. The common reason cloud shell does not work in many org is because the inability to support private IP and VPC SC, but this isn’t the case for workstations as these are deployed within your network. Check this out and it’s documented here cloud.google.com/workstations

    • @stevenchang4784
      @stevenchang4784 5 หลายเดือนก่อน

      @@practicalgcp2780 Hi, I tested all day. Thank you for your reply. It really solve the cloud shell's public IP issue.

  • @jean9174
    @jean9174 5 หลายเดือนก่อน

    😎 "Promosm"

  • @digimonsta
    @digimonsta 5 หลายเดือนก่อน

    Really interesting and informative. I'm currently looking at migrating away from a GKE workload purely because of the complexity, so this may prove useful. I'd be interested to know if you feel Cloud Run Jobs would support my use-case? Essentially, based on a Pub/Sub message, I need to pull down a bunch of files from a GCS bucket, zip them into a single archive and then push the resulting archive back into GCS. This zip file is then presented to the user for download. There could be many thousands of files to ZIP and the resulting archive could be several terrabytes in size. I was planning on hooking up FileStore or GCS FUSE to Cloud Run to facilitate this. The original implementation was in Cloud Run (prior to jobs), but at the time, no-one knew how many files users would need to download or how big the resulting zip files would be. We had to move over to GKE, as we hit the maximum time limits allowed for Cloud Run, before it was automatically terminated.

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Thanks for the kind comment. And it’s a quite interesting problem because the size of the archive can be potentially huge so it can take a long period of time. You are right. Cloud run service I think even today can only handle up to 1 hour timeout, cloud run job can handle 24 hours now. So if your archive process won’t take longer than a day i don’t see why you can’t use this approach. If you need longer time you can look at cloud batch, that can run longer without needing to create a cluster but it’s more complex the track the state of the operation. I have another video describing use cases using batch. Having said that, it feels a bit wrong to have archives of huge size like that, have you considered options to generate the PubSub message from upstream systems in smaller chunks, or use the cloud run service to break things down and only zip so much files in a single execution, track the offset somewhere (I.e in a separate PubSub topic) to trigger more short lived zip operations? The thing is if there’s a network glitch which happens every now and then you could have wasted huge amount of compute. Personally I would always prefer to make the logic slightly more complex in the code than maintaining a GKE cluster myself just to keep the infrastructure as simple as possible but that is just my opinion.

  • @user-dl5mm9fu9g
    @user-dl5mm9fu9g 5 หลายเดือนก่อน

    The introduction is very detailed and very good,Good Job,buddy...

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Thank you! Its great you found it useful

  • @eyehear10
    @eyehear10 5 หลายเดือนก่อน

    this seems like it adds more complexity than compared with the push model

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Yes it does, but not everything upstream supports push model, plus not every downstream can handle the load via the push model. I explained some of the pros and cons, mainly related to controlling or improving throughput (I.e limiting how much traffic you want to consume if there is too much traffic or use batching). A really important thing to consider is the downstream and how many connections you establish, or how many concurrent requires you make if, I.e the downstream system is a HTTPS endpoint. Opening too many requests can easily overwhelm the system on the other side where if you batch the request or just open a single connection and reuse it makes a huge difference. If it’s possible to use push without the constraints above, it’s almost always better to use push. Hope that makes sense

  • @SwapperTheFirst
    @SwapperTheFirst 5 หลายเดือนก่อน

    Any examples of such tools for cataloging, certification and lineage? Especially OSS? I had some experience with Qlik Catalog, but not sure if this is a good choice to GCP and how well it is integrated with BQ. Beyond usual suspects (Collibra, Immuta, ...)

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      There are a few who are GCP partners has very good integration with GCP to save you a lot of time doing meta data integration by engineers. Collibra is one of them as you already mentioned, you can also look at Atlan, a new player in the field but has some powerful features too. That’s the two I am aware of in my view have pretty good integration and features but please do your own research there are pros and cons and these are not recommendations I am making here. OSS do you mean support systems like JSM?

    • @SwapperTheFirst
      @SwapperTheFirst 5 หลายเดือนก่อน

      @@practicalgcp2780 nope, I mean open source software, like Apache Airflow for workflow management. From which you can also make managed solutions, like Astronomer or Cloud Composer. I think something should exist in this space too?

  • @SwapperTheFirst
    @SwapperTheFirst 5 หลายเดือนก่อน

    I like this format of battle stories/coaching.

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      Thanks ☺️ thought might try a different way to present feels like more people can relate to this

  • @WiktorJurek
    @WiktorJurek 5 หลายเดือนก่อน

    This is bang on. It would be awesome to see how this works in practice - as in, how all of this looks in the console, how to set it up, and practically how you can oversee/manage this kind of setup.

    • @practicalgcp2780
      @practicalgcp2780 5 หลายเดือนก่อน

      There’s quite a lot of effort involved but the foundation isn’t that difficult to setup. But it’s not like there is just some sort of UI everything can be done there, I think the entry point of data management and discovery for large group of users can be from the catalog tool, and a platform team can own the tooling for things like quality scan and analytics hub while making them self service. There are things especially like the data quality check rules I would prefer to keep these in version control so it’s much easier to control the changes and quality of the checks where as other things like analytics hub UI should be sufficient as long as there is a way to recovery if something goes wrong