Near real-time CDC using DataStream

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ย. 2024

ความคิดเห็น • 17

  • @calufa
    @calufa 10 หลายเดือนก่อน +1

    Good stuff!

  • @manojkumarchinnaswamy4489
    @manojkumarchinnaswamy4489 7 หลายเดือนก่อน +1

    Hi Richard, thank you for the wonderful tutorial. I find myself in a situation where I need to revoke the SELECT permission in the MySQL database after the data stream backfill has run and while the CDC is in progress. This is intended to alleviate some load on the master, I believe. Although the documentation from Google states that we need SELECT permission for the database, I had a conversation with the Google team, and they want me to test this scenario. Have you encountered something similar? What are your thoughts?

    • @richardshenghua
      @richardshenghua 7 หลายเดือนก่อน

      Hi there, thanks for the kind comments. I haven’t used this service for a while, I remember at the time I created the video I wasn’t comfortable using the backfill function created by Datastream because it lacked the control I need (I wasn’t sure if it can work with a replica database nor can handle not put too much pressure on it). As a result I haven’t done any backfills using this solution. The easiest way you can validate it is to use a sandbox environment to simulate exactly what you would like to test, revoke the permission and see what happens, the safest way.
      But from what I understand, select is only used for the backfill. Because the CDC solution is based on reading binary logs, which is not a SQL layer. So my assumption is it will work without issues. Although why would you want to revoke it? Are these any security concerns? If the service account is locked down and do have access to the bin logs. I am not sure why you wouldn’t keep it. What if you need to do another backfill?

    • @manojkumarchinnaswamy4489
      @manojkumarchinnaswamy4489 7 หลายเดือนก่อน

      > Although why would you want to revoke it?
      Thanks @richardshenghua. The concern is just the load impact. If Datastream goes crazy and scans entire tables all the time we are thinking it could impact the load. But in the meantime, I don't want to use a replica as well. Its operational cost is sort of high.

  • @SaiDileepfantasy
    @SaiDileepfantasy ปีที่แล้ว +2

    Hi can we connect BI tools like Tableau, Looker directly to these Authorized views ? Is it advisable

    • @practicalgcp2780
      @practicalgcp2780  ปีที่แล้ว

      You can do it but it is likely going to cause problems down the road. And I won’t advise giving these raw tables to a large group of people. These views are basically data lake views, data is very raw and have tons of duplicates (for good reasons). It’s not great for consumption directly by a large group or they will just build different versions of the same thing that don’t match up.
      A good way to do this, is to give it to a small centralised team of analytics engineers. That is where you can have the clean data models. Then give those to BI tools. But at the end of the day it’s not always possible to have everything clean from start so it’s a trade off between speed and quality. And if you are using DBT already, it’s quite easy to put something in that layer in the middle.

  • @ItsMe-mh5ib
    @ItsMe-mh5ib 10 หลายเดือนก่อน +1

    Hello Richard, first of all thanks a lot for all the wonderful videos. I have a question. Is there something like this for replicating in near real time data from BigQuery into a relational DB? If there is not an out-of-the-box service for this, what would be the best way to build a custom solution? Thanks! Keep up the great work.

    • @practicalgcp2780
      @practicalgcp2780  10 หลายเดือนก่อน

      No worries, glad you found the videos useful. To answer your question, short answer is no, and for good reasons. Typically the databases used to master data and collect the data is RMDBS, which is what these CDC type solution is designed for, as they are transactional databases designed for run OLTP workload.
      Systems like BigQuery or Snowflake, are analytics databases, which are typically where the data ended up in for analytics purposes, that means querying it to get insights, not designed to be consumed in real time in the same way, and it doesn’t support binary log based replication.
      There are however some use cases, called analytics operations, where you do some querying (maybe using DBT, airflow or BigFrame), in batch, then send it to Cloud PubSub or other message queues, and all downstream systems consuming those messages would be event driven, or you can say real time from that point onwards. See a new video i created, might give you some ideas www.linkedin.com/posts/shenghuahe_serverless-distributed-processing-with-bigframes-activity-7121664636927533056-8fjp?

    • @ItsMe-mh5ib
      @ItsMe-mh5ib 10 หลายเดือนก่อน

      @@practicalgcp2780 thanks a lot for your prompt response. appreciate all you do!

  • @ariefhalim5425
    @ariefhalim5425 ปีที่แล้ว +2

    Hello Richard,
    Great video. I have some question though:
    From your video, it appears that the data freshness is 56s, System latency is 5-10s, Total latency 20-90s.
    Are this the normal SLA for Datastream? Isn't it Debezium will have much lower latency?
    Also the needs of TCP proxy kind of defeat the serverless model of Datastream.
    As the number of transaction increases, Do we need to scale the TCP proxy? How do we scale it?

    • @practicalgcp2780
      @practicalgcp2780  ปีที่แล้ว

      Hi Arief, the latency on the source should be near real time (in ms) because the solution is steaming from the bin log. However on the destination, it’s based on how you have configured the staleness (see point no.3 in this article which explained it pretty well medium.com/google-cloud/configure-streams-in-datastream-with-predefined-tables-in-bigquery-for-postgresql-as-source-528340f7989b). It’s a decision based on latency vs cost. BigQuery is an analytical database, latency with less than a few minutes usually doesn’t make a big difference. But depends on what you use it for downstream it might differ.
      The proxy, this is a limitation if you want to use private connection with no exposure to internet traffic. It’s also because I have the source database on cloud sql or you won’t have this issue. Having said that it is still a bit of a bummer you have to create a proxy yourself to manage it, I hope google finds a way to do that i the future without needing this. With regards to scaling traffic, the tcp proxy is really not doing much other than forwarding the data at a single tcp socket. So it shouldn’t be that heavy. I recommend you test it out at large traffic before using it for anything important, I haven’t tested it at scale yet. Finally keep in mind the BigQuery target isn’t in GA yet, so it’s not supported. I suggest you can try it out but don’t use it on something critical until it’s in GA.

  • @mickaelchau2415
    @mickaelchau2415 ปีที่แล้ว +2

    Hello, very interesting thank you!
    I do not understand why replication with historical data could generate issues.
    Let's say the binary logging get corrupted since timestamp X. To be sure that we have valid data, can we not just delete all the timestamps after X-1 ?

    • @practicalgcp2780
      @practicalgcp2780  ปีที่แล้ว

      Very good question 👍. So typically historical data it doesn’t use binary log (either data stream or other similar solutions). This is typically a publisher subscriber model you can’t really go back in binary logs. The underlining technology could be a in memory Kafka which receives only new events from the bin logs (I.e how Debezium works) and you can’t easily go back to historical binlogs unless it’s already streamed out. The other thing is you can’t keep all binlogs forever on databases, to get historical data, the most usual way to do it is just a select * via sql. However this solution can put a lot of load on the database which is often controlled by a separate process. If the data is very large, doing it directly in the CDC solution can take a very long time, instead of you do it on a replica with a distributed compute framework such as Spark, you can get the historical data out a lot quicker in a controlled way. Hope that makes sense

  • @kavirajansakthivel
    @kavirajansakthivel 5 หลายเดือนก่อน

    Hello Richard, It was wonderful video, but somehow i couldn't setup the tcp proxy, how did you do it ? through reverse proxy method or auth proxy mehtod ? I see you are the only successful person who has done this so far ? could you please create a tutorial video for the same ?

    • @practicalgcp2780
      @practicalgcp2780  5 หลายเดือนก่อน

      Hi there, it’s been a while since I did it last time, it’s gonna be quite difficult to understand what your problems are as it’s a quite complex setup.
      If I remember correctly this is the documentation I followed cloud.google.com/datastream/docs/private-connectivity#reverse-csql-proxy, make sure you follow this step by step, especially don’t forget to open the required firewall rules as this can be a common cause.

  • @AyushMandloi
    @AyushMandloi 6 หลายเดือนก่อน

    Can you release updated video on same

    • @practicalgcp2780
      @practicalgcp2780  6 หลายเดือนก่อน

      It hasn’t changed much since, anything particular you would like me to cover?