Microservice Pitfalls: Solving the Dual-Write Problem | Designing Event-Driven Microservices

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 ก.ย. 2024

ความคิดเห็น • 29

  • @ConfluentDevXTeam
    @ConfluentDevXTeam 2 หลายเดือนก่อน +1

    Wade here. This is my second video on the Dual-Write Problem. The previous video generated a lot of interesting comments and discussions. I'd love for the same to happen here. If you haven't watched the previous video, you can find it linked in the description. But if you are still confused about the Dual-Write problem, drop me a comment and I'll do my best to try and clarify.

    • @marcom.
      @marcom. 2 หลายเดือนก่อน

      Hi Wade. Thanks a lot for your videos. I think I have one thing that you probably could address. I read somewhere that, if you use CDC (i.e. Debezium), there is no need to store the events in the outbox table at all. It is sufficient to save and immediately delete them, because CDC only sees the changelog of the db, reacts to the insert and ignores the delete. Right?

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน

      @@marcom. Wade here. That makes the assumption that the events that Debezium sees are the correct events to publish. Debezium is publishing changes to tables that are internal to your microservice. Do you really want to expose the internal structure of your database to the rest of the world? What if there is private data in the records? How will that affect your ability to evolve the database later? The benefit of the Outbox table is that it allows you to customize what the outside world sees. It keeps your internal database structure decoupled from your external API.

  • @MrGoGetItt
    @MrGoGetItt 2 หลายเดือนก่อน

    Exceptional content delivery! Not only were you articulate, but the visuals were an excellent aid. Great work

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน +1

      Wade here. Thanks. I'm glad you enjoyed it. On the visual side, I've started working a bit more with animations to try to assist in focusing the eyes where they need to be. It sounds like that has been effective.

  • @AxrelOn
    @AxrelOn 2 หลายเดือนก่อน

    Great video, thanks! In case of outbox pattern usage the outbox records should be removed once kafka producer received ack, but could batching prevent that? Should batching be disabled in this case?

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน

      Wade here. The outbox records don't have to be removed. They can just be marked as published if you prefer to keep them around. In terms of batching, you'll have to decide what the consequences of duplicate publishes would be and establish the right contracts on your data. Duplication is always going to be possible, assuming you are aiming for an "at-least-once" delivery mechanism. But with batching, those duplicates can be more complicated because you can end up with duplicate batches rather than single events. If you are going to use batching, make sure your downstream consumers are aware of the potential duplicates so they can react appropriately.

  • @ilijanl
    @ilijanl 2 หลายเดือนก่อน +1

    You can actually leverage legacy db transaction to publish to kafka with some tradeoffs. The flow can be following:
    1. Start transaction
    2. Insert into legacy db
    3. Publish to Kafka
    4. Commit
    If step 2 or 3 throws, nothing will be committed and the whole handler will fail, which can be retried later. If for some reason 2&3 succeed and 4 fails, you have published the event to kafka without storing to db however now you have atleast once for publishing.
    Tradeoff is ofcourse that your runtime has a dependency on Kafka, and if kafka is down, you never can succeed the transaction. However they say kafka is HA and high performance so the problem might be smaller than it seems.

    • @Fikusiklol
      @Fikusiklol 2 หลายเดือนก่อน +1

      That is not true - "now you have atleast once for publishing".
      It depends on:
      1. whether you have transient retries or not and those might not work either
      2. would user want to retry or not
      I'd say that fundamentally wrong and should be avoided for transactions that are important.
      For generic concerns like emails - whatever.
      All in all just ask your business people about it.

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน

      Wade here. Unfortunately, this solution doesn't work. I go into it specifically in a related video here: developer.confluent.io/courses/microservices/the-dual-write-problem/
      Essentially, the problem here is exactly what you outline for #4 failing. In that case, you are publishing an event for something that never happened. Downstream consumers are all going to be notified about a change to the database that never occurred. It's still broken, just in a different way.

    • @ilijanl
      @ilijanl 2 หลายเดือนก่อน

      @@ConfluentDevXTeam you are correct if your handler doesn't have any retry logic, if it does, then the commit (step 4) will eventually succeed if you setup it correctly (doing idempotencgy etc). However like you mentioned in video I also prefer the outbox pattern

    • @ConfluentDevXTeam
      @ConfluentDevXTeam หลายเดือนก่อน

      @@ilijanl Wade here. Retry logic, unfortunately, still doesn't solve the problem (on its own). If the failure results in the entire application crashing, then any in-memory state will get lost, and you no longer have anything to retry. Your retry won't get executed, and you still end up with inconsistencies.
      But...You might say...What if you store the details of the retry in a durable fashion so that if your application crashes, you can still retry anything you missed? Well...Now you've invented the Transactional Outbox (or Event Sourcing).

  • @petermoskovits8470
    @petermoskovits8470 2 หลายเดือนก่อน

    At 1:25 you cover in great detail how to address the problem when the Kafka write fails and the DB write succeeds. How about the other way around? What if the Kafka write succeeds, and the DB write fails?

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน +3

      Wade here. There's an implicit assumption in the video about order. I am assuming that your order of operations is 1. Write to the Database, 2. Write to Kafka. With that assumption in mind, if the write to the database fails, then presumably the operation aborts, and the write to Kafka never happens. This scenario is fine and we don't have to do anything. Now, if you reverse the order of operations and do the write to Kafka first, then we are back to the same problem and have to investigate the same solutions. You can find a deeper discussion on this here: developer.confluent.io/courses/microservices/the-dual-write-problem/. There's also a separate edge case that I haven't discussed. What happens if the write to the database times out? In that case, it's actually possible that the write succeeded, but the code will view it as a failure and the write to Kafka won't proceed, again resulting in an inconsistency. To be safe, it's best to deal with the dual-write problem upfront so we can avoid these edge cases.

  • @ghoshsuman9495
    @ghoshsuman9495 2 หลายเดือนก่อน

    What occurs if the Change Data Capture (CDC) process attempts to write to Kafka, but Kafka is unavailable? how to write the retry logic ?

    • @ConfluentDevXTeam
      @ConfluentDevXTeam หลายเดือนก่อน

      Wade here. If you are using a CDC system, then it hopefully has retries built-in. If it doesn't, then you might not want to use it if at-least-once delivery is your goal.
      If you opt to not use a CDC process with built-in retries, then you are left writing your own logic. The basic idea there, assuming you are using an Outbox or Event Sourcing is:
      - Read each record from the database.
      - Emit the record to Kafka.
      - When a failure occurs, retry the record (can be just a simple retry loop).
      - Whenever the application restarts, pick up with the last record that was successfully emitted to Kafka.
      Just be aware that a "failure" doesn't always mean the write failed. In the event of a timeout, it's possible that the write succeeded but took too long to respond, or the acknowledgment got lost somewhere. In that case, sending the event again results in a duplicate. That's why it's called an `at-least-once` delivery. You could potentially use Apache Kafka's transactional semantics to eliminate the duplication.

  • @BlindVirtuoso
    @BlindVirtuoso 2 หลายเดือนก่อน

    Nice one. Thanks Wade.

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน

      Wade here. Thank you. Glad you enjoyed it.

  • @Fikusiklol
    @Fikusiklol 2 หลายเดือนก่อน

    Hello, Wade!
    Wanted to ask about CDC with ES option.
    From my perspective publishing events (domain/internal) is wrong, as they might not contain enough data and consumers would be coupled to internal contracts (there are ways to avoid it, but just for simplicity sake).
    Did you mean somehow reading domain event, transform it to integration (ESCT/fat) and then publish using CDC?
    Cause ive been doing ES and still used Outbox table to fulfill that intergration event with metadata.

    • @sarwan.surchi
      @sarwan.surchi 2 หลายเดือนก่อน

      @Fikusiklol why not storing a mature domain event where a process could pick it up without additional Outbox table?
      Although the Outbox is great in controlling what should be emitted, it just degrades perf of writes a little.

    • @Fikusiklol
      @Fikusiklol 2 หลายเดือนก่อน

      @@sarwan.surchiBecause mature domain event doesnt care about metadata for integration, which could be captured only during request, like tracing, userid, timespan, etc.
      Also, im not a fan of slim/notification events for integration as they are not sufficient.

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน +1

      ​@@Fikusiklol I tend to agree that you shouldn't publish the domain/internal events directly. You should translate them into a different format before publishing to avoid coupling.
      If your CDC tool allows it, perhaps you can do that directly in the tool. Otherwise, you could write a separate process that translates the events into an outbox table and attach CDC to that. Note that this wouldn't actually be using the Transactional Outbox Pattern. But using a Transactional Outbox is another option. I final option would be to skip CDC and write the publisher yourself.
      The one thing I will say, however, is that early in development, publishing the internal events might be fine, as long as you leave yourself room to convert them to a different event later. I.E. It may be acceptable to publish the internal event until you find yourself in a position where the internal and external events must be different. At that point, you would introduce the extra complexity to translate from internal to external events. Let your future self worry about the problem. Just make sure you leave the door open to solve it.

  • @darwinmanalo5436
    @darwinmanalo5436 2 หลายเดือนก่อน

    So instead of manually sending events to Kafka, we save the events to the database first. Then, there is a CDC tool that detects updates and automatically sends them to Kafka?
    Another tool adds another layer of complexity. Event Sourcing is quite complex, so people should carefully consider if it's the right tool for the project before implementing it. I wish these inconsistencies/issues were already solved in Kafka itself, not by us.
    P.S. The presentation is well-explained though. Wade is a good teacher.

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน +1

      Wade here. CDC is entirely optional. If you already have a CDC system in place, it can assist you in this process. But, if you don't have CDC in place, and don't want to introduce it, you can write some code in your microservice to do this.
      Unfortunately, this isn't a problem you can solve in Kafka. That's in part because it's not a Kafka problem. It's a distributed systems problem that can exist in any system. You can encounter the exact same problem when saving to a file and sending an email.
      But the other issue is that in this specific instance, Kafka can't solve the problem because Kafka doesn't know about it. I.E. The situation outlined is what happens when Kafka doesn't get the message? How could Kafka solve the problem when it doesn't even know the message exists?
      To understand the problem in more depth, check out www.confluent.io/blog/dual-write-problem/

    • @darwinmanalo5436
      @darwinmanalo5436 2 หลายเดือนก่อน

      @@ConfluentDevXTeam I got your point. Thanks for the reply, Wade.

  • @soloboy118
    @soloboy118 2 หลายเดือนก่อน

    hell. I'm working as a Kafka admin. recently I have faced one issue that, for few topics (each topic has 6 partitions) one partition end offset become zero. how to resolve this. it is in production pls help

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 2 หลายเดือนก่อน

      Wade here. I'd suggest hopping over to our community channels and seeking help there: developer.confluent.io/community/