Realtime Advertisement Clicks Aggregator | System Design

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 มิ.ย. 2024
  • Let’s design a real time advertisement clicks aggregator with Kafka, Flink and Cassandra. We start with a simple design and gradually make it scalable while talking about different trade offs.
    Note: pdfhost.io/edit?doc=8a32143c-...
    System Design Playlist: • System Design Beginner...
    🥹 If you found this helpful, follow me online here:
    ✍️ Blog / irtizahafiz
    👨‍💻 Website irtizahafiz.com?
    📲 Instagram / irtiza.hafiz
    00:00 Why Track & Aggregate Clicks?
    01:07 Simple System
    02:12 Will it scale?
    04:00 Logs, Kafka & Stream Processing
    12:02 Database Bottlenecks
    17:13 Replace MySQL
    18:59 Data Model
    25:45 Data Reconciliation
    29:00 Offline Batch Process
    32:10 Future Videos
    #systemDesign #programming #softwareDevelopment

ความคิดเห็น • 84

  • @kevindebruyne17
    @kevindebruyne17 ปีที่แล้ว +3

    Your videos are really really great, no fluff, straight to the topic and covers a lot of details. Thank you and keep it up!

    • @irtizahafiz
      @irtizahafiz  ปีที่แล้ว

      Hi! Even though you are a City fan, thank you for watching the video haha!

  • @sarthakgupta290
    @sarthakgupta290 2 หลายเดือนก่อน

    Thank you so much for this perfect explanation!!

  • @freezefrancis
    @freezefrancis ปีที่แล้ว +2

    I think you deserve a lot more audience! The quality of the contents were really good. Thanks for sharing.

    • @irtizahafiz
      @irtizahafiz  ปีที่แล้ว

      Thank you for watching! Please like and share to reach more people.

  • @nosh3019
    @nosh3019 9 หลายเดือนก่อน

    Thanks for the effort making this! Very informative and a perfect companion to the system design volume 2 book.

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน

      Yes, please refer to the book for more details! It's a brilliant book!

  • @indavarapuaneesh2871
    @indavarapuaneesh2871 3 หลายเดือนก่อน +2

    what I would've done differently: have both warm and cold storages. If your data access pattern is mostly reading data from the last 90 days (pick your number), then store that data in warm storage like Vitess (shared mysql or some distributed relation db). And run a background process periodically that vacuums the stale data from the warm tier and exports it to cold tier like data lakes.
    This way your optimising both read query latency and storage cost. Best of both worlds.

  • @ranaafifi5487
    @ranaafifi5487 ปีที่แล้ว

    Very organized and neat! Thank you

  • @sergiim5601
    @sergiim5601 ปีที่แล้ว

    Great explanation and examples !

  • @rajeshkishore7119
    @rajeshkishore7119 4 หลายเดือนก่อน

    Excellent

  • @pratikjain2264
    @pratikjain2264 3 หลายเดือนก่อน

    This wasby farthe best video..thanks for dojng it

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      Most welcome!

  • @jkl89966
    @jkl89966 ปีที่แล้ว

    awesome

  • @protyaybanerjee5051
    @protyaybanerjee5051 3 หลายเดือนก่อน +1

    Some notes about this design
    - Adding more topics is a very vague statement. We have to define the data model to capture each click event and then allow data partitioning based on advertisement_id and some form of timestamp
    - Not sure why replication lag is stated as an issue here. The read patterns for this design doesn't require reading consistent data. So this should not be an issue
    - Relational DBs won't do well with aggregation queries. This is a little misguiding. Doing aggregation queries efficiently requires storing the data model in a column major format that unlocks efficient compressions and data loading.
    - Why provision a stream processing infra to upload data to cold storage . Once a log file reaches X MB, we can place an event in Kafka with a (file_id, offset) pair. There would be a consumer that reads this and uploads the data to s3. This avoids un-necessary dollar cost as well as operational cost of maintaining a stream infrastructure.

  • @weixing8985
    @weixing8985 ปีที่แล้ว +2

    Thanks for the tutorials! I think you're following the topics of the book System Design Interview 2 but using a way that a lot easier to understand. I'm very much struggled with those topics of the book until I came across your tutorials!

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน

      Yes! I refer to both the volumes of that book. I mention it on the video, as well as have it linked on most descriptions.

  • @TheZoneTrader
    @TheZoneTrader ปีที่แล้ว +12

    Each day click storage = .1KB * 3 B = .3 TB/ day and not 3 TB /day ? Correct me if i am wrong

    • @ramin5021
      @ramin5021 4 หลายเดือนก่อน

      I think you missed the K in KB.
      btw, he didn't calculate it correctly too. I think the below calculatioin is correct
      0.1KB = 100Btye
      100 Byte * 3B = 300TB
      correct me if I am wrong

    • @erickadinugraha5906
      @erickadinugraha5906 2 หลายเดือนก่อน

      @@ramin5021 you're wrong. 100 Bytes * 3B = 300 GB = 0.3 TB

    • @srawat1212
      @srawat1212 2 หลายเดือนก่อน +1

      It should be 0.3Tb or 300Gb

  • @mohsanabbas6835
    @mohsanabbas6835 ปีที่แล้ว

    Event data stream platform . It’s more complex system, where data is being processed either in real time streams or batch, ETL, data pipelines etc

  • @roopashastri9908
    @roopashastri9908 2 หลายเดือนก่อน

    You videos are great!Very clearly articulated!Was curious why do we have to use Nosql DB , if we are storing only the aggregated data based on advertiser ID.What are the drawbacks of using any columnar DB like snowflake in thise case?

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      You can use whatever DB works best for your case. I think Snowflake will work just as well here.

  • @rishabhjain2404
    @rishabhjain2404 2 หลายเดือนก่อน

    We should use count-min sketch for real time click aggregation on the stream processor, it is going to be very fast and you query data on last minute granularity. A map-reduce system can be useful for exact click information. Clicks can be batched, put into HDFS system, reduced into aggregates and saved on DB.

    • @monodeepdas2778
      @monodeepdas2778 2 หลายเดือนก่อน

      afaik count-min sketch was applicable for top k problem.
      I know we can have some faster lesser accurate algos, but that is what stream processors can do.

  • @sumonmal009
    @sumonmal009 3 หลายเดือนก่อน

    capture the click with application logging, good idea, main crucks 6:30
    21:30

  • @kirillkirooha3848
    @kirillkirooha3848 ปีที่แล้ว

    Thanks for the video, that's brilliant! But I didn't quite understand the problem with inaccurate data in the stream process. Late data can arrive to the stream process, but I suppose it has timestamp from apache log. So can't we just insert that late data to Cassandra?

    • @irtizahafiz
      @irtizahafiz  ปีที่แล้ว +1

      Hi!
      Yes, you can insert the late data when storing individual clicks.
      However, when you are computing aggregations, such as "total clicks every hour", you will already emit the count at the end of the hour (with some buffer). Then, when a late data arrives you won't be able to "correct" your aggregate.

  • @PrateekSaini
    @PrateekSaini 2 หลายเดือนก่อน

    Thanks for such a clear and detailed explanation.
    Could you please share a couple of blogs/articles for reference where companies are using this kind of systems?

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      You should find some in the Uber engineering blog.

  • @dhruvgarg722
    @dhruvgarg722 5 หลายเดือนก่อน

    Great video! Did you create these diagrams in obsidian or these are images?

    • @irtizahafiz
      @irtizahafiz  4 หลายเดือนก่อน

      I use a combination of Obsidian and Miro to create the notes.

  • @KShi-vq4mg
    @KShi-vq4mg 9 หลายเดือนก่อน

    Big Fan! hopefully lot more people find these. but just one feedback. you covered what kind of data we store. would it also be worth going little deeper into data model?

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน +2

      Hi! That's a really good feedback!
      It's definitely worth diving deeper, but it's difficult to do that in a high level SD video. If that's something you will find valuable, I can definitely create videos on data models.

  • @tonyliu2858
    @tonyliu2858 ปีที่แล้ว +2

    Can we use MapReduce for stream processing? Will it meet the latency requirement? Or we have to use some other streaming processors such as Flink/Spark?

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน

      I think it all depends on your application.
      If you want the most realtime, I know Flink or Spark (with micro batches) can get you that.

  • @lytung1532
    @lytung1532 ปีที่แล้ว

    The tutorial is helpful. Could you give the sample of the format of the records stored in Cassandra. Do we need to store the data in Cassandra with all data in original click?

    • @irtizahafiz
      @irtizahafiz  ปีที่แล้ว +1

      Depends on your use case. You could have two Cassandra tables, one for individual records and one for aggregations.

  • @ax5344
    @ax5344 3 หลายเดือนก่อน

    @5:35 0.1KB*3B = 3 TB Hi, how is the computation done? I thought 3B is 9 zero; multiply it by 0.1 will get 8 zero. 1 TB is 1e9 KB. Then I thought it would be 0.3 TB. Did I get something wrong?

  • @ankitagarwal4022
    @ankitagarwal4022 4 หลายเดือนก่อน

    Excellent video from thinking from basic design to scalable design.
    Few question.
    1. what technology actually we can use for log watcher,
    2. can you please correct me about stream processors for saving data in s3, we are using the Steam processor. Here stream processors can be any consumer of Kafka events like my simple Java service which pushes data in s3 and for stream processer for aggregation of the data and saving in Cassandra we can flink.

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      1. Don't remember off the top of my head, but you can even write a small cron job to read the file every few minutes. But there are better tools if you can Google around.
      2. Yes. That should work. Flink automatically checks data in S3 when doing the aggregation.

  • @code-commenter-vs7lb
    @code-commenter-vs7lb 5 หลายเดือนก่อน

    Hello, when we introduce a log file, how do we ensure the aggregates is still "near realtime" ? IMO when you introduce log files in the middle you will have append only logs which we will probably only publish once the log file finished appending and started generating new file. So there is a delay of sometime may be a min or something (depending how big is your rolling window).

    • @irtizahafiz
      @irtizahafiz  3 หลายเดือนก่อน

      That depends on how you are reading the log file. You can add a "watcher" that tails the log file and publishes a message downstream whenever a new line is appended.
      Alternatively, as a first step, you can write to Kafka, and the Kafka consumer can both process the data and add it to the log file.

  • @parthmahajan6535
    @parthmahajan6535 หลายเดือนก่อน

    That was an awesome video, i had a similar approach and got it validated. I was wondering if you could also start a code series on building such systems (as demonstrated in video).

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      Thank you for watching! I plan on building similar sub-systems, but TBH, building an e2e system like this without an actual use case (and traffic) is not really worth it. I don't think it will add much value either. Thank you for the suggestion though. Appreciate it.

    • @parthmahajan6535
      @parthmahajan6535 หลายเดือนก่อน

      @@irtizahafiz it would make sense tho, if someone's just starting,I was thinking we could use some dataset on clickstream logs, create a stream of the logs coming in(simulate a stream through python), and then build the system.

  • @VishalThakur-wo1vx
    @VishalThakur-wo1vx 3 หลายเดือนก่อน

    We could also keep States in Kafka Stream application (local or Global State) and use Interactive Query to fetch result of the aggregation. Can you please share how to decide whether to offload the aggregation result to external DB vs when to use interactive Query ? I understand that durability can be one factor but what are others ?

    • @irtizahafiz
      @irtizahafiz  3 หลายเดือนก่อน

      If your tables are on a relational DB and they are relatively large, aggregations will do poorly. Instead, either store precomputed aggregations in a different table, or compute on the fly using something like Flink.

  • @bobuputheeckal2693
    @bobuputheeckal2693 9 หลายเดือนก่อน

    Is it 300GB or 3TB ?

  • @shadytanny
    @shadytanny ปีที่แล้ว +1

    How are we using log files?Reading from log files is the best way to source?

    • @irtizahafiz
      @irtizahafiz  ปีที่แล้ว

      It totally depends on your system. In this example we are deciding to read from log files because of its simplicity.
      If you want, you can also run a Spark job on some logs for every website visit.

  • @chetanyaahuja1241
    @chetanyaahuja1241 หลายเดือนก่อน

    Thanks for clearly explaining the end to end design. Just a couple of questions:
    1) Could you explain a little bit about how the Apache log files gets the clicks information and how is it realtime.
    2) Also, Do you have any link of these notes/Diagram. As the one in description doesn't work.

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน +1

      The simplest would be to write a cron job or something similar that executes every couple of minutes reads the log file, and writes new data to Kafka.
      You can also poll using a continuously running Python program. What that would look like is a Python program will be running on, say, a "while" loop and read from the file every couple of minutes to write to Kafka.
      These are 2 solutions you can quickly prototype. For more comprehensive solutions, there are dedicated file watcher daemons that you could use.

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      Right, about the links. Unfortunately, they expired. Even I don't have access to most of them anymore. Sorry!

    • @chetanyaahuja1241
      @chetanyaahuja1241 หลายเดือนก่อน

      @@irtizahafiz Thank you for the explantation.

  • @vikramreddy7586
    @vikramreddy7586 หลายเดือนก่อน

    Correct me if I'm wrong - we are processing the data fetched via Batch Job, so essentially we are processing the data twice to get rid of the inaccuracies ?

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      It's been a while since I uploaded the video, but I believe we are only processing it once with the stream processing pipeline.
      However, it's pretty common to run some kind of a nightly job to "correct" some of the small inaccuracies coming from the real-time aggregations.

  • @utkarshgupta2909
    @utkarshgupta2909 ปีที่แล้ว +3

    Correct me if I am wrong, Seems to me more like lambda architecture.. aggregation being fast but inaccurate whereas S3 being slow but accurate

    • @irtizahafiz
      @irtizahafiz  ปีที่แล้ว +2

      Yup I think you can use the term for this.

    • @freezefrancis
      @freezefrancis ปีที่แล้ว +1

      Yes. That's right.

  • @matthayden1979
    @matthayden1979 3 หลายเดือนก่อน

    Is it correct to say that Kafka is Data ingestion platform as data is getting stored in the topics which would be later processed by stream processor?

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      Yup, you can say that. But you shouldn't treat the Kafka storage as persistent storage, but as a temporary buffer. If you want to store the incoming data for longer, you can dump it in S3 or some data warehouse like Redshift/Snowflake.

  • @srishtiganjoo8549
    @srishtiganjoo8549 3 หลายเดือนก่อน

    Given that Kafka is durable, why are we storing the clicks in Log Files? Would this not hamper with system performance?
    If we need logs, maybe the Kafka consumer can log it to files.

    • @irtizahafiz
      @irtizahafiz  3 หลายเดือนก่อน

      Kafka is durable for a short time (14 to 30 days depending on how it's configured). You also cannot do analytical queries easily when your data is in Kafka, as well as connect it to reporting software like PowerBI and Tableau.
      Because of all those and more, most of the time you only use Kafka as a buffer rather than permanent storage.

  • @krutikpatel906
    @krutikpatel906 3 หลายเดือนก่อน

    For batch job, i think you need separate stream processor, you cant mix real time data and old data. Please share your thought

    • @irtizahafiz
      @irtizahafiz  3 หลายเดือนก่อน

      You don't need a stream processor for batch jobs. You can do it offline.

  • @user-eq4oy6bk5p
    @user-eq4oy6bk5p 2 ปีที่แล้ว

    What do we store in S3? Apache log files?

    • @irtizahafiz
      @irtizahafiz  2 ปีที่แล้ว +1

      No. In S3 we store the individual click events. In Cassandra we store the aggregations. We are doing that under the assumption that individual clicks are rarely accessed, only aggregations are accessed regularly.

  • @shubhamdalvi6424
    @shubhamdalvi6424 8 หลายเดือนก่อน +1

    Great Video! A minor error: 0.1Kb x 3B = 300 Gb

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน

      Thank you! I will start posting again soon, so please let me know what type of content interests you the most.

  • @eternalsunshine313
    @eternalsunshine313 9 หลายเดือนก่อน

    Why can’t we process the clicks a bit later than they happen so we capture late data and avoid the batch job? This wouldn’t be 100% real time, but do most systems need super fast real-time processing for ad clicks?

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน

      Totally depends on your use case.
      There are real-life use cases where you need it to be real-time, in those cases the tradeoff of dropping "some" late data is acceptable.

  • @unbox2359
    @unbox2359 หลายเดือนก่อน

    can someone help me with the cassandra database schema design ?
    like what all tables will be there and what all columns will be there?

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      It depends on the type of application you are trying to build.

    • @unbox2359
      @unbox2359 หลายเดือนก่อน

      ​@@irtizahafiz I'm asking for this application only

  • @siarheikaravai
    @siarheikaravai ปีที่แล้ว

    What about expenses calculations? How to estimate how much this system will consume money?

    • @irtizahafiz
      @irtizahafiz  7 หลายเดือนก่อน

      That depends on your AWS, GCP, etc platform cost. Most of these systems and DBs will be hosted on instances with different billing requirements.

  • @neelakshisoni6118
    @neelakshisoni6118 2 หลายเดือนก่อน

    I can't download your pdf Notes

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      Sorry, some of the links have expired!

  • @szyulian
    @szyulian หลายเดือนก่อน

    Watched. --

    • @irtizahafiz
      @irtizahafiz  หลายเดือนก่อน

      Thank you!