Denormalizing DB for Justin Bieber

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ต.ค. 2024

ความคิดเห็น • 543

  • @rayforever
    @rayforever 9 หลายเดือนก่อน +1800

    It's such simple thing but still a franchise level industry wont take it as secure or in options

    • @carlosmspk
      @carlosmspk 8 หลายเดือนก่อน +163

      Because you can easily get into scenarios where "total_likes" doesn't actually reflect the number of likes

    • @crysist13
      @crysist13 8 หลายเดือนก่อน +33

      And no longer check who liked the post, just the number of likes. Meaning the list of who liked what would need to be kept somewhere else.

    • @snakefinn
      @snakefinn 8 หลายเดือนก่อน +17

      ​@@carlosmspk If posts couldn't be un-liked it would be a lot simpler

    • @Rhidayah
      @Rhidayah 8 หลายเดือนก่อน +54

      Race Conditions be like 🗿

    • @carlosmspk
      @carlosmspk 8 หลายเดือนก่อน +36

      @@crysist13 The idea is not to stop adding tuples to the "Likes" table, it's just that you don't count them for displaying total likes. That part wouldn't be an issue, you could still get who liked what

  • @ianbdb7686
    @ianbdb7686 9 หลายเดือนก่อน +921

    For people saying displaying incorrect numbers of likes, youtube does view count, the transactions are queued, and race conditions are not a big deal from a users perspective

    • @usernamesrbacknowthx
      @usernamesrbacknowthx 9 หลายเดือนก่อน

      Yeah I can’t believe people are arguing about accuracy lmao. Who cares. Goes to show you most programmers are GEEKS who can’t actually problem solve

    • @gabrielpedroza1346
      @gabrielpedroza1346 9 หลายเดือนก่อน +10

      wdym race conditions are not a big deal from a users perspective? i’m pretty sure the reason why is because they value more the availability of certain parts of youtube, like maybe the core system and like count, and consistency of certain parts, like the comments system. when a user would visit a video, it would hit a node (preferably the closest node) and that node might not have the most up-to-date information which means it’s eventual consistency

    • @Kasukke
      @Kasukke 8 หลายเดือนก่อน

      @@gabrielpedroza1346users are fine with eventually consistent like counts

    • @frenchButt44
      @frenchButt44 8 หลายเดือนก่อน +94

      ​@@gabrielpedroza1346 it doesn't matter if a video gets 1001 views or 1000 views, the discrepancy is because of "eventual consistency" so it will eventually be the same as the count even it is updated in other shards.
      The benefit's are outweigh the drawbacks so much that big companies are using that. I don't see you being an architect there.

    • @titan5064
      @titan5064 7 หลายเดือนก่อน +1

      ​@@frenchButt44 Idk exactly how yt work but does it go like "you press like, the client side interface updates what should be the eventual number and then queues a transaction to update on the db"?

  • @christophcooneyoff
    @christophcooneyoff 8 หลายเดือนก่อน +484

    Eventual consistency is a fair trade off for performance.

    • @oggassaggaoggaffa
      @oggassaggaoggaffa 7 หลายเดือนก่อน +31

      But ONLY for data/rows/fields that aren't mission critical...such a "likes" counts! 😁 If you're writing an airline reservation system or stock trading program "eventual" doesn't really cut it.

    • @christophcooneyoff
      @christophcooneyoff 7 หลายเดือนก่อน +25

      @@oggassaggaoggaffa No, this is why atomic transactions exist 😬

    • @lukeP01
      @lukeP01 6 หลายเดือนก่อน +2

      Probably still not consistent counting the rows as you need to lock the table

    • @Fennecbutt
      @Fennecbutt 5 หลายเดือนก่อน +1

      Eventual consistency is only okay for this model, one producer many consumers, where the content isn't liable to ever change much. Using it for something with state that changes a lot over time is always a terrible idea imo

    • @christophcooneyoff
      @christophcooneyoff 5 หลายเดือนก่อน

      @@Fennecbutt Yes.

  • @lefteriseleftheriades7381
    @lefteriseleftheriades7381 8 หลายเดือนก่อน +1265

    The fact that they even counted shows they never thought that thing would scale to more than 100k users

    • @marcialabrahantes3369
      @marcialabrahantes3369 8 หลายเดือนก่อน +289

      you don't optimize prematurely when building a startup

    • @MungeParty
      @MungeParty 8 หลายเดือนก่อน

      ​@@marcialabrahantes3369this guy gets it

    • @seighartmercury
      @seighartmercury 8 หลายเดือนก่อน

      ​@@marcialabrahantes3369 When optimization is literally just 1 line of code, you absolutely should

    • @choilive
      @choilive 8 หลายเดือนก่อน +81

      Yep. There’s a good chance they might not ever have gotten 100K users. Why worry about good problems to have?

    • @flodderr
      @flodderr 7 หลายเดือนก่อน +20

      @@choilive I mean adding the total_likes seems like the obvious thing to do from the start. Insta had to have a business plan when they started out so yes they should have anticipated 100k

  • @Hobson474
    @Hobson474 5 หลายเดือนก่อน +175

    This is so untrue. Relational databases a very good at counting rows and will typically not even read the table, instead using btree index to count. What he should say is the hash join of two tables is slow.
    This is actually bad advice, UPDATE is a very expensive operation on a key table and will generate logging data to deal with concurrency. Ideally there should be a seperate post_statistics table which can be unlogged for high performance write while trading off durability

    • @UpToTheWhalesNow
      @UpToTheWhalesNow 4 หลายเดือนก่อน +32

      yes... it's a real problem but his "solution" is in fact worse than the original

    • @csbshaw1
      @csbshaw1 4 หลายเดือนก่อน +19

      You raise some valid points about the efficiency of counting rows using indexed columns and the expense of update operations. I completely agree. I was happy to see btree mentioned in an Instagram comment, lol.
      While btree indexes can indeed speed up count operations, this efficiency diminishes as the dataset grows and queries become more complex. Instagram deals with massive volumes of data and very high traffic, which means that even indexed counts can become a bottleneck.
      I personally would use in memory caching (redis) and a message queue like Kafka. Also probably batch processing and composite indexing.

    • @xutiTrinh
      @xutiTrinh 3 หลายเดือนก่อน

      His foreword is so misleading.

    • @UnfrogRS
      @UnfrogRS 2 หลายเดือนก่อน +1

      The real fun begins when a single node is not fast enough to update the count at the rate users like the post 😅

    • @wilkesreid
      @wilkesreid หลายเดือนก่อน +1

      You aren't just counting total rows in the likes table; you're having to do the join, or at least limit to only likes having the matching foreign key. So either way, you're probably going to end up with a slow operation.
      Adding the total_likes property to the posts table introduces the risk that the number could fall out of sync with the real number of likes if they have code that updates one but not the other, but with the advantage that reading the likes is extremely fast. If they have tolerance in their business requirements for slightly inaccurate numbers on the likes, this is a perfectly fine solution. They could even have a cron job run something like once a day to re-count the actual like rows and update the number on the post entry.
      You could also have a trigger in the database that updates the number on the post row whenever a like is created, further reducing the risk that the numbers would be out of sync.
      His solution is perfectly reasonable.

  • @wywarren
    @wywarren 8 หลายเดือนก่อน +98

    It’s slightly a bit more complicated for instagram, where in order to drive engagement, it also shows which one of the people you’re following also likes the post. So other than the count which is global, there’s an aggregate query contextual to the logged in user’s first connections to the post.

    • @WeirdDuck781
      @WeirdDuck781 6 หลายเดือนก่อน +2

      Welp they can select top 3 from the likes table which should be rather fast

    • @MGraczyk
      @MGraczyk 5 หลายเดือนก่อน +1

      The most expensive part on the read side is showing the preview comments on the post, but there are fallbacks for very high volume posts and during times of high overall usage. I worked on ranking for these surfaces

    • @ianhpete
      @ianhpete 5 หลายเดือนก่อน +1

      Just query in the friends like list for the last cpuple days.
      And you take just the 5 friends with the most engagement or which communicate with the user the most

    • @wywarren
      @wywarren 5 หลายเดือนก่อน +1

      @@ianhpete That seems like a good heuristic to start with but that would also assume that those n friends (in this case 5) would have had to have engaged with said post. If they didn't you'd have to widen your circle to a bigger subset and then that becomes a unique set for each post.

    • @w花b
      @w花b 5 หลายเดือนก่อน

      ​@@wywarrenwhich it is

  • @Bcs-Mohtisham
    @Bcs-Mohtisham 9 หลายเดือนก่อน +115

    Reminds me about a nosql rule ,
    "Store related data together in a single structure to optimize retrieval, a principle particularly emphasized in NoSQL databases for efficient querying."

    • @JackSack-w5h
      @JackSack-w5h 8 หลายเดือนก่อน +3

      That's called "embedding"

    • @yuriiholenko6764
      @yuriiholenko6764 8 หลายเดือนก่อน +7

      video starting from "counting rows on RELATIONAL database", why you mentioning nosql rules?

    • @w花b
      @w花b 5 หลายเดือนก่อน

      The term NoSQL is so misleading

  • @aquiace
    @aquiace 5 หลายเดือนก่อน +24

    It makes me feel like a less bad programmer knowing this was the solution

    • @perfect.stealth
      @perfect.stealth 3 หลายเดือนก่อน +2

      Same lol

    • @DK281ify
      @DK281ify 2 หลายเดือนก่อน +4

      No, you should still feel bad lol
      This "story" is popular due to demonstration involving a celebrity.
      You cannot simply do incremental count. You still need to track which post to "unlike".
      To mitigate performance you can only slow down update of the counter by scheduling integer updates e.g. every 2-5 seconds. 1 count is much faster than a queue of 1000 updates for the same record in 5s. This video is mostly BS.
      Though a new field acting like cached counter helps, yes. But you wouldn't actually increment it one at a time, that would be very dumb and even slower.

  • @jackdavenport5011
    @jackdavenport5011 8 หลายเดือนก่อน +29

    Some databases allow you to do something like that automatically, so when a like row gets inserted/removed it automatically updates the count. Or you could just run the count operation every 1-2 minutes and cache the result in the post row :)

  • @emagenation6409
    @emagenation6409 หลายเดือนก่อน

    Amazing, I don’t know whether it does what you said it does but this is fixing a real-life problem that definitely cost money for the company. The fact that you show your point of fixing it for free, is amazing even if it doesn’t necessarily help instagram. I’m subbing.

  • @algeriennesaffaires7017
    @algeriennesaffaires7017 4 หลายเดือนก่อน +4

    When I created an ecommerce websites in app. First time I was like. Okay, I'm just gonna create a column and name it. Total likes this way. I'm just gonna show how many likes this product get but later for marketing reasons, I need to know who liked this product so we can send a marketing emails so I had to create now on you table and name it liked by user. ID, for example, so if we want to do marketing email about some products. We just go to that table and find who like that product and send the marketing email. So I suggest you do both solution. Create a separate table and save the ideas of the users who like that post or that product and for speed reasons. Also, add a total likes in the products. List or product post this way. You will have everything for you. It may make the data base bigger. Because now you have a big another table

  • @christopherparke9878
    @christopherparke9878 6 หลายเดือนก่อน +2

    I would prefer a "total likes" table with a foreign key to the post ID. Less bloat on the posts table and achieve the same thing. (ID, postId, Total)

  • @luca0898
    @luca0898 4 หลายเดือนก่อน +1

    In addition, you can place a trigger in the db so that every update operation the counter increment

  • @haxney
    @haxney 5 หลายเดือนก่อน +2

    It's more complicated than even that. As others have noted, you don't need (or want) to display a perfectly up-to-date like count, so it can be eventually consistent. The other important factor is despamming. A user clicking "like" and the like counter incrementing is not a single, atomic operation. If the user is identified as a spam bot, then their like should not count for the purpose of promoting the post.
    The real workflow is more like:
    1. User clicks "like." Store that.
    2. Start a batch despam pipeline to figure out whether the like is legitimate.
    3. If so, update the like count on the post.
    The despam pipeline is going to take some time. In this case, nobody but spammers actually cares about accuracy down to a single like, so lazily updating is just fine.

    • @CombobreakerDe
      @CombobreakerDe 4 หลายเดือนก่อน

      Normally you batch the "despam" process. You won't do it whenever someone clicks on the button. It's a separate process. You identify bots, check which posts they liked and update the counts accordingly.

  • @deynohmuturi1069
    @deynohmuturi1069 4 หลายเดือนก่อน

    I did this with my project, a twitter clone web app.You're making me feel like a genius.

  • @JTCF
    @JTCF 5 หลายเดือนก่อน +1

    It may seem as an unreliable system which might produce inaccurate results, but with proper structuring you won't be able to mess it up. And it saves a lot of resources, too.

  • @ckennedy0323
    @ckennedy0323 9 หลายเดือนก่อน +61

    Now tell me what happens when 1000000 people like the photo nearly simultaneously

    • @rakinar2
      @rakinar2 9 หลายเดือนก่อน +5

      Race condition, if not handled correctly

    • @NabeelFarooqui
      @NabeelFarooqui 9 หลายเดือนก่อน +12

      Simultaneous likes unlikes will always be significantly less than simultaneous views

    • @SamMeechWard
      @SamMeechWard  9 หลายเดือนก่อน +71

      Queue the updates and if it takes a while no big deal, because no one cares if the like count is off by a bit

    • @SamMeechWard
      @SamMeechWard  9 หลายเดือนก่อน +19

      Low steaks though. You could avoid the race condition or just account for it later by updating the value directly from the count

    • @lozyodella4178
      @lozyodella4178 9 หลายเดือนก่อน +4

      Just serialize it, no one cares if your like isn't instantly reflected on the app

  • @frontendtony
    @frontendtony 7 หลายเดือนก่อน +13

    I've just gotta say, that's a really high quality camera you got there

  • @georgeyoung2684
    @georgeyoung2684 9 หลายเดือนก่อน +5

    Read optimised vs write optimised

  • @chrisdiehl8452
    @chrisdiehl8452 4 หลายเดือนก่อน +1

    It depends on the system, and environment you are working with.
    If your count is slow on the database, you can change the operation from a post to a present operation.

  • @morph442
    @morph442 5 หลายเดือนก่อน +1

    one of the most underrated advise for large-scale systems!

  • @MarkWatsonY
    @MarkWatsonY หลายเดือนก่อน

    There’s still a bottleneck on writes going to a single row, so for very high scale you might get too much traffic going to whichever primary DB instance is responsible. But this is only a problem at like modern instagram or TH-cam scale so most people never have to worry about it. Still a fun problem to think about. Who knew counting numbers efficiently at scale could be so complicated.

  • @davidgillies620
    @davidgillies620 20 วันที่ผ่านมา

    Adding a total_likes column doesn't break normalisation. It's an attribute of the posts table (so it's functionally dependent on the primary key posts.id), not the likes table, and it's scalar. So the sample schema shown here is in 3NF. It's not in BCNF, but no-one really insists on that.

  • @ViciuSRO
    @ViciuSRO 7 หลายเดือนก่อน +15

    what did you use to draw the diagrams? it look so good!

    • @boekka
      @boekka 7 หลายเดือนก่อน +1

      MS Paint

    • @AldMD-cm2mz
      @AldMD-cm2mz 6 หลายเดือนก่อน

      Gold

  • @Andris_Briedis
    @Andris_Briedis 8 หลายเดือนก่อน +7

    100% normalization is just theoretical fun. In the DB of normal production, some intermediate results should always be stored, so as not to overload the DB unnecessarily.

  • @edocms
    @edocms 5 หลายเดือนก่อน +5

    The real solution is to normalize it further by adding a new table to keep the count. Never de-normalize !!!

  • @fullstack_journey
    @fullstack_journey 8 หลายเดือนก่อน +1

    Thanks for your music, and SQL query optimisation, Freddy!

  • @ololh4xx
    @ololh4xx 8 หลายเดือนก่อน +12

    counting rows has never been slow. Its basically a core feature of almost all relational DB systems to offer ultra-fast counting. *BUT* : counting several thousands of times per second (!) introduces racing conditions and counting algorithm overhead which can become absolutely deadly for performance because of suboptimal locking or incorrect usage of mutexes, semaphores and the likes. Its not the counting itself killing the performance, its the low scalability of it. There are databases out there which offer ultra-fast counting which is also scalable. In some cases, this is realized via auto-updates of internal count fields, some use dark math wizardry and some are simply that fast by design.

    • @phill6859
      @phill6859 7 หลายเดือนก่อน +1

      Counting rows has always been slow. Though this problem would be counting indexes.

    • @ololh4xx
      @ololh4xx 7 หลายเดือนก่อน +2

      @@phill6859 i recommend looking into current (and past) database technologies. Performance problems regarding row-counting have been solved for *decades* now. You need to get with the times, expand your knowledge and try out new stuff. It isnt slow anymore - and it hasnt been for decades. Except in cases where the database has been misconfigured or is being misused.

    • @FilipCordas
      @FilipCordas 5 หลายเดือนก่อน

      This sounds like a materialized view problem.

    • @Xcess007
      @Xcess007 4 หลายเดือนก่อน

      Precisely. Elasticsearch used HyperLogLog to trader off a bit of accuracy for lightning fast counting and fixed size memory usage.

    • @zaza-ik5ws
      @zaza-ik5ws 4 หลายเดือนก่อน

      Oh my god, your comment blew my mind. Master, teach me your ways.

  • @hotwaff
    @hotwaff 4 หลายเดือนก่อน +9

    You don't even necessarily need to put the counter in the *posts* table, either. 1,000,000 likes would mean 1,000,000 UPDATE calls on a table that is being read very often. Having another stats table, keeping the value hot in a cache and then reading from the cache if it's present will be vroom vroom.

  • @carlosalba9690
    @carlosalba9690 5 หลายเดือนก่อน +2

    Being able to use that as a proper workaround makes me appreciate ACID so much more

  • @dmxd
    @dmxd 2 หลายเดือนก่อน

    Actually, that's exactly the question I was asked at job interview. I was asked how would I structure the likes for posts.

  • @Llorx
    @Llorx 9 หลายเดือนก่อน +85

    The answer is always "cache it".

    • @petrlaskevic1948
      @petrlaskevic1948 9 หลายเดือนก่อน

      yeah, edit the databace once in a minute, with the value which is the sum of likes and dislikes over that time period

    • @chawza8402
      @chawza8402 8 หลายเดือนก่อน +6

      Yeah, but to "cache it" you required to count the rows before store it to the cache

    • @Llorx
      @Llorx 8 หลายเดือนก่อน

      @@chawza8402 no, each time you add a new row, increase the number by 1. If you delete, decrease the number (like the video explains). That's also a cache because is still a "copy of the real data", which is the like count.

    • @danielcook237
      @danielcook237 8 หลายเดือนก่อน

      not necessarily i think you could count the new likes and the new dislikes and then add the cached value to the value already stored in the database and then set the cached value to 0@@chawza8402

    • @Kasukke
      @Kasukke 8 หลายเดือนก่อน

      @@chawza8402no just increment it. You don’t need the cache to be 100% accurate, just fast

  • @whatnowZA
    @whatnowZA 2 หลายเดือนก่อน +1

    What happens when you want to see a list of users who liked the post? You've just added a new column, but in reality, you still need the normalized table to capture the users so that the same user won't be able to like the same picture more than once and for you to be able to have a record of who liked which post. So basically, denomalizing the posts table was not a solution.

  • @earl.grey_
    @earl.grey_ 7 หลายเดือนก่อน +2

    The db really should handle this kind of caching itself

    • @jc-aguilar
      @jc-aguilar 6 หลายเดือนก่อน

      I know that snowflake cache it, it’s really fast with count(*)

  • @mycode0
    @mycode0 6 หลายเดือนก่อน

    The pattern is called counter caching you cache the counter of the related field to avoid expensive computation

  • @abraham7966
    @abraham7966 2 หลายเดือนก่อน

    I have done that my entire f time and even got scolded a few time by mindless co-workers who have never worked with a high-traffic system.

  • @nielsvanderveer
    @nielsvanderveer 5 หลายเดือนก่อน

    Only thing to keep in mind tough is using transactions. But this is supported by virtually all database systems, so should not be a big deal.

  • @ДенисНовогродский-е6ш
    @ДенисНовогродский-е6ш 3 หลายเดือนก่อน

    We applied the same solution in a similar case before real load testing, but we have more then one property to calculate. And this place became the performance bottleneck of the app. it turned out that we don't need it. just counting from db (read nodes) didn't affect performance of the app (in context of our requirements) but we implemented `snapshots` solution when you take the last `snapshot` and just add all `likes` for this post after datetime when this snapshot was created

  • @hakami1426
    @hakami1426 หลายเดือนก่อน

    I just did this for my todo app lol. I didnt know it was called denormalization

  • @thought-provoker
    @thought-provoker 3 หลายเดือนก่อน

    You don't even need to denormalize.
    You could add another table, post_counts, that has post_id and a counter, which you increase on a like and decrease on a withdrawn like.
    And if you're complete overboard on microservices, you could even make a likes-counter service.
    Not that I'd advise going this route, but if it IS a problem in your infrastructure, that becomes an extremely easy way of scaling and stabilizing things.

    • @SamMeechWard
      @SamMeechWard  3 หลายเดือนก่อน

      adding a another table to track the count violates 3rd normal form

  • @kamranbashir4842
    @kamranbashir4842 5 หลายเดือนก่อน

    Denormalised database has higher risk of invalid or conflicting data. A process to increment count could be interrupted or terminated prematurely or an edge case can create a pattern of conflicting data. At a high scale, one might not observe a difference between 29877765 and 29877785 because both would be shown ad 29M but on lower scale it's a difference between 325 and 290. At lower scale developers are afraid of this because the mistake will easily be watched so they choose the safer route of keeping the database normalised and denormalise later if needed.
    This "later" things pile up and take a form of big technical dept which bites back when the app needs to be scaled up.

  • @abuzain859
    @abuzain859 5 หลายเดือนก่อน

    I am using this technique for Non-Relational DB.

  • @PaulSebastianM
    @PaulSebastianM 7 หลายเดือนก่อน +1

    That's is not really denormalizing the normal forms as this is not removing a foreign key. It's simply adding a counter that is basically an automatically computed and cached value.

  • @thombrown
    @thombrown 6 หลายเดือนก่อน

    You would have thought they had caching in the application layer. Strange to push that into the database.

  • @gpzim981
    @gpzim981 6 หลายเดือนก่อน

    This is not how it’s done tho.
    Likes in a post still has a reference of who liked and when. They still have that related table but what actually happens is that they have a projection (read only) of the post with the aggregated data and that is only updated time to time, not every time someone likes the post, the UI creates the illusion but on the backend the projection will be updated after some time .

  • @alexties6933
    @alexties6933 5 หลายเดือนก่อน

    Most sites dont even do that, youtube updates views periodically

  • @steventhawe1598
    @steventhawe1598 6 หลายเดือนก่อน

    In powerapps we call the column a rollup column a d its role is to calculate items in the child normalized table

  • @armorkinggaming1933
    @armorkinggaming1933 18 วันที่ผ่านมา

    But how Instagram dealt with thousands people liking the post simultaneously?

  • @zoltanzorgo
    @zoltanzorgo 8 หลายเดือนก่อน +1

    Normalization is great in theory. And as a diagram on paper. But you shouldn't come to a 3NF or BCNF in the first place if you intend to use a real-world RDBMS in a real-world application... like.. ever.

  • @69k_gold
    @69k_gold 4 หลายเดือนก่อน

    I like how perfect normalization led to this, and could've been easily fixed with a NoSQL database like Mongo that supports aggregates

  • @WheredMyPiggyGo
    @WheredMyPiggyGo หลายเดือนก่อน

    Wow this is 101 rl databases, they should really have fired the person who put this together.

  • @whattimeisitnow124
    @whattimeisitnow124 หลายเดือนก่อน

    Which is why preaggregates are necessary. Relational databases are terrible at on the fly aggregations

  • @yassinesafraoui
    @yassinesafraoui 26 วันที่ผ่านมา

    I suppose they would also need to queue likes if the isolation level isn't set correctly, I suppose it should be repeatable read here

  • @guptahimsocool
    @guptahimsocool 5 หลายเดือนก่อน

    It would not be a long term solution, the solution would be up till certain number the query will run conventionally and it’s when the threshold has breached should the application writes it to the column. Which majorly covers and saves cpu cycles.

  • @HoneyCombAI
    @HoneyCombAI 7 หลายเดือนก่อน

    Great explanation, so solid.

  • @anacsgo
    @anacsgo หลายเดือนก่อน

    hypothetically if millions of people all opened and scrolled through the likes list of a post with millions of likes, could we still crash it the same way?

  • @roodood
    @roodood 2 หลายเดือนก่อน

    Aka counter cache, Rails has this built in for example

  • @ebregains
    @ebregains 2 หลายเดือนก่อน

    Whaaaaat, who in their sane mind would use a hole table just to store two foreign keys and a UUID? ☠️☠️☠️☠️☠️

  • @shemmo
    @shemmo 9 หลายเดือนก่อน +3

    also use graph database for such problems, it is much faster in counting edges

  • @gokukakarot6323
    @gokukakarot6323 6 หลายเดือนก่อน

    They could also parse the output of Explain select count , to get an approximate. Above a certain number, 1million is the same as 1000002

  • @geteducatedyoufool4563
    @geteducatedyoufool4563 8 หลายเดือนก่อน

    I really like your explanations and examples
    Subbed

  • @ReSpawNnL
    @ReSpawNnL หลายเดือนก่อน

    Oof. Imagine the deadlocks. Better off logging each unique like much like a transactions/ledger. Every X minutes or when the total gets too out of hand, tally them and update the number. Don’t have direct user interaction influence a database like that.

  • @jtms1200
    @jtms1200 4 หลายเดือนก่อน

    counter caches are definitely a good thing

  • @MGraczyk
    @MGraczyk 5 หลายเดือนก่อน

    Instagram no longer uses a relational database for these sorts of things, i believe this specific case would have changed around 2017 but for sure by 2019

  • @chief-long-john
    @chief-long-john 5 หลายเดือนก่อน

    What about maintaining a queue with the requests and update a value in the database with an async job that runs every 10-20 items OR every 5 seconds, which either one comes first. So you can do a batch update, keep one copy of the count, and keep normalization . And periodically you could correct this count value by doing another async check to the actual COUNT value of the database

  • @Dynamiclink2
    @Dynamiclink2 3 หลายเดือนก่อน

    I believe that it is important to metion the database. That idea, for example, is not a good one with PostgreSQL and his locks.

  • @luiscarbonel1991
    @luiscarbonel1991 7 หลายเดือนก่อน

    The better option for this is use optimistic update in frontend side and send events using sns, topics, sqs.... whatever.. and if you wants betters results process that's events in batch to celebrities users like JB..

  • @aghilannathan8169
    @aghilannathan8169 2 หลายเดือนก่อน

    Problem is, now you have 2 sources of truth for post count.

  • @j031t8
    @j031t8 2 หลายเดือนก่อน

    This makes me feel better about my code

  • @lawalrasheed_
    @lawalrasheed_ 6 หลายเดือนก่อน

    I’ve found that generic statements about sql optimization/performance are not very useful in practice. There are numerous things that affect the performance of a relational db, not least of which is the specific DBMS e.g MySQL with InnoDB optimizes count(*) to use an index scan which is quite performant.

  • @SametAkalan
    @SametAkalan 21 วันที่ผ่านมา

    If you queue updates unless you need total likes, yes it works. Otherwise it sucks. I see a huge row lock contention in that scenario

  • @Dr_Larken
    @Dr_Larken 8 หลายเดือนก่อน +1

    Oh ok! I figured since we’re solving a problem, when the problem was essentially every time a celebrity uploaded causing the infrastructure to crash. My problem-solving skills kicked in. Immediately arriving at, we’ll If we remove celebrities from the equation problem solved!
    You could apply the solution to solve our problem here in the video, and in the real world! Bada bing bada BOOM, Bob’s your uncle!
    Isn’t coding fun!

  • @thebeastsclips
    @thebeastsclips 7 หลายเดือนก่อน

    I remember when I deleted my Facebook comments, I still see the comment count even though I delete all of them.

    • @SMILEPANT
      @SMILEPANT 7 หลายเดือนก่อน

      that is because of the soft deletion of comments from db.

  • @mahadevovnl
    @mahadevovnl 4 หลายเดือนก่อน

    They optimized it even more by not updating that record so often, they would cache it in Redis in batches and only have 1 update to the central database for every 1000 likes or so. That would be distributed across servers. So you might get 5 updates per minute saying: "A thousand more please."
    That "thousand" number was also based on how often the like button was pressed for that post. Lots of likes? It would update every 5k likes. Slowing down? It would stop caching.

  • @savagesarethebest7251
    @savagesarethebest7251 5 หลายเดือนก่อน

    Seriously they were worse at database design than I was when I was 10 years old in year 2000..!?

  • @aGj2fiebP3ekso7wQpnd1Lhd
    @aGj2fiebP3ekso7wQpnd1Lhd 4 หลายเดือนก่อน

    Tell me the person who designed the IG architecture had no idea what they were doing without telling me

  • @oakvillian5
    @oakvillian5 4 หลายเดือนก่อน

    This is a simplification. Nowadays we’re using big data processing like map reduce with probabilistic data structures for the real time counts

    • @jorgeconsulting
      @jorgeconsulting 4 หลายเดือนก่อน

      Could you elaborate on “probabilistic data structures” and what’s out there to achieve that?

  • @AndrewARitz
    @AndrewARitz 5 หลายเดือนก่อน

    Their biggest mistake was using a relational database for this kind of application.

  • @coda-n6u
    @coda-n6u 8 หลายเดือนก่อน

    This is amazing shorts content. Nice, sweet, technical, interesting story. Learn a small fact, accessible

  • @shogrran
    @shogrran 6 หลายเดือนก่อน

    Hence they started calling it "breaking the internet"

  • @missionpupa
    @missionpupa หลายเดือนก่อน

    My question is, why wasnt counting the first solution in the first place?

  • @UpToTheWhalesNow
    @UpToTheWhalesNow 4 หลายเดือนก่อน

    1) Counting rows in a realtional database is one of the fastest possible ways to count something.
    2) You're missing the half of this problem where millions of people are requesting the count (as well as increasing it themselves)
    3) A better term for fixing this problem is "materialization", not "denormalization"

  • @pranayvelisoju6144
    @pranayvelisoju6144 9 หลายเดือนก่อน +29

    This is wrong at many levels.
    1. Having just a likes counter will not work, we need to also save who liked a post so that we dont increment view/like counter upon repeated liking/unliking.
    2. Writing to the post table everytime some one likes is not a good approach when millions can like a post ina a few minutes.
    To address these we need to save both who liked a post and the reference to a post and to make sure this data takes less space, we use some probabilistic data structures like hyperloglog. These can estimate cardinalities of > 10^9 with accuracy of 2% and using around 1.5kb of memory.

    • @d0cx
      @d0cx 9 หลายเดือนก่อน +1

      Not to mention any sort of race condition... I'm not familiar with how modern databases handle race conditions.

    • @oxelf
      @oxelf 9 หลายเดือนก่อน +2

      @@d0cxMost of them with locks, they lock the table for writing while they do their transaction and then unlock it for other processes to write again.

    • @usernamesrbacknowthx
      @usernamesrbacknowthx 9 หลายเดือนก่อน +8

      You have no idea what you’re talking about.
      1. “We need to also save who liked a post” - yeah that’s exactly what denormalization is doing. Preventing needing a COUNT query over the likes table every time a user views a post is exactly the problem that denormalization solves. The user only needs a ROUGH ESTIMATE of how many users liked the post and the only accurate information they need from the likes table is if THEY liked the post or not.
      2. This is just technobabble bullshit. You’re just doing denormalization with extra steps. Did you even watch the video?

    • @usernamesrbacknowthx
      @usernamesrbacknowthx 9 หลายเดือนก่อน +2

      @@d0cx a race condition on what exactly? Where’s the race condition?

    • @usernamesrbacknowthx
      @usernamesrbacknowthx 9 หลายเดือนก่อน +1

      ⁠@@oxelftable locking is extremely rare most are using row locking nowadays.

  • @gamer_theegyad339
    @gamer_theegyad339 3 หลายเดือนก่อน

    Thing is they should have known how many users on Instagram then expand the cache or storage..... 😊buh they've done a great job 👏

  • @MrTStat
    @MrTStat 7 หลายเดือนก่อน

    I did that in the balance field in my app
    Yes I have all the transactions for integrity but I update the balance for faster processing

  • @matt55346
    @matt55346 5 หลายเดือนก่อน

    Yesterday i was talking with a colleague that I don't really like SQL, today my feed is flooded with SQL tutorials.

  • @nrgstudios612
    @nrgstudios612 หลายเดือนก่อน

    Dude, why would a like be an entity onto itself anyway? They were probably storing datetime on each like too.

  • @mgs_4k198
    @mgs_4k198 8 หลายเดือนก่อน

    Damn son,
    Your vidoes Hella practict

  • @czarlito_
    @czarlito_ 3 หลายเดือนก่อน

    Right but what about concurrect access?
    Without the column there is no problem as each backend instance inserts the row within its own connection, so the inserts can happen out of order and concurrently.
    But if all of these connections want to have the value of the new column, they cannot get the same value, as they would all return n+1. So does the database queue these requests?
    How does it manage millions updates per second where each update has to have the newest version of the value?
    I think TH-cam had similar problem with counting likes, that's why they did it in some clever way, and right after the upload the count would stop at around 300 views for some time.
    I don't know whether they still do it like this, or the times where the TH-cam views got stuck at 300 are now long gone, but my heart is with you if you remember that.

  • @doxologist
    @doxologist 8 หลายเดือนก่อน +5

    Damn, what ER software is that?

    • @guiolly
      @guiolly 6 หลายเดือนก่อน

      I wanna know too

  • @BoominGame
    @BoominGame 5 หลายเดือนก่อน

    Yep counting roses can be pricky....

  • @aditya234567
    @aditya234567 3 หลายเดือนก่อน

    Real question here is why Justin Bieber has so many fans?
    Everyone is talented in their own way! Never understand celebrity worship

  • @rbettsx
    @rbettsx 8 หลายเดือนก่อน

    The exact number of likes is not business-critical, for large numbers. A new, time-stamped entity: ' estimated number of likes' does *not* denormalize the database. All it takes is a semantic confession that the cache is not necessarily up to date.

  • @conundrum2u
    @conundrum2u หลายเดือนก่อน

    this depends entirely on the db server. not all work like this

  • @nimamc
    @nimamc 7 หลายเดือนก่อน

    This is standard practice in Firestore for example.

  • @darak2
    @darak2 2 หลายเดือนก่อน

    You shouldn't update that count every time someone makes a like, which could end up being slower than making the real count due to locking. Instead, you live up with an outdated counter and update it periodically with a job that actually does the count, but less often.

  • @andrek2478
    @andrek2478 7 หลายเดือนก่อน

    Love that example! Thank you

  • @Markus-iq4sm
    @Markus-iq4sm 5 หลายเดือนก่อน

    We were taught that in University years ago, it is just logical

  • @mdahsanraza
    @mdahsanraza 5 หลายเดือนก่อน

    Why do you ever need a separate likes table?!😅

  • @verbedr
    @verbedr 5 หลายเดือนก่อน

    Yeah, and then, 10000 people at the same time likes that post how many updates will trigger a lock on that row. This is not a solution for this problem. I hope there is a view optimized version of their database that is updated async based on an event queue.