Understanding Aggregate Functions Performance | The Backend Engineering Show

แชร์
ฝัง
  • เผยแพร่เมื่อ 12 ก.ย. 2024

ความคิดเห็น • 34

  • @hnasr
    @hnasr  2 ปีที่แล้ว +5

    Head to database.husseinnasser.com for a discount coupon to my Introduction to Database Engineering course. Link redirects to udemy with coupon applied.

    • @joshua.hintze
      @joshua.hintze 2 ปีที่แล้ว

      Already in the class. Definitely recommend to all!

  • @YourMakingMeNervous
    @YourMakingMeNervous 2 ปีที่แล้ว +4

    I am a data engineer and your channel has been invaluable for my learning lately

  • @SinskariBoi3366three
    @SinskariBoi3366three 2 ปีที่แล้ว +1

    I've been a database specialist for years and I did not know these intimate details. Great Thanks. You are simply awesome.

  • @md.hussainulislamsajib7189
    @md.hussainulislamsajib7189 2 ปีที่แล้ว +4

    I always enjoy the way you dissect things and get to the bare metal to fully understand and explain it! You're amazing!! 👏

    • @hnasr
      @hnasr  2 ปีที่แล้ว +2

      Thank you!

  • @briankimutai515
    @briankimutai515 2 ปีที่แล้ว +9

    Thanks a lot @Hussein for this informative video.
    I do not have a PhD but I'm going to attempt to solve the problems of clustered sequential and random writes. Opinions from the community will be appreciated :)
    1) Clustered Sequential Writes: How about we use an in memory queue like data structure for fresh new writes? For every new write, we allocate some memory with the new data and atomically append it at the tail end of the queue. Since the writes are strictly ordered by the time they arrive at the database, we can use a lock-free paradigm like compare-and-swap to atomically append data to the queue. This will do eliminate the lock overhead that mutexes introduce. Then when it comes to actually adding the writes to the b-tree index structure, with this ordering, there's a big potential of the tree rebuilding for every new write and this can cause a big performance overhead. As the writes are batched up in the queue, we can add them to the tree at a configurable threshold so that we rebuild the tree much less often than the rate at which the writes arrive. Finally, at some point we need to flush our tree on-disk. Since the right sub-tree of our b-tree is the one that's mostly growing, most of the changes are happening sequentially in the same region on disk. We can use fadvise when flushing our changes on disk, just to get some extra performance. This approach has trade-offs in that we'll have superior sequential write performance but poor random reads since we need search both the b-tree (O(logN)) and the in-memory queue (O(N) for a size equal to the configurable threshold).
    2) Clustered Random Writes: @Hussein, can you elaborate here how exactly flushing the entire contents of memory will cause write amplification, as in, for each record received, how many writes on disk will need to take place for it to be persisted?
    For this problem, there is one extreme end where every write results in a disk write, which can be very expensive for writes. At the other end, we store up data in-memory till we run out of memory and we flush everything to disk and rebuild the tree afresh if needed. This will also be a problem since during flushes, writes will need to be paused. I can anecdotally say that most storage engines, at least for RocksDB, writes can be batched at a configurable threshold. This threshold will lie some in the middle where the user weighs the amount of memory they have and the frequency of writes they can tolerate, so that writes can be flushed to disk in a manner that won't make the performance suffer.
    I welcome all opinions from all you database enthusiasts ;)

  • @overflow3789
    @overflow3789 2 ปีที่แล้ว

    The way you explain things and dissect things is really amazing I haven't found a detailed video as you have done.its really usefull to get the understanding of such beautiful things .

  • @jomarmontuya1516
    @jomarmontuya1516 2 ปีที่แล้ว +1

    I'm a front end developer but I I'm starting to like backend because of you!

  • @squirrel1620
    @squirrel1620 2 ปีที่แล้ว +6

    For the estimates, when guessing within 5-10 rows of error on a table with 100M rows, it would be easier to get the size of a row in bytes and do some math with the total bytes taken by the table

    • @hnasr
      @hnasr  2 ปีที่แล้ว +6

      Good idea, one thing might’ve get the estimate off as well is if many rows were deleted on the table. databases usually don’t return the space back to the OS which will cause the table to over report its size compared to how many rows it has

  • @mrvaibh0
    @mrvaibh0 2 ปีที่แล้ว +2

    I literally made notes, such core details 🙌

  • @michaelangelovideos
    @michaelangelovideos 2 ปีที่แล้ว +1

    Love your content man. Can you talk on the Okta breach??

  • @pieter5466
    @pieter5466 ปีที่แล้ว

    5:50 At first I interpreted "smallest" as "of least disk size in bytes", but it more likely means their sequential ordering with respect to the overall range.

  • @semperfiArs
    @semperfiArs 2 ปีที่แล้ว

    I would really love to know where you get ideas for these videos and how much time you put to read and research about it because this is really awesome and sounds like you put in a lot of time. Thanks a lot for the awesome content you post here

  • @vendetta3953
    @vendetta3953 2 ปีที่แล้ว

    always a pleasure to watch your informative videos :)

  • @young-ceo
    @young-ceo 2 ปีที่แล้ว

    Thank you for the video! amazing advice

  • @tafarakasparovmhangami2356
    @tafarakasparovmhangami2356 2 ปีที่แล้ว +1

    Quite Insightful😎

  • @subhamprasad1373
    @subhamprasad1373 2 ปีที่แล้ว

    thanks for your effort , really i learn a lot.

  • @henrikschmidt8178
    @henrikschmidt8178 2 ปีที่แล้ว +1

    Did you make a video about varchar and how it is stored in different systems? I hav always wondered how a varchar(255) is handled on disk and how changes are handled. Are all data saved as maximum size and fixed size records in total, are records stored in variable length or are all varchar stored in a separate space all together?

    • @hnasr
      @hnasr  2 ปีที่แล้ว +2

      I believe varchar works similarly to TOAST in postgres text. The actual text is stored in another table and the row has a pointer to that. This keeps the row from overflowing the page since most databases has fixed page size. I talked about It here th-cam.com/video/_UUFMAZswhU/w-d-xo.html
      And also on my course

  • @HarshKapadia
    @HarshKapadia 2 ปีที่แล้ว

    Loved this video!

  • @sarathkumar3749
    @sarathkumar3749 2 ปีที่แล้ว

    Thanks

  • @marslogics
    @marslogics 2 ปีที่แล้ว +1

    Thank you...

  • @nguyentanphat7754
    @nguyentanphat7754 2 ปีที่แล้ว

    Thank you for your awesome discussion in this video, @Hussein!
    However I have a question to the part of you talking about the cons of the uuid random-writes that would exhaust the RAM, what is the solution for this specific case? Should we just considerably use the clustered index with the sequential writes that you'd mentioned or is there a particular solution for this? 🙏🙏

    • @hnasr
      @hnasr  2 ปีที่แล้ว

      I think we should just understand that this might happen. Sometimes there is no escape and you need UUID to be your primary key, so in that case you would just increase the buffer pool size. Otherwise its more optimal to use a sequential light weight primary key so avoid this.
      Using a non clustered table and trying your workflow against it is also another idea.
      Percona wrote a nice blog about this
      www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

  • @shiewhun1772
    @shiewhun1772 2 ปีที่แล้ว +1

    Also, Hussein, I know that it will be time-consuming and will probably not be of much interest to you, but do more videos on web3 and Ethereum and the underlying tech. I've seen the videos on IPFS and Bitcoin mining you did, but it'd be nice if you did more.
    Also, how is it possible that in ML, they're able to process so much data in their Panda tables, when on the other hand, doing the same sort of manipulations on similar data in a regular database will be slower?

  • @dhawaljoshi
    @dhawaljoshi 2 ปีที่แล้ว

    would it be the best of both? like write to buffer and then one process would continuously write to the actual disk to prevent the buffer checkpoint situation?

  • @letsmusic3341
    @letsmusic3341 2 ปีที่แล้ว

    Will
    Select Count(A) FROM T; use index or will it go for table scan?

  • @mohamadabdelrida2866
    @mohamadabdelrida2866 2 ปีที่แล้ว

    If count is slow then how does auto-increment impact performance?

  • @officialpowergamerz
    @officialpowergamerz 2 ปีที่แล้ว +1

    Hi ☺️ can you please teach us how Facebook Twitter whatsapp hide their API, secrets in their android or Apple apps from reverse engineering

  • @causality5698
    @causality5698 2 ปีที่แล้ว

    3:00 Atomic Integers?

  • @mudassirhasan9022
    @mudassirhasan9022 2 ปีที่แล้ว

    Thanks