Avoid premature Database Sharding

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 มิ.ย. 2024
  • Someone asked a question on Twitter and I thought it's interesting to answer it here in the show.
    I have a 2 million row table used in my CRUD python app, I’m worried that as the table grows my inserts will slow down, should I consider sharding my database or partition the table? thank you
    * inserts are fast, queries are slow 0:00
    * inserts can be slow 3:00
    * indexes/stored procedures
    * selects, updates and deletes can be slow 12:00
    * add proper indexes.
    * simplicity wins, premature optimization is bad 15:20
    * crazy things that people say as microservices day 1 scares me
    🎙️Listen to the Backend Engineering Podcast
    husseinnasser.com/podcast
    🏭 Backend Engineering Videos
    backend.husseinnasser.com
    💾 Database Engineering Videos
    • Database Engineering
    🏰 Load Balancing and Proxies Videos
    • Proxies
    🏛️ Software Archtiecture Videos
    • Software Architecture
    📩 Messaging Systems
    • Message Queues & PubSu...
    Become a Member
    / @hnasr
    Support me on PayPal
    bit.ly/33ENps4
    Stay Awesome,
    Hussein
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 60

  • @hnasr
    @hnasr  3 ปีที่แล้ว +3

    Learn the fundamentals of database systems to understand and build performant backend apps
    Check out my udemy Introduction to Database Engineering
    husseinnasser.com/courses

  • @eslama-elwafa2
    @eslama-elwafa2 3 ปีที่แล้ว +20

    summary:
    "only solve problems that you have, don't solve the problem you don't have"

  • @sathyajithps013
    @sathyajithps013 3 ปีที่แล้ว +9

    Thanks for doing this video Hussein. As a new backend dev, I've always wondered about this and no one has answered it in detail as you have.

  • @mihailnesterov
    @mihailnesterov 3 ปีที่แล้ว

    Hussein, thanks for your videos!

  • @hnasr
    @hnasr  3 ปีที่แล้ว +20

    🍑 Hussein 🍑 Hussein 🍑 Hussein 🍑 Hussein

    • @hardikp5192
      @hardikp5192 3 ปีที่แล้ว

      My answer to "what's going on guys" is, It's my first day on the job, I got an Internship in a startup company, BIG THANKS to you sir, because of you I became passionate about Backend Engineering, so, I'm doing great and I'm grateful. Bless you and your family

  • @patacon_king
    @patacon_king 3 ปีที่แล้ว

    Great video, also, not related to the video but still wanted to ask it.
    Do you have some sort of list with recommended books?

  • @isaacfrost9798
    @isaacfrost9798 3 ปีที่แล้ว

    I think using hash Partitioning is a good idea if your table has the potential of adding up more rows on the client growth for example a URL shorter table.

  • @hazemabdelalim5189
    @hazemabdelalim5189 3 ปีที่แล้ว +2

    Hello Hussein , I am working with Couchbase Database which can update the indexes asynchronously , so even if you have indexes insertion will be still O(1) , however it will take time for the index tree to be re-built but it will be re-balanced eventually .

    • @hnasr
      @hnasr  3 ปีที่แล้ว

      Interesting Hazem! Thanks for sharing

  • @romantsyupryk3009
    @romantsyupryk3009 3 ปีที่แล้ว +1

    Thanks very much for this video.

  • @LearnToCodeAcademy
    @LearnToCodeAcademy 3 ปีที่แล้ว

    Just purchased the course... I love your contents honestly... They are beautiful

    • @hnasr
      @hnasr  3 ปีที่แล้ว +1

      Thank you so much ❤️

  • @AleksandarT10
    @AleksandarT10 3 ปีที่แล้ว +2

    You got a good point Hussein. Partitioning/Sharding does not make sense for most of the use cases. Where it makes sense as you said is when you have Billions of Rows and Terabytes of data (which only Big Enterprises do). I work for a Big ECommerce Company Headquartered in Silicon Valley and before migrating to Cloud, we were using Oracle DB with Sharding. We were using several physical servers where we had the UserId going through a "Hashing" Function which gave us the number of the shard. Based on our query "patterns" analysis, we decided that we can go to "DynamoDB" (NoSQL) and we moved the data there in a single table. DynamoDB Internally uses partitioning/sharding itself, but this is not visible to the end user, which means that the complexity of that gets offloaded to AWS.

    • @hnasr
      @hnasr  3 ปีที่แล้ว +1

      Thanks for sharing ! Great use case

  • @cse2335
    @cse2335 3 ปีที่แล้ว +3

    Hey Nasser,
    Which one we should prefer ORM OR Non-ORM(Query) can you light on performances and complexities!
    Thank you🙏

    • @hnasr
      @hnasr  3 ปีที่แล้ว +8

      Spoke about that before, I prefer raw SQL where I have full control, ORM can introduce extra layer of complexity and invisibility to what really is happening..

    • @cse2335
      @cse2335 3 ปีที่แล้ว

      @@hnasr Okay Thanks!

  • @GaurangDhorda
    @GaurangDhorda 3 ปีที่แล้ว

    what about mongoDB sharding ? They provide sharding default in mongoDB atlas ?

  • @hnasr
    @hnasr  3 ปีที่แล้ว

    If you enjoyed this video consider checking out my Introduction to Database Engineering course. Its a bestseller on udemy with over 14 hours worth of content and students love it !
    husseinnasser.com/courses

  • @ThinkSoftware
    @ThinkSoftware 3 ปีที่แล้ว

    Amazing video :)

  • @sherbineyb7166
    @sherbineyb7166 3 ปีที่แล้ว

    hussein , is it possible to update indexes asynchronously or it is set by default synchronous and cant be changed ?
    i understand the drawbacks of this , but is this even possible ?

    • @hnasr
      @hnasr  3 ปีที่แล้ว +1

      Technically it’s possible of course but I am not sure if any database engine exposes such feature. Its even more efficient because you can batch those index updated and do them at once. As long as the engineer understands the consequences
      its a complex thing to implement though

  • @nishantkamboj5973
    @nishantkamboj5973 2 ปีที่แล้ว +1

    Hey Hussein. I have a question. If we have a clustered index for email column or any other non auto incremented column in MySQL then the write operation will need to find the correct place to insert the row. Is it correct? Is so then I think that's another reason to use auto incremented primary keys for the tables.

    • @hnasr
      @hnasr  2 ปีที่แล้ว +1

      That is correct, the moment you cluster on a "random" field like email or UUID, you risk slow down inserts because the db need to find where to insert this new row so it fits the clustered page. You also risk something called page splits which drastically decrease performance.
      unclustered tables don't have this problem, everything goes in the end unsorted.

    • @javaprogramming7235
      @javaprogramming7235 4 หลายเดือนก่อน

      I want to ask following this question, does it mean to create PK based on some information from the record would help in performance of querying
      For example Inserting a new subscription record in database we can create pk something like this SUB_02022024_HASH does this will help in querying at the cost time the server will take to create this PK?

  • @saiavinashduddupudi8975
    @saiavinashduddupudi8975 3 ปีที่แล้ว

    @Hussein at 00:58, what do you mean by a VANILLA INSERT?

    • @hnasr
      @hnasr  3 ปีที่แล้ว +1

      Inserting a single row where all values are in the statement is simple doesn’t have performance impact (vanilla insert)
      However you can do an insert that queries the table (insert into select) that has huge impact on the size of the table

    • @saiavinashduddupudi8975
      @saiavinashduddupudi8975 3 ปีที่แล้ว

      @@hnasr got it...it means a simple insert at end of table

  • @javisartdesign
    @javisartdesign 3 ปีที่แล้ว

    Thanks for the video. The first statement is not totally right, inserts with lot of constraints and foreign keys with hundred or millions can be also slow.

  • @mehtabalamkhan2275
    @mehtabalamkhan2275 3 ปีที่แล้ว

    Hussein bro please make a video on ,everything about live video streaming app. how can we implement it and what would be the best approach and which data base would be the best. for both live video streaming and normal video streaming like twitch and netflix . please explain about the complete backend architecture with example thank you. i am pretty sure it would help a lot of people. theres not much insightful video out there about this topic on youtube

    • @hnasr
      @hnasr  3 ปีที่แล้ว +3

      This should help
      th-cam.com/video/1-KmLc0c2sk/w-d-xo.html
      th-cam.com/video/px0i9ihcjuM/w-d-xo.html

    • @mehtabalamkhan2275
      @mehtabalamkhan2275 3 ปีที่แล้ว +1

      @@hnasr omg how did i not see it 😳 tysmmm ☺☺ absolutely love your content

  • @colinmaharaj
    @colinmaharaj 3 ปีที่แล้ว

    I want to write my own database engine. Only update the index for crying out loud, if and only iff you do a select query. Or you can have a sub index as a temporary thing then you merge after a while

    • @hnasr
      @hnasr  3 ปีที่แล้ว +1

      I see where your going but I don’t know if that will work. Your select will be slower and more importantly it will have to acquire row exclusive locks on the parts of the index that needs to get updated. This will slow down things even more because of waiting events.

    • @colinmaharaj
      @colinmaharaj 3 ปีที่แล้ว

      @@hnasr Hmm, I hear you. Ill figure it out one day.

  • @Flankymanga
    @Flankymanga 3 ปีที่แล้ว

    16:03 YAGNI priniple and KISS... implement the advanced stuff only when you know you are going to need it....

    • @hnasr
      @hnasr  3 ปีที่แล้ว +1

      Yes! I talked about YAGNI here th-cam.com/video/zHSbMe15c2Q/w-d-xo.html

  • @FISS007
    @FISS007 3 ปีที่แล้ว +3

    TH-cam, just give me the possibility to give more than one like !!! This guy deserves tons of likes :D

  • @dibyaranjanmishra4272
    @dibyaranjanmishra4272 3 ปีที่แล้ว +2

    Can you make a video on Rocks DB?

    • @hnasr
      @hnasr  3 ปีที่แล้ว

      Good idea

    • @bisakhmondal8371
      @bisakhmondal8371 3 ปีที่แล้ว

      If possible, HyPer as well in future :)

  • @Nexjsdeveloper
    @Nexjsdeveloper 3 ปีที่แล้ว

    Its amaizing

  • @tikz.-3738
    @tikz.-3738 3 ปีที่แล้ว

    Hey Hussein I have a query. Say I'm building a app. There are many topics and those topics contain different content . The user subscribe to a topic now I only want to fetch the stuff that user doesn't have means newly added content how can that be done? The way I think it is to compare array of content ids(on user and inside topic table) by length for each topic and then fetch the remaining content but this sounds very bad due to multiple query per every user. Im using mongoDB which does support look ups to different tables should that be needed

    • @hnasr
      @hnasr  3 ปีที่แล้ว

      I would suggest using a queue or a pub/sub system such as RabbitMQ or Kafka. This feature is there by default

    • @tikz.-3738
      @tikz.-3738 3 ปีที่แล้ว

      @@hnasr so basically I'm new to this stuff so I was thinking this whole method itself is slow but as u say I need to use this only right I tried for days but I couldn't find a better solution, even tried aggregation pipelines but all require me to do multiple queries and finding diffrence of arrays

    • @tikz.-3738
      @tikz.-3738 3 ปีที่แล้ว

      @@hnasr I have seen ur videos on all those systems but as u say I wanted to keep it as simple as I can so I thought to myself the method to query itself is soo bad I must try redesigning but I can't get a better idea on model I thought of doing some courses on data modelling 😂

    • @tikz.-3738
      @tikz.-3738 3 ปีที่แล้ว

      Also one more follow up question the apps supposed to inform users about new contents as notification websockets would be expensive for my app(I think) so I thought about short polling every hr(set of users come and ask in 1hr and then another set comes) or something by doing some distribution of users with latitude and time zone on client (no idea how to do it) just a thought yet
      Thanks for ur replies and time ❤️

    • @willi1978
      @willi1978 3 ปีที่แล้ว

      @@tikz.-3738 queuing system (kafka, rabbitmq) or periodic polling by clients are possible. I guess if you already have an existing app and want to add near time functionality then adding polling is easier. but this will mean you have periodic requests for each online user. you have to weigh the pros and cons of the two approaches.

  • @teunohooijer6788
    @teunohooijer6788 3 ปีที่แล้ว

    is this the same lsm tree as the one on Wikipedia? that one was invented in 1991 an published in a paper in 1996. seems like it can't be a google invention then. love your videos.

    • @hnasr
      @hnasr  3 ปีที่แล้ว

      Correct the original paper is 1996, but google pushed the tech with Leveldb and made it practical

    • @teunohooijer6788
      @teunohooijer6788 3 ปีที่แล้ว

      @@hnasr that explains why you say google, a database in use is of course practical. thanks for the reply.

  • @0xfaizan
    @0xfaizan 3 ปีที่แล้ว

    Plz make short videos of about 5 to 8 minutes

    • @mugiwaranoadi5932
      @mugiwaranoadi5932 3 ปีที่แล้ว

      Why should he? What are the benefits for him and his viewers? Give him context, otherwise it only seems like a selfish wish.

  • @AnasHmamouch
    @AnasHmamouch 3 ปีที่แล้ว

    سلام عليكم حسين اخوك أنس من المغرب أنا كمان مبرمج و أتمنى تكون سعيد تشوف كومنت بالعربي حياك الله

    • @hnasr
      @hnasr  3 ปีที่แล้ว

      حبيبي تاج راسي اهل المغرب 🇲🇦 تحياتي وشكرا جزيلا