How Shopify’s engineering improved database writes by 50% with ULID

Hussein Nasser

มุมมอง 110 776

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 21 พ.ย. 2024

ความคิดเห็น • 108

@hnasr ปีที่แล้ว ⁺¹⁰
Get my fundamentals of database engineering udemy course database.husseinnasser.com
@kapssul หลายเดือนก่อน
20:56 ULID composed of 2 parts, first part is ordred but the second part is still random, so it will lead to non-sequential writes.. and this will cause page splits and rearrangement in the database, impacting performance.. am i missing something
@Winnetou17 ปีที่แล้ว ⁺⁷¹
Hussein: "A GET request by definition must be idempotent [...] Nothing chages on the backend, nothing should change"
View counters: 😨😰
@torocatala ปีที่แล้ว ⁺¹⁸
Logs: 😨😰
@jerynmathew ปีที่แล้ว ⁺³
One could always use POST for fetching counters and logs. Like how elasticsearch exposes /_search API
@abdelrahmandwedar 8 หลายเดือนก่อน ⁺¹
Well... Yeah that didn't change the response of GET request for most cases... 😕
There's cases like YT watch counter but mostly it doesn't increase on your GET request itself.
@gwapolaub ปีที่แล้ว ⁺⁴¹
Pure database architectural gold. Thank you for this analysis.
@pqnet84 ปีที่แล้ว ⁺¹⁶
24:30 you can tradeoff cache memory for mutex contention by adding some random bits at the START of the id so that for a certain timestamp you have multiple pages
@johannsebastianbach3411 ปีที่แล้ว ⁺²
Could you elaborate on that a little further for a simpleton like me? 😂
@pqnet84 ปีที่แล้ว ⁺¹⁵
@@johannsebastianbach3411 by having random bits at the start of the key you are distributing the objects on various parts of the btree. Example, if your first bit is random half of the keys will start with 0 then a timestamp, other half will start with 1 then a timestamp. Because of the timestamp all the 0s are going to be grouped together and all the 1s are going to be grouped together but they will be in different part of the btree. Like having the same street name, but different cities. So if you have a cache big enough for 2 pages you can keep both in memory and both will be fast (you are still limited to 2 pages), but you won't always access the same page (they are randomly load balanced to the two pages) so you reduce the contemption on the page mutex
@johannsebastianbach3411 ปีที่แล้ว ⁺⁵
@@pqnet84 omg, that makes perfect sense! Thanks. We use something similar at work for foldering strategy, like a/b/abab-abab-ababa-restOfUUID, where we get the first two folder names from the first two chars of the uuid, never would have thought about doing that for indexing though!
@justinsullard1519 ปีที่แล้ว ⁺⁵
I've been doing something like this for almost a decade now. The first 12 hex digits a time stamp, the next 4 he'd digits used as either a shard ID or a sequence, and then the rest random. Useful for so many reasons.
@JAlejandroMG 9 หลายเดือนก่อน
Hey @justinsullard1519, from somebody who has not worked a lot in DB, how secure is this approach? Have you had any security risk issue using this? I've been looking for info about this aspect, but haven't found anything at all.
@dhillaz ปีที่แล้ว ⁺¹⁴
Having the timestamp embedded in the UID also means you can use the sorted index for deleting/archiving old rows when doing housekeeping, without the overhead of maintaining an additional index.
@nemtudom5074 ปีที่แล้ว ⁺⁵
This is the kinda optimizing we used to see in the 90's when you *HAD* to optimize your code because you couldnt just buy more performance, not like this anyway.
Not to mention that even if you did, you didnt get much. Things perform so well now that you can just let your code be unoptimized and people wont care.
Im glad they are putting in the effort.
@lfsr1288 ปีที่แล้ว ⁺⁷
It's a great approach. The only case when you ended up inserting "old time request" is with the retry logic retrying some requests since they have a timeout of 24hs or less. But again, this is a great approach to improve the insert
@amitmahadik4653 9 หลายเดือนก่อน ⁺¹
For race condition, we can think of synchronization (allow one thread to write), but it will generally slow down insertion
@abdelrahmandwedar 8 หลายเดือนก่อน
That's exactly what I thought about. How would it increase the speed of insert by 50 percent while having everything synchronized?
@guss77 ปีที่แล้ว ⁺⁶
Correction about UUID randomness - please note that only UUID version 4 are random. There are other UUID versions, notably version 1 - which is time based and has lexicographic sorting, just like ULID, but it is standard and have a lot of supported implementations. The main problem with UUIDv1 is that the bit order is a bit weird and if you look at it as an opaque bit set, it looks quite random as the least significant bits of the time (time low in UUID spec) are in front. This was done purposefully.
MysQL version 8 improved support for using UUIDv1 as primary keys, with good index locality, by adding specific functions to reorder the time fields in a UUID so that they are ordered from most significant to less significant - similar to cardinals, and that convert the UUID string representation to a compact binary storage that is almost as efficient as using an integer.
@gsilva877 ปีที่แล้ว
Thanks for the coment, is useful ❤
@javedutube10 ปีที่แล้ว ⁺²
Just bought 'Fundamentals of Networking for Effective Backend Design' from UDEMY, even though I could learn it from different channels that also suggested by you : ), including yours, of course. That's your magic. Thanks. Love you.
@ankgaur ปีที่แล้ว ⁺¹⁰
Can anyone name youtube channels like this
with this kind of detailed backend videos
@prathameshbhat9816 ปีที่แล้ว
Please tag me as well if someone is replying
@tesla1772 ปีที่แล้ว ⁺⁴
You can check arpit bhayani and bytebytego
@ankgaur ปีที่แล้ว ⁺¹
@@prathameshbhat9816 FYI
@ankgaur ปีที่แล้ว
@@tesla1772 thanks
@siddharthsingh7281 ปีที่แล้ว ⁺¹
Theo brown is the best
@abhaygoswami7021 ปีที่แล้ว ⁺²³
One can also use UUIDv7 which also use timestamp and is sortable in nature and 62 bits randomness as compare to 80 bits in ULID
@DrBatroni ปีที่แล้ว ⁺¹
I also wonder why?
They come from UUID v4, so in both cases the new timespamt would be new.
So i wonder whats the big diffrence between UUID v7 and ULID?
@Tordek ปีที่แล้ว ⁺²
ULIDs are (slightly) shorter, the encoding is trivial (and easier to split: you split the first 8 characters for the timestamp without fiddling with dashes or variable length fields). Standards only matter if you need to interact with someone else, and you shouldn't depend on other people parsing your IDs. There doesn't seem to be any downsides to ULID over UUID.
@CTimmerman ปีที่แล้ว ⁺⁴
TL;DW: UUID tends to be too long and random. ULID is 10 byes shorter and starts with a sortable timestamp.
@EngineerNick ปีที่แล้ว
Thankyou for this :) I love the idea that data can be 'tucked in' to a db 🧸
@kenbee85 ปีที่แล้ว
I just stumble upon your video as suggested by YT. Though I merely focusing on Frontend I really got curious about this.
@Stig12 ปีที่แล้ว ⁺¹
This is either the same or very similar to a Comb Guid. Not sure if you mentioned it, but you'll want to modify the byte order before storing in MS SQL Server to get the same sequential benefit.
@MaxPrehl 5 หลายเดือนก่อน
This is a great breakdown! Wish you also had some visual aids!
@sanzharsuleimenov6380 ปีที่แล้ว
Certainly! Here's a simpler version of your text:
Thanks!
I found your video about Discord switching from Cassandra, and I really like how you share your thoughts.
@abcdef-fo1tf ปีที่แล้ว ⁺²
I feel like in url shortener, the ULID would still help reads as you're more likely to read newer URLs which might be in the buffer pool
@Aleks-fp1kq 11 หลายเดือนก่อน ⁺¹
I didn't understand the part about 24hrs indempotency. How is it related to ulid?
@nathansgreen ปีที่แล้ว ⁺¹
Use UUIDv7, as defined in the drafted revisions to RFC 4122. Yes, it's slightly different, but it enables compatibility with existing UUID code, which already exists in a lot of places.
@Milchmann2 ปีที่แล้ว ⁺¹⁸
Using a "timestamp" as key will then also redirect all writes to the same node of the distributed database and thus creating a hotstpot. I wish the article and you talked a bit more why this is not an issue for them.
@fishoil5310 ปีที่แล้ว ⁺⁶
The database is probably sharded using user id. The write is still effectively distributed across different nodes.
@colt4by5 ปีที่แล้ว ⁺⁵
Edit: I forgot about the 80 bits of random data! I think partitioners will hash the partition key, so any randomness will cause the data to be spread across nodes and avoid hotspots.
I was also going to make this point... good insight! Using a timestamp as the partition key in a distributed database can indeed create a hotspot and degrade performance. I think it's not an issue for them because they are using MySQL, which is not a true distributed database, so that caveat doesn't apply. I guess the potential issue for them is they're limited to the write performance of a SQL database, which can't scale up as high as a distributed database can.
@miladin4023 ปีที่แล้ว
Can someone elaborate more on this? Thanks :)
@colt4by5 ปีที่แล้ว ⁺⁵
@@miladin4023 I think the main point is that distributed databases use the partition key to "route" data to a physical node, so if you use just a timestamp, then all data will be routed to the same partition (and therefore node) for each time slice, which can create a hotspot and remove much of the value of using a distributed database. A 48-bit timestamp should allow for millisecond granularity, so maybe it's not a huge issue, but probably better to avoid it. The 80 bits of random data in the ULID should prevent the issue altogether though. You can find a quick explanation if you search for "DynamoDB designing partition keys to distribute your workload". On the relational database side, Hussein already did a much better job of explaining how data is routed than I could :).
@junaid1555 ปีที่แล้ว ⁺²
@@colt4by5 if the partition key is hashed to avoid hotspots, won't the range queries be very expensive?
@figloalds ปีที่แล้ว ⁺¹
The RID mechanism I use on my program is similar to that, but I use 32 byte array containing a machine specific signature, a timestamp, a random number and a in-memory sequential number, it is designed to absolutely never conflict
@ashishbtech ปีที่แล้ว
That explains well why SQL-type DBs do not perform well with the long tail of data.
@tesla1772 ปีที่แล้ว ⁺⁴²
So can we say that mongodb ids are more efficient compared to ulids as it is 96 bits and are also sequential
@tesla1772 ปีที่แล้ว ⁺³
@@JamesSmith-cm7sg we can genrate same ids locally
@btom1990 ปีที่แล้ว
@@JamesSmith-cm7sg but MongoDB is webscale th-cam.com/video/b2F-DItXtZs/w-d-xo.html
@atmrar ปีที่แล้ว ⁺¹²
@@tesla1772 local generation of ulid is error prone. If system clocks are not synchronised the local id can go back in the time .
@tesla1772 ปีที่แล้ว ⁺¹
@@atmrar yeah. But i think machine id and process ids are aslo included in it right?
@MrOlivm ปีที่แล้ว ⁺¹
Sequential importantly implies that mongo’s “web scale” ain’t so web scale. I don’t know why you would use mongo when you could use /dev/null
@engineeranonymous ปีที่แล้ว ⁺¹⁰
Never use GUID's for anything if your database have the ability to use b-tree. Eventually some day you have to use the GUID field for something and you will regret it. Use integer keys and hashid library in order to use it with controllers.
@nerdative ปีที่แล้ว ⁺¹
Do they assume that clients have the correct time? What happens if the client have the wrong time? Then it would generate "old" or "incorrect" ULIDs. Unless they somehow synchronize the time between server and client, or it happens so rarely that it doesn't matter.
@mosespeter9711 ปีที่แล้ว ⁺⁶
I have been using the sorted UUID strategy since MYSQL 8.0 with UUID_TO_BIN("uuid",1) and this function further convert it to BINARY(16) instead of storing it as string... this is super fast!
@DavisTibbz ปีที่แล้ว
Yes! More people should upgrade to mysql 8.0
@oah8465 ปีที่แล้ว ⁺²
fantastic video but why are we going with GUIDs, UUID, ULID, if we have our beloved auto-increment primary key in MySQL? Let me put it this way, in which scenario is the primary key (auto-increment) by MySQL is no good and we have to go either the UUID or the ULID way?
@SaeedAbdulRaheem ปีที่แล้ว ⁺²
7:05
@DavisTibbz ปีที่แล้ว
Because it is Predictable
@skreibb 10 หลายเดือนก่อน
UUID v4 could be beneficial for databases with partitions under the hood (dynamo db). ULID could make writes to the last partition really expensive
@DurgaShiva7574 ปีที่แล้ว ⁺¹
Amazing video, but, one doubt, who is actually generating the UUID or ULID's .. in both the scenarios.. client or Server ??
@pollathajeeva23 ปีที่แล้ว ⁺²
One video on Search engine indepth please
@ahmedelzubair6907 ปีที่แล้ว
Why do you think that ULID is not gonna help in reads?
At the end, it's sorted, right ? So it can be easily indexed, and if we indexed our ULIDs, it will help in reading a well because we are going to use the index in reading this data.
@eddie_dane 6 หลายเดือนก่อน
Which book or resource would you recommend to learn in depth knowledge on database
@JohnZakaria ปีที่แล้ว
That's my first time hearing about ULID
@athar-q 4 หลายเดือนก่อน
Tip: Don't forgot to watch at 1.75x speed.
@davidvultur8704 ปีที่แล้ว
Any plans on doing a new twitter architecture/speed video with the introduced updates? (compared to the old "Twitter Backend is slow" vide)
Thanks
@svencowart ปีที่แล้ว
Fantastic explanation
@kickeddroid ปีที่แล้ว
Great video man keep up the great work!!!!
@1Eagler ปีที่แล้ว
So either threads will go sequentially either in memory or in disk
@danielbocelli 5 หลายเดือนก่อน
Please explain to me why a second column with a timestamp couldn’t serve this purpose?
@abcdef-fo1tf ปีที่แล้ว
I'm curious why we couldn't just use the auto-incr feature on the SQL DB if we wanted sequential primary keys? Is this because this would make it harder to make it idempotent? If we're not using it for that in the case with URL shortener, we can just use auto-incr right? Also are there cases where ULID is worse than UUID? You talked about downsides, but it seems like they were also the case for UUID
@Zuriki09 ปีที่แล้ว
The drawbacks of sequential IDs are discussed early on in the video.
ULID can be worse in the following cases:
Where you specifically don't want to expose the relative position of the ID - the timestamp gives some indication of when the ULID was created in relation to other ULIDs, this may be important for cryptographic reasons.
When you are interfacing with systems that expect UUIDs. ULIDs can be converted to UUIDs but that may be more burdensome than simply using UUIDs to begin with.
UUID spec is considered more stable and future proof, this may be important in large enterprise environments where change is extremely difficult to implement - you are unlikely to encounter this problem, but that is "a" criticism of ULIDs.
@esra_erimez ปีที่แล้ว
The more things change, the more things stay the same. IBM ISAM/VSAM
@urrahman196 ปีที่แล้ว ⁺¹
Can you please make a video on refresh token? Why do we need this along with JWT? Pros n cons.
@fedemtz6 ปีที่แล้ว
if the first 48 bits are time why is the random part so long? the chances of two identical random parts being generated at the exact same time are very low
@hirisraharjo ปีที่แล้ว
What about CUID ? is it better than ULID and UUID?
@nuculabs ปีที่แล้ว
Great content, thank you!
@raunaquepatra3966 ปีที่แล้ว
What about BTree getting skewed
@mayyar123 ปีที่แล้ว
Would the same problem of inefficient indexing occur if we use Non-clustered indexing w/ UUID ?
@hnasr ปีที่แล้ว
Still exists but not as bad as clustered indexes. Clustered indexes leaf pages are rich with data which means UUID inserts are more likely to cause page splits and encores more random IO and dirty pages flushing.
non clustered uuids still has the same behavior but it will just take more rows to get there.
@akovbovich ปีที่แล้ว
Why not use UUIDv4 with hash indexes for fast lookups and a separate timestamp column for range selection?
@hck1bloodday ปีที่แล้ว
In my little knowledge of databases I feel that will cause slower inserts and they want to maximize insert speed
@-Jason-L ปีที่แล้ว
If you're selecting a range, it better be indexed on that tange
@jitxhere 5 หลายเดือนก่อน
what hash function would you use??
@Vikasptl07 ปีที่แล้ว
Great stuff !!
@harshsahu7825 ปีที่แล้ว
just joined you DB course on udemy :)
@megalodon4272 ปีที่แล้ว
This is gold
@guerra_dos_bichos ปีที่แล้ว
Thats not a type universally, university universe should be proceeded by A not AN
@billywang3829 ปีที่แล้ว
I love this man
@johnsailor3590 ปีที่แล้ว ⁺²
Yay
@cvoges12 ปีที่แล้ว
Be this guy and spend 30 minutes on talking about how sorted data makes things predictable.
@awnion ปีที่แล้ว
ULID is not something new btw
@alikoc4713 ปีที่แล้ว
Nothing new at all.
@cvoges12 ปีที่แล้ว
Stop playing with the mic. It's both highly distracting and messing with the audio by getting louder and quieter at near random. And aim the mic at your chest/jaw, not your forehead.

ต่อไป

เล่นอัตโนมัติ

Python UUID - Explanation and UUID generation