10: Design Uber/Lyft | Systems Design Interview Questions With Ex-Google SWE

Jordan has no life

มุมมอง 8 630

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 26 ม.ค. 2024
I'm Uber pumped to go Lyft some weights right now
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 55

@levimatheri7682 หลายเดือนก่อน ⁺²
You don't need to apologize 😅 Your videos are top notch and I'm learning a lot! Keep it up man 👍🏽
@pieter5466 4 หลายเดือนก่อน ⁺²
48:10 Great use case for combo of LB, Ques, and processing nodes.
@vishalagnihotri6431 3 หลายเดือนก่อน ⁺³
Great vid. One feedback. Please link the pre-watches/pre-req (like the geo spatial vid you mentioning) to easier discovery..
@dandogamer หลายเดือนก่อน ⁺¹
Love that there is this whole detailed section about finding the exact route and times but in reality we just use google distance matrix API LOL
@jordanhasnolife5163 หลายเดือนก่อน
Interview: tell me how to do an inorder binary search tree traversal
IRL: treetset.iterator().forEach
@RickyGarcia_Learning 2 หลายเดือนก่อน ⁺¹
The DJ Khalad of system design. Just when I think the last problem you solved was your best yet you hit me with a, "And another one!"
@prashlovessamosa 4 หลายเดือนก่อน ⁺¹
Great Explanation as always.
@jonwong8097 หลายเดือนก่อน ⁺¹
hey! incredible video as always. didn't fully understand the routing part of the video and checked the description for the article-- would appreciate it.
@jordanhasnolife5163 หลายเดือนก่อน ⁺¹
highscalability.com/designing-uber/
@kunalkumar7936 3 หลายเดือนก่อน ⁺¹
Accidentally found this playlist and loving it. Kate's asking if Jordan can have a link to notes on these videos too?
@jordanhasnolife5163 3 หลายเดือนก่อน ⁺¹
Damn Kate needs to stop hitting me up I'm too busy with Sydney Sweeney these days. Tell her I'll post the notes once I'm done with the series though.
@meenalgoyal8933 2 หลายเดือนก่อน ⁺¹
Hey Jordan! Thank you for consistently making these awesome videos and answering queries in comments. Want to clarify few things:
1. We want to update driver location every 3 sec. If we were to do that directly in geohash in-memory server, will it not be too many updates to be processed for each partition? Can potentially also result in re-partition or change in partition if lot of drivers go in/out of location.
Grokking course is suggesting maintaining separate hash table server for each driver's old and new location and then update the main geoHash every 15 mins or may be less lets say 5mins. (which I guess we can do through kafka + spark streaming (to micro-batch) combination). What are your thoughts on that?
One minor potential downside is driver has to maintain 2 websocket connections with matching server and this hashTable server.
2. Minor question - For secondary shard on userId hash range in geohash server, userId is customer requesting ride, right? And why not use consistent hashing instead of userId range partition?
@jordanhasnolife5163 2 หลายเดือนก่อน
1) Depends how big your partitions are :) . If you find that there are too many operations per partition, you should probably be using smaller partition sizes. Not sure why we would see lots of repartitioning if drivers change locations, it's just that the driver begins to connect to a different partition.
I'm a bit confused on your description in part 1. Then you make it seem like the main geo hash server will only see location updates every 15 mins. Feel free to elaborate, when I have some more time I can go back to grokking and see what they said for myself.
2) That's correct, userId or driverId (we can call this entity id or something). As for using user id range, what benefit does that get us? At least with consistent hashing, we can ensure that if we change the number of partitions fewer keys have to be moved around in the re-balancing.
@meenalgoyal8933 2 หลายเดือนก่อน ⁺¹
@@jordanhasnolife5163 Thanks for responding.
1. Yes smaller partitions can help.
Repartitioning can happen for case where we are using dynamic partition to keep lets say 500 drivers per partition. So if a location becomes hot (due to some event may be) we might see more drivers there and our system might want to repartition in those cases.
About grokking solution, they used separate hash table for getting driver location updates every 3 sec to prevent frequent updates in main geoHash server. But I also had this same confusion that the main server only gets update every 15 mins from this hash table. This might not be ideal if we want to show customers all available drivers.
I was trying to see if there was a middle ground where we can avoid frequent updates to geoHash partitions but still serve sort of fresh data. I guess grokking solution can work if we reduce the sync time from 15 mins to may be 2 mins.
@antonvityazev4823 20 วันที่ผ่านมา ⁺¹
Hello Jordan!
Thanks for all the content you provide, it is really insightful and I appreciate your work
I have a question: why you never consider Dynamo DB as an option for storage? I ofter hear about Cassandra, MySQL, Redis, but never about Dynamo
Are there some concerns with it from your perspective?
@jordanhasnolife5163 19 วันที่ผ่านมา
I don't talk about it as it isn't open source, so it's hard for me to speak about it's implementation under the hood. Plus, then you're locked into AWS.
@antonvityazev4823 19 วันที่ผ่านมา ⁺¹
Gotcha, thanks
@bengillman1123 4 หลายเดือนก่อน ⁺¹
Could anyone comment if they have been asked to compute the driver's exact route in an interview? Another great video Jordan, thanks! Looking forward to the Google Maps one
@jordanhasnolife5163 4 หลายเดือนก่อน
Also interested in the above - I doubt it but I still think its a solid case study in stream processing
@Luzkan 4 หลายเดือนก่อน ⁺¹
16:23 For later, as we are on "Finding a ride" part - I think we want to make it persistent-ish as soon as client and driver accept a drive up until the end of a drive. Maybe every Nth signal, but we have to have at least some o f it for user history/compliance/legal reasons.
@jordanhasnolife5163 4 หลายเดือนก่อน
Fair point!
@inAndOut323 2 หลายเดือนก่อน ⁺¹
Thank you man!
@ingenieroriquelmecagardomo4067 4 หลายเดือนก่อน ⁺⁴
Clear and insightful video. However, i think it's inappropriate that you doxxed my wife's house ( Kate Upton ).
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Sorry, to be fair everyone knows where she lives already...
@kaneezsk4405 4 หลายเดือนก่อน ⁺¹
For using viterbi algorithm's as well, can we avoid the cross partition computation, by having all the location pings go to the same db partition? we will be tracking path only when ride is in progress right? i.e only after trip id is created, driver and rider are matched. so we want persistant data from that point. so we start writing the location pings to db instead of just redis. and all those pings could be to the trips db, which can be partitioned by trip id (or rider id) but not geohash id.. So there is no scenario where state is stored across partition?
@jordanhasnolife5163 4 หลายเดือนก่อน
I don't think we can for the viterbi algorithm part. It would be nice to have all of the pings for a trip id go to the same node, however I need to know based on wherever I am, what the transition probability is that I end up at the road next to me. To get those, I'd have to make a ton of network calls to some sort of "transition probability service", which I think would take quite a bit longer and require more data sent over the network.
@varshard0 4 หลายเดือนก่อน ⁺¹
Regarding the trip distribution. When the service wait 5 seconds before sending the trip to the 2nd driver.
It's gonna have to sort drivers again every 5 second and exclude the 1st driver. How does this work?
I guess it's some kind of scoring system that will push the 1st driver down the rank.
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Can just sort once and be stateful!
@varshard0 4 หลายเดือนก่อน
@@jordanhasnolife5163 Ah, I can store each trip in Redis with a list of sorted driver IDs that we keep popping each driver out.
@nithinsastrytellapuri291 2 หลายเดือนก่อน ⁺²
hi Jordan, during the capacity estimation you mentioned that we receive around 15k ride requests per second and mentioned that "it's a lot". But i am surprised that you have chosen MySQL for such a high write heavy system. What about picking Cassandra here? (as it allows leader-less replication) to increased write throughput? We are not performing any reads on the db once its written, so I don't think we need strong consistency here. Please clarify. Thank you.
@Anonymous-ym6st หลายเดือนก่อน ⁺²
Also curious about the DB selection here. My understanding is we need some consistency to make sure the driver won't be matched to two ride -> maybe the reason for MySQL? But maybe Cassandra / Dynamo + a row lock would work for this level of consistency needed?
@jordanhasnolife5163 หลายเดือนก่อน ⁺¹
Yeah fair point, I think cassandra would be fine here. Typically, my main concern for cassandra is when we need really fast ingestion on a single partition (like a given chat for facebook messenger or something), whereas here I guess we could just add more MySQL partitions, but yeah Cassandra totally works fine.
I wasn't too concerned with consistency here at the time, but this is a fair point too. Problem is, if we're not sharding by rider/driver ids for the trips table, it's hard to see if we already have an open trip. Maybe the best option would be to use a two phase commit or something to make sure to put down that a driver/rider has an active trip and this way we ensure that they don't double up. Then those tables can be sql and the trips table itself can be cassandra.
Nice question!
@kaneezsk4405 4 หลายเดือนก่อน ⁺²
I did not understand why you have to jump the websocket from 1 server to another server as the driver moves. we can have driver connect always to the same websocket server and the location is pinged through this websocket server onto redis. during redis writes we make sure driver id in old geohash partition is erased(could also have ttl for handling race conditions of not getting deleted) and new location in new partition is written
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Sure but in your design now the server has to connect to each redispartition (of which there are many) via a websocket or poll it indefinitely as it also has to send driver locations back to all of the riders
@dmitrigekhtman1082 4 หลายเดือนก่อน
I also found the websocket creation/deletion in this video and in the webcrawler video kind of odd. IMO, coordinating websockets connections based on the internal partition scheme couples the frontend a little too tightly to the backend. Just keeping a persistent websocket connection and handling the routing internally feels like a better separation of concerns. For example, you could use a Kafka queue to pipe top K driver matches computed in the backend to the websockets in the frontend.
On the other hand, maybe connecting and disconnecting websockets is simpler and would give you lower latencies?? I have no idea.
@varshard0 4 หลายเดือนก่อน ⁺¹
Why is the ClaimedTrips table is needed?
I think that just the Trips table with a driver_id column should do the job.
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺²
Sorry ClaimedTrips == Trips. I just called it that on one slide haha.
@NBetweenStations หลายเดือนก่อน ⁺¹
Would there be a cache in front of the trips db? When the HMMs update the location of the ride periodically it’s also being read periodically. Wondering if that’s too many read and write requests for a mysql trips db?
@jordanhasnolife5163 หลายเดือนก่อน ⁺¹
If you noticed that there was a ton of load on it I don't see why you couldn't add a cache.
@NBetweenStations หลายเดือนก่อน
@@jordanhasnolife5163 thanks very much!!!
@GeorgeDicu-hs5yp 4 หลายเดือนก่อน ⁺¹
you dont talk about costs in any of these videos. cost is a very imp. aspect which can redefine the entire solution. But good video, you bring new dimensions to my thought process.
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Duplicate comment, responded to the last one lol
@zoebmithaiwala3961 3 หลายเดือนก่อน ⁺¹
If we are sharding based on Trips DB on trip id, who creates the unique trip IDs? Should there be a central service creating unique trip id? Or does the sharding needs to happen on something that is already existing, for example user id?
@jordanhasnolife5163 2 หลายเดือนก่อน
We could use a central service, but that could become a bottleneck. Could do something similar to tinyURL where you partition the servers by trip id, randomly choose a leader based on load or round robin or something, and then assign the id on the leader
@zoebmithaiwala3961 2 หลายเดือนก่อน ⁺¹
@@jordanhasnolife5163 That makes sense. Thanks for the clarification! Great videos :)
@tunepa4418 15 วันที่ผ่านมา ⁺¹
Why does a rider need to be connected to a matching service close its location using the geohash loadbalancing
? I am quite confused. Can you please clarify
@jordanhasnolife5163 14 วันที่ผ่านมา
The server itself doesn't need to be physically near it's location but it should be responsible for the area of the map that the potential rider and all nearby drivers could connect to
@KrunalSaija-ld5sn 4 หลายเดือนก่อน ⁺¹
how to make sure that driver will not accept two requests concurrently? (Race condition on driver side)
@jordanhasnolife5163 4 หลายเดือนก่อน
When a driver clicks accept on a ride, we can basically just tell their client to temporarily stop sending location updates to the cache until we confirm the acceptance. On the front end, just don't allow them to click accept again while there is one outgoing. Technically, we could have some wild race condition where they click accept, it takes 20 seconds to go through, and in that time they somehow accept another. If you really needed to, you could either
1) Just let it happen, the second rider will get mad and cancel
2) Shard all driver rides to the same database node and perform a check that they don't have a ride in progress
@beyondlimits8159 4 หลายเดือนก่อน ⁺¹
what do i do if i dont understand alotta this
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Watch the concepts series!
@TheTacoReview 4 หลายเดือนก่อน ⁺¹
The capacity is off. Uber only did about 500M trips globally lasts year. If you’re using localization to load balance you can optimize for a population. Like NYC which only does about 600k trips a day.
You don’t need location checkins with the db on time scale, offload that to the client and only put it to the db on significant change in position.
Most modern dbs can process Haversian requests natively. If you really wanted to build your own you could work with arc seconds to find plane intersections and plot from there.
Ubers REST API code was dumped a few years back and it’s actually pretty simple. They have a lot of complexity around rate and surge 😮calculations.
They do have some proprietary routing and gps accuracy systems that I’m not aware of.
Good work.
@jordanhasnolife5163 4 หลายเดือนก่อน
Thanks! Interesting point with regards to the surge pricing. Definitely a combination there of demand as well as distance calculation.
For these videos, agreed on the capacity being off, but in an interview, they can basically tell you any capacity that they want. But yeah mine was super crude lol
@TheTacoReview 4 หลายเดือนก่อน ⁺²
@@jordanhasnolife5163 Yea... I like your videos! You're on the right track. I have a couple of tips, 1. Don't boil the ocean. Lots of jrs try to exaggerate loads and over build systems, just make reasonable usage assumptions, marry that to good n-tier design that you can scale. Bandwidth and Hardware are cheap, refactoring isn't. 2. Stick to the Scope: You spent a lot of time on geocoding, gps location, etc. There are companies that make billions doing that. It's not in scope to build a routing or mapping engine, just say, "Here, i'll call x service to get the lon/lat, route, waypoints, etc." and then stick to the core of the app. Just my .03.
@jordanhasnolife5163 4 หลายเดือนก่อน
@@TheTacoReview Certainly not in scope to build them in reality, though for whatever reason certain companies do ask about it!
Appreciate the feedback!

ต่อไป

เล่นอัตโนมัติ

11: Design TicketMaster/StubHub | Systems Design Interview Questions With Ex-Google SWE