So are you guys interested in working at Twitter? 😅Btw, don't forget to "Batch" click the like & subscribe buttons. 🚀 neetcode.io/ - Get lifetime access to every course I ever create!
@@NeetCode you should look at your website, I tried to go pro but for some reason the google api won’t let me sign up. I don’t know if I’m the only one the problem or if it is general.
Man, finding a job as a Software Engineer is just crazy. You need to go through at least 4 to 6 rounds of interviews, starting with a technical take home challenge, then a follow up discussion about that challenge, and then another live technical coding interview, and then a live behavioral interview, and then a live system design interview, and then maybe a product delivery interview, and probably a chat with CTO or VP at the end of that. And then, once you're hired, you're just gonna be focused on fixing bugs and building features, it is very rare that you are creating a fresh system from scratch, unless you're working at a start-up, and even then, you're going to be working with other Engineers to design that system. In most other industries, you typically learn what you need for the job over time, through hands on experience. Only in Software Engineering do all Companies just expect you to be a data structure and algorithm wiz, have previous experience so you can answer those behavioral questions, and then design some abstract system from scratch within 1 hour, just to get hired.
Being an SWE these days is just insane. Any other job, you'd get hired then learn the system over time and by working with people at the company. As a SWE you have to already know how Twitter works just to get through one of the six or so interviews to get a job fixing bugs or writing new features. Does every other SWE know this shit just from going to school or working in the field for a few years? Ive been a SWE for 10 years and these are all semi-new concepts to me. Ive never once had to design a system like this but I guess now companies want you to be an expert on day one. I thought I could avoid cramming algorithims and system design stuff if I didnt try to get a job at FAANG but now every little startup expects you to be a senior level engineer just to make 140k. I feel like my 10 years of experience count for literally nothing.
10 Years and you barely did system design? Typically getting up in seniority means having to take a higher level approach to problems and leaving the implementation to juniors
@@garlicpress6121 I feel like the web based software has skewed everyones perception and it makes people think that this is the only kind of sodtware dev in the work. There are so many other domains which would never need to know this sort of stuff for interviews or even for their work. For example, someone working on low level programming for drivers, or OS level sofware or desktop applications.
I hear you. I am 15 years exp and I am finding this very strange. I functioned without knowing Leetcode algos and these insane System Design stuff. And I did pretty well! I dont know what value these things are adding, TBH.
@@garlicpress6121 I think for a typical SWE, doing system design is common, but not to this degree. Normally it's working on top of or improve existing systems to add features or improve performance/scale/reliability etc.
Starting from 4:26 to 7:38, that's pretty much superfluous arithmetic you're going to be doing during a systems design interview. The time you spend mentally calculating those numbers is going to be wasted, just to arrive at a conclusion of "it's a lot", which is almost a given in any systems design interview. Your time will be better spent calculating those numbers while you're doing your high-level design portion, if needed. One example of needing to calculate those numbers is a TopK system for trending topics in a social media feed (which doesn't pertain to a basic Twitter implementation). Ask your interviewer for DAUs and if it's anything over 100M, move on to the core components section (Tweet, User, Feed), rather than calculate capacity estimations.
This is a fantastic example of a realistic architecture screen. I would note for viewers that you will almost certainly not be able to think of and describe everything that was covered here and as someone who conducts 3 or 4 of these every week, I don't expect candidates to cover everything here in the 20-30 minutes I have with them. But as you go through this video, the issues presented scale really well with the expectations that go along with the seniority of the candidate and position. We actually skip a lot of the preliminary setup so that we can delve into the more complex issues for more senior candidates. If you're a mid level, I'm not expecting you to come at me talking about batching out feeds and dynamically updating them based on high popularity tweets.
no, with such test you filter already for ex-twitter employees. That would be fine if you build a social network, but you'd miss out on all the all the brilliant devs who for example designed large e-commerce or data-pipeline architectures, because that requires a very different approach.
It will also be a case study on if these software companies are truly over staffed or not. If Twitter survives after laying off so many people it may inspire other companies to consider down staffing
@@KennethBoneth I think the main issue with scaling down on employees is that the remaining employees will essentially have to monitor and handle the same amount of work as before scaling down, which will cause additional stress and probably a less than healthy work life balance.
Not really, Tesla and SpaceX both are well known for the horrendous work environment. So it depends on the management and the owner of the company in this case.
@@Mattarii That is true if you were properly staffed to begin with. If twitter is as overstaffed as many people believe, then a large chunk of employees are effectively doing nothing. IF twitter goes from properly staffed to understaffed, you are correct. If twitter is going from overstaffed to properly staffed, then that won't happen.
Love your content, your video help me land a position at Twitter one year ago. but I just got laid from Twitter and will start checking your video again 😅
Great video! One question (or perhaps a mistake), in 18:20, you say all the people this guy follows should be on one shard but I don't think that's possible. If person A follows B and C, then B and C should be on one shard. if person E follows C and D, C and D should be on one shard, but its already on a different shard. Maybe B,C,D are all one shard, but as long as each person follows another different person, we will only have one shard.
Thanks for this comment, I really didnt get this sharding thing :) it is looking impossible to sharding per user. I thought that maybe I misunderstood this point but, after your comment it's clear.
Sounds like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard". The phrasing isn't great.
Once I had an interview explaining how to design something. I totally missed the point. This definitely give us a clear idea. It's not about writing a user story, and not even building the actual application, but identifying the most critical points and possible components and to come up with how to solve it. Thanks again.
I can't help but find it slightly hilarious that you released this video during the ongoing controversies happening at Twitter. But in all seriousness, amazing content!
Loved it! The only issue I see is sharding having all the people who follow each other in the same shard. That's just not possible, as a friend of yours will follow someone in another shard group at some point. I haven't got a good answer for that yet, apart from saying we should use a GraphDB here that hopefully is optimised for sharding this kind of data...
Yes, that seems like a big oversight. Each shard will have a subset of a users followees, so the proposed user id as a shard key really doesn't do anything for us.
Just paused at that part, seems incorrect. The best sharding I think may be tweet id (assuming using chronological IDs like snowflake) as people are generally accessing the latest tweets so can grab them in a single request if it misses cache
You've got it wrong. The idea is to have all the _followers_ of the user in one shard. This way, when the user posts a tweet, you would get all their followers ids from one shard with one query. Then you'd use this list of ids, to update their respective feeds with the tweet. When the user request their feed, they get it pre-computed from the cache, not built on-the-fly.
Very good tutorial as always from NeetCode. Kudos. One confusion though: I am aware of publisher / subscriver pattern and I am also aware of message queue - What is new is "Pub/Sub message queue". Not sure what that is. From what it looks more like a message queue behaviour auther is indicating instead of a pub/sub. The impact you are creating is far better and huge than anyone working for FAANG.
DDIA is the most comprehensive resource (assuming you have at least some experience). Also, most companies (including twitter) release blog posts and white papers about technical challenges they faced and how they overcame them. I think many beginners miss these, but they are an extremely valuable and free resource, which is why they are commonly referenced by system design textbooks.
I just got asked this question in an interview, but with the added feature to follow interests too, and I am surprised I answered pretty much the same thing that is stated here and I passed the interview!, one thing to mention is that some companies/interviewers want to see SQL queries written in order to see how you make joins to the tables, so be prepared on that I would say
This level of quality content is available for free, it blows my mind! Also, I am churning through your Blind 75 list of questions and I am loving your solution videos.
There shouldn't be any userId in the POST /v1/tweet/create endpoint. This is because we will get the id of the user initiating the request from the authentication token in the request header. Putting sensitive information like authentication tokens in the request body is a security risk
There's no difference in security, whether you put the token in the headers or the body. But it's better to put it in the headers because your gateway can start checking it or sending the request to the destination API before it downloads the body. Putting the userId in the body doesn't make sense here, but it would allow you to have other features like "postponed tweets". And another service with an internal token (without the userId) could call the existing API to post those messages.
If user A follows B and C and B follows back to A then all three should be on same shard and same way if B follows 10 more people and even one person follows back then all those 10 should be on same shard and it goes on with all data on single shard . looks like very abstract way , i am not sure why people not think little more rather thn explaining that abstract way
Caching the Feed page in the CDN and purge it on update(feed is tagged with User_ids), the infrastructure is basically a multi layer data retrieval, uid->followee->tweets(sorted by timestamps) and then merge to get the final result. The uid->followee mapping can be compactly stored and updated if needed. (K/V or RDB) followee->tweets would be a sharded DB with all tweets posted. (K/V). it would just be a simple backend and most of the load would be handled by the CDN.
That more or less is I think what he described for his feed cache description. But it doesn't solve the problem he brings up where we don't want to update all the followers' feed cache whenever a popular user posts a tweet. Also, I don't know how to do it, but when you say "on update", I'm assuming that whenever a person posts a tweet, all the users following that person gets "updated". In that case, then only thing that needs to be changed is inserting that new tweet into the feed (and probably popping out whatever oldest or least important tweet that is in the feed that this new tweet will replace). In that case, I don't think retrieving and merging all the relevant tweets each time there is an "update" makes sense. I think that's why he brought up pub/sub. So it's just a queue where whenever a new one comes the least important one gets popped out.
@@marspark6351 Maybe it's possible to determine a "popular" user and when those users create a tweet, only cache that tweet instead of allowing a message to go through the pub/sub when they post a tweet.
If sharding by user id then, to retrieve a single tweet (e.g. by a direct link), you would need to request all shards. Is it something tolerable or how do you overcome it? And what about hot user problem? Sharding by user id does not work well in this case.
So if we want return the cache only, but if the user follows celerity then it will not be up to date. That mean every time user comes we still need to query the list of people that user is subbed to right? To check whether there is celebrity
Don't guess the capacity, there are infinite servers, infinite ram, infinite disk. Don't calculate. Only poor calculate. Is the design horizontally scalable? Yes. Go home now
If you have the capacity for asynchronously pre-building timelines for all (active) users, why don't you increase the capacity of the cache layer for the RLDB, or store the tweets in a fast KV NoSQL?
Probably, having NoSQL KV-store with such massive reads you'd have to deal with its sharding anyways. Don't think you'd just set up Cassandra and start throwing in nodes to the cluster mindlessly. So, author, choosing SQL DB, just makes that logic explicit.
I think the best way to test a senior developer is to ask them to explain their own projects, and how they solved certain problems. That way you can see if they have the necessary experience and knowledge to do the job. Such random architecture tests are pointless for senior positions. I would not be able to list all those things, because interview are stressful and I am not the best communicator. However, give me a few hours to prepare, and I would design a system that is even better, and if I had a day or two to code I could even create a proof-of-concept. (although that would be unreasonable to ask). Testing such things during a interview mainly tells you how well a persons memory works in a stressful situation. But real developers never have stressful situations, because they already made some code for that. The way to pass such a tests is to mention as many buzzwords as possible, but that filters out the real scientists who do not play such games. I think questions should focus more on actual problem solving abilities and past work, because that is a better indicator of success in the software engineering world.
Splendid! Solid content with crystal clear pronunciation and comfortable speed. How did you practice your speaking? I wish I could speak no er----en-----aa those no meaning words in a system design interview.
Ngl as a aspiring software engineer, I find this video helpful in terms of macro design. New video style over the different duties of a software engineer? 👀👀
Use a DB like Cassandra: users, tweets, followers, follows, feed. Everything sharded by user ID to colocate relevant data. Fan out to followers feeds on tweet. For celebrity users, fetch the celebrity tweets from cache when building the feed. Have some background jobs pre-populate some other good feed candidates, Rank the feed by some scoring system. Push likes, retweets to an event stream and update cached like counters in Redis from the stream every so often. Shard on tweet ID and spin up some read replicas if needed
Something I didn't understand: You suggests sharding on user ID as then the people a user follows will be grouped on the same shard. However, users can have a lot of followers and their followers will be distributed across different shards. So you have to duplicate a user's tweets across every shard that has someone following them in it in which case you probably have enough fanout that you're not really sharding anymore, it's just replicas with more steps (at least for the read case, writes would be meaningfully sharded). Am I missing something here? It feels like to get any value out of sharding you'd have to do something MUCH more complicated like assign users to shards based off similarity graphs.
I have a question. in most of the read internsive applications . most of the design is to add a cache layer like redis to block the db traffic. Can i not add any cache but add as many as read-only replicas of mysql to distribute the traffice ? as cache also need to consider the sync problem between redis and mysql. but read-only replica can get rid of this hassle .
I would believe this has less to do with whether it's SQL or noSQL, but probably more to do with that Redis makes better use of RAM than mysql. Don't take my word tho. Just a possible assumption
Why you need relational db, all this relationship data you can store in document db as json for high performance, low latency and scalability. Usage of relational db will not be efficient in this scenario because we need to achieve high availability, we need eventual consistency so NoSql Mongo db is preferred over relational db in this scenario. Correct me if I am wrong.
It will be helpful if you present it in a more realistic interview Q&A kind of scenario, as often the interviewer interrupts the process, asks a different question, or asks to take a different approach. The mocks presented only as a monologue do not serve that purpose.
I appreciate the effort and care you put into this video but I think it could use a little more focus. Especially at the sharding-for-writes portion. You jumped around a lot to digressions that made that line of thought hard to follow.
I would send the tweet timestamp from the client. If you handle it server-side and something breaks and delays the server-side ingestion of the tweet, you'd have an incorrect timestamp. ("Wow, what an amazing touchdown!" posted 2 hours after the touchdown and way out of context on feeds etc)
At 18:12, how do you manage to get all the users one follows in a single shard? It seems strange to me. If that can work, then all the users need to be in a single shard, which fails the purpose of sharding.
Sounds to me like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard"
@@arthur723 for a given user following users a,b,c with shards 1,2,3: User a posts -->> shard 2 User b posts -->> shard 3 User c posts --->> shard 1 But not User a--> some posts shard1 and some posts in shard2 So for any given user, all of their data will be in a single shard. However, not all of the people you follow will all have data in the same shard
I wouldn't combine reads of tweets with "reads" of videos into a single number of data we're going to read from our "storage" as storing videos and streaming videos and storing and reading text tweets + meta data are completely different tasks which access and deals with data in a completely different way.
One of the defining features of twitter is timely notifications about new tweets from people you follow. Could you please describe how could it be implemented in this architecture? Likes and comments allow users to attach their content to a potentially popular tweet. How would it affect our storage layer? What challenges, if any, we would face with multi-az deployment of such system? Thank you for your time and interest in our company.
Speaking of popular users. We can separate tweet data by some follower threshold (say 10k followers) and, when popular profile post a new tweet, we only need to update that feed. Every normal profile will check that feed in case they follow popular profiles.
End of min 8, a relation between a follower and a followee is not the same as a relation in a relational database. So it's not a reason to think of one.
Agree on the part that, the data is more on relational side. But why can't we put the tweet in any NoSql db like cassandra, scylla. As from our follow table i know which followee's tweet i have to fetch. Now that i know, i simply have to search in shards the followee's tweet stored.
Seems to be missing the peak throughput in the initial design 20 billion reads per day is not as useful a metric as how many per minute or second at peak load.
That's correct, I meant that while NoSQL is easier to scale (automatically or by specifying a shard key), we can still scale relational DBs via sharding.
Thank you for interesting video. I however doubt that relation database can store the tweets. I've just asked to design twitter during a job interview and constructed something very similar. But I suggested to use aerospike for messages using the following schema: id->list off messages. Aerospike is horisontaly scaled, so there is no need to think about sharding.
How about starting from the data model rather than the architecture? And let architecture emerge from constraints on use cases? Enterprise applications 101.
What this seems to miss is that what's important about twitter is the datastructure that represents a timeline of tweets. There are systems that generate that then there are systems that inject tweets into that timeline. These can be ads or they can be other tweets that the ML systems want to promote... etc. That's the DS that is at the heart of what twitter is. The tech here should be built around that concept NOT the other way around.
I understood why the userId helps as shard key but I did not understand why choosing "tweetId" as shard key does not help. Why to we have to query all the shards if we shard based on "tweetId"? can someone explain pls?
Pretty sure he means that someone shouldn't be able to use something like Postman to send a request with someone else's user id and retrieve all of their tweets.
I have a question on how on 23:46 on how that "update of the feed upon request instead of during when a tweet is created" would work. So would the feed of a user keep continuously get updated via the message queue whenever there's a new tweet, except for the tweets of the popular one? And when that user requests the feed, it will somehow just fetch that missing tweet and fill it in the feed? How would that work? Isn't that the same issue as what it's described at 19:57 where 19 of your 20 tweets could be cached but you'll have to go to the disk to find that one tweet?
Before returning the feed, the app server would check if the user follows any celebrity (one query to follow table). Then get the tweets of the celebrities which user follows from the cache, and inject them into the feed based on the timestamp. This approach has significant downsides like increasing latency for all users, so I believe this problem is addressed differently in real world.
I loved your video, very much and thanks a lot for he afford you made. These are the question we actually face when you are working on the BE side. One small question, If someone asks you, what kind/type of architecture is this? What will be your answer?
Yes, you must be really disliking Elon Musk so much (to say it mildly ). > Who is most popular on Twitter? Kim Kardashian. probably over 100 million followers . ..... -- Putting the subject aside - you made a good content - thank you!
That’s amazing how this kind of large-scale system can grow and become so complex with amount of components and “moving parts”, also it’s impressive how it works with a massive amount of users and data storage like petabytes. In the end, I didn’t understand if your solution was using sharding or not on the database, if it is using, how do you solve the issue about the sharding-key, ‘cause it looks like not possible to use the “every account followed by someone” strategy due the reasons you even talked about. Is it possible to have sharding and reading replicas at the same time? And how to handle it, using many load balancers, each one after sharding for a single replicas cluster?
It doesn't seem like there's a good general solution. What works for the average user, doesn't work well for super big users. I heard that Twitter separated users with more than a certain number of followers into a separate use case. Not sure if they ever solved it.
I don't think that sharding on UID is a good approach, I would go with sharding based on date (a shard for each month). thus making feed fetching simpler and faster. at most you will hit two shards at the same time. and maybe you only need replicas for the last 3 shards (last 3 months). And I'm not sure about saving pre-computed feeds. is it really good practice?
So are you guys interested in working at Twitter? 😅Btw, don't forget to "Batch" click the like & subscribe buttons.
🚀 neetcode.io/ - Get lifetime access to every course I ever create!
You should leave Google for Twitter
tweet this video to Elon , he might make you CEO, he is weird like that.
@@BhargavSushant lol maybe i should
Yes hire then next day fire
@@NeetCode you should look at your website, I tried to go pro but for some reason the google api won’t let me sign up. I don’t know if I’m the only one the problem or if it is general.
Man, finding a job as a Software Engineer is just crazy.
You need to go through at least 4 to 6 rounds of interviews, starting with a technical take home challenge, then a follow up discussion about that challenge, and then another live technical coding interview, and then a live behavioral interview, and then a live system design interview, and then maybe a product delivery interview, and probably a chat with CTO or VP at the end of that.
And then, once you're hired, you're just gonna be focused on fixing bugs and building features, it is very rare that you are creating a fresh system from scratch, unless you're working at a start-up, and even then, you're going to be working with other Engineers to design that system.
In most other industries, you typically learn what you need for the job over time, through hands on experience. Only in Software Engineering do all Companies just expect you to be a data structure and algorithm wiz, have previous experience so you can answer those behavioral questions, and then design some abstract system from scratch within 1 hour, just to get hired.
Being an SWE these days is just insane. Any other job, you'd get hired then learn the system over time and by working with people at the company. As a SWE you have to already know how Twitter works just to get through one of the six or so interviews to get a job fixing bugs or writing new features. Does every other SWE know this shit just from going to school or working in the field for a few years? Ive been a SWE for 10 years and these are all semi-new concepts to me. Ive never once had to design a system like this but I guess now companies want you to be an expert on day one. I thought I could avoid cramming algorithims and system design stuff if I didnt try to get a job at FAANG but now every little startup expects you to be a senior level engineer just to make 140k. I feel like my 10 years of experience count for literally nothing.
10 Years and you barely did system design? Typically getting up in seniority means having to take a higher level approach to problems and leaving the implementation to juniors
@@garlicpress6121 I feel like the web based software has skewed everyones perception and it makes people think that this is the only kind of sodtware dev in the work. There are so many other domains which would never need to know this sort of stuff for interviews or even for their work. For example, someone working on low level programming for drivers, or OS level sofware or desktop applications.
I hear you. I am 15 years exp and I am finding this very strange. I functioned without knowing Leetcode algos and these insane System Design stuff. And I did pretty well! I dont know what value these things are adding, TBH.
@@garlicpress6121 I think for a typical SWE, doing system design is common, but not to this degree. Normally it's working on top of or improve existing systems to add features or improve performance/scale/reliability etc.
@@vhchoang if you work in startup and new product is created from ou get oppurtunity to design these kind of things.
Starting from 4:26 to 7:38, that's pretty much superfluous arithmetic you're going to be doing during a systems design interview. The time you spend mentally calculating those numbers is going to be wasted, just to arrive at a conclusion of "it's a lot", which is almost a given in any systems design interview. Your time will be better spent calculating those numbers while you're doing your high-level design portion, if needed. One example of needing to calculate those numbers is a TopK system for trending topics in a social media feed (which doesn't pertain to a basic Twitter implementation).
Ask your interviewer for DAUs and if it's anything over 100M, move on to the core components section (Tweet, User, Feed), rather than calculate capacity estimations.
the biggest thing about sharding is that we could potentially lose the joins, and it adds a huge layer of complexity on the application.
wow !!! from algorithms to system design, love to see more on system design videos
This is a fantastic example of a realistic architecture screen. I would note for viewers that you will almost certainly not be able to think of and describe everything that was covered here and as someone who conducts 3 or 4 of these every week, I don't expect candidates to cover everything here in the 20-30 minutes I have with them. But as you go through this video, the issues presented scale really well with the expectations that go along with the seniority of the candidate and position. We actually skip a lot of the preliminary setup so that we can delve into the more complex issues for more senior candidates. If you're a mid level, I'm not expecting you to come at me talking about batching out feeds and dynamically updating them based on high popularity tweets.
no, with such test you filter already for ex-twitter employees. That would be fine if you build a social network, but you'd miss out on all the all the brilliant devs who for example designed large e-commerce or data-pipeline architectures, because that requires a very different approach.
I guess twitter will be a case study in “does talent matter” and “how interchangeable/disposable are sw engineers”.
It will also be a case study on if these software companies are truly over staffed or not. If Twitter survives after laying off so many people it may inspire other companies to consider down staffing
@@KennethBoneth I think the main issue with scaling down on employees is that the remaining employees will essentially have to monitor and handle the same amount of work as before scaling down, which will cause additional stress and probably a less than healthy work life balance.
Not really, Tesla and SpaceX both are well known for the horrendous work environment. So it depends on the management and the owner of the company in this case.
@@bryanyang7626 that's true, might not work well with other companies once people start realizing their lives are worth more than slaving away
@@Mattarii That is true if you were properly staffed to begin with. If twitter is as overstaffed as many people believe, then a large chunk of employees are effectively doing nothing. IF twitter goes from properly staffed to understaffed, you are correct. If twitter is going from overstaffed to properly staffed, then that won't happen.
Love your content, your video help me land a position at Twitter one year ago. but I just got laid from Twitter and will start checking your video again 😅
I'm sorry to hear that, wish you the best - it's only a matter of time!!!
me too😂😂
Great video! One question (or perhaps a mistake), in 18:20, you say all the people this guy follows should be on one shard but I don't think that's possible. If person A follows B and C, then B and C should be on one shard. if person E follows C and D, C and D should be on one shard, but its already on a different shard. Maybe B,C,D are all one shard, but as long as each person follows another different person, we will only have one shard.
Thanks for this comment, I really didnt get this sharding thing :) it is looking impossible to sharding per user. I thought that maybe I misunderstood this point but, after your comment it's clear.
Sounds like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard". The phrasing isn't great.
Once I had an interview explaining how to design something. I totally missed the point. This definitely give us a clear idea.
It's not about writing a user story, and not even building the actual application, but identifying the most critical points and possible components and to come up with how to solve it.
Thanks again.
I would love to see more System Design content !! nice video man
Thank you, more to come!
@@NeetCode I think he meant on your youtube channel haha..
@@sanskarkaazi3830 obviously what else could her mean?
@@indiging8330 neetcode has premium courses on his website as well so not there but here.. you get what i mean?
Can you talk about Pinterest, or someone link some available content.
I can't help but find it slightly hilarious that you released this video during the ongoing controversies happening at Twitter.
But in all seriousness, amazing content!
Musk will hire him
Its because Musk tweeted the HLD of twitter on twitter. You can see that in the thumbnail of this video too
Loved it!
The only issue I see is sharding having all the people who follow each other in the same shard. That's just not possible, as a friend of yours will follow someone in another shard group at some point.
I haven't got a good answer for that yet, apart from saying we should use a GraphDB here that hopefully is optimised for sharding this kind of data...
Yes, that seems like a big oversight. Each shard will have a subset of a users followees, so the proposed user id as a shard key really doesn't do anything for us.
Yeah I felt like I was missing something when he said sharding and scrolled down to the comments to confirm
Just paused at that part, seems incorrect. The best sharding I think may be tweet id (assuming using chronological IDs like snowflake) as people are generally accessing the latest tweets so can grab them in a single request if it misses cache
@@salient244 yeah, but still you'd need to store the friends relationships somehow and you'd get into the sharing issue when it scales up
You've got it wrong. The idea is to have all the _followers_ of the user in one shard. This way, when the user posts a tweet, you would get all their followers ids from one shard with one query. Then you'd use this list of ids, to update their respective feeds with the tweet. When the user request their feed, they get it pre-computed from the cache, not built on-the-fly.
Very much enjoyed the video, the explanation, the simplicity and the clarity it brought out. Thank you
Glad it was helpful!
Very good tutorial as always from NeetCode. Kudos.
One confusion though: I am aware of publisher / subscriver pattern and I am also aware of message queue - What is new is "Pub/Sub message queue". Not sure what that is. From what it looks more like a message queue behaviour auther is indicating instead of a pub/sub.
The impact you are creating is far better and huge than anyone working for FAANG.
What books / sources did you refer to get a strong grip on system design?
DDIA is the most comprehensive resource (assuming you have at least some experience).
Also, most companies (including twitter) release blog posts and white papers about technical challenges they faced and how they overcame them. I think many beginners miss these, but they are an extremely valuable and free resource, which is why they are commonly referenced by system design textbooks.
Thanks!!
@@NeetCode Is there a central url where you find those blog posts or do you just google them?
A good place to start is by learning the classic OOP design patterns. It's less about the OOP and more about the patterns.
I can't believe that I just found this channel now. Great content
Your content is way, WAY better than the others on TH-cam! Great work!
I just got asked this question in an interview, but with the added feature to follow interests too, and I am surprised I answered pretty much the same thing that is stated here and I passed the interview!, one thing to mention is that some companies/interviewers want to see SQL queries written in order to see how you make joins to the tables, so be prepared on that I would say
If we shard based on a used id, won't it become a hotspot (if user is a celebrity or has large no of tweets)?
This level of quality content is available for free, it blows my mind! Also, I am churning through your Blind 75 list of questions and I am loving your solution videos.
how is this a quality content?
@@umarqureshi8499what's wrong with it?
First time Kim Kardashian has come up in any tech video I've watched
Wow! That's a lot to take in maybe because I'm sleepy but sparked at the same time. Put out more of this please.
There shouldn't be any userId in the POST /v1/tweet/create endpoint. This is because we will get the id of the user initiating the request from the authentication token in the request header. Putting sensitive information like authentication tokens in the request body is a security risk
There's no difference in security, whether you put the token in the headers or the body. But it's better to put it in the headers because your gateway can start checking it or sending the request to the destination API before it downloads the body. Putting the userId in the body doesn't make sense here, but it would allow you to have other features like "postponed tweets". And another service with an internal token (without the userId) could call the existing API to post those messages.
If user A follows B and C and B follows back to A then all three should be on same shard and same way if B follows 10 more people and even one person follows back then all those 10 should be on same shard and it goes on with all data on single shard . looks like very abstract way , i am not sure why people not think little more rather thn explaining that abstract way
Caching the Feed page in the CDN and purge it on update(feed is tagged with User_ids), the infrastructure is basically a multi layer data retrieval, uid->followee->tweets(sorted by timestamps) and then merge to get the final result.
The uid->followee mapping can be compactly stored and updated if needed. (K/V or RDB)
followee->tweets would be a sharded DB with all tweets posted. (K/V).
it would just be a simple backend and most of the load would be handled by the CDN.
That more or less is I think what he described for his feed cache description.
But it doesn't solve the problem he brings up where we don't want to update all the followers' feed cache whenever a popular user posts a tweet.
Also, I don't know how to do it, but when you say "on update", I'm assuming that whenever a person posts a tweet, all the users following that person gets "updated". In that case, then only thing that needs to be changed is inserting that new tweet into the feed (and probably popping out whatever oldest or least important tweet that is in the feed that this new tweet will replace). In that case, I don't think retrieving and merging all the relevant tweets each time there is an "update" makes sense. I think that's why he brought up pub/sub. So it's just a queue where whenever a new one comes the least important one gets popped out.
@@marspark6351 Maybe it's possible to determine a "popular" user and when those users create a tweet, only cache that tweet instead of allowing a message to go through the pub/sub when they post a tweet.
Thank you for explaining in such detail. I learned about sharding, definitely will use in my projects.
If sharding by user id then, to retrieve a single tweet (e.g. by a direct link), you would need to request all shards. Is it something tolerable or how do you overcome it?
And what about hot user problem? Sharding by user id does not work well in this case.
Yep, but there is no requirement in this case to be able to request tweet by id directly without knowing the author of the tweet.
So if we want return the cache only, but if the user follows celerity then it will not be up to date. That mean every time user comes we still need to query the list of people that user is subbed to right? To check whether there is celebrity
Don't guess the capacity, there are infinite servers, infinite ram, infinite disk. Don't calculate. Only poor calculate. Is the design horizontally scalable? Yes. Go home now
If we have read heavy system why are we not using slave and master design
If you have the capacity for asynchronously pre-building timelines for all (active) users, why don't you increase the capacity of the cache layer for the RLDB, or store the tweets in a fast KV NoSQL?
Probably, having NoSQL KV-store with such massive reads you'd have to deal with its sharding anyways. Don't think you'd just set up Cassandra and start throwing in nodes to the cluster mindlessly. So, author, choosing SQL DB, just makes that logic explicit.
Literally Amazing man. Take a bow🙇♂️
Extremely good discussion in this video, more of this please!
I think the best way to test a senior developer is to ask them to explain their own projects, and how they solved certain problems. That way you can see if they have the necessary experience and knowledge to do the job.
Such random architecture tests are pointless for senior positions. I would not be able to list all those things, because interview are stressful and I am not the best communicator. However, give me a few hours to prepare, and I would design a system that is even better, and if I had a day or two to code I could even create a proof-of-concept. (although that would be unreasonable to ask).
Testing such things during a interview mainly tells you how well a persons memory works in a stressful situation. But real developers never have stressful situations, because they already made some code for that. The way to pass such a tests is to mention as many buzzwords as possible, but that filters out the real scientists who do not play such games. I think questions should focus more on actual problem solving abilities and past work, because that is a better indicator of success in the software engineering world.
Splendid! Solid content with crystal clear pronunciation and comfortable speed. How did you practice your speaking? I wish I could speak no er----en-----aa those no meaning words in a system design interview.
Ngl as a aspiring software engineer, I find this video helpful in terms of macro design. New video style over the different duties of a software engineer? 👀👀
Use a DB like Cassandra: users, tweets, followers, follows, feed. Everything sharded by user ID to colocate relevant data.
Fan out to followers feeds on tweet. For celebrity users, fetch the celebrity tweets from cache when building the feed. Have some background jobs pre-populate some other good feed candidates, Rank the feed by some scoring system.
Push likes, retweets to an event stream and update cached like counters in Redis from the stream every so often. Shard on tweet ID and spin up some read replicas if needed
Something I didn't understand:
You suggests sharding on user ID as then the people a user follows will be grouped on the same shard.
However, users can have a lot of followers and their followers will be distributed across different shards. So you have to duplicate a user's tweets across every shard that has someone following them in it in which case you probably have enough fanout that you're not really sharding anymore, it's just replicas with more steps (at least for the read case, writes would be meaningfully sharded).
Am I missing something here? It feels like to get any value out of sharding you'd have to do something MUCH more complicated like assign users to shards based off similarity graphs.
I have a question. in most of the read internsive applications . most of the design is to add a cache layer like redis to block the db traffic. Can i not add any cache but add as many as read-only replicas of mysql to distribute the traffice ? as cache also need to consider the sync problem between redis and mysql. but read-only replica can get rid of this hassle .
I would believe this has less to do with whether it's SQL or noSQL, but probably more to do with that Redis makes better use of RAM than mysql. Don't take my word tho. Just a possible assumption
The abstract design is vital! Now I have realized this point.
Amazing! This one of the best System Design videos I watched :) Great job!
considering how many joins you would have to do in a relational DB, it would be hard to justify that for twitter.
Awesome video, what are you using as your board?
Why you need relational db, all this relationship data you can store in document db as json for high performance, low latency and scalability. Usage of relational db will not be efficient in this scenario because we need to achieve high availability, we need eventual consistency so NoSql Mongo db is preferred over relational db in this scenario. Correct me if I am wrong.
It will be helpful if you present it in a more realistic interview Q&A kind of scenario, as often the interviewer interrupts the process, asks a different question, or asks to take a different approach. The mocks presented only as a monologue do not serve that purpose.
I appreciate the effort and care you put into this video but I think it could use a little more focus. Especially at the sharding-for-writes portion. You jumped around a lot to digressions that made that line of thought hard to follow.
I almost spilled my coffee when i heard the word "How hard can it be?" LOL
I would send the tweet timestamp from the client. If you handle it server-side and something breaks and delays the server-side ingestion of the tweet, you'd have an incorrect timestamp. ("Wow, what an amazing touchdown!" posted 2 hours after the touchdown and way out of context on feeds etc)
At 18:12, how do you manage to get all the users one follows in a single shard? It seems strange to me. If that can work, then all the users need to be in a single shard, which fails the purpose of sharding.
Sounds to me like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard"
@@Squigglybiggly the two sentences you said mean the same to me. Or maybe my English is bad.
@@arthur723 for a given user following users a,b,c with shards 1,2,3:
User a posts -->> shard 2
User b posts -->> shard 3
User c posts --->> shard 1
But not
User a--> some posts shard1 and some posts in shard2
So for any given user, all of their data will be in a single shard. However, not all of the people you follow will all have data in the same shard
What a nice video, I learnt a lot even being a junior developer.
Btw, how can I find the official twitter engineering paper you mentioned at the end?
I’d try checking their engineering blog for leads.
14:11, why put index on follower? I think once we index by followee, the followeer list would be grouped inside DB.
I really enjoy watching your video!!
I wouldn't combine reads of tweets with "reads" of videos into a single number of data we're going to read from our "storage" as storing videos and streaming videos and storing and reading text tweets + meta data are completely different tasks which access and deals with data in a completely different way.
I would argue you need an index on both followee and follower because in twitter you can see both ways
Don't forget ads. Imagine how complex this whole thing becomes when we add in ads.
One of the defining features of twitter is timely notifications about new tweets from people you follow. Could you please describe how could it be implemented in this architecture? Likes and comments allow users to attach their content to a potentially popular tweet. How would it affect our storage layer? What challenges, if any, we would face with multi-az deployment of such system? Thank you for your time and interest in our company.
which tool are you using to draw the diagrams?
The problem right now is not about designing a workable system but a system that works smoothly without spending much $$$ on the infrastructure.
Bro how do you draw so good with the mouse
Speaking of popular users. We can separate tweet data by some follower threshold (say 10k followers) and, when popular profile post a new tweet, we only need to update that feed. Every normal profile will check that feed in case they follow popular profiles.
So...use the average Twitterer's tweets as load dampening. They should do that. It will make Twitter even less popular.
Amazing video, this has made me curious about systems design roles in industry
my IT classes coming in clutch
Looking forward to part 2!!! More in-depth
End of min 8, a relation between a follower and a followee is not the same as a relation in a relational database. So it's not a reason to think of one.
Nice video! Gotta love some systems design
Agree on the part that, the data is more on relational side. But why can't we put the tweet in any NoSql db like cassandra, scylla. As from our follow table i know which followee's tweet i have to fetch. Now that i know, i simply have to search in shards the followee's tweet stored.
Seems to be missing the peak throughput in the initial design 20 billion reads per day is not as useful a metric as how many per minute or second at peak load.
Correction 9:01 We can also implement sharding in most nosql databases.
That's correct, I meant that while NoSQL is easier to scale (automatically or by specifying a shard key), we can still scale relational DBs via sharding.
Nice catch boss
Thank you for interesting video. I however doubt that relation database can store the tweets. I've just asked to design twitter during a job interview and constructed something very similar. But I suggested to use aerospike for messages using the following schema: id->list off messages. Aerospike is horisontaly scaled, so there is no need to think about sharding.
How about starting from the data model rather than the architecture? And let architecture emerge from constraints on use cases? Enterprise applications 101.
What this seems to miss is that what's important about twitter is the datastructure that represents a timeline of tweets. There are systems that generate that then there are systems that inject tweets into that timeline. These can be ads or they can be other tweets that the ML systems want to promote... etc. That's the DS that is at the heart of what twitter is. The tech here should be built around that concept NOT the other way around.
isn't what you are describing the "feed" part that he's describing? I'm confused why else you need a database ordered by timeline
If the interviewer is Elon, all you need to do is remember the word “turboencabulator”.
I understood why the userId helps as shard key but I did not understand why choosing "tweetId" as shard key does not help. Why to we have to query all the shards if we shard based on "tweetId"? can someone explain pls?
don't know much about sharding, but I do have a lot of experience with *sharting*
12:48
Can you clarify what you meant by "I shouldn't be able to pass in your uid"?
Are you saying that function should actually not take uid as input?
Pretty sure he means that someone shouldn't be able to use something like Postman to send a request with someone else's user id and retrieve all of their tweets.
I have a question on how on 23:46 on how that "update of the feed upon request instead of during when a tweet is created" would work.
So would the feed of a user keep continuously get updated via the message queue whenever there's a new tweet, except for the tweets of the popular one? And when that user requests the feed, it will somehow just fetch that missing tweet and fill it in the feed? How would that work?
Isn't that the same issue as what it's described at 19:57 where 19 of your 20 tweets could be cached but you'll have to go to the disk to find that one tweet?
Before returning the feed, the app server would check if the user follows any celebrity (one query to follow table). Then get the tweets of the celebrities which user follows from the cache, and inject them into the feed based on the timestamp. This approach has significant downsides like increasing latency for all users, so I believe this problem is addressed differently in real world.
That initial diss on twitter is everything 😂😂
I don't even have a twitter account or did get the reall need.
So do the interviewers gives inputs what is the twitter is used for?
I loved your video, very much and thanks a lot for he afford you made. These are the question we actually face when you are working on the BE side.
One small question,
If someone asks you, what kind/type of architecture is this? What will be your answer?
Yes, you must be really disliking Elon Musk so much (to say it mildly ).
> Who is most popular on Twitter? Kim Kardashian. probably over 100 million followers .
.....
--
Putting the subject aside - you made a good content - thank you!
People watch netflix, I watch neetcode.
That’s amazing how this kind of large-scale system can grow and become so complex with amount of components and “moving parts”, also it’s impressive how it works with a massive amount of users and data storage like petabytes. In the end, I didn’t understand if your solution was using sharding or not on the database, if it is using, how do you solve the issue about the sharding-key, ‘cause it looks like not possible to use the “every account followed by someone” strategy due the reasons you even talked about.
Is it possible to have sharding and reading replicas at the same time? And how to handle it, using many load balancers, each one after sharding for a single replicas cluster?
I was left with the same impression. I don't see how this sharding could work
It doesn't seem like there's a good general solution. What works for the average user, doesn't work well for super big users. I heard that Twitter separated users with more than a certain number of followers into a separate use case. Not sure if they ever solved it.
why do you index on follower, instead of making 2 db index on both
This is great. I loled at 0:48 .This video is neet.
omg, you're insane. thank you!
What software and device do you use for the drawing?
I don't think that sharding on UID is a good approach, I would go with sharding based on date (a shard for each month). thus making feed fetching simpler and faster. at most you will hit two shards at the same time. and maybe you only need replicas for the last 3 shards (last 3 months).
And I'm not sure about saving pre-computed feeds. is it really good practice?
so what's a batch RPC? Asking for a friend...
inserting/retrieving multiple things at once rather than separately
This is great! Thank you!
too bad that you didn't mention how likes system works. I was always curious how it is done that someone clicks like and immediately gets total amout
All followers are created equal. Some followers are more equal than others #AnimalFarmTwitter
Where do those caches live? Are they separate servers? Or are we caching on the app servers?
Twitter uses redis. Separate servers, sharded by tweet ID, with read replicas
Nice Video, how about design of an online bank ?
which hardware you use for writing?
we need more of these for sure
I'm confused on how pub/sub works? can anyone explain to me what its suppose to do? if you can explain like I'm five that would be great!. THX
thank uuuuuuu can you please upload more videos on system design and object oriented design. I know you might be busyy but would mean a LOTT!!!!