It's worth clarifying that with proper sharding, indexing and powerful enough machines (which a company with a billion users can afford) a single db lookup can be done in sub-milliseconds. So it's not expensive. The actual reason why a better solution like an in-memory cache is necessary is because of the number of simultaneous lookups (i.e. number of users trying to signup) is huge in such a scenario thus making even a sub-millisecond lookup time per query infeasible.
I was thinking same, that a binary search will take O(50) to search on a database of Quadrillion, so what is the need of these, but simultaneous lookups are a valid reason to do so...
@@ayeameen we shard a DB not a table. So sharding here would make sense if the db has only one table or only user related tables. Anyway, I think if we keep it simple and just increase the data then we better use partitioning. It will create virtual sub-tables of a table with each having its own B-Tree for indexing. And we can also partition by starting alphabets maybe. Even with this model we'll have to search among dozens millions of rows instead of billion.
Instead of number of oops concepts videos in TH-cam tech channels....i found it very very useful as she is explaining the real time usecase.... thank you...
- Unique indexing or hashing: Standard and most effective for quick lookups. - Sharding: Ideal for distributed systems and extremely large datasets. - Bloom filters: Fast in-memory probabilistic checks to avoid unnecessary database lookups. - In-memory caching: Extremely fast for frequently queried user data. - Partitioning: Optimizes database lookups by reducing the size of search spaces.
This was brilliant. As beginner people do not think of advance techniques which is fine, but after a while in the industry you do need to look for these advanced techniques used by top players.
This is the first time I saw your video I'm saying this on behalf of all viewers that you are an amazing teacher with brilliant communication and visualization presentation skills... Subscribed🎉
Even before watching this video I can give my thoughts on it and what I use on a daily basis. 1. Hash partitioning - you generate a hash of your data (maybe 1 to 100000) and then generate a secondary hash further from 1 to 1000 say. Create a composite index of both. Generate the hashes before querying and query the user efficiently. Your data set is reduced. 2. Sharding is a good way to augment the above strategy with lets say keeping the hash of hash as a shard key. 3. On top of that a persistent caching like redis can be very useful.
My first thought is to optimize your query to cut down your dataset. I think that might be after/within step 3. Great video, it’s awesome to see what bigger companies do.
Wow, Ma'am, Not many peoples will watch your content because it is very neniche, But please please please do not stop sharing the knowledge you hold. I am amazed by your knowledge, which i know is result of doing hardwork in the industry by investing many many years. Thankyou very much for making this content and sharing your knowledge for free.
- Use a hash-based index to map user information eg; username in game to hash values. Hash tables have an average time complexity of O(1) for lookups. - Cache like redis if finance of project allow.
Even if you use increment decrement bit array it won't solve the false positives problem, thereby it highly relies on hash function this is the one we should focus more on.
have you ever looked for a word in the dictionary? a dictionary has hundreds of thousands of words, but you still take less than 30 seconds to find your word let's say "luck", you go to section of L, then you go to section of U, then you go to the section of C and then you go to section of K and then you find your word. this works only on sorted database, and database should be sorted periodically so newly added, deleted and modified data can be sorted to reduce lookup time.
Yea, that is why databases uses an Index, to speed up the data lookup. What you explained is the process used to find dat in an index ognized table. For other tables, if Indexed and DB thinks that index is going to be useful, the same lookup proceses is used on the index, and that results in the location where data is stored.
Thank you for the nice and detailed explanation. Have a question - what is the max length of values use in hash function so in this case what would be max email length? Is there any thumb rule for that
SHA-256 always generates a fixed 256-bit (32-byte) hash, regardless of input length, so it doesn’t limit the maximum length of an email. Email length limits are typically defined by standards or application constraints - usually up to 254 characters as per the RFC guidelines.
what a explanation, subscribed, so basically we will need this bloom filters only when we have data over lakhs or in crores, not in thousands or hundreds in which cahcing can be used efficiently, right? And also this bloom filter will be used for signup only or any other scenario it will be beneficial in?
Yes, for the data size you mentioned, caching with database sharding and indexing would be good enough. Better to check with your architect. I have mentioned a few other scenarios companies like Google, facebook and hbase are using bloom filters for. Please check.
Excellent Ma'am, you have explained real-world scenarios. I am expecting such more videos. It could be more better if you create a spring boot application and implement those scenarios what you have explained. it would be so helpful. Thank you..🙏
I would say handling multiple users logging in at the same time is the only concern. Most of the time redis will take care of this. And for a smaller application the in-memory database of the application would be sufficient. Also many users mostly try to solve the problem on the application level while their own database do not have indexes. A well designed table is far more efficient than creating hash functions.
To find a user in milliseconds, we need a combination of Geo-location based routing, caching , and database sharding. Yes, if the objective is to check, if the user exists or not only, bloom filter may be the way to go.
It does. It also adds complexity to querying.Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
Awesome explanation and informative video, but I have doubt what if we add constraint over the email column itself, how would the DB behave then, will it check over all the entries, will that be same as querying over all the records manually? Thank you.
Quite interesting in Bloom Filter, however if we combined those three, we will get the downsides of the others isn't it, imagine we use Bloom filter for low memory footprint and we use Redis for another validations so Redis still need to store these record ? And how could we do query the database for another validations with faster responses ?
Great question! Combining Bloom filters, Redis, and database queries balances trade-offs. Bloom filters reduce unnecessary queries, while Redis stores frequently accessed data for faster lookups. Redis doesn't need to store all records, just recent or frequently used ones. For database queries, we rely on sharding and indexing to maintain speed, with Redis acting as a buffer to reduce load.
Most startups (99%) don’t have billion customers. Those that do have already implemented a one time custom solution to this problem. I don’t understand the reason for such interview questions. Just do a db query on an index.
I don't understand how caching will help for this particular problem, if I wanna check the email is already in used or not, hardly anyone else will try to check for the same email in near time (before cache expiry)
Good point! Caching helps mainly with frequently checked or popular usernames/emails. For unique queries, cache hits are rare, but it still reduces load on the database for cases where multiple users may check the same email (e.g., typos or common names). Caching shines more in scenarios with repeated access patterns, but other techniques like Bloom filters handle the uniqueness aspect efficiently.
I'd assume there would be lot of queries to check most common emails like Max@mail John@mail etc Ofcourse not affective for very specific email IDs and that's why I agree with the solution which combines multiple approaches.
You would never put all users in one database. You would create one database for every 1m users or something like that. And then just need a lookup in which database the user would be found. Eg first 3 digits of customer number or whatever.
It's worth clarifying that with proper sharding, indexing and powerful enough machines (which a company with a billion users can afford) a single db lookup can be done in sub-milliseconds. So it's not expensive. The actual reason why a better solution like an in-memory cache is necessary is because of the number of simultaneous lookups (i.e. number of users trying to signup) is huge in such a scenario thus making even a sub-millisecond lookup time per query infeasible.
Databases have in memory caches.
I was thinking same, that a binary search will take O(50) to search on a database of Quadrillion, so what is the need of these, but simultaneous lookups are a valid reason to do so...
What about network delay @@Cassp0nk
What will be your shard key? We are trying to find if any user exists with the email address.
@@ayeameen we shard a DB not a table. So sharding here would make sense if the db has only one table or only user related tables. Anyway, I think if we keep it simple and just increase the data then we better use partitioning. It will create virtual sub-tables of a table with each having its own B-Tree for indexing. And we can also partition by starting alphabets maybe. Even with this model we'll have to search among dozens millions of rows instead of billion.
Instead of number of oops concepts videos in TH-cam tech channels....i found it very very useful as she is explaining the real time usecase.... thank you...
Thank you so much 🙂 🙏
- Unique indexing or hashing: Standard and most effective for quick lookups.
- Sharding: Ideal for distributed systems and extremely large datasets.
- Bloom filters: Fast in-memory probabilistic checks to avoid unnecessary database lookups.
- In-memory caching: Extremely fast for frequently queried user data.
- Partitioning: Optimizes database lookups by reducing the size of search spaces.
This was brilliant. As beginner people do not think of advance techniques which is fine, but after a while in the industry you do need to look for these advanced techniques used by top players.
Agreed! Let me know what you think about my latest video- HLL in the Redis usage.
The video itself is great but the comments are gold. Learnt so much.
Glad to hear it! Please check our other videos too 🙏
Please upload this type of video were U teach what tech giants optimize their API. one of the best video on youtube please keep-it-up..
Sure, pls share it in your circle too and support this channel. 🙏
This is the first time I saw your video I'm saying this on behalf of all viewers that you are an amazing teacher with brilliant communication and visualization presentation skills...
Subscribed🎉
Wow, thank you! Means a lot to me! Please share it in your circle 🙏
Agree.
Even before watching this video I can give my thoughts on it and what I use on a daily basis.
1. Hash partitioning - you generate a hash of your data (maybe 1 to 100000) and then generate a secondary hash further from 1 to 1000 say. Create a composite index of both. Generate the hashes before querying and query the user efficiently. Your data set is reduced.
2. Sharding is a good way to augment the above strategy with lets say keeping the hash of hash as a shard key.
3. On top of that a persistent caching like redis can be very useful.
Thanks TH-cam for recommending this.
Thank you Mam, you can very well name your channel as "Tech Goldmine"!
🙂 🙏
Surely recommend your channel to my team as well in my office. Thanks a lot for this type of video.
My first thought is to optimize your query to cut down your dataset. I think that might be after/within step 3. Great video, it’s awesome to see what bigger companies do.
Wow, Ma'am, Not many peoples will watch your content because it is very neniche, But please please please do not stop sharing the knowledge you hold. I am amazed by your knowledge, which i know is result of doing hardwork in the industry by investing many many years. Thankyou very much for making this content and sharing your knowledge for free.
Thanks 🙏
- Use a hash-based index to map user information eg; username in game to hash values. Hash tables have an average time complexity of O(1) for lookups.
- Cache like redis if finance of project allow.
This bloom filter stuff is ingenious.
Wow. Please upload more of such system design videos
Definitely. Please share it in your circle 🙏
Understood the concept, Keep sharing your precious knowledge with use
Thank you, I will🙏
Thank you for providing such a clear explanation with examples of production services. Great content, keep up the amazing work, Ma'am
Much appreciated!
Starting my day with good learning through this video and comments.
Greatly explained. Just subscribed.
Awesome, thank you!
Even if you use increment decrement bit array it won't solve the false positives problem, thereby it highly relies on hash function this is the one we should focus more on.
have you ever looked for a word in the dictionary?
a dictionary has hundreds of thousands of words, but you still take less than 30 seconds to find your word let's say "luck",
you go to section of L,
then you go to section of U,
then you go to the section of C
and then you go to section of K
and then you find your word.
this works only on sorted database, and database should be sorted periodically so newly added, deleted and modified data can be sorted to reduce lookup time.
you made all that Killer BIT process, simple man...Thank yoU!!
Isn't this the database partition in the nutshell?
Thnxxx
Yea, that is why databases uses an Index, to speed up the data lookup.
What you explained is the process used to find dat in an index ognized table.
For other tables, if Indexed and DB thinks that index is going to be useful, the same lookup proceses is used on the index, and that results in the location where data is stored.
@@anothermouth7077 not partition. It's indexing
Superb exploration!
Keep posting such tutorials.
Sure 👍 🙏
Very simple and in plain, understandable way!!! Excellent explanation. Please keep more videos coming. Subscribed!!!
Thanks, will do! Please share it in your circle 🙏
Thanks for the video. It surely expands the knowledge of engineering with all the conversation going on in the comments.
Learned a nice concept and strategy today. Thank you.
Excellent explanation thanks a lot..
Glad you liked it. Please check our other videos too 🙏
Never thought this video would be this informative!!!!
Glad it was helpful! Please share it in your circle and support this channel 🙏
Excellent technical presentation. Very good
Thanks! Please check my other videos too!
Thank you for the nice and detailed explanation. Have a question - what is the max length of values use in hash function so in this case what would be max email length? Is there any thumb rule for that
SHA-256 always generates a fixed 256-bit (32-byte) hash, regardless of input length, so it doesn’t limit the maximum length of an email. Email length limits are typically defined by standards or application constraints - usually up to 254 characters as per the RFC guidelines.
Good one. It could have been a youtube shot. Good luck
Thanks TH-cam algo, very well explained video, subbed
Thanks for the sub! 🙏
wow , I really loved it just watched it out of curiosity and learned a lot
Happy to hear that! Please share it in your circle 🙏
This is very informative. Thank you. Hope to see mote videos like this
instantly subscribed 🙏
.
system design and concepts for optimal performance
Thanks 🙏
Great, Please upload more videos on these concepts !!!!
Thank you, I will. Please share it in your circle and support this channel 🙏
Straight to the point , instant sub :)
Thanks! Do check out our other videos 🙏
@@TechCareerBytes Ya
Thanks for a great explanation.
First time watching your videos and it was very informative. Thank you for your efforts and clear explanation.
Glad it was helpful! 🙏
what a explanation, subscribed, so basically we will need this bloom filters only when we have data over lakhs or in crores, not in thousands or hundreds in which cahcing can be used efficiently, right?
And also this bloom filter will be used for signup only or any other scenario it will be beneficial in?
Yes, for the data size you mentioned, caching with database sharding and indexing would be good enough. Better to check with your architect.
I have mentioned a few other scenarios companies like Google, facebook and hbase are using bloom filters for. Please check.
@@TechCareerBytes ok, thanks, will check that.
What an insightful video, thank you for sharing such an amazing knowledge. Subscribed!
Awesome, thank you! Please share in your circle 🙏
This was so insightful, thank you so much.
Glad it was helpful! Please share it in your circle 🙏
Excellent Ma'am, you have explained real-world scenarios. I am expecting such more videos. It could be more better if you create a spring boot application and implement those scenarios what you have explained. it would be so helpful. Thank you..🙏
I will try my best.
Thanks Rupa mam
AI has two meanings.
Artificial Indian and An Instructor.
Great work ma'am!!
Thanks a lot 😊 🙏
I would say handling multiple users logging in at the same time is the only concern. Most of the time redis will take care of this. And for a smaller application the in-memory database of the application would be sufficient.
Also many users mostly try to solve the problem on the application level while their own database do not have indexes. A well designed table is far more efficient than creating hash functions.
Insight full video Tutorial with very good real world examples. Thankyou Mam..Keep sharing knowledge and experiences
Thanks for liking. Please share it in your circle 🙏
Very good and straight to the point video. Also how to implement these methods in other languages
You have a new subscriber, ma'am. 🎉
Wow, I learned quite some new things from this! Thanks
Glad it was helpful! Please share it in your circle 🙏
To find a user in milliseconds, we need a combination of Geo-location based routing, caching , and database sharding.
Yes, if the objective is to check, if the user exists or not only, bloom filter may be the way to go.
The video was really helpful mam. Thank you for the video.
Glad it was helpful! 🙏
Thanks 4 the info, You've now got a new sub😁
Thanks for the sub! Please check my other videos too 🙏
very knowledgeable video, thanks mam.
Thanks for liking 🙏
Explained so well!
Such concept in this short video, really appreciate it. ❤
Glad you liked it!. Please share it in your circle 🙏
Nice tutorial, learnt something new today, thank you so much Mam...
Glad to hear that. Please share it in your circle too! 🙏
Sharding the database also helps in querying speed & performance!
It does. It also adds complexity to querying.Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
This was impressive
Please post more video like this
Great video, would suggest to improve the quality of screenshots also invest in good quality microphone
Working on it
Awesome explanation and informative video, but I have doubt what if we add constraint over the email column itself, how would the DB behave then, will it check over all the entries, will that be same as querying over all the records manually? Thank you.
Super Madam.
Great info 🙌🙌
Thank you 🙏 please upload video in 4K resolution if possible
Ok next time
Love your videos mam. Can you please make a video on sorted sets data structure?
Great. Useful.
Nice video 😊
amazing video thanks for sharing
Glad you liked it! 🙏
Great lesson! keep it up ! Thanks! :)
Thanks! Please share it in your circle and support this channel 🙏
What a way to explain💐📈
Thanks 🙏
Helpful 🙏🏼
Wow, Unique video, great content, Nice explanation.. Thank you so much madam, Please make this kind of unique videos, Subscribed.... ♥
Thanks and welcome 🙏
Thanks and welcome
Just awesome 💯💯
I Appreciate Amazing knowledge shared by, but please buy some good quality mic, your audio should more clean
Noted. Thanks.
Keep it up ma'am 👏
Thank you, I will. Please share in in your circle.
@@TechCareerBytes sure ma'am
When to use consistent hashing ? Please explain with real use cases. Thanks Ruba
Sure. You can check my videos on data partition and data replication. They cover consistent hashing.
Really helpful
Quite interesting in Bloom Filter, however if we combined those three, we will get the downsides of the others isn't it, imagine we use Bloom filter for low memory footprint and we use Redis for another validations so Redis still need to store these record ? And how could we do query the database for another validations with faster responses ?
Great question! Combining Bloom filters, Redis, and database queries balances trade-offs. Bloom filters reduce unnecessary queries, while Redis stores frequently accessed data for faster lookups. Redis doesn't need to store all records, just recent or frequently used ones. For database queries, we rely on sharding and indexing to maintain speed, with Redis acting as a buffer to reduce load.
Incredible !!!! spot on
Thank you! Pls share it in your circle 🙏
The video is great but i wish to improve the quality of the images provided as examples
Noted. Will work on it
Super Awesome video, make more like it.
I will try my best. Thanks. Please share it in your circle 🙏
great video mam....Thanks
Thanks! Please share it in your circle 🙏
Thank you
Extraordinary mam.
Thanks a lot 🙏 please share it in your circle.
Most startups (99%) don’t have billion customers. Those that do have already implemented a one time custom solution to this problem. I don’t understand the reason for such interview questions. Just do a db query on an index.
Finally some insights.
Thank you so much for the video ma'am. Can you please provide the link to the code? Its not clear.
Yes, sure. Please check the description box for the link. Don't forget to share the video in your circle 🙏
Great Video
Thanks! Please share it in your circle and support this channel 🙏
Nice content Mam.
Thanks a lot. Please share it in your circle and support this channel. 🙏
❤ nice makes video on schema migration and database migration
Thats where Cassandra, HBase come
Please tell about sharding and other concepts
Please check this video - th-cam.com/video/EoHh1NMeUJM/w-d-xo.html
I don't understand how caching will help for this particular problem, if I wanna check the email is already in used or not, hardly anyone else will try to check for the same email in near time (before cache expiry)
Good point! Caching helps mainly with frequently checked or popular usernames/emails. For unique queries, cache hits are rare, but it still reduces load on the database for cases where multiple users may check the same email (e.g., typos or common names). Caching shines more in scenarios with repeated access patterns, but other techniques like Bloom filters handle the uniqueness aspect efficiently.
I'd assume there would be lot of queries to check most common emails like
Max@mail
John@mail etc
Ofcourse not affective for very specific email IDs and that's why I agree with the solution which combines multiple approaches.
Bloom filter
What about indexing, will it not be helpful and the best approach
Indexing and database sharding will definitely help. But, at large scale we also need a cache and bloom filter to speed up the process.
one of the great video
Glad you think so! Pls share it in your circle 🙏
You would never put all users in one database. You would create one database for every 1m users or something like that. And then just need a lookup in which database the user would be found. Eg first 3 digits of customer number or whatever.
Really awasome.. (y)
Nice