It's worth clarifying that with proper sharding, indexing and powerful enough machines (which a company with a billion users can afford) a single db lookup can be done in sub-milliseconds. So it's not expensive. The actual reason why a better solution like an in-memory cache is necessary is because of the number of simultaneous lookups (i.e. number of users trying to signup) is huge in such a scenario thus making even a sub-millisecond lookup time per query infeasible.
I was thinking same, that a binary search will take O(50) to search on a database of Quadrillion, so what is the need of these, but simultaneous lookups are a valid reason to do so...
@@ayeameen we shard a DB not a table. So sharding here would make sense if the db has only one table or only user related tables. Anyway, I think if we keep it simple and just increase the data then we better use partitioning. It will create virtual sub-tables of a table with each having its own B-Tree for indexing. And we can also partition by starting alphabets maybe. Even with this model we'll have to search among dozens millions of rows instead of billion.
Instead of number of oops concepts videos in TH-cam tech channels....i found it very very useful as she is explaining the real time usecase.... thank you...
- Unique indexing or hashing: Standard and most effective for quick lookups. - Sharding: Ideal for distributed systems and extremely large datasets. - Bloom filters: Fast in-memory probabilistic checks to avoid unnecessary database lookups. - In-memory caching: Extremely fast for frequently queried user data. - Partitioning: Optimizes database lookups by reducing the size of search spaces.
This was brilliant. As beginner people do not think of advance techniques which is fine, but after a while in the industry you do need to look for these advanced techniques used by top players.
This is the first time I saw your video I'm saying this on behalf of all viewers that you are an amazing teacher with brilliant communication and visualization presentation skills... Subscribed🎉
My first thought is to optimize your query to cut down your dataset. I think that might be after/within step 3. Great video, it’s awesome to see what bigger companies do.
Even before watching this video I can give my thoughts on it and what I use on a daily basis. 1. Hash partitioning - you generate a hash of your data (maybe 1 to 100000) and then generate a secondary hash further from 1 to 1000 say. Create a composite index of both. Generate the hashes before querying and query the user efficiently. Your data set is reduced. 2. Sharding is a good way to augment the above strategy with lets say keeping the hash of hash as a shard key. 3. On top of that a persistent caching like redis can be very useful.
- Use a hash-based index to map user information eg; username in game to hash values. Hash tables have an average time complexity of O(1) for lookups. - Cache like redis if finance of project allow.
Even if you use increment decrement bit array it won't solve the false positives problem, thereby it highly relies on hash function this is the one we should focus more on.
have you ever looked for a word in the dictionary? a dictionary has hundreds of thousands of words, but you still take less than 30 seconds to find your word let's say "luck", you go to section of L, then you go to section of U, then you go to the section of C and then you go to section of K and then you find your word. this works only on sorted database, and database should be sorted periodically so newly added, deleted and modified data can be sorted to reduce lookup time.
Yea, that is why databases uses an Index, to speed up the data lookup. What you explained is the process used to find dat in an index ognized table. For other tables, if Indexed and DB thinks that index is going to be useful, the same lookup proceses is used on the index, and that results in the location where data is stored.
I don't understand how caching will help for this particular problem, if I wanna check the email is already in used or not, hardly anyone else will try to check for the same email in near time (before cache expiry)
Good point! Caching helps mainly with frequently checked or popular usernames/emails. For unique queries, cache hits are rare, but it still reduces load on the database for cases where multiple users may check the same email (e.g., typos or common names). Caching shines more in scenarios with repeated access patterns, but other techniques like Bloom filters handle the uniqueness aspect efficiently.
I'd assume there would be lot of queries to check most common emails like Max@mail John@mail etc Ofcourse not affective for very specific email IDs and that's why I agree with the solution which combines multiple approaches.
Wow, Ma'am, Not many peoples will watch your content because it is very neniche, But please please please do not stop sharing the knowledge you hold. I am amazed by your knowledge, which i know is result of doing hardwork in the industry by investing many many years. Thankyou very much for making this content and sharing your knowledge for free.
Quite interesting in Bloom Filter, however if we combined those three, we will get the downsides of the others isn't it, imagine we use Bloom filter for low memory footprint and we use Redis for another validations so Redis still need to store these record ? And how could we do query the database for another validations with faster responses ?
Great question! Combining Bloom filters, Redis, and database queries balances trade-offs. Bloom filters reduce unnecessary queries, while Redis stores frequently accessed data for faster lookups. Redis doesn't need to store all records, just recent or frequently used ones. For database queries, we rely on sharding and indexing to maintain speed, with Redis acting as a buffer to reduce load.
Thank you for the nice and detailed explanation. Have a question - what is the max length of values use in hash function so in this case what would be max email length? Is there any thumb rule for that
SHA-256 always generates a fixed 256-bit (32-byte) hash, regardless of input length, so it doesn’t limit the maximum length of an email. Email length limits are typically defined by standards or application constraints - usually up to 254 characters as per the RFC guidelines.
I would say handling multiple users logging in at the same time is the only concern. Most of the time redis will take care of this. And for a smaller application the in-memory database of the application would be sufficient. Also many users mostly try to solve the problem on the application level while their own database do not have indexes. A well designed table is far more efficient than creating hash functions.
It does. It also adds complexity to querying.Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
what a explanation, subscribed, so basically we will need this bloom filters only when we have data over lakhs or in crores, not in thousands or hundreds in which cahcing can be used efficiently, right? And also this bloom filter will be used for signup only or any other scenario it will be beneficial in?
Yes, for the data size you mentioned, caching with database sharding and indexing would be good enough. Better to check with your architect. I have mentioned a few other scenarios companies like Google, facebook and hbase are using bloom filters for. Please check.
Excellent Ma'am, you have explained real-world scenarios. I am expecting such more videos. It could be more better if you create a spring boot application and implement those scenarios what you have explained. it would be so helpful. Thank you..🙏
Awesome explanation and informative video, but I have doubt what if we add constraint over the email column itself, how would the DB behave then, will it check over all the entries, will that be same as querying over all the records manually? Thank you.
To find a user in milliseconds, we need a combination of Geo-location based routing, caching , and database sharding. Yes, if the objective is to check, if the user exists or not only, bloom filter may be the way to go.
Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
can u make video on my question where i have 10 billion mobile numbers where 1000 mobile number want to delete from 10 billion, what is efficient way in java?
For 10-15 million users, an index and cache would likely work well for performance, especially with a well-tuned database. Please check with your architect to understand the future data growth and scale expected.
Most startups (99%) don’t have billion customers. Those that do have already implemented a one time custom solution to this problem. I don’t understand the reason for such interview questions. Just do a db query on an index.
It's worth clarifying that with proper sharding, indexing and powerful enough machines (which a company with a billion users can afford) a single db lookup can be done in sub-milliseconds. So it's not expensive. The actual reason why a better solution like an in-memory cache is necessary is because of the number of simultaneous lookups (i.e. number of users trying to signup) is huge in such a scenario thus making even a sub-millisecond lookup time per query infeasible.
Databases have in memory caches.
I was thinking same, that a binary search will take O(50) to search on a database of Quadrillion, so what is the need of these, but simultaneous lookups are a valid reason to do so...
What about network delay @@Cassp0nk
What will be your shard key? We are trying to find if any user exists with the email address.
@@ayeameen we shard a DB not a table. So sharding here would make sense if the db has only one table or only user related tables. Anyway, I think if we keep it simple and just increase the data then we better use partitioning. It will create virtual sub-tables of a table with each having its own B-Tree for indexing. And we can also partition by starting alphabets maybe. Even with this model we'll have to search among dozens millions of rows instead of billion.
Instead of number of oops concepts videos in TH-cam tech channels....i found it very very useful as she is explaining the real time usecase.... thank you...
Thank you so much 🙂 🙏
- Unique indexing or hashing: Standard and most effective for quick lookups.
- Sharding: Ideal for distributed systems and extremely large datasets.
- Bloom filters: Fast in-memory probabilistic checks to avoid unnecessary database lookups.
- In-memory caching: Extremely fast for frequently queried user data.
- Partitioning: Optimizes database lookups by reducing the size of search spaces.
Please upload this type of video were U teach what tech giants optimize their API. one of the best video on youtube please keep-it-up..
Sure, pls share it in your circle too and support this channel. 🙏
This was brilliant. As beginner people do not think of advance techniques which is fine, but after a while in the industry you do need to look for these advanced techniques used by top players.
Agreed! Let me know what you think about my latest video- HLL in the Redis usage.
The video itself is great but the comments are gold. Learnt so much.
Glad to hear it! Please check our other videos too 🙏
Thanks TH-cam for recommending this.
This is the first time I saw your video I'm saying this on behalf of all viewers that you are an amazing teacher with brilliant communication and visualization presentation skills...
Subscribed🎉
Wow, thank you! Means a lot to me! Please share it in your circle 🙏
Agree.
Surely recommend your channel to my team as well in my office. Thanks a lot for this type of video.
Superb exploration!
Keep posting such tutorials.
Sure 👍 🙏
My first thought is to optimize your query to cut down your dataset. I think that might be after/within step 3. Great video, it’s awesome to see what bigger companies do.
Understood the concept, Keep sharing your precious knowledge with use
Thank you, I will🙏
This bloom filter stuff is ingenious.
Even before watching this video I can give my thoughts on it and what I use on a daily basis.
1. Hash partitioning - you generate a hash of your data (maybe 1 to 100000) and then generate a secondary hash further from 1 to 1000 say. Create a composite index of both. Generate the hashes before querying and query the user efficiently. Your data set is reduced.
2. Sharding is a good way to augment the above strategy with lets say keeping the hash of hash as a shard key.
3. On top of that a persistent caching like redis can be very useful.
Wow. Please upload more of such system design videos
Definitely. Please share it in your circle 🙏
- Use a hash-based index to map user information eg; username in game to hash values. Hash tables have an average time complexity of O(1) for lookups.
- Cache like redis if finance of project allow.
Never thought this video would be this informative!!!!
Glad it was helpful! Please share it in your circle and support this channel 🙏
Even if you use increment decrement bit array it won't solve the false positives problem, thereby it highly relies on hash function this is the one we should focus more on.
Thanks for the video. It surely expands the knowledge of engineering with all the conversation going on in the comments.
Learned a nice concept and strategy today. Thank you.
Starting my day with good learning through this video and comments.
Greatly explained. Just subscribed.
Awesome, thank you!
Straight to the point , instant sub :)
Thanks! Do check out our other videos 🙏
@@TechCareerBytes Ya
Excellent technical presentation. Very good
Thanks! Please check my other videos too!
Thanks TH-cam algo, very well explained video, subbed
Thanks for the sub! 🙏
Thank you Mam, you can very well name your channel as "Tech Goldmine"!
🙂 🙏
This is very informative. Thank you. Hope to see mote videos like this
Excellent explanation thanks a lot..
Glad you liked it. Please check our other videos too 🙏
Very simple and in plain, understandable way!!! Excellent explanation. Please keep more videos coming. Subscribed!!!
Thanks, will do! Please share it in your circle 🙏
Thanks for a great explanation.
instantly subscribed 🙏
.
system design and concepts for optimal performance
Thanks 🙏
have you ever looked for a word in the dictionary?
a dictionary has hundreds of thousands of words, but you still take less than 30 seconds to find your word let's say "luck",
you go to section of L,
then you go to section of U,
then you go to the section of C
and then you go to section of K
and then you find your word.
this works only on sorted database, and database should be sorted periodically so newly added, deleted and modified data can be sorted to reduce lookup time.
you made all that Killer BIT process, simple man...Thank yoU!!
Isn't this the database partition in the nutshell?
Thnxxx
Yea, that is why databases uses an Index, to speed up the data lookup.
What you explained is the process used to find dat in an index ognized table.
For other tables, if Indexed and DB thinks that index is going to be useful, the same lookup proceses is used on the index, and that results in the location where data is stored.
@@anothermouth7077 not partition. It's indexing
Thank you for providing such a clear explanation with examples of production services. Great content, keep up the amazing work, Ma'am
Much appreciated!
You have a new subscriber, ma'am. 🎉
Thanks 4 the info, You've now got a new sub😁
Thanks for the sub! Please check my other videos too 🙏
Wow, I learned quite some new things from this! Thanks
Glad it was helpful! Please share it in your circle 🙏
This was so insightful, thank you so much.
Glad it was helpful! Please share it in your circle 🙏
wow , I really loved it just watched it out of curiosity and learned a lot
Happy to hear that! Please share it in your circle 🙏
What an insightful video, thank you for sharing such an amazing knowledge. Subscribed!
Awesome, thank you! Please share in your circle 🙏
The video was really helpful mam. Thank you for the video.
Glad it was helpful! 🙏
very knowledgeable video, thanks mam.
Thanks for liking 🙏
Great info 🙌🙌
Good one. It could have been a youtube shot. Good luck
Such concept in this short video, really appreciate it. ❤
Glad you liked it!. Please share it in your circle 🙏
What a way to explain💐📈
Thanks 🙏
Just awesome 💯💯
Great work ma'am!!
Thanks a lot 😊 🙏
I don't understand how caching will help for this particular problem, if I wanna check the email is already in used or not, hardly anyone else will try to check for the same email in near time (before cache expiry)
Good point! Caching helps mainly with frequently checked or popular usernames/emails. For unique queries, cache hits are rare, but it still reduces load on the database for cases where multiple users may check the same email (e.g., typos or common names). Caching shines more in scenarios with repeated access patterns, but other techniques like Bloom filters handle the uniqueness aspect efficiently.
I'd assume there would be lot of queries to check most common emails like
Max@mail
John@mail etc
Ofcourse not affective for very specific email IDs and that's why I agree with the solution which combines multiple approaches.
Explained so well!
Insight full video Tutorial with very good real world examples. Thankyou Mam..Keep sharing knowledge and experiences
Thanks for liking. Please share it in your circle 🙏
Great, Please upload more videos on these concepts !!!!
Thank you, I will. Please share it in your circle and support this channel 🙏
Great lesson! keep it up ! Thanks! :)
Thanks! Please share it in your circle and support this channel 🙏
amazing video thanks for sharing
Glad you liked it! 🙏
Wow, Ma'am, Not many peoples will watch your content because it is very neniche, But please please please do not stop sharing the knowledge you hold. I am amazed by your knowledge, which i know is result of doing hardwork in the industry by investing many many years. Thankyou very much for making this content and sharing your knowledge for free.
Thanks 🙏
Nice tutorial, learnt something new today, thank you so much Mam...
Glad to hear that. Please share it in your circle too! 🙏
Helpful 🙏🏼
First time watching your videos and it was very informative. Thank you for your efforts and clear explanation.
Glad it was helpful! 🙏
Very good and straight to the point video. Also how to implement these methods in other languages
Quite interesting in Bloom Filter, however if we combined those three, we will get the downsides of the others isn't it, imagine we use Bloom filter for low memory footprint and we use Redis for another validations so Redis still need to store these record ? And how could we do query the database for another validations with faster responses ?
Great question! Combining Bloom filters, Redis, and database queries balances trade-offs. Bloom filters reduce unnecessary queries, while Redis stores frequently accessed data for faster lookups. Redis doesn't need to store all records, just recent or frequently used ones. For database queries, we rely on sharding and indexing to maintain speed, with Redis acting as a buffer to reduce load.
Thank you for the nice and detailed explanation. Have a question - what is the max length of values use in hash function so in this case what would be max email length? Is there any thumb rule for that
SHA-256 always generates a fixed 256-bit (32-byte) hash, regardless of input length, so it doesn’t limit the maximum length of an email. Email length limits are typically defined by standards or application constraints - usually up to 254 characters as per the RFC guidelines.
Super Awesome video, make more like it.
I will try my best. Thanks. Please share it in your circle 🙏
I would say handling multiple users logging in at the same time is the only concern. Most of the time redis will take care of this. And for a smaller application the in-memory database of the application would be sufficient.
Also many users mostly try to solve the problem on the application level while their own database do not have indexes. A well designed table is far more efficient than creating hash functions.
AI has two meanings.
Artificial Indian and An Instructor.
Nice video 😊
Great. Useful.
Sharding the database also helps in querying speed & performance!
It does. It also adds complexity to querying.Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
Incredible !!!! spot on
Thank you! Pls share it in your circle 🙏
Finally some insights.
what a explanation, subscribed, so basically we will need this bloom filters only when we have data over lakhs or in crores, not in thousands or hundreds in which cahcing can be used efficiently, right?
And also this bloom filter will be used for signup only or any other scenario it will be beneficial in?
Yes, for the data size you mentioned, caching with database sharding and indexing would be good enough. Better to check with your architect.
I have mentioned a few other scenarios companies like Google, facebook and hbase are using bloom filters for. Please check.
@@TechCareerBytes ok, thanks, will check that.
Great video, would suggest to improve the quality of screenshots also invest in good quality microphone
Working on it
Excellent Ma'am, you have explained real-world scenarios. I am expecting such more videos. It could be more better if you create a spring boot application and implement those scenarios what you have explained. it would be so helpful. Thank you..🙏
I will try my best.
This was impressive
Wow, Unique video, great content, Nice explanation.. Thank you so much madam, Please make this kind of unique videos, Subscribed.... ♥
Thanks and welcome 🙏
Thanks and welcome
Keep it up ma'am 👏
Thank you, I will. Please share in in your circle.
@@TechCareerBytes sure ma'am
Love your videos mam. Can you please make a video on sorted sets data structure?
The code snippets code is not properly visible, the fonts are blur
Please check the description for the link to the code.
Thank you 🙏 please upload video in 4K resolution if possible
Ok next time
Please post more video like this
great video mam....Thanks
Thanks! Please share it in your circle 🙏
What about indexing, will it not be helpful and the best approach
Indexing and database sharding will definitely help. But, at large scale we also need a cache and bloom filter to speed up the process.
Awesome explanation and informative video, but I have doubt what if we add constraint over the email column itself, how would the DB behave then, will it check over all the entries, will that be same as querying over all the records manually? Thank you.
To find a user in milliseconds, we need a combination of Geo-location based routing, caching , and database sharding.
Yes, if the objective is to check, if the user exists or not only, bloom filter may be the way to go.
❤ nice makes video on schema migration and database migration
Thanks Rupa mam
When to use consistent hashing ? Please explain with real use cases. Thanks Ruba
Sure. You can check my videos on data partition and data replication. They cover consistent hashing.
What about sharding the database?
Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
Please tell about sharding and other concepts
Please check this video - th-cam.com/video/EoHh1NMeUJM/w-d-xo.html
can u make video on my question where i have 10 billion mobile numbers where 1000 mobile number want to delete from 10 billion, what is efficient way in java?
Let me try.
Really helpful
if data is not too much big but it's big as 10-15M approx. can we just apply cache + index on table . please guide u
For 10-15 million users, an index and cache would likely work well for performance, especially with a well-tuned database. Please check with your architect to understand the future data growth and scale expected.
Caching in caching db is the same direct db query approach
I Appreciate Amazing knowledge shared by, but please buy some good quality mic, your audio should more clean
Noted. Thanks.
Thank you so much for the video ma'am. Can you please provide the link to the code? Its not clear.
Yes, sure. Please check the description box for the link. Don't forget to share the video in your circle 🙏
one of the great video
Glad you think so! Pls share it in your circle 🙏
Thank you
Great Video
Thanks! Please share it in your circle and support this channel 🙏
Nice content Mam.
Thanks a lot. Please share it in your circle and support this channel. 🙏
Extraordinary mam.
Thanks a lot 🙏 please share it in your circle.
The video is great but i wish to improve the quality of the images provided as examples
Noted. Will work on it
Most startups (99%) don’t have billion customers. Those that do have already implemented a one time custom solution to this problem. I don’t understand the reason for such interview questions. Just do a db query on an index.
Caching wont help i guess, bcoz everytime new user will come and enter his email id, high chance of that data not available in cache
We can use binary search with memoization...
Database queries on indexed columns use binary search by default