8: Design a Web Crawler | Systems Design Interview Questions With Ex-Google SWE

Jordan has no life

มุมมอง 16 002

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 12 ม.ค. 2024
They call me spiderman the way the ladies (euphemism for my friends) crawl into my web
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 84

@bchandra72 4 หลายเดือนก่อน ⁺¹⁰
The videos were highly beneficial for my FANNG system design interview. I purchased several paid system design courses, yet your videos surpassed them all.
@jordanhasnolife5163 4 หลายเดือนก่อน
That's really nice to hear, thanks a lot!
@dare2expressmonicakothari415 หลายเดือนก่อน ⁺²
Excellent video and explanation. Every trade-off that you speak about is an invaluable nugget and made it super easy to understand why we should choose a particular strategy in design. One of the best channels on System Design !! Keep up the great work!
@jordanhasnolife5163 หลายเดือนก่อน
Thank you!!
@panhejia 4 หลายเดือนก่อน ⁺³
Best discussion of crawler that I have seen. You abstract the whole HLD into frontier, content deduper, url deduper, and talks about each piece in details, regarding frontier strategies, content deduper db/memcache selection, url deduper and partitioning. You even touched the fail modes at the end. Nice job!
@jordanhasnolife5163 4 หลายเดือนก่อน
Thanks!! Glad you enjoyed :)
@reedomrahman8799 5 หลายเดือนก่อน ⁺³
Yo been watching your system design 2.0 vids. Have been a great supplement to DDIA.
Have you made a video on how todo estimations in system design?
@jordanhasnolife5163 5 หลายเดือนก่อน
I have not, but I do them in just about every video if there's anything to be learned there haha. I don't know that there's a cut and dry way to do "estimations", I think that it's sort of problem specific.
@maxvettel7337 5 หลายเดือนก่อน ⁺⁵
Good support for my preparation
@anuj9538 4 หลายเดือนก่อน ⁺¹
Hi @jordanhasnolife5163
Your content is one of the best out there in system design. I've few doubt regarding the architecture
Load balancer - Is it just the Kafka cluster where partition key is hash of the host and each flink instance is going to read the data from that particular partition always ?.
Will it not cause the issue once this particular flink instance is died.
@jordanhasnolife5163 4 หลายเดือนก่อน
Hey! So if our flink node dies, another one can come up, and restore its state from the S3 checkpoint. Or an existing one can take over and grab the share of the keys from that kafka partition.
@JoseEGuerra หลายเดือนก่อน ⁺¹
Great video as always Jordan! Can't thank you enough for all the value you have been providing
On the topic of having nodes working at full capacity... Hash range by hostname evenly distributes URLs across partitions but some URLs may require more work than others to be fetched. This can be amplified if larger hostnames end up in the same partition. What would be your strategy so that for the current URLs in the queue we don't end up with idle nodes and nodes having more work that they can handle? Is there a feasible solution considering you are keeping the same amount of servers with similar capacity and still want to finish the task in one week?
@jordanhasnolife5163 หลายเดือนก่อน ⁺¹
Yeah this is definitely a problem for sure! I think in such a situation, you'd want a couple of nodes continuously pinging the flink nodes to see how backed up they are on their queues, and potentially modifying hashing configurations in zookeeper accordingly. This certainly isn't easy, as it requires querying some state from the backed up node to get it onto the idle node in order to run there!
@LeoLeo-nx5gi 5 หลายเดือนก่อน ⁺¹
Hi Jordan, awesome indepth explanation, thanks!!
@jordanhasnolife5163 5 หลายเดือนก่อน
Thanks!!
@canijustgetanamealre 3 หลายเดือนก่อน ⁺²
these are great breakdowns. I much prefer your explanations to what grokking has
@erictsai1592 5 หลายเดือนก่อน ⁺³
Hey great videos as always! Just wondering if you will talk about one popular design question ads aggregation in near real-time and batch processing some time?
@jordanhasnolife5163 5 หลายเดือนก่อน
Can you elaborate on this question? I'm not entirely sure what we're trying to design here.
@erictsai1592 5 หลายเดือนก่อน
@@jordanhasnolife5163 Sure! As below requirements, design Ad-Click Event Aggregation system in the online advertising so the campaign managers can adjust bidding strategies, control the budget and change targeted audience groups, etc.
- Events are sent to the distributed message queue.
- Facebook or Google scale.
- Design for as much granularity with max 1 min.
- Events can arrive later than expected.
- Need to view both real time and archival information.
- Two query requirements: 1. Return the number of click events in the last M minutes for a particular advertisement. 2. Return the top N ads clicked in the last M minutes.
@KannanMavila หลายเดือนก่อน ⁺¹
Great content, as always! Thank you Jordan :)
A small feedback: it would be nice to maintain a crawl depth with every url in the queue to avoid Crawler Traps (infinite loops).
@jordanhasnolife5163 หลายเดือนก่อน
Good idea!
@WINDSORONFIRE วันที่ผ่านมา ⁺¹
Your video is the best for sure. Two questions. Wouldn't the load balancer become a single point of failure. Also I don't see any kind of mechanism to prevent crawling a particular website too fast.
@jordanhasnolife5163 วันที่ผ่านมา ⁺¹
Yep - didn't draw it in the diagram but this is why we use an active-active or active-passive configuration for fault tolerance with load balancers. See my load balancing concept video for reference.
@WINDSORONFIRE วันที่ผ่านมา
@@jordanhasnolife5163Thx. How much would you charge if you were willing that is to do a video call to review one of my designs. 1hr? I am not rich and I'm sure you're super busy. But I thought I'd ask.
@meenalgoyal8933 3 หลายเดือนก่อน ⁺²
Thank you for the video! :D
For document content dedup, I was thinking of another option. May be have redis cluster partitioned by document location and still use single leader. This way cache of the checksum for document content from different geographic location will stay close to that location. So more and faster hits to cache. And in the background these cache partitions can sync up. What are your thoughts?
@jordanhasnolife5163 3 หลายเดือนก่อน
I think the main thing to note here is that websites with different URLs can have the same conteng
@jordanhasnolife5163 3 หลายเดือนก่อน
So location based partioning doesn't necessarily help us
@KarthikVasudevan27 29 วันที่ผ่านมา ⁺¹
Hey Jordan, thanks for the content. Very useful. Got a question buddy, is there a need for the LB ? Cant we just direct the messages to the respective Kafka topics directly with a local hash function determining the node and topic?
@jordanhasnolife5163 28 วันที่ผ่านมา
You can, we just then risk inconsistency in the state of each publisher with respect to what partitions handle what keys
A single load balancer makes it so that this isn't a problem, which can be useful for Kafka since pretty much any of our servers can publish there so now they'd all have to listen to zookeeper
@KENTOSI 5 หลายเดือนก่อน ⁺³
Hey Jordan this was an excellent coverage of this interview question. I was asked this once and didn't even occur to me to think about the robot.txt file. Nice work!
@jordanhasnolife5163 5 หลายเดือนก่อน
I appreciate it!
@IiAzureI 4 หลายเดือนก่อน ⁺²
Is Kafka your load balancer? I don't see how Kafka is doing anything if it's completely local to each node. Doesn't the LB have to do the partitioning of the URLs? (ye I don't understand kafka, reading more now).
@jordanhasnolife5163 4 หลายเดือนก่อน
Kafka is itself a distributed system with many nodes. There's typically a zookeeper and load balancer that don't have to live on the same nodes as the queues.
@Hollowpoint321 2 หลายเดือนก่อน ⁺¹
Hi Jordan - just wondering how you would modify this system to ensure stuff is re-crawled with approximate accuracy? E.g. in your intro you said you'd be aiming to complete this within 1 week - if your crawler is running persistently, how would you enqueue sites to be re-crawled a week later, given that there is no message visibility equivalent in kafka, versus a traditional message broker?
@jordanhasnolife5163 2 หลายเดือนก่อน
Can you elaborate on this question? You'd typically just enqueue the same seed websites that you enqueued into kafka before (maybe we add a date_id or a crawl_id to show that this is a different iteration of the crawl).
@ahmedkhaled7960 4 หลายเดือนก่อน ⁺¹
If the content of the web pages that we are crawling is text I think we can just use a relational database and have a UNIQUE index over the content column and call it a day
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
A couple things:
1) It's a lot of text, and comparing text for equality can only be done in O(n) time in relation to the length of the text (which to be fair it takes the same amount of time to generate all of those hashes, but at least then we don't have to store them)
2) To actually do the indexing and ordering over bigger pieces of text we need to compare the two pieces of text which is longer if they are full documents.
3) More data to send over the network.
@reddy5095 3 หลายเดือนก่อน ⁺¹
Shouldn't we use cassandra to store the 1mb files? HDFS is usually for large files ~125mb or more
@jordanhasnolife5163 3 หลายเดือนก่อน
I think that's fair, or alternatively every 128 URL crawls you can aggregate the results into a file and upload to HDFS. Either way works.
@user-bq1iu1sl7h 3 หลายเดือนก่อน ⁺¹
Hey Jordan, great and in-depth content. Just one quick suggestion : Can you please put a framework to every problem? Like, solve problems using the same framework so that we can draw a pattern. Something Like Functional req -> Non-Functional req -> Capacity Esti -> Data Model -> High Level -> Deep Dive -> Tradeoffs/Future etc. Also if you can explain using images and design flows more, it sticks better. I know this is extra work but it will really help us. Your content is really useful but a little verbose. Anyhow, it's really helpful and free, so no complaints :)
@jordanhasnolife5163 3 หลายเดือนก่อน ⁺¹
Hey I appreciate it! I try to generally follow this format actually, but sometimes I feel that it can be too rigid. For example, if I talk about one component, I may have to talk about other components before it makes sense to discuss tradeoffs.
Thanks for the feedback, will try to take this into account.
@user-bq1iu1sl7h 3 หลายเดือนก่อน ⁺¹
@@jordanhasnolife5163 Will it be okay for you to share your notes that you use during these videos?
@jordanhasnolife5163 3 หลายเดือนก่อน
@@user-bq1iu1sl7h Yep! Going to do so once I finish up with the series
@bogdax 2 หลายเดือนก่อน ⁺¹
If we partition data by host, how do we deal with hot partitions then? Some nodes could run forever, while other may be idle.
@jordanhasnolife5163 2 หลายเดือนก่อน ⁺¹
While I'd hope that no individual host is too overly complex, I believe kafka can perform dynamic load balancing, so while each host is limited to one consumer, a consumer with far fewer messages get assigned more hosts.
@0xhhhhff 5 หลายเดือนก่อน ⁺⁵
Chaddest system designer i know
@jordanhasnolife5163 5 หลายเดือนก่อน ⁺²
Big designs, small...well you know
@priteshacharya หลายเดือนก่อน ⁺¹
On your final design, Can you explain the path to write to S3. The diagram shows write from flink -> LB -> S3, but shouldn't the flink be able to write to s3 directly, why would it go to LB?
@jordanhasnolife5163 หลายเดือนก่อน ⁺¹
That's correct, I mainly had the load balancer there to represent how the crawling consumers would communicate with another. I agree that we'd write straight to s3.
@leocambridge หลายเดือนก่อน ⁺¹
Could you explain how to send URLs from node to node? At 10:00 you said DB requires a network call, isn't node to node also a network call? If there are 10 nodes, then 1 node need to send urls to the other 9 nodes?
@jordanhasnolife5163 หลายเดือนก่อน
Fair point
1) You're just putting incoming URLs into kafka (using the built in load balancer, partitioned on host id) which I'd expect to be very fast due to just writing to the end of a log
2) The main thing here is to avoid duplicate work, hence why we want the same URLs being handled on the same node. If I can do this, I can avoid things like having to reach out to a DNS service each time that I look to load a link. If we're performing load balancing *anyways*, because we want to ensure that each node has equal amounts of load, then sending certain links to certain hosts based on the URL itself allows us to cache state locally on each processor node without having to reach out to a database to see if we've already read that URL.
@time_experiment 3 หลายเดือนก่อน ⁺²
Why can't we just cache the domain name mappings? Earlier in the problem we use the assumption that we would index about 1B webpages. If we assume all those are uniquely hosted, then we can also map all of the IPs to domain name with about 8GB of memory.
@jordanhasnolife5163 3 หลายเดือนก่อน ⁺²
DNS can change :) But yeah we basically are caching them until we get 404s
@time_experiment 3 หลายเดือนก่อน
@@jordanhasnolife5163 Thanks for your videos. Love the depth.
@LouisDuran หลายเดือนก่อน ⁺¹
I was asked this question from Amazon a couple years ago... and failed because I hadn't watched this video first. 😞
@jordanhasnolife5163 หลายเดือนก่อน
Only way to improve is to fail first :)
@seemapandey2020 3 หลายเดือนก่อน ⁺¹
How are we achieving prioritisation among different URLs here ?
@jordanhasnolife5163 3 หลายเดือนก่อน
In this video, I don't talk about this. However, presumably, we'd need to build a custom distributed priority queue, which is really just a partitioned database index on whatever field it is that you care about. Ideally we only care about making a local index, and not global one though, or else that becomes the "top k" problem which is a whole separate video.
@ramannanda 13 วันที่ผ่านมา ⁺¹
also things will be i/o bound, so CPU estimates are not that important.
@jordanhasnolife5163 13 วันที่ผ่านมา
Suppose it depends on the task actually, most databases these days are CPU bound funny enough. But yeah hard to say empirically really without doing some profiling.
@jshaikh3897 5 หลายเดือนก่อน ⁺¹
Is this a good video to start for total beginners in sys design...?
@jordanhasnolife5163 5 หลายเดือนก่อน
I'd probably start from episode 1 of my systems design concepts 2.0 series.
@sauhard12 18 วันที่ผ่านมา ⁺¹
It is not clear to me why DNS needs to be mapped to IP? I mean, my server can make HTTP request to the URL directly.
@jordanhasnolife5163 18 วันที่ผ่านมา
What do you think that does internally
@sauhard12 18 วันที่ผ่านมา ⁺¹
@@jordanhasnolife5163 Yeah. I understand that. But are you trying to imply that if my server were to call the DNS which internally maps to an IP, the latency would be lower if I were to hit the IP directly? And to achieve this, we cache the DNS to IP map and use IP.
@jordanhasnolife5163 18 วันที่ผ่านมา
@@sauhard12 As far as I understand it yes, but let me know if I'm incorrect
@maxmanzhos8411 3 หลายเดือนก่อน ⁺¹
GOAT
@yrfvnihfcvhjikfjn 5 หลายเดือนก่อน ⁺¹
Are you using a bath towel in your kitchen?
@jordanhasnolife5163 5 หลายเดือนก่อน
Nope I just have a bunch of hand towels lol
@GeorgeDicu-hs5yp 4 หลายเดือนก่อน ⁺¹
you dont talk about costs in any of these videos. cost is a very imp. aspect which can redefine the entire solution. But good video, you bring new dimensions to my thought process.
@jordanhasnolife5163 4 หลายเดือนก่อน
Duplicate comment, responded to the last one lol
@mridulsetia4974 5 หลายเดือนก่อน ⁺¹
Great content, just one suggestion. Please type out notes instead of writing
@jordanhasnolife5163 5 หลายเดือนก่อน ⁺¹
I think I'm in a bit too deep for this one but I appreciate the feedback!
@AlbertLeng 4 หลายเดือนก่อน
At around 3:12, how do you determine 500 threads are needed?
@AR-jx3cs 2 หลายเดือนก่อน
@@AlbertLeng It's based on the assumption that an average web-page takes 1/3 of a second to load.
@AP-eh6gr 3 หลายเดือนก่อน ⁺¹
my IQ shot up to 145 watching this content 😏
@zen5882 3 หลายเดือนก่อน
something else of mine shot up
@joealtona2532 4 หลายเดือนก่อน ⁺¹
Your choice of Flink is questionable. An overkill for simple URI fetching and not really practical for JavaScript driven websites. The crawler worker must be more sophisticated, carry a sort of headless browser, python libs, like beautiful soup, etc wrapping to a Docker image might work better and running this on Kubernetes or other container orchestrator.
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
I don't necessarily see why the two can't be used in conjunction, to be honest. I think that you make a good point regarding needing certain libraries, but if we don't run those directly within Flink I think that having a stateful consumer that guarantees messages replay still makes it a viable job orchestrator of a bunch of docker images deployed on kubernetes (as you mentioned).
Appreciate the comment!
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
From my understanding of actual web crawling as a service companies, much of the work is done via a bunch of proxies (unfortunately on user devices without even explicit consent) in order to avoid violating robots.txt policies and/or seeming like a genuine user to the site
@canijustgetanamealre 3 หลายเดือนก่อน
@@jordanhasnolife5163 this is correct
@ronin4518 7 วันที่ผ่านมา
I don't understand what using fink has to do with the crawler's implementation or even JavaScript. If you watched the video, you'd know flink was used for resilience i.e. local state management to help an instance recover and continue from consumption from Kafka from the last known state. Has very little to do with the implementation of how it handles fetching of page content.

ต่อไป

เล่นอัตโนมัติ

9: Design Yelp/Google Places | Systems Design Interview Questions With Ex-Google SWE