Problem: Design a distributed web crawler to crawl billion of pages.
Design: Here are our main architecture goals:
- Scalability
- Extensibility
- Freshness of pages
- Priority
- Politeness
Here is our design looks like:
Let's jump into every component to see the details:
1. URL Provider: URL Provider provides the URLs to URL Fetcher workers based on priority and politeness Here politeness means how frequent we should hit and fetch the page of a specific domain. Here is how we can implement it:
Let's see the flow:
- Prioritizer will prioritizes the URLs to be crawled. It could be based on freshness of the page, importance of the URL, how frequently the site is being updated like new sites and given some priority with in some range say [1...n].
- There will be n Priority Queues so prioritizer puts the URL into it's priority queue so that means priority 1 URL will go to priority queue 1 and so on. This will handle the priority requirements.
- Polite Queue Router fetch the URLs from the highest priority queue first and then second highest and then third highest and so on and put the URLs in the Polite Queues. Please note that it will put one domain's URLs in one polite queue. This is to ensure the politeness. That means it will maintain the domain and queue number.
- The number of Polite Queues are equal to number of URL fetcher workers.
- Whenever URL Fetcher ask for a URL, the Queue URL Provider can provide URL from Polite Queues. It provides the URL one by one from Polite Queues. (If needed we can have heap if fetching time of URLs are way too different)
2. URL Fetcher: These are kind of worker / threads which gets the URL from URL provider and fetch the content of the given URL. This can use it's own DNS resolver to resolve the IP address faster.
Once it fetched the content it store it in the DB and also put it in the cache so that next components can get the content quickly.
Once it stored the content into the cache, it pushed the URL in the queue.
3. The URL extractor gets the URL and using the URL it gets the content from cache. From the content it extracts all the unique URLs.
4. There could be one more component here which is called URL Filter, basically it can be used to filter out the URLs which we are not interested in. Say we are only interested in images then it can filter out all the URLs which are not pointing to an image.
5. URL Loader get the list of URLs from URL extractor / URL Filter and it may stored these in DB. It also look for the URLs which are already crawled from the input lists. It then send the filtered list of URLs to URL Provider.
That's all about the system design of web crawler.
No comments:
Post a Comment