How 80legs Crawls URLs: Depth-first vs Breadth-first vs Greedy

Traditional implementation for web crawlers typically dictates a choice between depth-first or breadth-first crawling.

To understand these concepts, you should visualize your crawl as an upside-down tree. The trunk of the tree is the first URL you crawl, and each branch extending out is a URL linked from the previous URL.

In breadth-first crawling, every URL at the current depth will be crawled before proceeding to the next depth level, as illustrated here:

In depth-first crawling, the web crawler will travel as far down one branch before proceeding back up the graph to the next set of URL, like so:

80legs actually uses neither of these approaches. Instead, it uses a "greedy" approach. A strict depth or breadth-first approach would be ok if every URL had the same response time. In reality, this is not the case, and requiring a URL to return data before moving to the next URL would result in very slow web crawling.

Because of this, 80legs will do depth or breadth-first crawling for the set of URLs it crawls next, depending on whichever URLs return the most quickly. We call this a greedy approach, because 80legs is doing whatever seems fastest or easiest at the time.