How 80legs Crawls URLs: Depth-first vs Breadth-first vs Greedy

Traditional implementation for web crawlers typically dictates a choice between depth-first or breadth-first crawling.

To understand these concepts, you should visualize your crawl as an upside-down tree.  The trunk of the tree is the first URL you crawl, and each branch extending out is a URL linked from the previous URL.

In breadth-first crawling, every URL at the current depth will be crawled before proceeding to the next depth level, as illustrated here:


In depth-first crawling, the web crawler will travel as far down one branch before proceeding back up the graph to the next set of URL, like so:


80legs actually uses neither of these approaches.  Instead, it uses a "greedy" approach.  A strict depth or breadth-first approach would be ok if every URL had the same response time.  In reality, this is not the case, and requiring a URL to return data before moving to the next URL would result in very slow web crawling.

Because of this, 80legs will do depth or breadth-first crawling for the set of URLs it crawls next, depending on whichever URLs return the most quickly.  We call this a greedy approach, because 80legs is doing whatever seems fastest or easiest at the time.