Best Practices

As you become more proficient in building custom web scrapers using 80apps, you'll probably want to learn more about how to build 80apps so they're as efficient as possible.  This will help your crawls run more quickly, since the processing time to run an 80app can take a significant portion of total crawl run time.

We'll be using this URL to illustrate each approach: http://www.supplyhouse.com/Zurn-GT2700-50-100-Grease-Trap-50gpm-4385000-p

Here's the specific part of the page we'll be focusing on:

We want to find the fastest way to specifically scrape the length, width and height. Dimensions will be an object containing the length, width and height properties, each with its respective values.

Our goal is to produce this object:  

We've found 4 different ways that you can approach this:

1. Brute Force Approach

Time Taken: 4181 milliseconds

This approach is very slow because the DOM has to be parsed multiple times. Using the contains pseudo-selector is also something you want to avoid because it searches both innerHTML and text.

2. Cached Approach

Time Taken: 913 milliseconds

In the first approach our starting point was always #feature_list. We cache #feature_list by saving the result of $html.find("#feature_list") into a variable and just parse that selection instead of parsing the entire DOM for #feature_list everytime. Caching sections of the DOM and parsing that as opposed to the entire document is significantly faster.

3. For-Loop Iterative Approach

Time Taken: 2 milliseconds

Instead of searching through "#feature_list" multiple times, this approach iterates through the table rows in that div, checking if they match our desired property names.

4. .each Iterative Approach

Time Taken: 1 millisecond

This approach is similar to the iterative approach, except it uses Cheerio's .each method to loop through the table rows. It shaves off 1 millisecond.

Take-Aways

  1. Cache sections of the DOM you find yourself repeatedly parsing in order to prevent having to parse a giant tree over and over again. 
  2. Use loops whenever you can in order to further limit the amount of times you parse the DOM or even sections of the DOM.