Creating Your First Crawl with the Web Portal

In this walkthrough, we'll explain all the necessary steps to run your first web crawl with the web portal.

The steps you'll take are:

  1. Upload a URL list
  2. Create a crawl
  3. Download the results of a crawl

1. Upload a URL list

Every web crawl needs to start from 1 or more URLs.  URL lists are collections of URLs from which to start your web crawls on 80legs.

To get started, login to the web portal and click on "My URL Lists".  Then click on "Create a URL list".

Enter the name of your URL list and then enter 1 or more URLs.  Click "Create URL list" when you're done.

2. Create a crawl

Now it's time to create the web crawl.  Go back to the "Crawl the Web" page and click on "Create a New Crawl".  Fill out the form provided by putting in a name for your web crawl and selecting the URL list you just uploaded.

You'll also need to select an 80app to specify what data you want returned from your crawl.  You can learn more about 80apps here.  If you choose "Fullpagecontent.js", the crawl will download the HTML source of each URL crawled.

Click on "Create Crawl" when you're done.

The crawl will start off in queue status, but will soon move to "started" status.  When this happens, 80legs will begin processing your crawl, and if you occassionally refresh the page, you'll see the URLs count increment.  When the crawl is complete, you'll see its status set to "completed".

If you want to stop the crawl, just click "Cancel".  Stopping the crawl while it's running will still get you results.

3. Download the crawl results

Once your crawl is complete, or after you've canceled it, you'll see one or more result files associated with it.  Click on each result file link to download your crawl data.  Crawl results from 80legs are stored in JSON format, with each record in the data containing the URL crawled and the data retrieved from that URL, like this example: