Creating Your First Crawl with the API

In this walkthrough, we'll explain all the necessary steps to create your first crawl from scratch.

The steps you'll take are:

  1. Upload a URL list
  2. Upload an 80app
  3. Create a crawl
  4. Download the results of the crawl

IMPORTANT: This walkthrough assumes you are familiar with RESTful services.  If you're not, we recommend reading this introduction.

NOTE: If you use cURL to use the API, you may need to update how it handles SSL certificates.

1. Upload a URL list

The URL list consists of 1 or more URLs from which the crawl will begin.  We need to provide 80legs with this list.

To do this, we'll send a PUT request to the API with the contents of our URL list.  Here's what that request looks like:

A few important notes:

  • Do not include a file extension (e.g., .txt, .doc) as part of the URL list name.
  • The entire URL list must be less than 40 MB in size.

If your request was successful, you'll receive a 204 status code as a response.  You can check the status of your URL list by doing a GET request like this:

The response you'll get back is a list of all your URL lists:

2. Upload an 80app
The next step is to upload an 80app.  For each URL you crawl, the 80app is what decides which links to follow and what data to return from the current URL being crawled.
80apps are written in Javascript.  To see example 80apps, take a look at our public 80app repo.  We'll use the full page content 80app as the 80app we upload.  To upload it, we do a command like:
If the upload was successful, you'll get a 204 response.

3. Create a crawl

Now it's time to create a crawl using the URL list and 80app we just created.  To do that, we'll issue another PUT request, but this time it will be to the crawl resource.  It looks like:

Here's what each of the parameters means:

  • app: Name of the javascript app file to process the page contents.
  • urllist: Name of the URL List used to start the crawl
  • data: data file name (optional - if using, see upload data instructions)
  • max_depth: the max depth you want the crawl to reach.
  • max_urls: upper limit on the number of URLs you want the crawl to process

A few important notes:

  • The app and urllist must be uploaded prior to creating a crawl.
  • The data field is optional.
  • In this case, we're using the full_page_content app, which return the full HTML source of each URL crawled.  Each account has access to this app by default.  You can also upload your own apps.

If the request was successful, you'll get a 204 response.

4. Check the status of the crawl and download result files when available

As the crawl is running, you can check on its status by issuing a GET request like so:

This will generate a response like:

Crawl results will post as the crawl is running.  Issue the following GET request to get the link to download the results:

You'll get a response like:

If you've run a very large crawl, you'll get multiple links to result files, since we break up results into several files if the results are very large.

Your result file will have a list of objects, with each object having a URL and result attribute, like in this example:

Crawl results expire 7 days after they are created, so for larger crawls it's a good idea to keep checking for new result files and downloading them right away.

Wrapping Up

That's all there is to it!  You can issue the calls described above through any programming language, or even just by using curl.

For full documentation, refer to the 80legs API documentation.