Accessing your Giant Web Crawl Data


Before accessing the Giant Web Crawl (GWC), you will need the following:

  1. An 80legs API token for your GWC data.  This will be provided by your client manager.
  2. How to use a RESTful API.  Currently, the GWC can only be accessed via the 80legs API, which uses a RESTful interface.
  3. How to use JSON data.  The GWC produces results in JSON format.  To learn more about this format, click here.

Step 1: Request the list of available results

As the GWC runs, it will post relevant results for your account to your account's results directory.  That directory can be accessed by issuing a request like so:

Step 2: Download the available result files

The response you receive from your request in Step 1 will look something like:

This is a list of links to result files.  Download the result files to get the data posted to your GWC account.

Each file will contain one or more records matching the specifications supplied to us for your GWC account.  For example, if you requested to receive any emails found on URLs crawled by the GWC, your data will look something like:

Please note that we have added tabs and line-breaks to this example to make it more readable.  The actual data will not contain such separation.

Result files will expire after 7 days.  Once they have expired, they cannot be retrieved.  Because of this, you should make sure to constantly check for newly-available files.