Job Lifecycle

Job Creation

You can initiate a Crawler Job by sending a POST request to https://crawler.scraperapi.com/job with a request body that looks like this:

{
  "api_key": "<YOUR API KEY>",
  "start_url": "https://www.zillow.com/homes/44269_rid/",
  "max_depth": 5,
  "crawl_budget": 50,
  "url_regexp_include": ".*", //Use .* to crawl all pages on the site.
  "url_regexp_exclude": ".*/product/.*", //Optional paramter. Leave empty to include all pages on the site.
  "api_params": {
    "country_code": "us"
  },
  "callback": {
    "type": "webhook",
    "url": "<YOUR CALLBACK WEBHOOK URL>"
  },
  "enabled": true,
  "schedule": {
    "name": "NAME_OF_CRAWLER", // Name of the crawler.
    "interval": "once" //once, hourly, daily, weekly, monthly.
  }
}

Here's a full example in Python:

import requests

url = "https://crawler.scraperapi.com/job"

payload = {
    "api_key": "API_KEY",
    "start_url": "https://www.zillow.com/homes/44269_rid/",
    "max_depth": 5,
    "crawl_budget": 50,
    "url_regexp_include": ".*",
    "api_params": {
        "country_code": "us"
    },
    "callback": {
        "type": "webhook",
        "url": "YYYYYY"
    }
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print("Status code:", response.status_code)
try:
    print("Response JSON:", response.json())
except ValueError:
    print("Response content:", response.text)

Control Properties

API_KEY

REQUIRED

Your API key. You can grab it from your Dashboard.

START_URL

REQUIRED

The URL which is going to be the starting point of the crawling.

MAX_DEPTH

EITHER THIS OR CRAWL_BUDGET MUST BE SET

Maximum depth level of the crawling task. The start URL is at depth 0.

CRAWL_BUDGET

EITHER THIS OR max_depth MUST BE SET

The maximum amount of ScraperAPI credits, that the crawling task should consume.

URL_REGEXP_INCLUDE

REQUIRED

The regexp is used to extract additional URLs to crawl from each page the crawler visits. Use .* to crawl all pages on the site. You can then use tools like regex101 for debugging.

URL_REGEXP_EXCLUDE

OPTIONAL

Enter a regex pattern to skip certain URLs. Any URL that matches this pattern will not be crawled. Example: .*/product/.* Leave the field empty to crawl all URLs.

API_PARAMS

OPTIONAL

Control parameters for each individual scrape attempts. The list of supported parameters can be found here.

CALLBACK

REQUIRED

Currently, only webhook callbacks are supported. The results of both successful and failed scrape attempts throughout the crawling job will be streamed to the specified webhook. Once the job is complete, a summary of the entire crawling task will also be sent.

ENABLED

OPTIONAL

When set to true, the crawler will run per schedule/interval settings. If false, the crawler will not run, just the crawler config will be created. Defaults to true if not specified.

SCHEDULE

OPTIONAL

Defines an optional crawl schedule. Includes name (name of crawler) and interval (when it should run: once, hourly, daily, weekly, or monthly). Refer to Job Creation

Job Management

Starting a Job

When you initiate a crawler job, you'll receive a response with the following format:

{
    "status": "initiated",
    "jobId": "<UNIQUE_JOB_ID>"
}

Cancelling a Job

You can cancel a running job by sending a DELETE request to:

DELETE <https://crawler.scraperapi.com/job/><JOB_ID>

This will retun the following response:

{
    "status": "OK",
    "message": "Job ID: <JOB_ID> has been cancelled"
}

Job States

A crawler job can be in one of the following states:

delayed: The job is waiting to be processed.
running: The job is currently being processed.
completed: The job has finished successfully.
failed: The job has failed.
cancelled: The job was cancelled.
in delivery: The job results are being delivered.
delivered: The job results have been delivered.

Job Summary

After the crawler job finishes, a summary is sent to the webhook specified during job setup. Here’s what that might look like:

{
    "jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76",
    "jobState": "finished",
    "jobCost": 42,
    "completed": [
        {
            "url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homes/44269_rid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/",
            "cost": 1
        },
				...
    ],
    "cancelled": [],
    "failed": [
        {
            "url": "<https://www.zillow.com/homedetails/49-Prince-St-2-Boston-MA-02130/59131187_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        },
        {
            "url": "<https://www.zillow.com/homedetails/55-Devon-St-APT-6-Dorchester-MA-02121/71497287_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        }],
    "crawlBudget": 55
}

Note: Crawled results are stored for up to 7 days. For scheduled crawls, new results replace the previously stored ones.

PreviousCrawler API NextCallbacks, Errors & Best Practices

Last updated 8 days ago

Was this helpful?