Job Lifecycle

Job Creation

You can initiate a Crawler Job by sending a POST request to https://crawler.scraperapi.com/job with a request body that looks like this:

{
  "api_key": "<YOUR API KEY>",
  "start_url": "https://www.zillow.com/homes/44269_rid/",
  "max_depth": 5,
  "crawl_budget": 50,
  "url_regexp": "\\"(?<full_url>https:\\/\\/(www.)?zillow.com\\/homedetails\\/[^\\"]+)|href=\\"(?<relative_url>\\/homedetails\\/[^\\"]+)",
  "api_params": {
    "country_code": "us"
  },
  "callback": {
    "type": "webhook",
    "url": "<YOUR CALLBACK WEBHOOK URL>"
   }
}

Here's a full example in Python:

import requests

url = "https://crawler.scraperapi.com/job"

payload = {
    "api_key": "API_KEY",
    "start_url": "https://www.zillow.com/homes/44269_rid/",
    "max_depth": 5,
    "crawl_budget": 50,
    "url_regexp": "https://www\\.zillow\\.com/homedetails/[^\"\\s>]+|/homedetails/[^\"\\s>]+",
    "api_params": {
        "country_code": "us"
    },
    "callback": {
        "type": "webhook",
        "url": "YYYYYY"
    }
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print("Status code:", response.status_code)
try:
    print("Response JSON:", response.json())
except ValueError:
    print("Response content:", response.text)

Control Properties

API_KEY

REQUIRED

Your API key. You can grab it from your Dashboard.

START_URL

REQUIRED

The URL which is going to be the starting point of the crawling.

MAX_DEPTH

EITHER THIS OR CRAWL_BUDGET MUST BE SET

Maximum depth level of the crawling task. The start URL is at depth 0.

CRAWL_BUDGET

EITHER THIS OR max_depth MUST BE SET

The maximum amount of ScraperAPI credits, that the crawling task should consume.

URL_REGEXP

REQUIRED

The regexp is used to extract additional URLs to crawl from each page the crawler visits.

API_PARAMS

OPTIONAL

Control parameters for each individual scrape attempts. The list of supported parameters can be found here.

CALLBACK

REQUIRED

Currently, only webhook callbacks are supported. The results of both successful and failed scrape attempts throughout the crawling job will be streamed to the specified webhook. Once the job is complete, a summary of the entire crawling task will also be sent.


Job Management

Starting a Job

When you initiate a crawler job, you'll receive a response with the following format:

{
    "status": "initiated",
    "jobId": "<UNIQUE_JOB_ID>"
}

Cancelling a Job

You can cancel a running job by sending a DELETE request to:

DELETE <https://crawler.scraperapi.com/job/><JOB_ID>

This will retun the following response:

{
    "status": "OK",
    "message": "Job ID: <JOB_ID> has been cancelled"
}

Job States

A crawler job can be in one of the following states:

  • delayed: The job is waiting to be processed.

  • running: The job is currently being processed.

  • completed: The job has finished successfully.

  • failed: The job has failed.

  • cancelled: The job was cancelled.

  • in delivery: The job results are being delivered.

  • delivered: The job results have been delivered.

Job Summary

After the crawler job finishes, a summary is sent to the webhook specified during job setup. Here’s what that might look like:

{
    "jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76",
    "jobState": "finished",
    "jobCost": 42,
    "completed": [
        {
            "url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homes/44269_rid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/",
            "cost": 1
        },
				...
    ],
    "cancelled": [],
    "failed": [
        {
            "url": "<https://www.zillow.com/homedetails/49-Prince-St-2-Boston-MA-02130/59131187_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        },
        {
            "url": "<https://www.zillow.com/homedetails/55-Devon-St-APT-6-Dorchester-MA-02121/71497287_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        }],
    "crawlBudget": 55
}

Last updated

Was this helpful?