Callbacks, Errors & Best Practices

Callback Details

The callback webhook will receive updates for each crawled page and a final summary, when the job is complete. Each callback includes:

  1. For individual page results:

    • jobId: The unique identifier of the crawler job.

    • url: The URL that was crawled.

    • status: The status of the crawl attempt.

    • credits: The number of credits used for this request.

    • currentDepth: The current depth level of the crawl.

    • currentCost: The total cost of the job so far.

    • failReason: (if failed) The reason for the failure.

    • responseStream: The actual content of the crawled page (for successful requests).

Example individual page result:

{
  "url": "https://zillow.com/homedetails/55-Eleanor-St-APT-17-Chelsea-MA-02150/63440284_zpid/\\",
  "jobId": "3283eab9-f0f9-4f5d-b081-43339643b8c2",
  "status": "finished",
  "credits": 1,
  "attempts": 0,
  "startUrl": "https://www.zillow.com/homes/44269_rid/",
  "currentCost": 26,
  "currentDepth": 2,
  "response": {
    "statusCode": 200,
    "headers": {
      "x-powered-by": "Express",
      "access-control-allow-origin": "undefined",
      "access-control-allow-headers": "Origin, X-Requested-With, Content-Type, Accept",
      "access-control-allow-methods": "HEAD,GET,POST,DELETE,OPTIONS,PUT",
      "access-control-allow-credentials": "true",
      "x-robots-tag": "none",
      "content-type": "text/html; charset=utf-8",
      "sa-final-url": "https://www.zillow.com/homedetails/5-Lambert-St-5-Roxbury-MA-02119/2069696289_zpid/",
      "sa-statuscode": "200",
      "sa-credit-cost": "1",
      "sa-proxy-hash": "undefined",
      "etag": "W/\"fe810-G9Mn1GIfG51ph9p18EVHwluw8Xo\"",
      "vary": "Accept-Encoding",
      "date": "Thu, 31 Jul 2025 14:00:57 GMT",
      "connection": "keep-alive",
      "keep-alive": "timeout=5",
      "transfer-encoding": "chunked"
    },
    "body":...........
     "credits": 1
  }
}
  1. For the final job summary:

    • jobId: The unique identifier of the crawler job

    • jobState: The final state of the job

    • jobCost: The total cost of the job

    • completed: Array of successfully crawled URLs with their individual costs

    • cancelled: Array of cancelled URLs

    • failed: Array of failed URLs with their failure reasons

    • crawlBudget: The total crawl budget that was set

Example final job summary result:

{
    "jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76",
    "jobState": "finished",
    "jobCost": 42,
    "completed": [
        {
            "url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homes/44269_rid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/",
            "cost": 1
        },
				...
    ],
    "cancelled": [],
    "failed": [
        {
            "url": "<https://www.zillow.com/homedetails/49-Prince-St-2-Boston-MA-02130/59131187_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        },
        {
            "url": "<https://www.zillow.com/homedetails/55-Devon-St-APT-6-Dorchester-MA-02121/71497287_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        }],
    "crawlBudget": 55
}

Response Details

Each successful crawl response includes:

  • HTTP status code.

  • Response headers.

  • Response body.

  • Number of credits used.

Error Handling

The crawler implements several error handling mechanisms:

  1. Failed Requests:

    • Failed requests don't cost any API credits.

    • Failed requests are included in the final summary with their failure reasons.

    • The crawler moves forward with other URLs even if some requests fail.

  2. Duplicate URLs:

    • The crawler automatically detects and skips duplicate URLs. This helps prevent unnecessary credit usage and infinite loops.

  3. Budget Exceeded:

    • If a single URL's cost exceeds the crawl budget, the job will fail to start.

    • If the cumulative cost exceeds the budget during crawling, the job will stop gracefully.

    • A final summary will be sent with all successfully crawled URLs.

  4. Webhook Failures:

    • The system will retry failed webhook deliveries.

    • Webhook timeouts are handled gracefully.

    • Failed webhook deliveries don't affect the crawling process.

Best Practices

  1. URL Regex Pattern:

    • Make sure your regex pattern is specific enough to only match the URLs you want to crawl.

    • Consider extracting both full URLs and relative URLs.

    • Test your regex pattern before starting a large crawl job with tools like regex101.

  2. Crawl Budget vs Max Depth:

    • Use crawl_budget when you want to control costs.

    • Use max_depth when you want to control how deep the crawler goes.

    • Consider using both to ensure you don't exceed your budget while maintaining depth control.

  3. API Parameters:

    • Use country_code to specify the country for the requests.

    • Add any other ScraperAPI parameters as needed in the api_params object.

  4. Callback Webhook:

    • Ensure your webhook endpoint can handle the expected load.

    • Implement proper error handling for failed requests.

    • Store the results as they come in, as the final summary might be delayed.

    • Handle both individual page results and the final summary.

    • Implement proper timeout handling (default timeout is 2 hours).

Last updated

Was this helpful?