> For the complete documentation index, see [llms.txt](https://docs.scraperapi.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.scraperapi.com/scraperapi-crawler-v2.0/crawler-api/callbacks-errors-and-best-practices.md). # Callbacks, Errors & Best Practices ## Callback Details The callback `webhook` will receive updates for each crawled page and a final summary, when the job is complete. Each callback includes: 1. **For individual page results:** * `jobId`: The unique identifier of the crawler job. * `url`: The URL that was crawled. * `status`: The status of the crawl attempt. * `credits`: The number of credits used for this request. * `currentDepth`: The current depth level of the crawl. * `currentCost`: The total cost of the job so far. * `failReason`: (if failed) The reason for the failure. * `responseStream`: The actual content of the crawled page (for successful requests). **Example individual page result:** ```json { "url": "https://zillow.com/homedetails/55-Eleanor-St-APT-17-Chelsea-MA-02150/63440284_zpid/\\", "jobId": "3283eab9-f0f9-4f5d-b081-43339643b8c2", "status": "finished", "credits": 1, "attempts": 0, "startUrl": "https://www.zillow.com/homes/44269_rid/", "currentCost": 26, "currentDepth": 2, "response": { "statusCode": 200, "headers": { "x-powered-by": "Express", "access-control-allow-origin": "undefined", "access-control-allow-headers": "Origin, X-Requested-With, Content-Type, Accept", "access-control-allow-methods": "HEAD,GET,POST,DELETE,OPTIONS,PUT", "access-control-allow-credentials": "true", "x-robots-tag": "none", "content-type": "text/html; charset=utf-8", "sa-final-url": "https://www.zillow.com/homedetails/5-Lambert-St-5-Roxbury-MA-02119/2069696289_zpid/", "sa-statuscode": "200", "sa-credit-cost": "1", "sa-proxy-hash": "undefined", "etag": "W/\"fe810-G9Mn1GIfG51ph9p18EVHwluw8Xo\"", "vary": "Accept-Encoding", "date": "Thu, 31 Jul 2025 14:00:57 GMT", "connection": "keep-alive", "keep-alive": "timeout=5", "transfer-encoding": "chunked" }, "body":........... "credits": 1 } } ``` 1. **For the final job summary:** * `jobId`: The unique identifier of the crawler job * `jobState`: The final state of the job * `jobCost`: The total cost of the job * `completed`: Array of successfully crawled URLs with their individual costs * `cancelled`: Array of cancelled URLs * `failed`: Array of failed URLs with their failure reasons * `crawlBudget`: The total crawl budget that was set **Example final job summary result:** ```json { "jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76", "jobState": "finished", "jobCost": 42, "completed": [ { "url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/", "cost": 1 }, { "url": "https://www.zillow.com/homes/44269_rid/", "cost": 1 }, { "url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/", "cost": 1 }, { "url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/", "cost": 1 }, ... ], "cancelled": [], "failed": [ { "url": "", "failReason": "" }, { "url": "", "failReason": "< ``` ## Response Details Each successful crawl response includes: * `HTTP` status code. * Response headers. * Response body. * Number of credits used. ## Error Handling The crawler implements several error handling mechanisms: 1. **Failed Requests**: * Failed requests **`don't cost`** any API credits. * Failed requests are included in the final summary with their failure reasons. * The crawler moves forward with other URLs even if some requests fail. 2. **Duplicate URLs**: * The crawler **automatically detects** and **skips** duplicate URLs. This helps prevent unnecessary credit usage and infinite loops. 3. **Budget Exceeded**: * If a single URL's cost **exceeds** the crawl budget, the job will fail to start. * If the cumulative cost exceeds the budget during crawling, the job will stop gracefully. * A final summary will be sent with all successfully crawled URLs. 4. **Webhook Failures**: * The system will retry failed webhook deliveries. * Webhook timeouts are handled gracefully. * Failed webhook deliveries don't affect the crawling process. 5. **Plan-tier Restrictions**: * If a request from a **Free plan** includes a `max_depth` greater than 1 or a `schedule` field, the API will return a `403` error message. To lift these restrictions, the user should upgrade to a paid plan. ## Best Practices 1. **URL Regex Pattern**: * Make sure your regex pattern is **specific enough** to only match the URLs you want to crawl. * Consider extracting both full URLs and relative URLs. * Test your regex pattern before starting a large crawl job with tools like [regex101](https://regex101.com/). 2. **Crawl Budget vs Max Depth**: * Use `crawl_budget` when you want to control costs. * Use `max_depth` when you want to control how deep the crawler goes. * Consider using **both** to ensure you don't exceed your budget while maintaining depth control. 3. **API Parameters**: * Use `country_code` to specify the country for the requests. * Add any other [ScraperAPI parameters](/asynchronous-api/callbacks-and-api-params.md#api-params) as needed in the `api_params` object. 4. **Callback Webhook**: * Ensure your webhook endpoint can handle the expected load. * Implement proper error handling for failed requests. * Store the results as they come in, as the final summary might be delayed. * Handle both individual page results and the final summary. * Implement proper timeout handling (default timeout is 2 hours). --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.scraperapi.com/scraperapi-crawler-v2.0/crawler-api/callbacks-errors-and-best-practices.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.