Job Lifecycle
Job Creation
You can initiate a Crawler Job by sending a POST request to https://crawler.scraperapi.com/job with a request body that looks like this:
{
"api_key": "<YOUR API KEY>",
"start_url": "https://www.zillow.com/homes/44269_rid/",
"max_depth": 5,
"crawl_budget": 50,
"url_regexp_include": ".*", //Use .* to crawl all pages on the site.
"url_regexp_exclude": ".*/product/.*", //Optional paramter. Leave empty to include all pages on the site.
"api_params": {
"country_code": "us"
},
"callback": {
"type": "webhook",
"url": "<YOUR CALLBACK WEBHOOK URL>"
},
"enabled": true,
"schedule": {
"name": "NAME_OF_CRAWLER", // Name of the crawler.
"interval": "once" //once, hourly, daily, weekly, monthly.
}
}Here's a full example in Python:
import requests
url = "https://crawler.scraperapi.com/job"
payload = {
"api_key": "API_KEY",
"start_url": "https://www.zillow.com/homes/44269_rid/",
"max_depth": 5,
"crawl_budget": 50,
"url_regexp_include": ".*",
"api_params": {
"country_code": "us"
},
"callback": {
"type": "webhook",
"url": "YYYYYY"
}
}
headers = {
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print("Status code:", response.status_code)
try:
print("Response JSON:", response.json())
except ValueError:
print("Response content:", response.text)Control Properties
START_URL
REQUIRED
The URL which is going to be the starting point of the crawling.
MAX_DEPTH
EITHER THIS OR CRAWL_BUDGET MUST BE SET
Maximum depth level of the crawling task. The start URL is at depth 0.
CRAWL_BUDGET
EITHER THIS OR max_depth MUST BE SET
The maximum amount of ScraperAPI credits, that the crawling task should consume.
URL_REGEXP_INCLUDE
REQUIRED
The regexp is used to extract additional URLs to crawl from each page the crawler visits. Use .* to crawl all pages on the site. You can then use tools like regex101 for debugging.
URL_REGEXP_EXCLUDE
OPTIONAL
Enter a regex pattern to skip certain URLs. Any URL that matches this pattern will not be crawled. Example:
.*/product/.*
Leave the field empty to crawl all URLs.
API_PARAMS
OPTIONAL
Control parameters for each individual scrape attempts. The list of supported parameters can be found here.
CALLBACK
REQUIRED
Currently, only webhook callbacks are supported. The results of both successful and failed scrape attempts throughout the crawling job will be streamed to the specified webhook. Once the job is complete, a summary of the entire crawling task will also be sent.
ENABLED
OPTIONAL
When set to true, the crawler will run per schedule/interval settings. If false, the crawler will not run, just the crawler config will be created.
Defaults to true if not specified.
SCHEDULE
OPTIONAL
Defines an optional crawl schedule. Includes name (name of crawler) and interval (when it should run: once, hourly, daily, weekly, or monthly). Refer to Job Creation
Job Management
Starting a Job
When you initiate a crawler job, you'll receive a response with the following format:
{
"status": "initiated",
"jobId": "<UNIQUE_JOB_ID>"
}Cancelling a Job
You can cancel a running job by sending a DELETE request to:
DELETE <https://crawler.scraperapi.com/job/><JOB_ID>This will retun the following response:
{
"status": "OK",
"message": "Job ID: <JOB_ID> has been cancelled"
}Job States
A crawler job can be in one of the following states:
delayed: The job is waiting to be processed.running: The job is currently being processed.completed: The job has finished successfully.failed: The job has failed.cancelled: The job was cancelled.in delivery: The job results are being delivered.delivered: The job results have been delivered.
Job Summary
After the crawler job finishes, a summary is sent to the webhook specified during job setup. Here’s what that might look like:
{
"jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76",
"jobState": "finished",
"jobCost": 42,
"completed": [
{
"url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/",
"cost": 1
},
{
"url": "https://www.zillow.com/homes/44269_rid/",
"cost": 1
},
{
"url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/",
"cost": 1
},
{
"url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/",
"cost": 1
},
...
],
"cancelled": [],
"failed": [
{
"url": "<https://www.zillow.com/homedetails/49-Prince-St-2-Boston-MA-02130/59131187_zpid/>",
"failReason": "<ERROR MESSAGE>"
},
{
"url": "<https://www.zillow.com/homedetails/55-Devon-St-APT-6-Dorchester-MA-02121/71497287_zpid/>",
"failReason": "<ERROR MESSAGE>"
}],
"crawlBudget": 55
}Note: Crawled results are stored for up to 7 days. For scheduled crawls, new results replace the previously stored ones.
Last updated
Was this helpful?

