Job Lifecycle
Job Creation
You can initiate a Crawler Job by sending a POST request to https://crawler.scraperapi.com/job with a request body that looks like this:
{
"api_key": "<YOUR API KEY>",
"start_url": "https://www.zillow.com/homes/44269_rid/",
"max_depth": 5,
"crawl_budget": 50,
"url_regexp": "\\"(?<full_url>https:\\/\\/(www.)?zillow.com\\/homedetails\\/[^\\"]+)|href=\\"(?<relative_url>\\/homedetails\\/[^\\"]+)",
"api_params": {
"country_code": "us"
},
"callback": {
"type": "webhook",
"url": "<YOUR CALLBACK WEBHOOK URL>"
}
}Here's a full example in Python:
import requests
url = "https://crawler.scraperapi.com/job"
payload = {
"api_key": "API_KEY",
"start_url": "https://www.zillow.com/homes/44269_rid/",
"max_depth": 5,
"crawl_budget": 50,
"url_regexp": "https://www\\.zillow\\.com/homedetails/[^\"\\s>]+|/homedetails/[^\"\\s>]+",
"api_params": {
"country_code": "us"
},
"callback": {
"type": "webhook",
"url": "YYYYYY"
}
}
headers = {
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print("Status code:", response.status_code)
try:
print("Response JSON:", response.json())
except ValueError:
print("Response content:", response.text)
Control Properties
START_URL
REQUIRED
The URL which is going to be the starting point of the crawling.
MAX_DEPTH
EITHER THIS OR CRAWL_BUDGET MUST BE SET
Maximum depth level of the crawling task. The start URL is at depth 0.
CRAWL_BUDGET
EITHER THIS OR max_depth MUST BE SET
The maximum amount of ScraperAPI credits, that the crawling task should consume.
URL_REGEXP
REQUIRED
The regexp is used to extract additional URLs to crawl from each page the crawler visits.
API_PARAMS
OPTIONAL
Control parameters for each individual scrape attempts. The list of supported parameters can be found here.
CALLBACK
REQUIRED
Currently, only webhook callbacks are supported. The results of both successful and failed scrape attempts throughout the crawling job will be streamed to the specified webhook. Once the job is complete, a summary of the entire crawling task will also be sent.
Job Management
Starting a Job
When you initiate a crawler job, you'll receive a response with the following format:
{
"status": "initiated",
"jobId": "<UNIQUE_JOB_ID>"
}Cancelling a Job
You can cancel a running job by sending a DELETE request to:
DELETE <https://crawler.scraperapi.com/job/><JOB_ID>This will retun the following response:
{
"status": "OK",
"message": "Job ID: <JOB_ID> has been cancelled"
}Job States
A crawler job can be in one of the following states:
delayed: The job is waiting to be processed.running: The job is currently being processed.completed: The job has finished successfully.failed: The job has failed.cancelled: The job was cancelled.in delivery: The job results are being delivered.delivered: The job results have been delivered.
Job Summary
After the crawler job finishes, a summary is sent to the webhook specified during job setup. Here’s what that might look like:
{
"jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76",
"jobState": "finished",
"jobCost": 42,
"completed": [
{
"url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/",
"cost": 1
},
{
"url": "https://www.zillow.com/homes/44269_rid/",
"cost": 1
},
{
"url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/",
"cost": 1
},
{
"url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/",
"cost": 1
},
...
],
"cancelled": [],
"failed": [
{
"url": "<https://www.zillow.com/homedetails/49-Prince-St-2-Boston-MA-02130/59131187_zpid/>",
"failReason": "<ERROR MESSAGE>"
},
{
"url": "<https://www.zillow.com/homedetails/55-Devon-St-APT-6-Dorchester-MA-02121/71497287_zpid/>",
"failReason": "<ERROR MESSAGE>"
}],
"crawlBudget": 55
}
Last updated
Was this helpful?

