# Job Lifecycle

## Job Creation

You can initiate a Crawler Job by sending a **`POST`** request to <https://crawler.scraperapi.com/job> with a request body that looks like this:

```json
{
  "api_key": "<YOUR API KEY>",
  "start_url": "https://www.zillow.com/homes/44269_rid/",
  "max_depth": 5, // Free plan supports max_depth = 1 only.
  "crawl_budget": 50,
  "url_regexp_include": ".*", //Use .* to crawl all pages on the site.
  "url_regexp_exclude": ".*/product/.*", //Optional parameter. Leave empty to include all pages on the site.
  "api_params": {
    "country_code": "us"
  },
  "callback": {
    "type": "webhook",
    "url": "<YOUR CALLBACK WEBHOOK URL>"
  },
  "enabled": true,
  "schedule": {
    "name": "NAME_OF_CRAWLER", // Name of the crawler.
    "interval": "weekly" //once, hourly, daily, weekly, monthly. Scheduling is available on paid plans only.
  }
}
```

Here's a full example in Python:

```python
import requests

url = "https://crawler.scraperapi.com/job"

payload = {
    "api_key": "API_KEY",
    "start_url": "https://www.zillow.com/homes/44269_rid/",
    "max_depth": 5,
    "crawl_budget": 50,
    "url_regexp_include": ".*",
    "api_params": {
        "country_code": "us"
    },
    "callback": {
        "type": "webhook",
        "url": "YYYYYY"
    }
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print("Status code:", response.status_code)
try:
    print("Response JSON:", response.json())
except ValueError:
    print("Response content:", response.text)
```

## Control Properties

| **`API_KEY`**            | REQUIRED                                      | Your API key.  You can grab it from your [Dashboard](https://dashboard.scraperapi.com/).                                                                                                                                                                                                                              |
| ------------------------ | --------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`START_URL`**          | REQUIRED                                      | The URL which is going to be the starting point of the crawling.                                                                                                                                                                                                                                                      |
| **`MAX_DEPTH`**          | EITHER THIS OR **`CRAWL_BUDGET`** MUST BE SET | Maximum depth level of the crawling task. The start URL is at depth 0. On the **Free Plan**, `max_depth` cannot exceed **1.** Paid plans can configure deeper crawls up to the supported maximum.                                                                                                                     |
| **`CRAWL_BUDGET`**       | EITHER THIS OR **max\_depth** MUST BE SET     | The **maximum** amount of ScraperAPI credits, that the crawling task should consume.                                                                                                                                                                                                                                  |
| **`URL_REGEXP_INCLUDE`** | REQUIRED                                      | The regexp is used to extract additional URLs to crawl from each page the crawler visits. Use .\* to crawl all pages on the site. You can then use tools like [regex101](https://regex101.com/) for debugging.                                                                                                        |
| **`URL_REGEXP_EXCLUDE`** | OPTIONAL                                      | <p>Enter a regex pattern to skip certain URLs. Any URL that matches this pattern will not be crawled. Example:<br><code>.*/product/.*</code><br>Leave the field empty to crawl all URLs. </p>                                                                                                                         |
| **`API_PARAMS`**         | OPTIONAL                                      | Control parameters for each individual scrape attempts. The list of supported parameters can be found [here](/asynchronous-api/callbacks-and-api-params.md#api-params).                                                                                                                                               |
| **`CALLBACK`**           | REQUIRED                                      | Currently, only `webhook` callbacks are supported. The results of both **successful** and **failed** scrape attempts throughout the crawling job will be streamed to the specified webhook. Once the job is complete, a summary of the entire crawling task will also be sent.                                        |
| **`ENABLED`**            | OPTIONAL                                      | <p>When set to <code>true</code>, the crawler will run per schedule/interval settings. If <code>false</code>, the crawler will not run, just the crawler config will be created. <br>Defaults to <code>true</code> if not specified.</p>                                                                              |
| **`SCHEDULE`**           | OPTIONAL                                      | <p>Defines an optional crawl schedule. Includes <code>name</code> (name of crawler) and <code>interval</code> (when it should run: once, hourly, daily, weekly, or monthly). Refer to <a data-mention href="#job-creation">#job-creation</a>.<br><br>Scheduling is <strong>only</strong> available on Paid plans.</p> |

{% hint style="warning" %}
\*\* Free Plan limitations \*\*

* **max\_depth:** cannot exceed `1` (seed URL plus direct links). Deeper crawls require a **paid plan**.
* **schedule:** recurring schedules (hourly/daily/weekly/monthly) are not available. Only one-time crawls are supported.<br>

*Attempting to exceed these limits will return a `403` error.* *Upgrade to a paid plan to run deeper crawls and enable scheduling.*
{% endhint %}

***

## Job Management

### Starting a Job

When you initiate a crawler job, you'll receive a response with the following format:

```json
{
    "status": "initiated",
    "jobId": "<UNIQUE_JOB_ID>"
}
```

### Cancelling a Job

You can cancel a running job by sending a `DELETE` request to:

```bash
DELETE <https://crawler.scraperapi.com/job/><JOB_ID>
```

This will retun the following response:

```json
{
    "status": "OK",
    "message": "Job ID: <JOB_ID> has been cancelled"
}
```

## Job States

A crawler job can be in one of the following states:

* `delayed`: The job is waiting to be processed.
* `running`: The job is currently being processed.
* `completed`: The job has finished successfully.
* `failed`: The job has failed.
* `cancelled`: The job was cancelled.
* `in delivery`: The job results are being delivered.
* `delivered`: The job results have been delivered.

## Job Summary

After the crawler job finishes, a summary is sent to the webhook specified during job setup. Here’s what that might look like:

```json
{
    "jobId": "e25c4cf9-b521-4f97-8e0d-ff2220756a76",
    "jobState": "finished",
    "jobCost": 42,
    "completed": [
        {
            "url": "https://www.zillow.com/homedetails/1-E-Pier-13th-ST-DOCK-B-Boston-MA-02129/2055047115_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homes/44269_rid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/17-19-Beech-Glen-St-Roxbury-MA-02119/452661181_zpid/",
            "cost": 1
        },
        {
            "url": "https://www.zillow.com/homedetails/230-232-Washington-St-2-Boston-MA-02108/2084793714_zpid/",
            "cost": 1
        },
				...
    ],
    "cancelled": [],
    "failed": [
        {
            "url": "<https://www.zillow.com/homedetails/49-Prince-St-2-Boston-MA-02130/59131187_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        },
        {
            "url": "<https://www.zillow.com/homedetails/55-Devon-St-APT-6-Dorchester-MA-02121/71497287_zpid/>",
            "failReason": "<ERROR MESSAGE>"
        }],
    "crawlBudget": 55
}
```

{% hint style="warning" %}
**Note:** Crawled results are stored for up to 7 days. For scheduled crawls, new results replace the previously stored ones.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.scraperapi.com/scraperapi-crawler-v2.0/crawler-api/job-lifecycle.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
