Crawler UI

Crawler Workspace

You can access the Crawler UI from the navigation menu on the left after logging in with your account.

This section is the starting point for creating a new crawler, and it’s also where all your active and inactive crawlers will be listed. You can quickly review each crawler’s details and status, making it easy to track what’s running and manage what needs attention.

Crawler Setup

To launch a new Crawler, start by selecting the "New Crawler" button in the Crawler Workspace. This will redirect you to the Crawler Setup page, where you specify the crawler’s parameters and behavior.

At the top of the setup page, specify the URL to Crawl, which defines the starting point for your crawler and give it a name (optional).

Crawler Options

The Crawler Options view gives you full control over how your crawler behaves. Here you can set the link depth, define which paths to include and/or exclude, choose a callback URL, set cost limits, specify how often the crawler should run, and decide the output format. This section is where you shape the actual crawling logic.

Field

Field Type

Description

Link depth

REQUIRED

Controls how deep the crawler goes. Deeper crawls cover more pages but use more credits.

Include path

REQUIRED

Regex pattern for URLs to include. The crawler will only follow URLs that match. Use .* to crawl all pages on the site. You can then use tools like regex101 for debugging.

Exclude path

OPTIONAL

A regex pattern for URLs the crawler should skip. Any URL matching this pattern won’t be crawled.

Callback URL

OPTIONAL

Specify a webhook URL that should be called after the crawler completes. The crawler will send the results to this endpoint.

Maximum cost in credits

REQUIRED

Sets a maximum number of credits the crawler may use. The job stops once this limit is reached to avoid unexpected costs.

Crawling frequency

REQUIRED

Defines the crawler’s run schedule. Available options:

- Run crawler job once (now) - Crawler scheduling disabled (crawler will not run, just the crawler config will be created) - Hourly - Daily - Weekly - Monthly

Output format

REQUIRED

Choose how results are returned. Raw HTML is the default. JSON/CSV are available for Amazon, Google, eBay, Redfin, and Walmart. Markdown/Text are LLM-friendly and ideal for model training.

Notification Preferences

This section allows you to choose how frequently you want to be notified when a crawler run completes. You can opt out entirely (Never), receive an alert after each job (With every run) or get Daily/Weekly summaries that aggregate multiple runs into a single email.

Advanced Options

The Advanced Options section exposes request-level parameters available in the API. Here you can disable redirect-following, enable retries for 404 responses (if the target domain is known for throwing 'fake' 404s), or turn on JavaScript rendering for JS-heavy domains. Advanced bypassing adds extra anti-bot bypass logic (use only if necessary). You can also route traffic through residential or mobile IPs, specify a session for session-based crawling, set a country_code for geo-targeting. Device Type allows you to simulate different device types (Desktop or Mobile).

Crawler Job Details

When you open an existing crawler from the Crawler Workspace, you’re taken to a page that shows an overview of the crawler along with its recent activity and configuration. At the top, you'll see the latest Status, Created time, Avg. cost per job, and Total credits spent. The table below shows all jobs associated with this crawler. It includes the job ID, last run timestamp, status (Done, Running or Failed) and the target domain. You can sort the table or use the search bar to find specific jobs.

Clicking on a Job ID takes you to a results page that shows an overview of the crawl. At the very top there's the job's status, completion time and the crawl summary (URLs crawled, Credits used and cancelled or failed items). Underneath that, is a list of every URL processed in that job, including information on when it was crawled and its status.

Note: Crawled results are stored for up to 7 days. For scheduled crawls, new results replace the previously stored ones.

PreviousIntroduction NextCrawler API

Last updated 6 hours ago

Was this helpful?