To ensure a higher level of successful requests when using our scraper, we’ve built a new product, Async Scraper. Rather than making requests to our endpoint waiting for the response, this endpoint submits a job of scraping, in which you can later collect the data from using our status endpoint.
Scraping websites can be a difficult process; it takes numerous steps and significant effort to get through some sites’ protection which sometimes proves to be difficult with the timeout constraints of synchronous APIs. The Async Scraper will work on your requested URLs until we have achieved a 100% success rate (when applicable), returning the data to you.
Async Scraping is the recommended way to scrape pages when success rate on difficult sites is more important to you than response time (e.g. you need a set of data periodically).
How to use
Submit an async job
A simple example showing how to submit a job for scraping and receive a status endpoint URL through which you can poll for the status (and later the result) of your scraping job:
You can also send POST requests to the Async scraper by using the parameter “method”: “POST”. Here is an example on how to make a POST request to the Async scraper:
Note the statusUrl field in the response. That is a personal URL to retrieve the status and results of your scraping job. Invoking that endpoint provides you with the status first:
You can include a meta object in your request to store custom data (your own request ID for example), which will be returned in the response as well.
Once your job is finished, the response will change and will contain the results of your scraping job:
{
"id":"0962a8e0-5f1a-4e14-bf8c-5efcc18f0953",
"status":"finished",
"statusUrl":"https://async.scraperapi.com/jobs/0962a8e0-5f1a-4e14-bf8c-5efcc18f0953",
"url":"https://example.com",
"response":{
"headers":{
"date":"Thu, 14 Apr 2022 11:10:44 GMT",
"content-type":"text/html; charset=utf-8",
"content-length":"1256",
"connection":"close",
"x-powered-by":"Express",
"access-control-allow-origin":"undefined","access-control-allow-headers":"Origin, X-Requested-With, Content-Type, Accept",
"access-control-allow-methods":"HEAD,GET,POST,DELETE,OPTIONS,PUT",
"access-control-allow-credentials":"true",
"x-robots-tag":"none",
"sa-final-url":"https://example.com/",
"sa-statuscode":"200",
"etag":"W/\"4e8-Sjzo7hHgkd15I/TYxuW15B7HwEc\"",
"vary":"Accept-Encoding"
},
"body":"<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset=\"utf-8\" />\n <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n <style type=\"text/css\">\n body {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n \n }\n div {\n width: 600px;\n margin: 5em auto;\n padding: 2em;\n background-color: #fdfdff;\n border-radius: 0.5em;\n box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n }\n a:link, a:visited {\n color: #38488f;\n text-decoration: none;\n }\n @media (max-width: 700px) {\n div {\n margin: 0 auto;\n width: auto;\n }\n }\n </style> \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n</div>\n</body>\n</html>\n",
"statusCode":200
}
}
Please note that the response for an Async job is stored for up to 72 hours (24hrs guaranteed) or until you retrieve the results, whichever comes first. If you do not get the response in due time, it will be deleted from our side and you will have to send another request for the same job.
If callbacks are used and the results are successfully delivered, we automatically delete the results.
Callbacks
Using a status URL is a great way to test the API or get started quickly, but some customer environments may require some more robust solutions, so we implemented callbacks. Currently only webhook callbacks are supported but we are planning to introduce more over time (e.g. direct database callbacks, AWS S3, etc).
Using a callback you don’t need to use the status URL (although you still can) to fetch the status and results of the job. Once the job is finished the provided webhook URL will be invoked by our system with the same content as the status URL provides.
Just replace the https://yourcompany.com/scraperapi URL with your preferred endpoint. You can even add basic auth to the URL in the following format: https://user:pass@yourcompany.com/scraperapi
By default, we'll call the webhook URL you provide for successful requests. If you'd like to receive data on failed attempts too, you will have to include the expectsUnsuccessReport: trueparameter in your request structure.
An example of using callbacks that report on the failed attempts as well:
Note: The system will try to invoke the webhook URL 3 times, then it cancels the job. So please make sure that the webhook URL is available through the public internet and will be capable of handling the traffic that you need.
Hint:Webhook.site is a free online service to test webhooks without requiring you to build a complex infrastructure.
API Parameters
You can use the usual API parameters just the same way you’d use it with our synchronous API. These parameters should go into an apiParams object inside the POST data, e.g:
We have created a separate endpoint that accepts an array of URLs instead of just one to initiate scraping of multiple URLs at the same time: https://async.scraperapi.com/batchjobs. The API is almost the same as the single endpoint, but we expect an array of strings in the urls field instead of a string in url.
We recommend sending a maximum of 50,000 URLs in one batch job.
Decoding
The responses returned by the Async API for binary requests require you to decode the data as they are encoded using Base64 encoding. This allows the binary data to be sent as a text string, which can then be decoded back into its original form when you want to use it.