Page2API is a powerful tool designed for scraping web pages and converting HTML into well-organized JSON structure.
It offers the possibility to launch long-running scrape sessions by using the Asynchronous Scraping.
Aside from that, it also supports executing complex browser scenarios and handling pagination with ease.
After you create your account, the first thing you will need in order to authenticate and start using the API is your api_key
It is a random generated string that you will find on your Dashboard page and looks like this:
0e72feee16180ef1f3f190ae350d74705d6ebec1
https://www.page2api.com/api/v1/scrape
POST
{
"api_key": "YOUR_API_KEY",
"url": "https://www.example.com",
"real_browser": true,
"parse": {
"title_html": "h1",
"link_text": "/html/body/div/p[2]/a >> text",
"link_href": "/html/body/div/p[2]/a >> href"
}
}
{
"result": {
"title_html": "<h1>Example Domain</h1>",
"link_text": "More information...",
"link_href": "https://www.iana.org/domains/example"
},
"request": {
"parse": {
"title_html": "h1",
"link_text": "/html/body/div/p[2]/a >> text",
"link_href": "/html/body/div/p[2]/a >> href"
},
"url": "https://www.example.com",
"real_browser": true
},
"id": 123456,
"pages_parsed": 1,
"cost": 0.002,
"success": true,
"duration": 2.14
}
{
"error" : "Api key was not found."
}
https://www.page2api.com/api/v1/scrape/encoded/{base64_urlsafe_encoded_payload}
GET
The URL with encoded payload will be:
Press 'Encode'
Name | Type | Required | Description |
---|---|---|---|
api_key
|
string
|
required | Your Page2API Api Key |
url
|
string
|
required | The url of the page that will be scraped |
user_agent
|
string
|
optional | Set custom user agent that will be used for the request. |
locale
|
string
|
optional |
Set custom locale that will be used for the request. Ex: es or pt-BR. All supported locales |
parse
|
object
|
optional |
The object that consists of field names and selectors that will extract the data and build the result. The HTML of the page will be returned if empty More details |
batch
|
object
|
optional |
The batch parameter represents an object that contains the following properties: urls, concurrency, and merge_results. It provides the possibility to scrape web pages in batches with a specific concurrency. More details |
scenario
|
array(objects)
|
optional |
A collection of instructions that the browser will execute More details |
real_browser
|
boolean
|
optional |
Use headless chrome instance to open the url. Default: false |
javascript
|
boolean
|
optional |
Render the JavaScript on the page when using a headless browser (real_browser). Default: true |
import_jquery
|
boolean
|
optional |
Import the latest version of the jQuery library into the browser instance. Default: false |
window_size
|
array(integer)
|
optional |
Set custom window size for the browser. Format: [width, height]. Default: [1920, 1080]. |
wait
|
integer (seconds)
|
optional |
Just wait, and give the browser some time to rest and meditate on the meaning of life. Max value: 10 (seconds) |
wait_for
|
string (css/xpath selector)
|
optional |
Wait for a specific element to appear on the page. Max wait: 10 seconds |
wait_until
|
string (JS snippet)
|
optional |
Wait for a JavaScript snippet to return a Truthy value. Max wait: 10 seconds |
cookies
|
object
|
optional |
Set custom cookies that will be used for the request. More details |
sanitize
|
boolean
|
optional |
Remove all whitespaces from the parsed content. Default: true |
raw
|
boolean / object
|
optional |
Return only the scraping result in the response with a custom format (CSV, JSON, TEXT, HTML). More details |
absolute_urls
|
boolean
|
optional |
Ensure that all parsed attributes that contain an URL have absolute paths. Supported attributes: action, archive, background, cite, classid, codebase, data, dsync formaction, href, icon, longdesc, manifest, poster, profile, src usemap. Default: true |
log_requests
|
boolean
|
optional |
Return all network requests. Default: false |
async
|
boolean
|
optional |
Perform the request asynchronously. Receive the response via callback URL specified on the profile page. Default: false |
callback_url
|
string
|
optional |
A custom callback URL for a specific scrape request. Default: The callback url from user's profile |
passthrough
|
string / integer / object
|
optional |
Any data added to this parameter will be returned in the response or sent in any subsequent callbacks |
request_method
|
string
|
optional |
Set a custom request method for the request. Possible values: GET, POST, PUT, PATCH, DELETE, HEAD. Default: GET More details |
post_body
|
object / string (json)
|
optional |
Set a post body for the request. Example: { "post_body": { "query": "web scraping" }} More details |
headers
|
object
|
optional |
Set custom headers that will be used for the request. Example: { "headers": { "Content-Type": "application/json" }} More details |
refresh_interval
|
integer (minutes)
|
optional |
Create a scheduled parsing that will run every n minutes. Min: 1, Max: 2592000 |
merge_loops
|
boolean
|
optional |
Merge the results obtained during the parsing of the paginated views (loops) Default: false |
datacenter_proxy
|
string
|
optional |
The code of the datacenter proxy to be used for the request. Default: auto More details |
premium_proxy
|
string
|
optional |
The code of the premium proxy to be used for the request. More details |
custom_proxy
|
string
|
optional |
Provide your own proxy to be used for the request. More details |
The parse parameter represents an object, where the keys are the names of the fields you want to create and the values are the selectors that will extract the data.
Description | Required | Examples |
---|---|---|
css/xpath selector
|
required |
|
'>>' concatenated with a selector function
The selector function can be the name of any attribute of the element, as well as one of the special ones: text - extracts the text from the element json - parses the content from the element that contains a JSON object Note: If no function is specified, the HTML of the element will be returned. |
optional |
|
<a href="https://example.com">Example</a>
// The 'parse' parameter:
"parse": {
"link": "a"
}
// The result:
{
"link": "<a href='https://example.com'>Example</a>"
}
// The 'parse' parameter:
"parse": {
"link_href": "a >> href"
}
// The result:
{
"link_href": "https://example.com"
}
In order to extract all elements that share a selector, you must wrap the selector in [ ], like in the following examples:
["/html/body/div/p[*]/a"]
/* without a selector function */
["a >> href"]
/* with a selector function (extract all hrefs) */
<a href='https://example.com'>Example</a>
<a href='https://www.page2api.com'>Page2API</a>
<a href='https://ipapi.co/api'>IpApi</a>
// The 'parse' parameter:
"parse": {
"links": ["a"]
}
// The result:
{
"links": [
"<a href='https://example.com'>Example</a>",
"<a href='https://www.page2api.com'>Page2API</a>",
"<a href='https://ipapi.co/api'>IpApi</a>"
]
}
// The 'parse' parameter:
"parse": {
"links_text": ["a >> text"]
}
// The result:
{
"links_text": [
"Example",
"Page2API",
"IpApi"
]
}
This scenario is used if you want to parse elements from repeating structures, for example a list of articles, products, posts and so on. In order to extract the elements mentioned above, you must wrap the whole { name1: selector1, name2: selector2 } structure in [ ], like in the following example:
"parse": {
"posts": [
{
"_parent": ".feed-item",
"title": ".feed-item_title-link >> text",
"link": ".feed-item_title-link >> href",
"author": "span.user-link_name >> text"
}
]
}
Please note that each structure must have a _parent key that will define the parent for the parsed elements:
"_parent": ".feed-item"
<div class='all-posts'>
<div class='post'>
<a class='title' href='/posts/123'>Post one title</a>
<span class='comments'>(3 comments)</span>
<a class='author' href='/author/757'>Author One</a>
</div>
<div class='post'>
<a class='title' href='/posts/234'>Post two title</a>
<span class='comments'>(no comments)</span>
<a class='author' href='/author/347'>Author Two</a>
</div>
<div class='post'>
<a class='title' href='/posts/456'>Post three title</a>
<span class='comments'>(1 comment)</span>
<a class='author' href='/author/923'>Author Three</a>
</div>
</div>
// The 'parse' parameter:
"parse": {
"posts": [
{
"_parent": ".post",
"title": "a.title >> text",
"link": "a.title >> href",
"author": "a.author >> text"
"comments": "span.comments >> text"
}
]
}
// The result:
{
"posts": [
{
"title": "Post one title",
"link": "/posts/123",
"author": "Author One"
"comments": "(3 comments)"
},
{
"title": "Post two title",
"link": "/posts/234",
"author": "Author Two"
"comments": "(no comments)"
},
{
"title": "Post one title",
"link": "/posts/456",
"author": "Author Three"
"comments": "(1 comment)"
},
]
}
Tables are parsed automatically, there is no need to specify any selector function for them.
<table class='people'>
<thead>
<tr>
<th>Firstname</th>
<th>Lastname</th>
</tr>
<thead/>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
</tr>
</tbody>
</table>
// The 'parse' parameter:
"parse": {
"people": "table.people"
}
// The result:
{
"people": [
{
"Firstname": "Jill",
"Lastname": "Smith"
},
{
"Firstname": "Eve",
"Lastname": "Jackson"
},
]
}
{
"api_key": "YOUR_API_KEY",
"url": "https://www.indiehackers.com",
"sanitize": true,
"parse": {
"footer": "/html/body/div[1]/div/footer/div[1]/form/p[1] >> text",
"sections": [".posts-section__nav-content >> text"],
"posts": [
{
"_parent": ".feed-item",
"title": ".feed-item__title-link >> text",
"link": ".feed-item__title-link >> href",
"author": "span.user-link__name >> text"
}
],
"side_links": [
{
"_parent": ".news-section__item",
"title": ".news-section__item-title >> text",
"link": "_parent >> href",
"category": "/html/body/div[1]/div/div[2]/div[1]/div/a[*]/div/span[1] >> text"
}
]
}
}
The parameter above allows you to build custom scraping requests with specific request method, body and headers.
For better flexibility, the post_body parameter can be an object, as well as a JSON string.
// object
"post_body": { "test": "success" }
// JSON string
"post_body": "{ \"test\": \"success\" }"
{
"api_key": "YOUR_API_KEY",
"url": "http://httpbin.org/anything",
"post_body": { "testing": "true" },
"request_method": "POST",
"parse": {
"data": "body >> json"
}
}
{
"result": {
"data": {
"form": {
"testing": "true"
},
"method": "POST",
"headers": {
"Content-Type": "application/x-www-form-urlencoded",
...
},
...
}
},
"request": {
"parse": {
"data": "body >> json"
},
"url": "http://httpbin.org/anything",
"request_method": "POST",
"post_body": {
"testing": "true"
}
"scenario": [
{
"execute": "parse"
}
]
},
"id": 12345,
"pages_parsed": 1,
"cost": 0.00025,
"success": true,
"duration": 1.22
}
{
"api_key": "YOUR_API_KEY",
"url": "http://httpbin.org/anything",
"post_body": "{ \"testing\": \"true\" }",
"request_method": "POST",
"headers": {
"Content-Type": "application/json"
},
"parse": {
"data": "body >> json"
}
}
{
"result": {
"data": {
"data": "{ \"testing\": \"true\" }",
"method": "POST",
"json": {
"testing": "true"
},
"headers": {
"Content-Type": "application/json",
...
},
...
}
},
"request": {
"parse": {
"data": "body >> json"
},
"url": "http://httpbin.org/anything",
"request_method": "POST",
"post_body": "{ \"testing\": \"true\" }",
"headers": {
"Content-Type": "application/json"
},
"scenario": [
{
"execute": "parse"
}
]
},
"id": 12345,
"pages_parsed": 1,
"cost": 0.00025,
"success": true,
"duration": 1.22
}
This parameter allows the customization of the scraping response.
It returns by default only the scraping result, in a JSON format, without any additional properties.
Name | Type | Required | Description |
---|---|---|---|
format
|
string
|
optional |
The response format. Possible values: csv, auto. Default: auto (will return JSON, HTML, or TEXT, depending on the presence of the parse parameter) |
key
|
string
|
optional |
The key that should be used as a source for the response. |
{
"api_key": "YOUR_API_KEY",
"url": "https://www.page2api.com",
"raw": true,
"parse": {
"features": [
{
"_parent": ".feature-container",
"title": "h2 >> text"
}
]
}
}
{
"features": [
{ "title": "Intuitive and powerful API" },
{ "title": "Asynchronous scraping" },
{ "title": "Javascript rendering" },
{ "title": "Scheduled scraping" },
{ "title": "Custom browser scenarios" },
{ "title": "Fast and reliable proxies" }
]
}
Note: As you can see the response has only the features key, without any additional properties such as duration, cost, ..etc.
{
"api_key": "YOUR_API_KEY",
"url": "https://www.page2api.com",
"raw": {
"key": "features"
},
"parse": {
"features": [
{
"_parent": ".feature-container",
"title": "h2 >> text"
}
]
}
}
[
{ "title": "Intuitive and powerful API" },
{ "title": "Asynchronous scraping" },
{ "title": "Javascript rendering" },
{ "title": "Scheduled scraping" },
{ "title": "Custom browser scenarios" },
{ "title": "Fast and reliable proxies" }
]
{
"api_key": "YOUR_API_KEY",
"url": "https://www.page2api.com",
"raw": {
"key": "features",
"format": "csv"
},
"parse": {
"features": [
{
"_parent": ".feature-container",
"title": "h2 >> text"
}
]
}
}
title
Intuitive and powerful API
Asynchronous scraping
Javascript rendering
Scheduled scraping
Custom browser scenarios
Fast and reliable proxies
Hint: The example above is helpful when we want to import the result directly into a Spreadsheet without any code.
=IMPORTDATA("https//www.page2api.com/api/v1/scrape/encoded/{urlsafe_base64_encoded_params}")
When scraping a page with real_browser set to true - the API will automatically execute all the JavaScript on that page.
This can be useful for scraping websites that load content dynamically with plain JavaScript or any framework such as React, Angular, Vue, or JQuery.
To scrape the page with a headless browser, but with the JavaScript disabled - just set the javascript to false.
{
"api_key": "YOUR_API_KEY",
"url": "https://www.whatismybrowser.com/",
"real_browser": true,
"javascript": false,
"parse": {
"browser": ".string-major >> text",
"javascript": "#javascript-detection >> text"
}
}
Note: this parameter is available only when real_browser is set to true.
Keep in mind that even if javascript is set to false - you can still run your own JavaScript snippets on the page.
A request without a real browser will cost the same as a request with a real browser but with disabled JavaScript.
// This request will use a rest client to fetch the web page.
// It will be faster than the example below but could be sometimes detected.
"real_browser": false
// This request will use a headless chrome with the JavaScript disabled.
// It will be slightly slower than the previous example but will be harder to detect.
"real_browser": true,
"javascript": false
The scenario parameter represents a collection of browser instructions, such as:
Note: this parameter is available only when real_browser is set to true and javascript is not disabled.
"scenario" : [
{ "execute_js": "$($('select')[5]).val('yes').change()" },
{ "wait": 0.1 },
{
"loop" : [
{ "wait_for": "li.next a" },
{ "execute": "parse" },
{ "execute_js": "document.getElementById('proxylisttable_next').click()" },
{ "wait": 0.1 }
],
"iterations": 10, // in this case - this parameter is optional
"stop_condition": "document.getElementById('proxylisttable_next').classList.contains('disabled')"
}
]
Note: a loop is just a collection of instructions that are executed in cycle.
Hint: The most relevant use case for a loop is parsing paginated views.
Command | Description |
---|---|
|
Tells the browser to take a small break. The value is any integer between 1 and 10 (seconds). |
|
Waits until a specific element appears on the page. The timeout is 10 seconds. |
|
Executes a js snippet. All js errors are ignored. |
|
Fills an input, natively. Each character is sent separately. |
|
Clicks an element, natively. |
|
Initiate the parsing with the current HTML on the page. |
|
Executes a set of commands in a cycle. |
This parameter allows the browser to give the web page some time (seconds) to render before capturing the HTML.
The use case usually occurs when interacting with the web page via scenario parameter
or when some content is rendered asynchronously after the page load.
Maximum value: 10 (seconds).
"wait": 2
Note: this parameter is available only when real_browser is set to true and javascript is not disabled.
This parameter allows the browser to give the web page some time (seconds) to render a particular element before capturing the HTML.
Maximum timeout: 10 (seconds).
"wait_for": "li.next"
Note: this parameter is available only when real_browser is set to true and javascript is not disabled.
This parameter allows the browser to wait until a JavaScript snippet will return a Truthy value before capturing the HTML.
Maximum timeout: 10 (seconds).
"wait_until": "document.querySelectorAll('.element').length == 10"
// or base64 encoded:
"wait_until": "ZG9jdW1lbnQucXVlcnlTZWxlY3RvckFsbCgnLmVsZW1lbnQnKS5sZW5ndGggPT0gMTA="
Note: this parameter is available only when real_browser is set to true and javascript is not disabled.
This parameter allows you to send custom cookies that will be used for the request.
Format: an object with string values.
"cookies": { "test": "success", "session": "123asdqwe" }
When using this parameter with real_browser set to true, the response will contain the information about all cookies that were set, like in the example below:
{
"api_key": "YOUR_API_KEY",
"url": "http://httpbin.org/cookies?json",
"cookies": { "testing": "true" },
"real_browser": true,
"parse": {
"cookies": "body >> json"
}
}
{
"result": {
"cookies": {
"cookies": {
"testing": "true"
}
}
},
"request": {
"parse": {
"cookies": "body >> json"
},
"url": "http://httpbin.org/cookies?json",
"cookies": {
"testing": "true"
},
"real_browser": true,
"scenario": [
{
"execute": "parse"
}
]
},
"id": 12345,
"pages_parsed": 1,
"cost": 0.002,
"success": true,
"extra": {
"cookies": [
{
"name": "testing",
"value": "true",
"path": "/",
"domain": "httpbin.org",
"expires": null,
"secure": false
}
]
},
"duration": 5.22
}
Page2API can execute custom JavaScript code during the scraping session.
This is useful when you need to interact with the web page while or before parsing.
It is performed via scenario parameter that was described earlier.
document.querySelector('.morelink').click()
ZG9jdW1lbnQucXVlcnlTZWxlY3RvcignLm1vcmVsaW5rJykuY2xpY2soKQ==
"scenario" : [
{ "execute_js": "document.querySelector('.morelink').click()" },
{ "wait": 0.1 },
{ "execute": "parse" }
]
Note: this parameter is available only when real_browser is set to true.
Page2API can execute custom JavaScript code during the scraping session.
The executed code can be used as a selector to ease the data extraction process.
This is useful when you need to access some data that is stored in the javascript code from the page.
js >> raw_or_base64_js_snippet
{
"api_key": "YOUR_API_KEY",
"url": "https://www.page2api.com/",
"real_browser": true,
"parse": {
"title": "js >> $('h1').text().trim()",
"location": "js >> document.location.href",
"js_variable": "js >> let object = { arr: [1, 2] }; object",
"base64_js": "js >> bGV0IG9iaiA9IHsgb25lX3R3bzogWzEsIDJdIH07IG9iag=="
}
}
{
"result": {
"title": "The Ultimate Web Scraping API",
"location": "https://www.page2api.com/",
"js_variable": {
"arr": [1, 2]
},
"base64_js": {
"one_two": [1, 2]
}
}
}
Note: this parameter is available only when real_browser is set to true
Page2API can fill inputs natively by using the fill_in scenario command.
The format is an array, when the first attribute represents a css/xpath selector, and the second one - the value.
"scenario" : [
{ "fill_in": ["input#search", "Page2API"] },
{ "wait_for": ".search-results" },
{ "execute": "parse" }
]
Note: this parameter is available only when real_browser is set to true.
Page2API can click natively on visible elements from the page by using the click scenario command.
The format is a string that represents a css/xpath selector of the element that must be clicked.
"scenario" : [
{ "fill_in": ["input#search", "Page2API"] },
{ "click": "button[type=submit]" },
{ "execute": "parse" }
]
Note: this parameter is available only when real_browser is set to true.
Page2API can handle paginated views, such as the classic ones with links to the pages, as well as infinite scrolls.
This is made via a loop command from the inside of a scenario.
Note: a loop is just a collection of instructions that are executed in cycle.
"scenario" : [
{
"loop" : [
{ "wait_for": "li.next a" },
{ "execute": "parse" },
{ "execute_js": "document.getElementById('proxylisttable_next').click()" }
],
"iterations": 10, // in this case - this parameter is optional
"stop_condition": "document.getElementById('proxylisttable_next').classList.contains('disabled')"
}
]
Note: this parameter is available only when real_browser is set to true.
Name | Type | Required | Description |
---|---|---|---|
urls
|
string / array
|
required |
The URLs that needs to be scraped. There are two ways of using this parameter: 1. By defining an array of hardcoded URLs, like in the example below:
2. By defining a URL generation rule:
The URL generation rule has the following format:
where each element of the rule is an integer.
|
concurrency
|
integer
|
required |
The amount of pages that should be scraped at the same time. For a Free Trial account it must be equal to 1, for a Paid one - between 1 and the maximum value allowed by your account settings. |
merge_results
|
boolean
|
optional |
Merge the results obtained during the parsing of the each page. Default: false |
{
"api_key": "YOUR_API_KEY",
"batch": {
"concurrency": 3,
"urls": [
"https://www.ebay.com/itm/334297414333",
"https://www.ebay.com/itm/392912936671",
"https://www.ebay.com/itm/174045421299"
]
},
"parse": {
"title": "h1 >> text",
"price": "#prcIsum >> text",
"url": "link[rel=canonical] >> href"
}
}
{
"api_key": "YOUR_API_KEY",
"batch": {
"merge_results": true,
"concurrency": 3,
"urls": "https://companiesmarketcap.com/page/[1, 3, 1]/"
},
"parse": {
"data": "table"
}
}
Note: when performing batch scraping, the url parameter that is usually used to scrape a single page - is optional. (see the examples above)
Name | Type | Required | Description |
---|---|---|---|
payloads
|
array of objects
|
required |
This parameter is a collection of individual payloads, assembled from the list of
available parameters. Sample value:
|
concurrency
|
integer
|
required |
The amount of pages that should be scraped at the same time. For a Free Trial account it must be equal to 1, for a Paid one - between 1 and the maximum value allowed by your account settings. |
merge_results
|
boolean
|
optional |
Merge the results obtained during the parsing of the each page. Default: false |
{
"api_key": "YOUR_API_KEY",
"batch": {
"payloads": [
{
"url": "https://httpbin.org/anything?a=1",
"request_method": "POST",
"post_body": { "post": true }
},
{
"url": "https://httpbin.org/anything?a=2",
"request_method": "PUT",
"post_body": { "put": true }
}
],
"concurrency": 1,
"merge_results": false
},
"parse": {
"data": "body >> json"
}
}
{
"api_key": "YOUR_API_KEY",
"batch": {
"payloads": [
{
"url": "https://www.page2api.com",
"parse": {
"title": "h1 >> text"
},
"real_browser": true
},
{
"url": "https://www.example.com",
"parse": {
"description": "p >> text"
}
}
],
"concurrency": 1,
"merge_results": false
}
}
Usually, any request that takes more than 120 seconds will be interrupted.
In order to handle long running scraping requests (up to 240 seconds), Page2API has the ability of scraping the web pages in the background.
This is made by adding the async parameter to the request and setting it to true.
{
"api_key": "YOUR_API_KEY",
"url": "https://www.example.com",
"async": true,
"parse": {
"title": "h1",
}
}
{
"id": 123456,
"performed_async": true
}
After the scraping is done, the result will be sent to the Callback url that you set on your profile page.
{
"result": {
"title": "<h1>Example Domain</h1>",
},
"request": {
"parse": {
"title": "h1"
},
"url": "https://www.example.com",
"callback_url": "https://www.userapplication.com/callback"
},
"id": 123456,
"pages_parsed": 1,
"cost": 0.00025,
"success": true,
"duration": 1.85
}
You can set a custom Callback URL for each of your asynchronous requests
{
"api_key": "YOUR_API_KEY",
"url": "https://www.example.com",
"callback_url": "https://www.userapplication.com/custom_callback"
"async": true,
"parse": {
"title": "h1",
}
}
You can also use the passthrough field for your asynchronous requests and this field will be returned within the callback request.
{
"api_key": "YOUR_API_KEY",
"url": "https://www.example.com",
"passthrough": {
"custom_field": "qwe123asd",
"passthrough_can_be_integer_string_or_object": true
},
"async": true,
"parse": {
"title": "h1",
}
}
{
"result": {
"title": "<h1>Example Domain</h1>",
},
"request": {
"parse": {
"title": "h1"
},
"url": "https://www.example.com",
"passthrough": {
"custom_field": "qwe123asd",
"passthrough_can_be_integer_string_or_object": true
},
"callback_url": "https://www.userapplication.com/callback"
},
"id": 123456,
"pages_parsed": 1,
"cost": 0.00025,
"success": true,
"duration": 1.85
}
Page2API can create a schedule and scrape web pages automatically in background.
To create a schedule - just add refresh_interval (minutes) parameter to the request with a value between 1 and 2592000 (30 days).
We will run the schedule according to the interval you specified and send the results to your Callback url
or to a custom callback url that you can set per request.
After creating a schedule, you will be able to visualize it, update the interval and delete the schedule intirely.
{
"api_key": "YOUR_API_KEY",
"url": "https://www.example.com",
"async": true,
"refresh_interval": 5,
"parse": {
"title": "h1 >> text",
}
}
{
"id": 123456,
"performed_async": true,
"schedule_id": 1234
}
{
"result": {
"title": "Example Domain",
},
"request": {
"parse": {
"title": "h1 >> text"
},
"url": "https://www.example.com",
"callback_url": "https://www.userapplication.com/callback"
},
"id": 123456,
"schedule_id": 1234,
"pages_parsed": 1,
"cost": 0.00025,
"success": true,
"duration": 1.85
}
Note: The schedule_id will be returned in the response, regardless of the async parameter value.
https://www.page2api.com/api/v1/schedules
GET
[
{
"id": 1234,
"refresh_interval": 5,
"last_refresh_at": "2021-08-01T09:52:33Z",
"next_refresh_at": "2021-08-01T09:53:33Z",
"active": true,
"options": {
"url": "https://www.example.com",
"parse": {
"title": "h1 >> text"
},
"callback_url": "https://www.userapplication.com/callback"
"refresh_interval": 5
},
"created_at": "2021-08-01T09:51:21Z",
"updated_at": "2021-08-01T09:52:33Z",
"scrape_records_count": 14
}
]
For a specific Schedule, you can update any of the following parameters:
https://www.page2api.com/api/v1/schedules/:id
PUT
{
"api_key": "YOUR_API_KEY",
"refresh_interval": 1,
"callback_url": "https://www.userapplication.com/new_callback_url"
}
{
"id": 1234,
"refresh_interval": 1,
"last_refresh_at": "2021-08-01T09:52:33Z",
"next_refresh_at": "2021-08-01T09:53:33Z",
"active": true,
"options": {
"url": "https://www.example.com",
"parse": {
"title": "h1 >> text"
},
"callback_url": "https://www.userapplication.com/new_callback_url"
"refresh_interval": 1
},
"created_at": "2021-08-01T09:51:21Z",
"updated_at": "2021-08-01T09:52:33Z",
"scrape_records_count": 14
}
https://www.page2api.com/api/v1/schedules/:id
DELETE
{
"api_key": "YOUR_API_KEY",
}
{
"message": "Schedule with id: '1234' was deleted successfully."
}
The Datacenter Proxy is the default proxy used to scrape the web.
The default value is auto. With this value - the API will try to use the most suitable datacenter location for the scraped website.
You can also choose a specific location or set the value to random to pick a random datacenter proxy from the available locations.
{
"api_key": "YOUR_API_KEY",
"url": "https://whatismycountry.com",
"datacenter_proxy": "de",
"parse": {
"country": "h2#country >> text"
}
}
{
"result": {
"country": "Your country is Germany"
},
"request": {
"parse": {
"country": "h2#country >> text"
},
"url": "https://whatismycountry.com",
"premium_proxy": "de",
"scenario": [
{
"execute": "parse"
}
]
},
"id": 190,
"pages_parsed": 1,
"cost":0.00025,
"success": true,
"duration": 1.55
}
Location | API value |
---|---|
Auto (default)
|
auto |
Random
|
random |
EU
|
eu |
USA
|
us |
Germany
|
de |
Romania
|
ro |
Netherlands
|
nl |
United Kingdom
|
uk |
You can provide your own proxy for scraping the web with Page2API.
{
"api_key": "YOUR_API_KEY",
"url": "https://api.ipify.org?format=json",
"custom_proxy": "http://username:[email protected]:1234",
"parse": {
"ip": "body >> json"
}
}
{
"result": {
"data": {
"ip": "192.168.13.14"
}
}
},
"request": {
"parse": {
"ip": "body >> json"
},
"url": "https://api.ipify.org?format=json",
"custom_proxy": "http://username:[email protected]:1234",
"scenario": [
{
"execute": "parse"
}
]
},
"id": 190,
"pages_parsed": 1,
"cost":0.00025,
"success": true,
"duration": 1.55
}