Page2API | Documentation

Introduction

What is Page2API?

Page2API is a powerful tool designed for scraping web pages and converting HTML into well-organized JSON structure.
It offers the possibility to launch long-running scrape sessions by using the Asynchronous Scraping.
Aside from that, it also supports executing complex browser scenarios and handling pagination with ease.

Getting ready

Authentication

After you create your account, the first thing you will need in order to authenticate and start using the API is your api_key
It is a random generated string that you will find on your Dashboard page and looks like this:

  0e72feee16180ef1f3f190ae350d74705d6ebec1

The scraping endpoint

URL

  https://www.page2api.com/api/v1/scrape

Method

  POST

Sample request payload

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "real_browser": true,
      "parse": {
        "title_html": "h1",
        "link_text": "/html/body/div/p[2]/a >> text",
        "link_href": "/html/body/div/p[2]/a >> href"
      }
    }

Sample response for a successfull request

  
    {
      "result": {
        "title_html": "<h1>Example Domain</h1>",
        "link_text": "More information...",
        "link_href": "https://www.iana.org/domains/example"
      },
      "request": {
        "parse": {
          "title_html": "h1",
          "link_text": "/html/body/div/p[2]/a >> text",
          "link_href": "/html/body/div/p[2]/a >> href"
        },
        "url": "https://www.example.com",
        "real_browser": true
      },
      "id": 123456,
      "pages_parsed": 1,
      "cost": 0.002,
      "success": true,
      "duration": 2.14
    }

Sample response for a failed request

  
    {
      "error" : "Api key was not found."
    }

Accessing the scraping endpoint via GET with encoded payload:

URL

  https://www.page2api.com/api/v1/scrape/encoded/{base64_urlsafe_encoded_payload}

Method

GET

Sample request payload

{ "api_key": "YOUR_API_KEY", "url": "https://www.example.com", "real_browser": true, "parse": { "title_html": "h1", "link_text": "/html/body/div/p[2]/a >> text", "link_href": "/html/body/div/p[2]/a >> href" } }

Edit the payload above if needed, and press Encode →

Encode

The URL with encoded payload will be:

  Press 'Encode'

Parameters overview

Name	Type	Required	Description
api_key	string	required	Your Page2API Api Key
url	string	required	The url of the page that will be scraped
user_agent	string	optional	Set custom user agent that will be used for the request.
locale	string	optional	Set custom locale that will be used for the request. Ex: es or pt-BR. All supported locales
parse	object	optional	The object that consists of field names and selectors that will extract the data and build the result. The HTML of the page will be returned if empty More details
batch	object	optional	The batch parameter represents an object that contains the following properties: urls, concurrency, and merge_results. It provides the possibility to scrape web pages in batches with a specific concurrency. More details
scenario	array(objects)	optional	A collection of instructions that the browser will execute More details
real_browser	boolean	optional	Use headless chrome instance to open the url. Default: false
javascript	boolean	optional	Render the JavaScript on the page when using a headless browser (real_browser). Default: true
import_jquery	boolean	optional	Import the latest version of the jQuery library into the browser instance. Default: false
window_size	array(integer)	optional	Set custom window size for the browser. Format: [width, height]. Default: [1920, 1080].
wait	integer (seconds)	optional	Just wait, and give the browser some time to rest and meditate on the meaning of life. Max value: 10 (seconds)
wait_for	string (css/xpath selector)	optional	Wait for a specific element to appear on the page. Max wait: 10 seconds
wait_until	string (JS snippet)	optional	Wait for a JavaScript snippet to return a Truthy value. Max wait: 10 seconds
cookies	object	optional	Set custom cookies that will be used for the request. More details
sanitize	boolean	optional	Remove all whitespaces from the parsed content. Default: true
raw	boolean / object	optional	Return only the scraping result in the response with a custom format (CSV, JSON, TEXT, HTML). More details
absolute_urls	boolean	optional	Ensure that all parsed attributes that contain an URL have absolute paths. Supported attributes: action, archive, background, cite, classid, codebase, data, dsync formaction, href, icon, longdesc, manifest, poster, profile, src usemap. Default: true
log_requests	boolean	optional	Return all network requests. Default: false
async	boolean	optional	Perform the request asynchronously. Receive the response via callback URL specified on the profile page. Default: false
callback_url	string	optional	A custom callback URL for a specific scrape request. Default: The callback url from user's profile
passthrough	string / integer / object	optional	Any data added to this parameter will be returned in the response or sent in any subsequent callbacks
request_method	string	optional	Set a custom request method for the request. Possible values: GET, POST, PUT, PATCH, DELETE, HEAD. Default: GET More details
post_body	object / string (json)	optional	Set a post body for the request. Example: { "post_body": { "query": "web scraping" }} More details
headers	object	optional	Set custom headers that will be used for the request. Example: { "headers": { "Content-Type": "application/json" }} More details
refresh_interval	integer (minutes)	optional	Create a scheduled parsing that will run every n minutes. Min: 1, Max: 2592000
merge_loops	boolean	optional	Merge the results obtained during the parsing of the paginated views (loops) Default: false
datacenter_proxy	string	optional	The code of the datacenter proxy to be used for the request. Default: auto More details
premium_proxy	string	optional	The code of the premium proxy to be used for the request. More details
custom_proxy	string	optional	Provide your own proxy to be used for the request. More details

Data extraction

Parameter: parse

The parse parameter represents an object, where the keys are the names of the fields you want to create and the values are the selectors that will extract the data.

A simple selector consists of 2 parts:

Description Required Examples

Description	Required	Examples
css/xpath selector	required	`a` `/html/body/div/p[2]/a` `/html/body/div/p[*]/a`
'>>' concatenated with a selector function The selector function can be the name of any attribute of the element, as well as one of the special ones: text - extracts the text from the element json - parses the content from the element that contains a JSON object Note: If no function is specified, the HTML of the element will be returned.	optional	`>> href` `>> title` `>> text` `>> json`

css/xpath selector

required

/html/body/div/p[2]/a

/html/body/div/p[*]/a

'>>' concatenated with a selector function

The selector function can be the name of any attribute of the element,
as well as one of the special ones:
text - extracts the text from the element
json - parses the content from the element that contains a JSON object

Note: If no function is specified, the HTML of the element will be returned.

optional

>> href

>> title

>> text

>> json

1. Extracting one element per selector

Having the following element on the page:

  
    <a href="https://example.com">Example</a>

The most simple example of a parse parameter will look like:

  
    // The 'parse' parameter:

    "parse": {
      "link": "a"
    }

    // The result:

    {
      "link": "<a href='https://example.com'>Example</a>"
    }

A parse parameter where the selector function is present:

  
    // The 'parse' parameter:

    "parse": {
      "link_href": "a >> href"
    }

    // The result:

    {
      "link_href": "https://example.com"
    }

2. Extracting all elements with a specific selector

In order to extract all elements that share a selector, you must wrap the selector in [ ], like in the following examples:

  
    ["/html/body/div/p[*]/a"]
    /* without a selector function */

  
    ["a >> href"]
    /* with a selector function (extract all hrefs) */

Having the following elements on the page:

  
    <a href='https://example.com'>Example</a>
    <a href='https://www.page2api.com'>Page2API</a>
    <a href='https://ipapi.co/api'>IpApi</a>

A selector that will extract all links will look like:

  
    // The 'parse' parameter:

    "parse": {
      "links": ["a"]
    }

    // The result:

    {
      "links": [
        "<a href='https://example.com'>Example</a>",
        "<a href='https://www.page2api.com'>Page2API</a>",
        "<a href='https://ipapi.co/api'>IpApi</a>"
      ]
    }

And if you want to extract specific attributes/content:

  
    // The 'parse' parameter:

    "parse": {
      "links_text": ["a >> text"]
    }

    // The result:

    {
      "links_text": [
        "Example",
        "Page2API",
        "IpApi"
      ]
    }

3. Extracting nested elements with different selectors

This scenario is used if you want to parse elements from repeating structures, for example a list of articles, products, posts and so on. In order to extract the elements mentioned above, you must wrap the whole { name1: selector1, name2: selector2 } structure in [ ], like in the following example:

  
    "parse": {
      "posts": [
        {
          "_parent": ".feed-item",
          "title":  ".feed-item_title-link >> text",
          "link":   ".feed-item_title-link >> href",
          "author": "span.user-link_name >> text"
        }
      ]
    }

Please note that each structure must have a _parent key that will define the parent for the parsed elements:

  
    "_parent": ".feed-item"

Having the following structure of elements on the page:

  
  
    <div class='all-posts'>
      <div class='post'>
        <a class='title' href='/posts/123'>Post one title</a>
        <span class='comments'>(3 comments)</span>
        <a class='author' href='/author/757'>Author One</a>
      </div>
      <div class='post'>
        <a class='title' href='/posts/234'>Post two title</a>
        <span class='comments'>(no comments)</span>
        <a class='author' href='/author/347'>Author Two</a>
      </div>
      <div class='post'>
        <a class='title' href='/posts/456'>Post three title</a>
        <span class='comments'>(1 comment)</span>
        <a class='author' href='/author/923'>Author Three</a>
      </div>
    </div>

A selector that will extract the data about each post will look like:

  
    // The 'parse' parameter:

    "parse": {
      "posts": [
        {
          "_parent": ".post",
          "title": "a.title >> text",
          "link": "a.title >> href",
          "author": "a.author >> text"
          "comments": "span.comments >> text"
        }
      ]
    }

    // The result:

    {
      "posts": [
        {
          "title": "Post one title",
          "link": "/posts/123",
          "author": "Author One"
          "comments": "(3 comments)"
        },
        {
          "title": "Post two title",
          "link": "/posts/234",
          "author": "Author Two"
          "comments": "(no comments)"
        },
        {
          "title": "Post one title",
          "link": "/posts/456",
          "author": "Author Three"
          "comments": "(1 comment)"
        },
      ]
    }

4. Extracting data from tables

Tables are parsed automatically, there is no need to specify any selector function for them.

Having the following table on the page:

  
  
    <table class='people'>
      <thead>
        <tr>
          <th>Firstname</th>
          <th>Lastname</th>
        </tr>
      <thead/>
      <tbody>
        <tr>
          <td>Jill</td>
          <td>Smith</td>
        </tr>
        <tr>
          <td>Eve</td>
          <td>Jackson</td>
        </tr>
      </tbody>
    </table>

A selector that will extract the data from this table will look like:

  
    // The 'parse' parameter:

    "parse": {
      "people": "table.people"
    }

    // The result:

    {
      "people": [
        {
          "Firstname": "Jill",
          "Lastname": "Smith"
        },
        {
          "Firstname": "Eve",
          "Lastname": "Jackson"
        },
      ]
    }

To summarize - here is a sample request with a complex parse parameter

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.indiehackers.com",
      "sanitize": true,
      "parse": {
        "footer": "/html/body/div[1]/div/footer/div[1]/form/p[1] >> text",
        "sections": [".posts-section__nav-content >> text"],
        "posts": [
          {
            "_parent": ".feed-item",
            "title":  ".feed-item__title-link >> text",
            "link":   ".feed-item__title-link >> href",
            "author": "span.user-link__name >> text"
          }
        ],
        "side_links": [
          {
            "_parent": ".news-section__item",
            "title": ".news-section__item-title >> text",
            "link": "_parent >> href",
            "category": "/html/body/div[1]/div/div[2]/div[1]/div/a[*]/div/span[1] >> text"
          }
        ]
      }
    }

Custom request

Parameters: request_method post_body headers

The parameter above allows you to build custom scraping requests with specific request method, body and headers.

For better flexibility, the post_body parameter can be an object, as well as a JSON string.

    // object
    "post_body": { "test": "success" }

    // JSON string
    "post_body": "{ \"test\": \"success\" }"

Sample POST request with form data

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "http://httpbin.org/anything",
      "post_body": { "testing": "true" },
      "request_method": "POST",
      "parse": {
        "data": "body >> json"
      }
    }

Sample response

  
    {
      "result": {
        "data": {
          "form": {
            "testing": "true"
          },
          "method": "POST",
          "headers": {
            "Content-Type": "application/x-www-form-urlencoded",
            ...
          },
          ...
        }
      },
      "request": {
        "parse": {
          "data": "body >> json"
        },
        "url": "http://httpbin.org/anything",
        "request_method": "POST",
        "post_body": {
          "testing": "true"
        }
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 12345,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.22
    }

Sample POST request with JSON body and custom headers

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "http://httpbin.org/anything",
      "post_body": "{ \"testing\": \"true\" }",
      "request_method": "POST",
      "headers": {
        "Content-Type": "application/json"
      },
      "parse": {
        "data": "body >> json"
      }
    }

Sample response

  
    {
      "result": {
        "data": {
          "data": "{ \"testing\": \"true\" }",
          "method": "POST",
          "json": {
            "testing": "true"
          },
          "headers": {
            "Content-Type": "application/json",
            ...
          },
          ...
        }
      },
      "request": {
        "parse": {
          "data": "body >> json"
        },
        "url": "http://httpbin.org/anything",
        "request_method": "POST",
        "post_body": "{ \"testing\": \"true\" }",
        "headers": {
          "Content-Type": "application/json"
        },
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 12345,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.22
    }

Custom response

Parameter: raw

This parameter allows the customization of the scraping response.
It returns by default only the scraping result, in a JSON format, without any additional properties.

There are 2 ways to use the raw parameter.

1. You can set it as true, and it will return only the scraping result, in a JSON format.
2. In case you want to customize the response, you can send it as an object that can contains the following properties:

Name	Type	Required	Description
format	string	optional	The response format. Possible values: csv, auto. Default: auto (will return JSON, HTML, or TEXT, depending on the presence of the parse parameter)
key	string	optional	The key that should be used as a source for the response.

The most simple example:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com",
      "raw": true,
      "parse": {
        "features": [
          {
            "_parent": ".feature-container",
            "title": "h2 >> text"
          }
        ]
      }
    }

The payload above will generate the folowing response:

  
    {
      "features": [
        { "title": "Intuitive and powerful API" },
        { "title": "Asynchronous scraping" },
        { "title": "Javascript rendering" },
        { "title": "Scheduled scraping" },
        { "title": "Custom browser scenarios" },
        { "title": "Fast and reliable proxies" }
      ]
    }

Note: As you can see the response has only the features key, without any additional properties such as duration, cost, ..etc.

If we want to dig into our response and return a specific key:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com",
      "raw": {
        "key": "features"
      },
      "parse": {
        "features": [
          {
            "_parent": ".feature-container",
            "title": "h2 >> text"
          }
        ]
      }
    }

The payload above will generate the folowing response:

  
    [
      { "title": "Intuitive and powerful API" },
      { "title": "Asynchronous scraping" },
      { "title": "Javascript rendering" },
      { "title": "Scheduled scraping" },
      { "title": "Custom browser scenarios" },
      { "title": "Fast and reliable proxies" }
    ]

If we want to return our result as CSV:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com",
      "raw": {
        "key": "features",
        "format": "csv"
      },
      "parse": {
        "features": [
          {
            "_parent": ".feature-container",
            "title": "h2 >> text"
          }
        ]
      }
    }

The payload above will generate the folowing response:

  
    title
    Intuitive and powerful API
    Asynchronous scraping
    Javascript rendering
    Scheduled scraping
    Custom browser scenarios
    Fast and reliable proxies

Hint: The example above is helpful when we want to import the result directly into a Spreadsheet without any code.

The Spreadsheet snippet for this use case could look like the following:

  
    =IMPORTDATA("https//www.page2api.com/api/v1/scrape/encoded/{urlsafe_base64_encoded_params}")

More details

JavaScript rendering

Parameter: javascript

When scraping a page with real_browser set to true - the API will automatically execute all the JavaScript on that page.
This can be useful for scraping websites that load content dynamically with plain JavaScript or any framework such as React, Angular, Vue, or JQuery.

To scrape the page with a headless browser, but with the JavaScript disabled - just set the javascript to false.

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.whatismybrowser.com/",
      "real_browser": true,
      "javascript": false,
      "parse": {
        "browser": ".string-major >> text",
        "javascript": "#javascript-detection >> text"
      }
    }

Note: this parameter is available only when real_browser is set to true.

Keep in mind that even if javascript is set to false - you can still run your own JavaScript snippets on the page.

What is the advantage of using this parameter?

A request without a real browser will cost the same as a request with a real browser but with disabled JavaScript.

  
    // This request will use a rest client to fetch the web page.
    // It will be faster than the example below but could be sometimes detected.

    "real_browser": false

  
    // This request will use a headless chrome with the JavaScript disabled.
    // It will be slightly slower than the previous example but will be harder to detect.

    "real_browser": true,
    "javascript": false

Both examples will cost the same.

Browser scenario

Parameter: scenario

The scenario parameter represents a collection of browser instructions, such as:

1. wait
2. wait for element
3. execute javascript
4. (native) fill input
5. (native) click
6. start a cycle (loop)
7. initiate the parsing

The instructions are used to interact with the web page, according to a specific scenario.

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

The scenario parameter has the following format:

  
    "scenario" : [
      { "execute_js": "$($('select')[5]).val('yes').change()" },
      { "wait": 0.1 },
      {
        "loop" : [
          { "wait_for": "li.next a" },
          { "execute": "parse" },
          { "execute_js": "document.getElementById('proxylisttable_next').click()" },
          { "wait": 0.1 }
        ],
        "iterations": 10, // in this case - this parameter is optional
        "stop_condition": "document.getElementById('proxylisttable_next').classList.contains('disabled')"
      }
    ]

Note: a loop is just a collection of instructions that are executed in cycle.

For a loop, an iterations or a stop_condition parameter is necessary.

iterations is a number of loop cycles. This parameter is optional if a stop_condition is present.
stop_condition is a js snippet that is executed after each iteration and if it returns true - the loop is stopped.

Hint: The most relevant use case for a loop is parsing paginated views.

All available commands:

Command	Description
`{ "wait": 0.1 }`	Tells the browser to take a small break. The value is any integer between 1 and 10 (seconds).
`{ "wait_for": "li.next a" }`	Waits until a specific element appears on the page. The timeout is 10 seconds.
`{ "execute_js": "$('#proxylisttable_next').click()" }`	Executes a js snippet. All js errors are ignored.
`{ "fill_in": ["input#search", "Page2API"] }`	Fills an input, natively. Each character is sent separately.
`{ "click": "button[type=submit]" }`	Clicks an element, natively.
`{ "execute": "parse" }`	Initiate the parsing with the current HTML on the page.
`{ "loop": [/* commands */] }`	Executes a set of commands in a cycle.

Wait

Parameter: wait scenario.wait loop.wait

This parameter allows the browser to give the web page some time (seconds) to render before capturing the HTML.
The use case usually occurs when interacting with the web page via scenario parameter or when some content is rendered asynchronously after the page load.
Maximum value: 10 (seconds).

    "wait": 2

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

Wait for element

Parameter: wait_for scenario.wait_for loop.wait_for

This parameter allows the browser to give the web page some time (seconds) to render a particular element before capturing the HTML.
Maximum timeout: 10 (seconds).

    "wait_for": "li.next"

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

Wait until

Parameter: wait_until scenario.wait_until loop.wait_until

This parameter allows the browser to wait until a JavaScript snippet will return a Truthy value before capturing the HTML.
Maximum timeout: 10 (seconds).

    
      "wait_until": "document.querySelectorAll('.element').length == 10"

      // or base64 encoded:

      "wait_until": "ZG9jdW1lbnQucXVlcnlTZWxlY3RvckFsbCgnLmVsZW1lbnQnKS5sZW5ndGggPT0gMTA="

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

Cookies

Parameter: cookies

This parameter allows you to send custom cookies that will be used for the request.
Format: an object with string values.

    "cookies": { "test": "success", "session": "123asdqwe" }

When using this parameter with real_browser set to true, the response will contain the information about all cookies that were set, like in the example below:

Sample request using real_browser and cookies

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "http://httpbin.org/cookies?json",
      "cookies": { "testing": "true" },
      "real_browser": true,
      "parse": {
        "cookies": "body >> json"
      }
    }

Sample response

  
    {
      "result": {
        "cookies": {
          "cookies": {
            "testing": "true"
          }
        }
      },
      "request": {
        "parse": {
          "cookies": "body >> json"
        },
        "url": "http://httpbin.org/cookies?json",
        "cookies": {
          "testing": "true"
        },
        "real_browser": true,
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 12345,
      "pages_parsed": 1,
      "cost": 0.002,
      "success": true,
      "extra": {
        "cookies": [
          {
            "name": "testing",
            "value": "true",
            "path": "/",
            "domain": "httpbin.org",
            "expires": null,
            "secure": false
          }
        ]
      },
      "duration": 5.22
    }

Execute JavaScript

Parameter: scenario.execute_js loop.execute_js

Page2API can execute custom JavaScript code during the scraping session.
This is useful when you need to interact with the web page while or before parsing.

It is performed via scenario parameter that was described earlier.

The javascript snippet can be sent in one of two formats:

Raw:

  
    document.querySelector('.morelink').click()

Base64 encoded:

  
    ZG9jdW1lbnQucXVlcnlTZWxlY3RvcignLm1vcmVsaW5rJykuY2xpY2soKQ==

The most simple way of using this parameter is shown below:

  
    "scenario" : [
      { "execute_js": "document.querySelector('.morelink').click()" },
      { "wait": 0.1 },
      { "execute": "parse" }
    ]

Note: this parameter is available only when real_browser is set to true.

JavaScript selectors

Parameter: parse

Page2API can execute custom JavaScript code during the scraping session.
The executed code can be used as a selector to ease the data extraction process.
This is useful when you need to access some data that is stored in the javascript code from the page.

A javascript selector has the following format:

  
    js >> raw_or_base64_js_snippet

The most simple way of using this parameter is shown below:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com/",
      "real_browser": true,
      "parse": {
        "title": "js >> $('h1').text().trim()",
        "location": "js >> document.location.href",
        "js_variable": "js >> let object = { arr: [1, 2] }; object",
        "base64_js": "js >> bGV0IG9iaiA9IHsgb25lX3R3bzogWzEsIDJdIH07IG9iag=="
      }
    }

The payload above will return the following result:

  
    {
      "result": {
        "title": "The Ultimate Web Scraping API",
        "location": "https://www.page2api.com/",
        "js_variable": {
          "arr": [1, 2]
        },
        "base64_js": {
          "one_two": [1, 2]
        }
      }
    }

Note: this parameter is available only when real_browser is set to true

Fill input [native]

Parameter: scenario.fill_in loop.fill_in

Page2API can fill inputs natively by using the fill_in scenario command.
The format is an array, when the first attribute represents a css/xpath selector, and the second one - the value.

A simple way of using this parameter is shown below:

  
    "scenario" : [
      { "fill_in": ["input#search", "Page2API"] },
      { "wait_for": ".search-results" },
      { "execute": "parse" }
    ]

Note: this parameter is available only when real_browser is set to true.

Click [native]

Parameter: scenario.click loop.click

Page2API can click natively on visible elements from the page by using the click scenario command.
The format is a string that represents a css/xpath selector of the element that must be clicked.

A simple way of using this parameter is shown below:

  
    "scenario" : [
      { "fill_in": ["input#search", "Page2API"] },
      { "click": "button[type=submit]" },
      { "execute": "parse" }
    ]

Note: this parameter is available only when real_browser is set to true.

Handle pagination

Parameter: scenario.loop

Page2API can handle paginated views, such as the classic ones with links to the pages, as well as infinite scrolls.
This is made via a loop command from the inside of a scenario.

Note: a loop is just a collection of instructions that are executed in cycle.

The most simple way of handling a paginated view is shown below:

  
    "scenario" : [
      {
        "loop" : [
          { "wait_for": "li.next a" },
          { "execute": "parse" },
          { "execute_js": "document.getElementById('proxylisttable_next').click()" }
        ],
        "iterations": 10, // in this case - this parameter is optional
        "stop_condition": "document.getElementById('proxylisttable_next').classList.contains('disabled')"
      }
    ]

Note: this parameter is available only when real_browser is set to true.

Batch scraping

Parameter: batch

Page2API can to scrape web pages in batches, and handle concurrency for you.

The batch feature has two variants:

1. Basic batching (same payload, different URLs)

This feature is useful when scraping multiple web pages with the same selectors.

There are 2 common use cases for this feature:
1. Scraping a paginated view
2. Scraping a collection of individual pages, using the same selectors

The batch parameter represents an object that contains the following properties:

Name Type Required Description

Name	Type	Required	Description
urls	string / array	required	The URLs that needs to be scraped. There are two ways of using this parameter: 1. By defining an array of hardcoded URLs, like in the example below: `"urls": [ "https://companiesmarketcap.com/page/1/", "https://companiesmarketcap.com/page/2/", "https://companiesmarketcap.com/page/3/", ]` 2. By defining a URL generation rule: `"urls": "https://companiesmarketcap.com/page/[1, 3, 1]/"` The URL generation rule has the following format: `[START, END, STEP]` where each element of the rule is an integer.
concurrency	integer	required	The amount of pages that should be scraped at the same time. For a Free Trial account it must be equal to 1, for a Paid one - between 1 and the maximum value allowed by your account settings.
merge_results	boolean	optional	Merge the results obtained during the parsing of the each page. Default: false

urls

string / array

required

The URLs that needs to be scraped.
There are two ways of using this parameter:

1. By defining an array of hardcoded URLs, like in the example below:

  
    "urls": [
      "https://companiesmarketcap.com/page/1/",
      "https://companiesmarketcap.com/page/2/",
      "https://companiesmarketcap.com/page/3/",
    ]

2. By defining a URL generation rule:

  
    "urls": "https://companiesmarketcap.com/page/[1, 3, 1]/"

The URL generation rule has the following format:

  
    [START, END, STEP]

where each element of the rule is an integer.

concurrency

integer

required

The amount of pages that should be scraped at the same time.

For a Free Trial account it must be equal to 1,
for a Paid one - between 1 and the maximum value allowed by your account settings.

merge_results

boolean

optional

Merge the results obtained during the parsing of the each page.

Default: false

An example of a payload with a predefined collection of URLs:

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "concurrency": 3,
        "urls": [
          "https://www.ebay.com/itm/334297414333",
          "https://www.ebay.com/itm/392912936671",
          "https://www.ebay.com/itm/174045421299"
        ]
      },
      "parse": {
        "title": "h1 >> text",
        "price": "#prcIsum >> text",
        "url": "link[rel=canonical] >> href"
      }
    }

An example of a payload with auto-generated URLs:

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "merge_results": true,
        "concurrency": 3,
        "urls": "https://companiesmarketcap.com/page/[1, 3, 1]/"
      },
      "parse": {
        "data": "table"
      }
    }

Note: when performing batch scraping, the url parameter that is usually used to scrape a single page - is optional. (see the examples above)

2. Advanced batching (different payloads, different URLs)

This feature is useful when scraping multiple web pages with custom selectors or payloads.

As in the previous variant, the batch parameter represents an object that contains the following properties:

Name	Type	Required	Description
payloads	array of objects	required	This parameter is a collection of individual payloads, assembled from the list of available parameters. Sample value: `{ "payloads": [ { "url": "https://httpbin.org/anything?a=1", "request_method": "POST", "post_body": { "post": true } }, { "url": "https://httpbin.org/anything?a=2", "request_method": "PUT", "post_body": { "put": true } } ], ... }`
concurrency	integer	required	The amount of pages that should be scraped at the same time. For a Free Trial account it must be equal to 1, for a Paid one - between 1 and the maximum value allowed by your account settings.
merge_results	boolean	optional	Merge the results obtained during the parsing of the each page. Default: false

An example of a payload with common parameters (parse):

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "payloads": [
          {
            "url": "https://httpbin.org/anything?a=1",
            "request_method": "POST",
            "post_body": { "post": true }
          },
          {
            "url": "https://httpbin.org/anything?a=2",
            "request_method": "PUT",
            "post_body": { "put": true }
          }
        ],
        "concurrency": 1,
        "merge_results": false
      },
      "parse": {
        "data": "body >> json"
      }
    }

An example of a payload with fully-customizable parameters:

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "payloads": [
          {
            "url": "https://www.page2api.com",
            "parse": {
              "title": "h1 >> text"
            },
            "real_browser": true
          },
          {
            "url": "https://www.example.com",
            "parse": {
              "description": "p >> text"
            }
          }
        ],
        "concurrency": 1,
        "merge_results": false
      }
    }

Async scraping

Parameter: async

Usually, any request that takes more than 120 seconds will be interrupted.
In order to handle long running scraping requests (up to 240 seconds), Page2API has the ability of scraping the web pages in the background.
This is made by adding the async parameter to the request and setting it to true.

Sample async request payload

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "async": true,
      "parse": {
        "title": "h1",
      }
    }

Sample response for async request

  
    {
      "id": 123456,
      "performed_async": true
    }

After the scraping is done, the result will be sent to the Callback url that you set on your profile page.

Sample request to your callback url

  
    {
      "result": {
        "title": "<h1>Example Domain</h1>",
      },
      "request": {
        "parse": {
          "title": "h1"
        },
        "url": "https://www.example.com",
        "callback_url": "https://www.userapplication.com/callback"
      },
      "id": 123456,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.85
    }

You can set a custom Callback URL for each of your asynchronous requests

Sample async request with custom callback_url

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "callback_url": "https://www.userapplication.com/custom_callback"
      "async": true,
      "parse": {
        "title": "h1",
      }
    }

You can also use the passthrough field for your asynchronous requests and this field will be returned within the callback request.

Sample async request with passthrough

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "passthrough": {
        "custom_field": "qwe123asd",
        "passthrough_can_be_integer_string_or_object": true
      },
      "async": true,
      "parse": {
        "title": "h1",
      }
    }

Sample request to your callback url with passthrough field

  
    {
      "result": {
        "title": "<h1>Example Domain</h1>",
      },
      "request": {
        "parse": {
          "title": "h1"
        },
        "url": "https://www.example.com",
        "passthrough": {
          "custom_field": "qwe123asd",
          "passthrough_can_be_integer_string_or_object": true
        },
        "callback_url": "https://www.userapplication.com/callback"
      },
      "id": 123456,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.85
    }

Scheduled scraping

Parameter: refresh_interval

Page2API can create a schedule and scrape web pages automatically in background.
To create a schedule - just add refresh_interval (minutes) parameter to the request with a value between 1 and 2592000 (30 days).
We will run the schedule according to the interval you specified and send the results to your Callback url
or to a custom callback url that you can set per request.

After creating a schedule, you will be able to visualize it, update the interval and delete the schedule intirely.

Sample scheduled async request

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "async": true,
      "refresh_interval": 5,
      "parse": {
        "title": "h1 >> text",
      }
    }

Sample response for scheduled async request

  
    {
      "id": 123456,
      "performed_async": true,
      "schedule_id": 1234
    }

Sample scheduled request to your callback url

  
    {
      "result": {
        "title": "Example Domain",
      },
      "request": {
        "parse": {
          "title": "h1 >> text"
        },
        "url": "https://www.example.com",
        "callback_url": "https://www.userapplication.com/callback"
      },
      "id": 123456,
      "schedule_id": 1234,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.85
    }

Note: The schedule_id will be returned in the response, regardless of the async parameter value.

1. Visualize all schedules

URL

  https://www.page2api.com/api/v1/schedules

Method

GET

Sample response

  
    [
        {
            "id": 1234,
            "refresh_interval": 5,
            "last_refresh_at": "2021-08-01T09:52:33Z",
            "next_refresh_at": "2021-08-01T09:53:33Z",
            "active": true,
            "options": {
                "url": "https://www.example.com",
                "parse": {
                    "title": "h1 >> text"
                },
                "callback_url": "https://www.userapplication.com/callback"
                "refresh_interval": 5
            },
            "created_at": "2021-08-01T09:51:21Z",
            "updated_at": "2021-08-01T09:52:33Z",
            "scrape_records_count": 14
        }
    ]

2. Update a schedule

For a specific Schedule, you can update any of the following parameters:

parse
batch
wait_for
wait_until
wait
scenario
url
refresh_interval
callback_url
user_agent
javascript
cookies
passthrough
merge_loops
log_requests
raw
headers
request_method
post_body
premium_proxy
datacenter_proxy
real_browser
absolute_urls
import_jquery
custom_proxy
locale
sanitize

URL

  https://www.page2api.com/api/v1/schedules/:id

Method

PUT

Sample payload

  
    {
      "api_key": "YOUR_API_KEY",
      "refresh_interval": 1,
      "callback_url": "https://www.userapplication.com/new_callback_url"
    }

Sample response

  
    {
      "id": 1234,
      "refresh_interval": 1,
      "last_refresh_at": "2021-08-01T09:52:33Z",
      "next_refresh_at": "2021-08-01T09:53:33Z",
      "active": true,
      "options": {
          "url": "https://www.example.com",
          "parse": {
              "title": "h1 >> text"
          },
          "callback_url": "https://www.userapplication.com/new_callback_url"
          "refresh_interval": 1
      },
      "created_at": "2021-08-01T09:51:21Z",
      "updated_at": "2021-08-01T09:52:33Z",
      "scrape_records_count": 14
    }

3. Delete a schedule

URL

  https://www.page2api.com/api/v1/schedules/:id

Method

  DELETE

Sample payload

  
    {
      "api_key": "YOUR_API_KEY",
    }

Sample response

  
    {
        "message": "Schedule with id: '1234' was deleted successfully."
    }

Datacenter Proxy

Parameter: datacenter_proxy

The Datacenter Proxy is the default proxy used to scrape the web.
The default value is auto. With this value - the API will try to use the most suitable datacenter location for the scraped website.
You can also choose a specific location or set the value to random to pick a random datacenter proxy from the available locations.

Sample request payload using datacenter proxy from a specific location

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://whatismycountry.com",
      "datacenter_proxy": "de",
      "parse": {
        "country": "h2#country >> text"
      }
    }

Sample response

  
    {
      "result": {
        "country": "Your country is Germany"
      },
      "request": {
        "parse": {
            "country": "h2#country >> text"
        },
        "url": "https://whatismycountry.com",
        "premium_proxy": "de",
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 190,
      "pages_parsed": 1,
      "cost":0.00025,
      "success": true,
      "duration": 1.55
    }

Supported locations (11)

Location	API value
Auto (default)	auto
Random	random
EU	eu
USA	us
Germany	de
Romania	ro
Netherlands	nl
United Kingdom	gb
China	cn
Hong Kong	hk
Brazil	br
South Korea	kr
Singapore	sg

Premium Proxy

Parameter: premium_proxy

For hard-to-scrape websites, we offer the possibility to use Premium Proxy, also known as Residential proxy.
Premium Proxy allow you to choose a specific country (or a random one) and surf the web as a real user in that area.

If you set the value to auto - the API will try to use the most suitable premium location for the scraped website.
You can also choose a specific location or set the value to random to pick a random premium proxy from the available locations.

Sample request payload using premium proxy

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://whatismycountry.com",
      "premium_proxy": "de",
      "parse": {
        "country": "h2#country >> text"
      }
    }

Sample response

  
    {
      "result": {
        "country": "Your country is Germany"
      },
      "request": {
        "parse": {
            "country": "h2#country >> text"
        },
        "url": "https://whatismycountry.com",
        "premium_proxy": "de",
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 190,
      "pages_parsed": 1,
      "cost":0.0025,
      "success": true,
      "duration": 1.55
    }

Supported locations (139)

Location	API value
Auto	auto
Random	random
Andorra	ad
UAE	ae
Afghanistan	af
Albania	al
Armenia	am
Angola	ao
Argentina	ar
Austria	at
Australia	au
Aruba	aw
Azerbaijan	az
Bosnia and Herzegovina	ba
Bangladesh	bd
Belgium	be
Bulgaria	bg
Bahrain	bh
Benin	bj
Bolivia	bo
Brazil	br
Bahamas	bs
Bhutan	bt
Belarus	by
Belize	bz
Canada	ca
Central African Republic	cf
Switzerland	ch
Côte d'Ivoire	ci
Chile	cl
Cameroon	cm
China	cn
Colombia	co
Costa Rica	cr
Cuba	cu
Cyprus	cy
Czech Republic	cz
Germany	de
Djibouti	dj
Denmark	dk
Dominica	dm
Ecuador	ec
Estonia	ee
Egypt	eg
Spain	es
EU	eu
Ethiopia	et
Finland	fi
Fiji	fj
France	fr
Great Britain	gb
Georgia	ge
Ghana	gh
Gambia	gm
Greece	gr
Hong Kong	hk
Honduras	hn
Croatia	hr
Haiti	ht
Hungary	hu
Indonesia	id
Ireland	ie
Israel	il
India	in
Iraq	iq
Iran	ir
Iceland	is
Italy	it
Jamaica	jm
Jordan	jo
Japan	jp
Kenya	ke
Cambodia	kh
South Korea	kr
Kazakhstan	kz
Lebanon	lb
Liechtenstein	li
Liberia	lr
Lithuania	lt
Luxembourg	lu
Latvia	lv
Morocco	ma
Monaco	mc
Moldova	md
Montenegro	me
Madagascar	mg
Macedonia	mk
Mali	ml
Myanmar	mm
Mongolia	mn
Mauritania	mr
Mauritius	mu
Maldives	mv
Mexico	mx
Malaysia	my
Mozambique	mz
Nigeria	ng
Netherlands	nl
Norway	no
New Zealand	nz
Oman	om
Panama	pa
Peru	pe
Philippines	ph
Pakistan	pk
Poland	pl
Puerto Rico	pr
Portugal	pt
Paraguay	py
Qatar	qa
Romania	ro
Serbia	rs
Russia	ru
Saudi Arabia	sa
Seychelles	sc
Sudan	sd
Sweden	se
Singapore	sg
Slovenia	si
Slovakia	sk
Senegal	sn
South Sudan	ss
Syria	sy
Chad	td
Togo	tg
Thailand	th
Turkmenistan	tm
Tunisia	tn
Turkey	tr
Trinidad and Tobago	tt
Taiwan	tw
Ukraine	ua
Uganda	ug
USA	us
Uruguay	uy
Uzbekistan	uz
British Virgin Islands	vg
Yemen	ye
South Africa	za
Zambia	zm
Zimbabwe	zw

Custom Proxy

Parameter: custom_proxy

You can provide your own proxy for scraping the web with Page2API.

Sample request payload using custom proxy

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://api.ipify.org?format=json",
      "custom_proxy": "http://username:[email protected]:1234",
      "parse": {
        "ip": "body >> json"
      }
    }

Sample response

  
    {
      "result": {
         "data": {
            "ip": "192.168.13.14"
          }
        }
      },
      "request": {
        "parse": {
          "ip": "body >> json"
        },
        "url": "https://api.ipify.org?format=json",
        "custom_proxy": "http://username:[email protected]:1234",
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 190,
      "pages_parsed": 1,
      "cost":0.00025,
      "success": true,
      "duration": 1.55
    }

Page2API Documentation

Let's explore how to use the most advanced web scraping API.

Introduction

What is Page2API?

Getting ready

Authentication

The scraping endpoint

Accessing the scraping endpoint via GET with encoded payload:

Parameters overview

Data extraction

Parameter: parse

1. Extracting one element per selector

2. Extracting all elements with a specific selector

3. Extracting nested elements with different selectors

4. Extracting data from tables

Custom request

Parameters: request_method post_body headers

Custom response

Parameter: raw

JavaScript rendering

Parameter: javascript

Browser scenario

Parameter: scenario

Wait

Parameter: wait scenario.wait loop.wait

Wait for element

Parameter: wait_for scenario.wait_for loop.wait_for

Wait until

Parameter: wait_until scenario.wait_until loop.wait_until

Cookies

Parameter: cookies

Execute JavaScript

Parameter: scenario.execute_js loop.execute_js

JavaScript selectors

Parameter: parse

Fill input [native]

Parameter: scenario.fill_in loop.fill_in

Click [native]

Parameter: scenario.click loop.click

Handle pagination

Parameter: scenario.loop

Batch scraping

Parameter: batch

1. Basic batching (same payload, different URLs)

2. Advanced batching (different payloads, different URLs)

Async scraping

Parameter: async

Scheduled scraping

Parameter: refresh_interval

1. Visualize all schedules

2. Update a schedule

3. Delete a schedule

Datacenter Proxy

Parameter: datacenter_proxy

Supported locations (11)

Premium Proxy

Parameter: premium_proxy

Supported locations (139)

Custom Proxy

Parameter: custom_proxy

Ready to Scrape the Web like a PRO?