Page2API Documentation

Let's explore how to use the most advanced web scraping API.

Introduction

What is Page2API?

Page2API is a powerful tool designed for scraping web pages and converting HTML into well-organized JSON structure.
It offers the possibility to launch long-running scrape sessions by using the Asynchronous Scraping.
Aside from that, it also supports executing complex browser scenarios and handling pagination with ease.

Getting ready

Authentication

After you create your account, the first thing you will need in order to authenticate and start using the API is your api_key
It is a random generated string that you will find on your Dashboard page and looks like this:


  0e72feee16180ef1f3f190ae350d74705d6ebec1

The scraping endpoint

URL

  https://www.page2api.com/api/v1/scrape

Method

  POST

Sample request payload

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "real_browser": true,
      "parse": {
        "title_html": "h1",
        "link_text": "/html/body/div/p[2]/a >> text",
        "link_href": "/html/body/div/p[2]/a >> href"
      }
    }
  

Sample response for a successfull request

  
    {
      "result": {
        "title_html": "<h1>Example Domain</h1>",
        "link_text": "More information...",
        "link_href": "https://www.iana.org/domains/example"
      },
      "request": {
        "parse": {
          "title_html": "h1",
          "link_text": "/html/body/div/p[2]/a >> text",
          "link_href": "/html/body/div/p[2]/a >> href"
        },
        "url": "https://www.example.com",
        "real_browser": true
      },
      "id": 123456,
      "pages_parsed": 1,
      "cost": 0.002,
      "success": true,
      "duration": 2.14
    }
  

Sample response for a failed request

  
    {
      "error" : "Api key was not found."
    }
  



Accessing the scraping endpoint via GET with encoded payload:

URL

  https://www.page2api.com/api/v1/scrape/encoded/{base64_urlsafe_encoded_payload}

Method

  GET

Sample request payload

{ "api_key": "YOUR_API_KEY", "url": "https://www.example.com", "real_browser": true, "parse": { "title_html": "h1", "link_text": "/html/body/div/p[2]/a >> text", "link_href": "/html/body/div/p[2]/a >> href" } }

Edit the payload above if needed, and press Encode →

The URL with encoded payload will be:


  Press 'Encode'

Parameters overview

Name Type Required Description
api_key
string
required Your Page2API Api Key
url
string
required The url of the page that will be scraped
user_agent
string
optional Set custom user agent that will be used for the request.
locale
string
optional Set custom locale that will be used for the request. Ex: es or pt-BR.

All supported locales
parse
object
optional The object that consists of field names and selectors
that will extract the data and build the result.
The HTML of the page will be returned if empty

More details
batch
object
optional The batch parameter represents an object that contains the following properties:
urls, concurrency, and merge_results.
It provides the possibility to scrape web pages in batches with a specific concurrency.

More details
scenario
array(objects)
optional A collection of instructions that the browser will execute

More details
real_browser
boolean
optional Use headless chrome instance to open the url.
Default: false
javascript
boolean
optional Render the JavaScript on the page when using a headless browser (real_browser).
Default: true
import_jquery
boolean
optional Import the latest version of the jQuery library into the browser instance.
Default: false
window_size
array(integer)
optional Set custom window size for the browser.
Format: [width, height]. Default: [1920, 1080].
wait
integer (seconds)
optional Just wait,
and give the browser some time to rest and meditate on the meaning of life.
Max value: 10 (seconds)
wait_for
string (css/xpath selector)
optional Wait for a specific element to appear on the page.
Max wait: 10 seconds
wait_until
string (JS snippet)
optional Wait for a JavaScript snippet to return a Truthy value.
Max wait: 10 seconds
cookies
object
optional Set custom cookies that will be used for the request.

More details
sanitize
boolean
optional Remove all whitespaces from the parsed content.
Default: true
raw
boolean / object
optional Return only the scraping result in the response with a custom format (CSV, JSON, TEXT, HTML).

More details
absolute_urls
boolean
optional Ensure that all parsed attributes that contain an URL have absolute paths.
Supported attributes:
action, archive, background, cite, classid, codebase, data, dsync
formaction, href, icon, longdesc, manifest, poster, profile, src
usemap
.
Default: true
log_requests
boolean
optional Return all network requests.
Default: false
async
boolean
optional Perform the request asynchronously.
Receive the response via callback URL specified on the profile page.
Default: false
callback_url
string
optional A custom callback URL for a specific scrape request.
Default: The callback url from user's profile
passthrough
string / integer / object
optional Any data added to this parameter will be returned in the response
or sent in any subsequent callbacks
request_method
string
optional Set a custom request method for the request.
Possible values: GET, POST, PUT, PATCH, DELETE, HEAD.
Default: GET

More details
post_body
object / string (json)
optional Set a post body for the request.
Example: { "post_body": { "query": "web scraping" }}

More details
headers
object
optional Set custom headers that will be used for the request.
Example: { "headers": { "Content-Type": "application/json" }}

More details
refresh_interval
integer (minutes)
optional Create a scheduled parsing that will run every n minutes.
Min: 1, Max: 2592000
merge_loops
boolean
optional Merge the results obtained during the parsing of the paginated views (loops)
Default: false
datacenter_proxy
string
optional The code of the datacenter proxy to be used for the request.
Default: auto

More details
premium_proxy
string
optional The code of the premium proxy to be used for the request.

More details
custom_proxy
string
optional Provide your own proxy to be used for the request.

More details

Data extraction

Parameter: parse

The parse parameter represents an object, where the keys are the names of the fields you want to create and the values are the selectors that will extract the data.


A simple selector consists of 2 parts:

Description Required Examples
css/xpath selector
required
a
/html/body/div/p[2]/a
/html/body/div/p[*]/a
'>>' concatenated with a selector function

The selector function can be the name of any attribute of the element,
as well as one of the special ones:
text - extracts the text from the element
json - parses the content from the element that contains a JSON object

Note: If no function is specified, the HTML of the element will be returned.
optional
>> href
>> title
>> text
>> json

1. Extracting one element per selector

Having the following element on the page:

  
    <a href="https://example.com">Example</a>
  

The most simple example of a parse parameter will look like:

  
    // The 'parse' parameter:

    "parse": {
      "link": "a"
    }

    // The result:

    {
      "link": "<a href='https://example.com'>Example</a>"
    }
  

A parse parameter where the selector function is present:

  
    // The 'parse' parameter:

    "parse": {
      "link_href": "a >> href"
    }

    // The result:

    {
      "link_href": "https://example.com"
    }
  


2. Extracting all elements with a specific selector

In order to extract all elements that share a selector, you must wrap the selector in [ ], like in the following examples:


  
    ["/html/body/div/p[*]/a"]
    /* without a selector function */
  
  
    ["a >> href"]
    /* with a selector function (extract all hrefs) */
  

Having the following elements on the page:

  
    <a href='https://example.com'>Example</a>
    <a href='https://www.page2api.com'>Page2API</a>
    <a href='https://ipapi.co/api'>IpApi</a>
  

A selector that will extract all links will look like:

  
    // The 'parse' parameter:

    "parse": {
      "links": ["a"]
    }

    // The result:

    {
      "links": [
        "<a href='https://example.com'>Example</a>",
        "<a href='https://www.page2api.com'>Page2API</a>",
        "<a href='https://ipapi.co/api'>IpApi</a>"
      ]
    }
  

And if you want to extract specific attributes/content:

  
    // The 'parse' parameter:

    "parse": {
      "links_text": ["a >> text"]
    }

    // The result:

    {
      "links_text": [
        "Example",
        "Page2API",
        "IpApi"
      ]
    }
  


3. Extracting nested elements with different selectors

This scenario is used if you want to parse elements from repeating structures, for example a list of articles, products, posts and so on. In order to extract the elements mentioned above, you must wrap the whole { name1: selector1, name2: selector2 } structure in [ ], like in the following example:


  
    "parse": {
      "posts": [
        {
          "_parent": ".feed-item",
          "title":  ".feed-item_title-link >> text",
          "link":   ".feed-item_title-link >> href",
          "author": "span.user-link_name >> text"
        }
      ]
    }
  

Please note that each structure must have a _parent key that will define the parent for the parsed elements:


  
    "_parent": ".feed-item"
  

Having the following structure of elements on the page:

  
  
    <div class='all-posts'>
      <div class='post'>
        <a class='title' href='/posts/123'>Post one title</a>
        <span class='comments'>(3 comments)</span>
        <a class='author' href='/author/757'>Author One</a>
      </div>
      <div class='post'>
        <a class='title' href='/posts/234'>Post two title</a>
        <span class='comments'>(no comments)</span>
        <a class='author' href='/author/347'>Author Two</a>
      </div>
      <div class='post'>
        <a class='title' href='/posts/456'>Post three title</a>
        <span class='comments'>(1 comment)</span>
        <a class='author' href='/author/923'>Author Three</a>
      </div>
    </div>
  
  

A selector that will extract the data about each post will look like:

  
    // The 'parse' parameter:

    "parse": {
      "posts": [
        {
          "_parent": ".post",
          "title": "a.title >> text",
          "link": "a.title >> href",
          "author": "a.author >> text"
          "comments": "span.comments >> text"
        }
      ]
    }

    // The result:

    {
      "posts": [
        {
          "title": "Post one title",
          "link": "/posts/123",
          "author": "Author One"
          "comments": "(3 comments)"
        },
        {
          "title": "Post two title",
          "link": "/posts/234",
          "author": "Author Two"
          "comments": "(no comments)"
        },
        {
          "title": "Post one title",
          "link": "/posts/456",
          "author": "Author Three"
          "comments": "(1 comment)"
        },
      ]
    }
  


4. Extracting data from tables

Tables are parsed automatically, there is no need to specify any selector function for them.


Having the following table on the page:

  
  
    <table class='people'>
      <thead>
        <tr>
          <th>Firstname</th>
          <th>Lastname</th>
        </tr>
      <thead/>
      <tbody>
        <tr>
          <td>Jill</td>
          <td>Smith</td>
        </tr>
        <tr>
          <td>Eve</td>
          <td>Jackson</td>
        </tr>
      </tbody>
    </table>
  
  

A selector that will extract the data from this table will look like:

  
    // The 'parse' parameter:

    "parse": {
      "people": "table.people"
    }

    // The result:

    {
      "people": [
        {
          "Firstname": "Jill",
          "Lastname": "Smith"
        },
        {
          "Firstname": "Eve",
          "Lastname": "Jackson"
        },
      ]
    }
  

To summarize - here is a sample request with a complex parse parameter

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.indiehackers.com",
      "sanitize": true,
      "parse": {
        "footer": "/html/body/div[1]/div/footer/div[1]/form/p[1] >> text",
        "sections": [".posts-section__nav-content >> text"],
        "posts": [
          {
            "_parent": ".feed-item",
            "title":  ".feed-item__title-link >> text",
            "link":   ".feed-item__title-link >> href",
            "author": "span.user-link__name >> text"
          }
        ],
        "side_links": [
          {
            "_parent": ".news-section__item",
            "title": ".news-section__item-title >> text",
            "link": "_parent >> href",
            "category": "/html/body/div[1]/div/div[2]/div[1]/div/a[*]/div/span[1] >> text"
          }
        ]
      }
    }
  

Custom request

Parameters: request_method post_body headers

The parameter above allows you to build custom scraping requests with specific request method, body and headers.


For better flexibility, the post_body parameter can be an object, as well as a JSON string.

    // object
    "post_body": { "test": "success" }
  

    // JSON string
    "post_body": "{ \"test\": \"success\" }"
  

Sample POST request with form data

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "http://httpbin.org/anything",
      "post_body": { "testing": "true" },
      "request_method": "POST",
      "parse": {
        "data": "body >> json"
      }
    }
  

Sample response

  
    {
      "result": {
        "data": {
          "form": {
            "testing": "true"
          },
          "method": "POST",
          "headers": {
            "Content-Type": "application/x-www-form-urlencoded",
            ...
          },
          ...
        }
      },
      "request": {
        "parse": {
          "data": "body >> json"
        },
        "url": "http://httpbin.org/anything",
        "request_method": "POST",
        "post_body": {
          "testing": "true"
        }
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 12345,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.22
    }
  

Sample POST request with JSON body and custom headers

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "http://httpbin.org/anything",
      "post_body": "{ \"testing\": \"true\" }",
      "request_method": "POST",
      "headers": {
        "Content-Type": "application/json"
      },
      "parse": {
        "data": "body >> json"
      }
    }
  

Sample response

  
    {
      "result": {
        "data": {
          "data": "{ \"testing\": \"true\" }",
          "method": "POST",
          "json": {
            "testing": "true"
          },
          "headers": {
            "Content-Type": "application/json",
            ...
          },
          ...
        }
      },
      "request": {
        "parse": {
          "data": "body >> json"
        },
        "url": "http://httpbin.org/anything",
        "request_method": "POST",
        "post_body": "{ \"testing\": \"true\" }",
        "headers": {
          "Content-Type": "application/json"
        },
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 12345,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.22
    }
  

Custom response

Parameter: raw

This parameter allows the customization of the scraping response.
It returns by default only the scraping result, in a JSON format, without any additional properties.


There are 2 ways to use the raw parameter.

1. You can set it as true, and it will return only the scraping result, in a JSON format.
2. In case you want to customize the response, you can send it as an object that can contains the following properties:

Name Type Required Description
format
string
optional The response format.
Possible values: csv, auto.
Default: auto (will return JSON, HTML, or TEXT, depending on the presence of the parse parameter)
key
string
optional The key that should be used as a source for the response.

The most simple example:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com",
      "raw": true,
      "parse": {
        "features": [
          {
            "_parent": ".feature-container",
            "title": "h2 >> text"
          }
        ]
      }
    }
  

The payload above will generate the folowing response:

  
    {
      "features": [
        { "title": "Intuitive and powerful API" },
        { "title": "Asynchronous scraping" },
        { "title": "Javascript rendering" },
        { "title": "Scheduled scraping" },
        { "title": "Custom browser scenarios" },
        { "title": "Fast and reliable proxies" }
      ]
    }
  

Note: As you can see the response has only the features key, without any additional properties such as duration, cost, ..etc.




If we want to dig into our response and return a specific key:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com",
      "raw": {
        "key": "features"
      },
      "parse": {
        "features": [
          {
            "_parent": ".feature-container",
            "title": "h2 >> text"
          }
        ]
      }
    }
  

The payload above will generate the folowing response:

  
    [
      { "title": "Intuitive and powerful API" },
      { "title": "Asynchronous scraping" },
      { "title": "Javascript rendering" },
      { "title": "Scheduled scraping" },
      { "title": "Custom browser scenarios" },
      { "title": "Fast and reliable proxies" }
    ]
  



If we want to return our result as CSV:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com",
      "raw": {
        "key": "features",
        "format": "csv"
      },
      "parse": {
        "features": [
          {
            "_parent": ".feature-container",
            "title": "h2 >> text"
          }
        ]
      }
    }
  

The payload above will generate the folowing response:

  
    title
    Intuitive and powerful API
    Asynchronous scraping
    Javascript rendering
    Scheduled scraping
    Custom browser scenarios
    Fast and reliable proxies
  

Hint: The example above is helpful when we want to import the result directly into a Spreadsheet without any code.


The Spreadsheet snippet for this use case could look like the following:

  
    =IMPORTDATA("https//www.page2api.com/api/v1/scrape/encoded/{urlsafe_base64_encoded_params}")
  

More details

JavaScript rendering

Parameter: javascript

When scraping a page with real_browser set to true - the API will automatically execute all the JavaScript on that page.
This can be useful for scraping websites that load content dynamically with plain JavaScript or any framework such as React, Angular, Vue, or JQuery.

To scrape the page with a headless browser, but with the JavaScript disabled - just set the javascript to false.


  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.whatismybrowser.com/",
      "real_browser": true,
      "javascript": false,
      "parse": {
        "browser": ".string-major >> text",
        "javascript": "#javascript-detection >> text"
      }
    }
  

Note: this parameter is available only when real_browser is set to true.

Keep in mind that even if javascript is set to false - you can still run your own JavaScript snippets on the page.



What is the advantage of using this parameter?

A request without a real browser will cost the same as a request with a real browser but with disabled JavaScript.


  
    // This request will use a rest client to fetch the web page.
    // It will be faster than the example below but could be sometimes detected.

    "real_browser": false
  

  
    // This request will use a headless chrome with the JavaScript disabled.
    // It will be slightly slower than the previous example but will be harder to detect.

    "real_browser": true,
    "javascript": false
  

Both examples will cost the same.

Browser scenario

Parameter: scenario

The scenario parameter represents a collection of browser instructions, such as:


1. wait
2. wait for element
3. execute javascript
4. (native) fill input
5. (native) click
6. start a cycle (loop)
7. initiate the parsing

The instructions are used to interact with the web page, according to a specific scenario.

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.


The scenario parameter has the following format:

  
    "scenario" : [
      { "execute_js": "$($('select')[5]).val('yes').change()" },
      { "wait": 0.1 },
      {
        "loop" : [
          { "wait_for": "li.next a" },
          { "execute": "parse" },
          { "execute_js": "document.getElementById('proxylisttable_next').click()" },
          { "wait": 0.1 }
        ],
        "iterations": 10, // in this case - this parameter is optional
        "stop_condition": "document.getElementById('proxylisttable_next').classList.contains('disabled')"
      }
    ]
  

Note: a loop is just a collection of instructions that are executed in cycle.


For a loop, an iterations or a stop_condition parameter is necessary.

iterations is a number of loop cycles. This parameter is optional if a stop_condition is present.
stop_condition is a js snippet that is executed after each iteration and if it returns true - the loop is stopped.

Hint: The most relevant use case for a loop is parsing paginated views.


All available commands:

Command Description
    { "wait": 0.1 }
  
Tells the browser to take a small break.
The value is any integer between 1 and 10 (seconds).
    { "wait_for": "li.next a" }
  
Waits until a specific element appears on the page.
The timeout is 10 seconds.
    { "execute_js": "$('#proxylisttable_next').click()" }
  
Executes a js snippet. All js errors are ignored.
    { "fill_in": ["input#search", "Page2API"] }
  
Fills an input, natively.
Each character is sent separately.
    { "click": "button[type=submit]" }
  
Clicks an element, natively.
    { "execute": "parse" }
  
Initiate the parsing with the current HTML on the page.
    { "loop": [/* commands */] }
  
Executes a set of commands in a cycle.

Wait

Parameter: wait scenario.wait loop.wait

This parameter allows the browser to give the web page some time (seconds) to render before capturing the HTML.
The use case usually occurs when interacting with the web page via scenario parameter or when some content is rendered asynchronously after the page load.
Maximum value: 10 (seconds).


    "wait": 2
  

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

Wait for element

Parameter: wait_for scenario.wait_for loop.wait_for

This parameter allows the browser to give the web page some time (seconds) to render a particular element before capturing the HTML.
Maximum timeout: 10 (seconds).


    "wait_for": "li.next"
  

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

Wait until

Parameter: wait_until scenario.wait_until loop.wait_until

This parameter allows the browser to wait until a JavaScript snippet will return a Truthy value before capturing the HTML.
Maximum timeout: 10 (seconds).


    
      "wait_until": "document.querySelectorAll('.element').length == 10"

      // or base64 encoded:

      "wait_until": "ZG9jdW1lbnQucXVlcnlTZWxlY3RvckFsbCgnLmVsZW1lbnQnKS5sZW5ndGggPT0gMTA="
    
  

Note: this parameter is available only when real_browser is set to true and javascript is not disabled.

Cookies

Parameter: cookies

This parameter allows you to send custom cookies that will be used for the request.
Format: an object with string values.


    "cookies": { "test": "success", "session": "123asdqwe" }
  

When using this parameter with real_browser set to true, the response will contain the information about all cookies that were set, like in the example below:

Sample request using real_browser and cookies

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "http://httpbin.org/cookies?json",
      "cookies": { "testing": "true" },
      "real_browser": true,
      "parse": {
        "cookies": "body >> json"
      }
    }
  

Sample response

  
    {
      "result": {
        "cookies": {
          "cookies": {
            "testing": "true"
          }
        }
      },
      "request": {
        "parse": {
          "cookies": "body >> json"
        },
        "url": "http://httpbin.org/cookies?json",
        "cookies": {
          "testing": "true"
        },
        "real_browser": true,
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 12345,
      "pages_parsed": 1,
      "cost": 0.002,
      "success": true,
      "extra": {
        "cookies": [
          {
            "name": "testing",
            "value": "true",
            "path": "/",
            "domain": "httpbin.org",
            "expires": null,
            "secure": false
          }
        ]
      },
      "duration": 5.22
    }
  

Execute JavaScript

Parameter: scenario.execute_js loop.execute_js

Page2API can execute custom JavaScript code during the scraping session.
This is useful when you need to interact with the web page while or before parsing.

It is performed via scenario parameter that was described earlier.


The javascript snippet can be sent in one of two formats:
Raw:

  
    document.querySelector('.morelink').click()
  

Base64 encoded:

  
    ZG9jdW1lbnQucXVlcnlTZWxlY3RvcignLm1vcmVsaW5rJykuY2xpY2soKQ==
  

The most simple way of using this parameter is shown below:

  
    "scenario" : [
      { "execute_js": "document.querySelector('.morelink').click()" },
      { "wait": 0.1 },
      { "execute": "parse" }
    ]
  

Note: this parameter is available only when real_browser is set to true.

JavaScript selectors

Parameter: parse

Page2API can execute custom JavaScript code during the scraping session.
The executed code can be used as a selector to ease the data extraction process.
This is useful when you need to access some data that is stored in the javascript code from the page.

A javascript selector has the following format:

  
    js >> raw_or_base64_js_snippet
  

The most simple way of using this parameter is shown below:

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.page2api.com/",
      "real_browser": true,
      "parse": {
        "title": "js >> $('h1').text().trim()",
        "location": "js >> document.location.href",
        "js_variable": "js >> let object = { arr: [1, 2] }; object",
        "base64_js": "js >> bGV0IG9iaiA9IHsgb25lX3R3bzogWzEsIDJdIH07IG9iag=="
      }
    }
  

The payload above will return the following result:

  
    {
      "result": {
        "title": "The Ultimate Web Scraping API",
        "location": "https://www.page2api.com/",
        "js_variable": {
          "arr": [1, 2]
        },
        "base64_js": {
          "one_two": [1, 2]
        }
      }
    }
  

Note: this parameter is available only when real_browser is set to true

Fill input [native]

Parameter: scenario.fill_in loop.fill_in

Page2API can fill inputs natively by using the fill_in scenario command.
The format is an array, when the first attribute represents a css/xpath selector, and the second one - the value.

A simple way of using this parameter is shown below:

  
    "scenario" : [
      { "fill_in": ["input#search", "Page2API"] },
      { "wait_for": ".search-results" },
      { "execute": "parse" }
    ]
  

Note: this parameter is available only when real_browser is set to true.

Click [native]

Parameter: scenario.click loop.click

Page2API can click natively on visible elements from the page by using the click scenario command.
The format is a string that represents a css/xpath selector of the element that must be clicked.

A simple way of using this parameter is shown below:

  
    "scenario" : [
      { "fill_in": ["input#search", "Page2API"] },
      { "click": "button[type=submit]" },
      { "execute": "parse" }
    ]
  

Note: this parameter is available only when real_browser is set to true.

Handle pagination

Parameter: scenario.loop

Page2API can handle paginated views, such as the classic ones with links to the pages, as well as infinite scrolls.
This is made via a loop command from the inside of a scenario.

Note: a loop is just a collection of instructions that are executed in cycle.


For a loop, an iterations or a stop_condition parameter is necessary.

iterations is a number of loop cycles. This parameter is optional if a stop_condition is present.
stop_condition is a js snippet that is executed after each iteration and if it returns true - the loop is stopped.

The pagination is handled via scenario parameter that was described earlier.

The most simple way of handling a paginated view is shown below:

  
    "scenario" : [
      {
        "loop" : [
          { "wait_for": "li.next a" },
          { "execute": "parse" },
          { "execute_js": "document.getElementById('proxylisttable_next').click()" }
        ],
        "iterations": 10, // in this case - this parameter is optional
        "stop_condition": "document.getElementById('proxylisttable_next').classList.contains('disabled')"
      }
    ]
  

Note: this parameter is available only when real_browser is set to true.

Batch scraping

Parameter: batch

Page2API can to scrape web pages in batches, and handle concurrency for you.

The batch feature has two variants:

1. Basic batching (same payload, different URLs)

This feature is useful when scraping multiple web pages with the same selectors.

There are 2 common use cases for this feature:
1. Scraping a paginated view
2. Scraping a collection of individual pages, using the same selectors

The batch parameter represents an object that contains the following properties:

Name Type Required Description
urls
string / array
required The URLs that needs to be scraped.
There are two ways of using this parameter:

1. By defining an array of hardcoded URLs, like in the example below:
  
    "urls": [
      "https://companiesmarketcap.com/page/1/",
      "https://companiesmarketcap.com/page/2/",
      "https://companiesmarketcap.com/page/3/",
    ]
  

2. By defining a URL generation rule:
  
    "urls": "https://companiesmarketcap.com/page/[1, 3, 1]/"
  

The URL generation rule has the following format:
  
    [START, END, STEP]
  
where each element of the rule is an integer.
concurrency
integer
required The amount of pages that should be scraped at the same time.

For a Free Trial account it must be equal to 1,
for a Paid one - between 1 and the maximum value allowed by your account settings.
merge_results
boolean
optional Merge the results obtained during the parsing of the each page.

Default: false
An example of a payload with a predefined collection of URLs:

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "concurrency": 3,
        "urls": [
          "https://www.ebay.com/itm/334297414333",
          "https://www.ebay.com/itm/392912936671",
          "https://www.ebay.com/itm/174045421299"
        ]
      },
      "parse": {
        "title": "h1 >> text",
        "price": "#prcIsum >> text",
        "url": "link[rel=canonical] >> href"
      }
    }
  

An example of a payload with auto-generated URLs:

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "merge_results": true,
        "concurrency": 3,
        "urls": "https://companiesmarketcap.com/page/[1, 3, 1]/"
      },
      "parse": {
        "data": "table"
      }
    }
  

Note: when performing batch scraping, the url parameter that is usually used to scrape a single page - is optional. (see the examples above)



2. Advanced batching (different payloads, different URLs)

This feature is useful when scraping multiple web pages with custom selectors or payloads.

As in the previous variant, the batch parameter represents an object that contains the following properties:

Name Type Required Description
payloads
array of objects
required This parameter is a collection of individual payloads, assembled from the list of available parameters.

Sample value:
  
    {
      "payloads": [
        {
          "url": "https://httpbin.org/anything?a=1",
          "request_method": "POST",
          "post_body": { "post": true }
        },
        {
          "url": "https://httpbin.org/anything?a=2",
          "request_method": "PUT",
          "post_body": { "put": true }
        }
      ], ...
    }
  
concurrency
integer
required The amount of pages that should be scraped at the same time.

For a Free Trial account it must be equal to 1,
for a Paid one - between 1 and the maximum value allowed by your account settings.
merge_results
boolean
optional Merge the results obtained during the parsing of the each page.

Default: false
An example of a payload with common parameters (parse):

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "payloads": [
          {
            "url": "https://httpbin.org/anything?a=1",
            "request_method": "POST",
            "post_body": { "post": true }
          },
          {
            "url": "https://httpbin.org/anything?a=2",
            "request_method": "PUT",
            "post_body": { "put": true }
          }
        ],
        "concurrency": 1,
        "merge_results": false
      },
      "parse": {
        "data": "body >> json"
      }
    }
  

An example of a payload with fully-customizable parameters:

  
    {
      "api_key": "YOUR_API_KEY",
      "batch": {
        "payloads": [
          {
            "url": "https://www.page2api.com",
            "parse": {
              "title": "h1 >> text"
            },
            "real_browser": true
          },
          {
            "url": "https://www.example.com",
            "parse": {
              "description": "p >> text"
            }
          }
        ],
        "concurrency": 1,
        "merge_results": false
      }
    }
  

Async scraping

Parameter: async

Usually, any request that takes more than 120 seconds will be interrupted.
In order to handle long running scraping requests (up to 240 seconds), Page2API has the ability of scraping the web pages in the background.
This is made by adding the async parameter to the request and setting it to true.


Sample async request payload

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "async": true,
      "parse": {
        "title": "h1",
      }
    }
  

Sample response for async request

  
    {
      "id": 123456,
      "performed_async": true
    }
  


After the scraping is done, the result will be sent to the Callback url that you set on your profile page.


Sample request to your callback url

  
    {
      "result": {
        "title": "<h1>Example Domain</h1>",
      },
      "request": {
        "parse": {
          "title": "h1"
        },
        "url": "https://www.example.com",
        "callback_url": "https://www.userapplication.com/callback"
      },
      "id": 123456,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.85
    }
  


You can set a custom Callback URL for each of your asynchronous requests


Sample async request with custom callback_url

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "callback_url": "https://www.userapplication.com/custom_callback"
      "async": true,
      "parse": {
        "title": "h1",
      }
    }
  


You can also use the passthrough field for your asynchronous requests and this field will be returned within the callback request.


Sample async request with passthrough

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "passthrough": {
        "custom_field": "qwe123asd",
        "passthrough_can_be_integer_string_or_object": true
      },
      "async": true,
      "parse": {
        "title": "h1",
      }
    }
  

Sample request to your callback url with passthrough field

  
    {
      "result": {
        "title": "<h1>Example Domain</h1>",
      },
      "request": {
        "parse": {
          "title": "h1"
        },
        "url": "https://www.example.com",
        "passthrough": {
          "custom_field": "qwe123asd",
          "passthrough_can_be_integer_string_or_object": true
        },
        "callback_url": "https://www.userapplication.com/callback"
      },
      "id": 123456,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.85
    }
  

Scheduled scraping

Parameter: refresh_interval

Page2API can create a schedule and scrape web pages automatically in background.
To create a schedule - just add refresh_interval (minutes) parameter to the request with a value between 1 and 2592000 (30 days).
We will run the schedule according to the interval you specified and send the results to your Callback url
or to a custom callback url that you can set per request.

After creating a schedule, you will be able to visualize it, update the interval and delete the schedule intirely.


Sample scheduled async request

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://www.example.com",
      "async": true,
      "refresh_interval": 5,
      "parse": {
        "title": "h1 >> text",
      }
    }
  

Sample response for scheduled async request

  
    {
      "id": 123456,
      "performed_async": true,
      "schedule_id": 1234
    }
  

Sample scheduled request to your callback url

  
    {
      "result": {
        "title": "Example Domain",
      },
      "request": {
        "parse": {
          "title": "h1 >> text"
        },
        "url": "https://www.example.com",
        "callback_url": "https://www.userapplication.com/callback"
      },
      "id": 123456,
      "schedule_id": 1234,
      "pages_parsed": 1,
      "cost": 0.00025,
      "success": true,
      "duration": 1.85
    }
  

Note: The schedule_id will be returned in the response, regardless of the async parameter value.


1. Visualize all schedules

URL

  https://www.page2api.com/api/v1/schedules

Method

  GET

Sample response

  
    [
        {
            "id": 1234,
            "refresh_interval": 5,
            "last_refresh_at": "2021-08-01T09:52:33Z",
            "next_refresh_at": "2021-08-01T09:53:33Z",
            "active": true,
            "options": {
                "url": "https://www.example.com",
                "parse": {
                    "title": "h1 >> text"
                },
                "callback_url": "https://www.userapplication.com/callback"
                "refresh_interval": 5
            },
            "created_at": "2021-08-01T09:51:21Z",
            "updated_at": "2021-08-01T09:52:33Z",
            "scrape_records_count": 14
        }
    ]
  

2. Update a schedule

For a specific Schedule, you can update any of the following parameters:

  • parse
  • batch
  • wait_for
  • wait_until
  • wait
  • scenario
  • url
  • refresh_interval
  • callback_url
  • user_agent
  • javascript
  • cookies
  • passthrough
  • merge_loops
  • log_requests
  • raw
  • headers
  • request_method
  • post_body
  • premium_proxy
  • datacenter_proxy
  • real_browser
  • absolute_urls
  • import_jquery
  • custom_proxy
  • locale
  • sanitize
URL

  https://www.page2api.com/api/v1/schedules/:id

Method

  PUT

Sample payload

  
    {
      "api_key": "YOUR_API_KEY",
      "refresh_interval": 1,
      "callback_url": "https://www.userapplication.com/new_callback_url"
    }
  

Sample response

  
    {
      "id": 1234,
      "refresh_interval": 1,
      "last_refresh_at": "2021-08-01T09:52:33Z",
      "next_refresh_at": "2021-08-01T09:53:33Z",
      "active": true,
      "options": {
          "url": "https://www.example.com",
          "parse": {
              "title": "h1 >> text"
          },
          "callback_url": "https://www.userapplication.com/new_callback_url"
          "refresh_interval": 1
      },
      "created_at": "2021-08-01T09:51:21Z",
      "updated_at": "2021-08-01T09:52:33Z",
      "scrape_records_count": 14
    }
  

3. Delete a schedule

URL

  https://www.page2api.com/api/v1/schedules/:id

Method

  DELETE

Sample payload

  
    {
      "api_key": "YOUR_API_KEY",
    }
  

Sample response

  
    {
        "message": "Schedule with id: '1234' was deleted successfully."
    }
  

Datacenter Proxy

Parameter: datacenter_proxy

The Datacenter Proxy is the default proxy used to scrape the web.
The default value is auto. With this value - the API will try to use the most suitable datacenter location for the scraped website.
You can also choose a specific location or set the value to random to pick a random datacenter proxy from the available locations.


Sample request payload using datacenter proxy from a specific location

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://whatismycountry.com",
      "datacenter_proxy": "de",
      "parse": {
        "country": "h2#country >> text"
      }
    }
  

Sample response

  
    {
      "result": {
        "country": "Your country is Germany"
      },
      "request": {
        "parse": {
            "country": "h2#country >> text"
        },
        "url": "https://whatismycountry.com",
        "premium_proxy": "de",
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 190,
      "pages_parsed": 1,
      "cost":0.00025,
      "success": true,
      "duration": 1.55
    }
  

Supported locations (6)

Location API value
Auto (default)
auto
Random
random
EU
eu
USA
us
Germany
de
Romania
ro
Netherlands
nl
United Kingdom
uk

Premium Proxy

Parameter: premium_proxy

For hard-to-scrape websites, we offer the possibility to use Premium Proxy, also known as Residential proxy.
Premium Proxy allow you to choose a specific country (or a random one) and surf the web as a real user in that area.

If you set the value to auto - the API will try to use the most suitable premium location for the scraped website.
You can also choose a specific location or set the value to random to pick a random premium proxy from the available locations.


Sample request payload using premium proxy

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://whatismycountry.com",
      "premium_proxy": "de",
      "parse": {
        "country": "h2#country >> text"
      }
    }
  

Sample response

  
    {
      "result": {
        "country": "Your country is Germany"
      },
      "request": {
        "parse": {
            "country": "h2#country >> text"
        },
        "url": "https://whatismycountry.com",
        "premium_proxy": "de",
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 190,
      "pages_parsed": 1,
      "cost":0.0025,
      "success": true,
      "duration": 1.55
    }
  

Supported locations (139)

Location API value
Auto
auto
Random
random
Andorra
ad
UAE
ae
Afghanistan
af
Albania
al
Armenia
am
Angola
ao
Argentina
ar
Austria
at
Australia
au
Aruba
aw
Azerbaijan
az
Bosnia and Herzegovina
ba
Bangladesh
bd
Belgium
be
Bulgaria
bg
Bahrain
bh
Benin
bj
Bolivia
bo
Brazil
br
Bahamas
bs
Bhutan
bt
Belarus
by
Belize
bz
Canada
ca
Central African Republic
cf
Switzerland
ch
Côte d'Ivoire
ci
Chile
cl
Cameroon
cm
China
cn
Colombia
co
Costa Rica
cr
Cuba
cu
Cyprus
cy
Czech Republic
cz
Germany
de
Djibouti
dj
Denmark
dk
Dominica
dm
Ecuador
ec
Estonia
ee
Egypt
eg
Spain
es
EU
eu
Ethiopia
et
Finland
fi
Fiji
fj
France
fr
Great Britain
gb
Georgia
ge
Ghana
gh
Gambia
gm
Greece
gr
Hong Kong
hk
Honduras
hn
Croatia
hr
Haiti
ht
Hungary
hu
Indonesia
id
Ireland
ie
Israel
il
India
in
Iraq
iq
Iran
ir
Iceland
is
Italy
it
Jamaica
jm
Jordan
jo
Japan
jp
Kenya
ke
Cambodia
kh
South Korea
kr
Kazakhstan
kz
Lebanon
lb
Liechtenstein
li
Liberia
lr
Lithuania
lt
Luxembourg
lu
Latvia
lv
Morocco
ma
Monaco
mc
Moldova
md
Montenegro
me
Madagascar
mg
Macedonia
mk
Mali
ml
Myanmar
mm
Mongolia
mn
Mauritania
mr
Mauritius
mu
Maldives
mv
Mexico
mx
Malaysia
my
Mozambique
mz
Nigeria
ng
Netherlands
nl
Norway
no
New Zealand
nz
Oman
om
Panama
pa
Peru
pe
Philippines
ph
Pakistan
pk
Poland
pl
Puerto Rico
pr
Portugal
pt
Paraguay
py
Qatar
qa
Romania
ro
Serbia
rs
Russia
ru
Saudi Arabia
sa
Seychelles
sc
Sudan
sd
Sweden
se
Singapore
sg
Slovenia
si
Slovakia
sk
Senegal
sn
South Sudan
ss
Syria
sy
Chad
td
Togo
tg
Thailand
th
Turkmenistan
tm
Tunisia
tn
Turkey
tr
Trinidad and Tobago
tt
Taiwan
tw
Ukraine
ua
Uganda
ug
USA
us
Uruguay
uy
Uzbekistan
uz
British Virgin Islands
vg
Yemen
ye
South Africa
za
Zambia
zm
Zimbabwe
zw

Custom Proxy

Parameter: custom_proxy

You can provide your own proxy for scraping the web with Page2API.


Sample request payload using custom proxy

  
    {
      "api_key": "YOUR_API_KEY",
      "url": "https://api.ipify.org?format=json",
      "custom_proxy": "http://username:[email protected]:1234",
      "parse": {
        "ip": "body >> json"
      }
    }
  

Sample response

  
    {
      "result": {
         "data": {
            "ip": "192.168.13.14"
          }
        }
      },
      "request": {
        "parse": {
          "ip": "body >> json"
        },
        "url": "https://api.ipify.org?format=json",
        "custom_proxy": "http://username:[email protected]:1234",
        "scenario": [
          {
            "execute": "parse"
          }
        ]
      },
      "id": 190,
      "pages_parsed": 1,
      "cost":0.00025,
      "success": true,
      "duration": 1.55
    }
  

Ready to Scrape the Web like a PRO?

1000 free API calls.
Based on all requests made in the last 30 days. 99.85% success rate.
No-code-friendly.
Trustpilot stars 4.6