NAV
#
shell python javascript

Introduction

API V1

The Scrapex.ai API (version 1) provides programmatic access to the Scrapex platform. The API is organized around RESTful HTTP endpoints. All requests and responses (including errors) are encoded in JSON format with UTF-8 encoding.

Whether you’re an API pro, a beginning developer or a Scrapex.ai partner, our extensive API is waiting for your imagination. Our API suite allows you to run scripts, access data and much more.

We have language bindings in Shell, Python, and JavaScript! You can view code examples in the dark area to the right, and you can switch the programming language of the examples with the tabs in the top right.

Authentication

Authenticated Requests:

import requests
from requests.auth import HTTPBasicAuth

try:
    r = requests.get('https://api.scrapex.ai/v1/user/self/',
                    auth=HTTPBasicAuth('YOUR_API_KEY', ''))
    r.raise_for_status()
    print(r.json())
except requests.exceptions.HTTPError as err:
    print(err.response.json())

# With curl, you can just pass the correct header with each request
curl "https://api.scrapex.ai/v1/user/self" -u "YOUR_API_KEY:"
import fetch from "node-fetch";

(async () => {
  try {
    let res = await fetch("https://api.scrapex.ai/v1/user/self", {
      method: "GET",
      headers: {
        Authorization: `Basic ${Buffer.from("YOUR_API_KEY" + ":" + "").toString(
          "base64"
        )}`,
      },
    });
    let data = await res.json();
    console.log(data);
  } catch (e) {
    console.log(e);
  }
})();

Make sure to replace YOUR_API_KEY with your API key.

Scrapex uses API keys to authenticate requests. You can view and manage your API keys in the account settings page.

Your API keys carry many privileges, so be sure to keep them secure! Do not share your secret API keys in publicly accessible areas such as GitHub, client-side code, and so forth.

Authentication to the API is performed via HTTP Basic Auth. Provide your API key as the basic auth username value. You do not need to provide a password.

All API requests must be made over HTTPS. Calls made over plain HTTP will be redirected to HTTPS. API requests without authentication will also fail.

On Demand Extract

Run an OnDemand Extract

import requests
from requests.auth import HTTPBasicAuth

try:
    r = requests.get('http://api.scrapex.local/v1/scrapers/<SCRAPER_ID>/extract?url=<URL>',
                    auth=HTTPBasicAuth('YOUR_API_KEY', ''))
    r.raise_for_status()
    print(r.json())
except requests.exceptions.HTTPError as err:
    print(err.response.json())

curl "https://api.scrapex.ai/v1/scrapers/<SCRAPER_ID>/extract?url=<URL>" -u "YOUR_API_KEY:"
import fetch from "node-fetch";

(async () => {
  try {
    let res = await fetch(
      "http://api.scrapex.local/v1/scrapers/<SCRAPER_ID>/extract?url=<URL>",
      {
        method: "GET",
        headers: {
          Authorization: `Basic ${Buffer.from(
            "<YOUR_API_KEY>" + ":" + ""
          ).toString("base64")}`,
        },
      }
    );
    let data = await res.json();
    console.log(data);
  } catch (e) {
    console.log(e);
  }
})();

The above command returns JSON structured like this:

{
  "data": [
    ...
    ...
    ...
  ],
  "errors": [
    ...
    ...
    ...
  ]
}

Given an URL, this endpoint runs an on-demand extract

HTTP Request

GET https://api.scrapex.ai/v1/scrapers/<SCRAPER_ID>/extract

URL Parameters

Parameter Description
SCRAPER_ID The ID of the scraper of interest

QUERY Parameters

Parameter Description Required
url An encoded URL string YES

Collection

Get records in a collection

import requests
from requests.auth import HTTPBasicAuth

try:
    r = requests.get('https://api.scrapex.ai/v1/projects/project-store/<PROJECT_ID>/<COLLECTION_NAME>',
                    auth=HTTPBasicAuth('YOUR_API_KEY', ''))
    r.raise_for_status()
    print(r.json())
except requests.exceptions.HTTPError as err:
    print(err.response.json())

curl "https://api.scrapex.ai/v1/projects/project-store/<PROJECT_ID>/<COLLECTION_NAME>" -u "YOUR_API_KEY:"
import fetch from "node-fetch";

(async () => {
  try {
    let res = await fetch(
      "https://api.scrapex.ai/v1/projects/project-store/<PROJECT_ID>/<COLLECTION_NAME>",
      {
        method: "GET",
        headers: {
          Authorization: `Basic ${Buffer.from(
            "YOUR_API_KEY" + ":" + ""
          ).toString("base64")}`,
        },
      }
    );
    let data = await res.json();
    console.log(data);
  } catch (e) {
    console.log(e);
  }
})();

The above command returns JSON structured like this:

{
  "total_count": 2,
  "count": 2,
  "offset": 0,
  "data": [
    {
      "id": 0,
      "name": "..."
    },
    {
      "id": 1,
      "name": "..."
    }
  ]
}

This endpoint retrieves retrieves all the collections associated with an project

HTTP Request

GET https://api.scrapex.ai/v1/projects/project-store/<PROJECT_ID>/<COLLECTION_NAME>

URL Parameters

Parameter Description
PROJECT_ID The ID of the project of interest
COLLECTION_NAME The name of the collection of interest

Scripts

Run a script

import requests
from requests.auth import HTTPBasicAuth

try:
    r = requests.post('http://app.scrapex.local/api/v1/scripts/<SCRIPT_ID>/start',
                      data={"params": {}},
                      auth=HTTPBasicAuth('YOUR_API_KEY', ''))
    r.raise_for_status()
    print(r.json())
except requests.exceptions.HTTPError as err:
    print(err.response.json())

curl "http://app.scrapex.local/api/v1/scripts/<SCRIPT_ID>/start" -u "YOUR_API_KEY:" -X POST -H "Content-type: application/json" -d '{"params": {}}'
import fetch from "node-fetch";

(async () => {
  let body = { params: {} };

  try {
    let res = await fetch(
      "http://app.scrapex.local/api/v1/scripts/<SCRIPT_ID>/start",
      {
        method: "POST",
        body: JSON.stringify(body),
        headers: {
          Authorization: `Basic ${Buffer.from(
            "28a8a1d2d6e4ca31cd0d9b303bf771cf9c1be470" + ":" + ""
          ).toString("base64")}`,
        },
      }
    );
    let data = await res.json();
    console.log(data);
  } catch (e) {
    console.log(e);
  }
})();

The above command returns JSON structured like this:

{
  "params": {},
  "script_id": "<SCRIPT_ID>",
  "run_id": "<RUN_ID>"
}

This endpoint starts a script

HTTP Request

GET https://api.scrapex.ai/v1/scripts/<SCRIPT_ID>/start

URL Parameters

Parameter Description
SCRIPT_ID The ID of the script to be run

BODY Parameters

Parameter Description Required
params An object containing the script params YES

Check status of a script

import requests
from requests.auth import HTTPBasicAuth

try:
    r = requests.get('http://app.scrapex.local/api/v1/scripts/<SCRIPT_ID>/runs/<RUN_ID>',
                  auth=HTTPBasicAuth('YOUR_API_KEY', ''))
    r.raise_for_status()
    print(r.json())
except requests.exceptions.HTTPError as err:
    print(err.response.json())

curl "https://api.scrapex.ai/v1/scripts/<SCRIPT_ID>/runs/<RUN_ID>" -u "YOUR_API_KEY:"
import fetch from "node-fetch";

(async () => {
  let body = { params: {} };

  try {
    let res = await fetch(
      "http://app.scrapex.local/api/v1/scripts/<SCRIPT_ID>/runs/<RUN_ID>",
      {
        method: "GET",
        headers: {
          Authorization: `Basic ${Buffer.from(
            "28a8a1d2d6e4ca31cd0d9b303bf771cf9c1be470" + ":" + ""
          ).toString("base64")}`,
        },
      }
    );
    let data = await res.json();
    console.log(data);
  } catch (e) {
    console.log(e);
  }
})();

The above command returns JSON structured like this:

{
  "id": "<ID>",
  "script_id": "<SCRIPT_ID>",
  "metadata": {
    "content-type": "application/json"
  },
  "response": {},
  "status": 12,
  "console_logs": [],
  "ts_mod": "2022-01-12 10:43:49.924191+05:30",
  "ts": "2022-01-12 10:43:42.112979+05:30",
  "account_id": "6d84f178-20d2-11eb-beda-8752d362e34c",
  "ts_start": "2022-01-12 10:43:42.19+05:30",
  "ts_end": "2022-01-12 10:43:49.923+05:30",
  "type": 0,
  "script_job_id": null
}

This endpoint retrieves a specific runs's details.

HTTP Request

GET https://api.scrapex.ai/v1/scripts/<SCRIPT_ID>/runs/<RUN_ID>

URL Parameters

Parameter Description
SCRIPT_ID The ID of the script of interest
RUN_ID The ID of the run of interest

Javascript API

Overview

Scripts are executable crawlers that can perform some pre-configured automation along with scraping data from websites. Scripts can be accessed from the Scripts Table under Project details page under the Projects page.

Following are the attributes of a script:

User script structure

let {newPage, end, except, extract, extractAndSave, store, runStore, waitFor} = __sandbox;
let {params, } = OPTIONS;
(async () => { try { //---> prefix
    // -- START --

    const page = await newPage()

    // CUSTOM LOGIC ON page ---> user script

    // -- END --
    end()
} catch(e) { except(e) } })(); // ---> suffix

Editing the user script can be achieved by using the script editor page. Only the logic part of the script is editable while suffix and prefix will be read only.

Sample User Script

let {newPage, end, except, extract, extractAndSave, store, runStore, waitFor} = __sandbox;
let {params, } = OPTIONS;
(async () => { try {
    // -- START --

    const page = await newPage();   //creates new page
    await page.goto('https://www.example.org'); //opens example.org
    await store.saveOne('store', {id: 1, msg: 'data'})  //saves to store
    console.log(await store.getOne('store', 1)) //fetches from store
    if (await page.exists('a')) {   //check if anchor exists
        await page.click('a');  //click anchor
        await waitFor(2000);    //wait 2 seconds
        await page.saveSnapshot('clicked the anchor');  //save a snapshot of page

    }
    await page.close(); //close page

    // -- END --
    end()
} catch(e) { except(e) } })();

Here is a sample script that performs certain actions:

NOTE On script errors, snapshots of all valid open pages are saved. If none were to be found, it's likely that the pages never had any context in the first place.

Script Objects

General functions

end()

Terminates script.

extractAndSave(scraper, url[, idFn])

Extracts the scrapex scraper's data from the given url and saves it to the project-level store under the name of the scraper.

newPage()

Returns a promise returning the page object. This spawns a new page object. The page will be a chromium tab instance.

waitFor(timeout)

Returns a promise waiting until timeout ms has passed. (Not page.waitFor)

class: Page

  const page = await newPage();
  await page.goto('https://example.com');
  await page.saveSnapShot('example-page');

A page instance can be spawned in by using await newPage() call. Page provides methods to interact with a single tab in Chromium. You can spawn a maximum of two pages parallely in one user script.

This example creates a page, navigates it to a URL, and then saves a screenshot:

page.click(selector[, options])

const [response] = await Promise.all([
  page.waitForNavigation(waitOptions),
  page.click(selector, clickOptions),
]);

Click an element on the page specified by its CSS selector.

This method fetches an element with selector, scrolls it into view if needed, and then uses page.mouse to click in the center of the element. If there's no element matching selector, the method throws an error.

Bear in mind that if click() triggers a navigation event and there's a separate page.waitForNavigation() promise to be resolved, you may end up with a race condition that yields unexpected results. The correct pattern for click and wait for navigation is the following:

NOTE This race condition is handled by the page.clickAndWait API.

page.clickAndWait(selector[, options])

Click an element on the page specified by its CSS selector.

This method fetches an element with selector, scrolls it into view if needed, and then uses page.mouse to click in the center of the element. If there's no element matching selector, the method throws an error.

page.clickTag(scraper, tag[, options])

const [response] = await Promise.all([
  page.waitForNavigation(waitOptions),
  page.clickTag(scraper, tag, clickOptions),
]);

Click an element on the page specified by its CSS selector and wait for Navigation to finish.

This method fetches an element with selector, scrolls it into view if needed, and then uses page.mouse to click in the center of the element. If there's no element matching selector, the method throws an error.

Bear in mind that if click() triggers a navigation event and there's a separate page.waitForNavigation() promise to be resolved, you may end up with a race condition that yields unexpected results. The correct pattern for click and wait for navigation is the following:

NOTE This race condition is handled by the page.clickTagAndWait API.

page.clickTagAndWait(scraper, tag[, options])

Click an element on the page specified by its CSS selector and wait for Navigation to finish.

This method fetches an element with selector, scrolls it into view if needed, and then uses page.mouse to click in the center of the element. If there's no element matching selector, the method throws an error.

page.close([options])

Closes the specified tab.

By default, page.close() does not run beforeunload handlers.

NOTE if runBeforeUnload is passed as true, a beforeunload dialog might be summoned

page.exists(selector)

Checks if the specified selector exists on the page.

page.extract(scraper)

Extracts the content of the page.

page.goto(url[, options])

Go to the specified URL in the page that was spawned. options are wait and other nav options

page.goto will throw an error if:

page.goto will not throw an error when any valid HTTP status code is returned by the remote server, including 404 "Not Found" and 500 "Internal Server Error". The status code for such responses can be retrieved by calling response.status().

NOTE page.goto either throws an error or returns a main resource response. The only exceptions are navigation to about:blank or navigation to the same URL with a different hash, which would succeed and return null. page has JavaScript enabled, false otherwise.

page.keyboard

Invokes the keyboard object in page.

page.mouse

Invokes the mouse object in page.

page.reload([options])

Reload the page. Options are waitOptions.

page.saveSnapshot(name)

Saves the snapshot of the current page.

page.scrape(scraper)

Extracts the content of the page.

NOTE This API is deprecated. Please check out the extract API.

page.scrapeSelector(selector[, options])

Extracts the content of the page by passing a barebone scraper.

page.select(selector, ...values)

page.select('select#colors', 'blue'); // single selection
page.select('select#colors', 'red', 'green', 'blue'); // multiple selections

Select a checkbox on the page.

Triggers a change and input event once all the provided options have been selected. If there's no <select> element matching selector, the method throws an error.

page.tagExists(scraper, tag)

Checks if the specified selector exists on the page.

page.waitFor(selectorOrFunctionOrTimeout[, options[, ...args]])

Explicit time wait on the page.

This method is deprecated. You should use the more explicit API methods available:

NOTE This method behaves differently with respect to the type of the first parameter.

page.waitForNavigation([options])

const [response] = await Promise.all([
  page.waitForNavigation(), // The promise resolves after navigation has finished
  page.click('a.my-link'), // Clicking the link will indirectly cause a navigation
]);

Wait for page Navigation to finish.

This resolves when the page navigates to a new URL or reloads. It is useful when you run code that will indirectly cause the page to navigate. Consider this example:

page.waitForSelector(selector[, options])

Wait for selector to be available on the page.

Wait for the selector to appear in page. If at the moment of calling the method the selector already exists, the method will return immediately. If the selector doesn't appear after the timeout milliseconds of waiting, the function will throw.

NOTE Usage of the History API to change the URL is considered a navigation.

page.waitForTag(scraper, tag[, options])

Wait for selector to be available on the page, defined by a scrapex scraper.

Wait for the selector to appear in page. If at the moment of calling the method the selector already exists, the method will return immediately. If the selector doesn't appear after the timeout milliseconds of waiting, the function will throw.

NOTE Usage of the History API to change the URL is considered a navigation.

class: Keyboard

await page.keyboard.type('Hello World!');
await page.keyboard.press('ArrowLeft');

await page.keyboard.down('Shift');
for (let i = 0; i < ' World'.length; i++)
  await page.keyboard.press('ArrowLeft');
await page.keyboard.up('Shift');

await page.keyboard.press('Backspace');
// Result text will end up saying 'Hello!'


await page.keyboard.down('Shift');
await page.keyboard.press('KeyA');
await page.keyboard.up('Shift');
// An example of pressing `A`

Keyboard provides an API for managing a virtual keyboard. The high-level API is keyboard.type, which takes raw characters and generates proper keydown, keypress/input, and keyup events on your page.

For finer control, you can use keyboard.down, keyboard.up, and keyboard.sendCharacter to manually fire events as if they were generated from a real keyboard.

An example of holding down Shift in order to select and delete some text:

NOTE On macOS, keyboard shortcuts like ⌘ A -> Select All does not work. See #1313

keyboard.down(key[, options])

Dispatches a keydown event.

If key is a single character and no modifier keys besides Shift are being held down, a keypress/input event will also be generated. The text option can be specified to force an input event to be generated.

If key is a modifier key, Shift, Meta, Control, or Alt, subsequent key presses will be sent with that modifier active. To release the modifier key, use keyboard.up.

After the key is pressed once, subsequent calls to keyboard.down will have repeat set to true. To release the key, use keyboard.up.

NOTE Modifier keys DO influence keyboard.down. Holding down Shift will type the text in upper case.

keyboard.press(key[, options])

If key is a single character and no modifier keys besides Shift are being held down, a keypress/input event will also be generated. The text option can be specified to force an input event to be generated.

NOTE Modifier keys DO affect keyboard.press. Holding down Shift will type the text in upper case.

Shortcut for keyboard.down and keyboard.up.

keyboard.sendCharacter(char)

page.keyboard.sendCharacter('');

Dispatches a keypress and input event. This does not send a keydown or keyup event.

NOTE Modifier keys DO NOT affect keyboard.sendCharacter. Holding down Shift will not type the text in upper case.

keyboard.type(text[, options])

await page.keyboard.type('Hello'); // Types instantly
await page.keyboard.type('World', { delay: 100 }); // Types slower, like a user

Sends a keydown, keypress/input, and keyup event for each character in the text.

To press a special key, like Control or ArrowDown, use keyboard.press.

NOTE Modifier keys DO NOT affect keyboard.type. Holding down Shift will not type the text in upper case.

keyboard.up(key)

Dispatches a keyup event.

class: Mouse

// Using ‘page.mouse’ to trace a 100x100 square.
await page.mouse.move(0, 0);
await page.mouse.down();
await page.mouse.move(0, 100);
await page.mouse.move(100, 100);
await page.mouse.move(100, 0);
await page.mouse.move(0, 0);
await page.mouse.up();

The Mouse class operates in main-frame CSS pixels relative to the top-left corner of the viewport. Every page object has its own Mouse, accessible with page.mouse. Note that the mouse events trigger synthetic MouseEvents. This means that it does not fully replicate the functionality of what a normal user would be able to do with their mouse.

mouse.click(x, y[, options])

Shortcut for mouse.move, mouse.down and mouse.up.

mouse.down([options])

Dispatches a mousedown event.

mouse.move(x, y[, options])

Dispatches a mousemove event.

mouse.up([options])

Dispatches a mouseup event.

mouse.wheel([options])

await page.goto(
  'https://mdn.mozillademos.org/en-US/docs/Web/API/Element/wheel_event$samples/Scaling_an_element_via_the_wheel?revision=1587366'
);
await page.mouse.wheel({ deltaY: -100 });

Dispatches a mousewheel event.

class: Store

  await store.saveOne('random-store', {id: 'some', data: 'other'});
  const data = await store.getOne('random-store', 'some');

Scrapex stores data in collections and serves it to the user through the script or API as required. Store is an outer level store that shares its inventory with other scripts in the same Project. This store is to be used when multiple script may be extracting data from different websites but the data accessed from one store for uniformity.

Script store is the default store and the snippet below is a sample store interaction:

store.getOne(collection, id)

Retrieves a single record from the store.

store.getAll(collection [, options])

  await store.getAll('store', {
    limit: 100,
    offset: 500,
    only:  ['id']
  })

Retrieves multiple records from the store.

store.getIds(collection)

  await store.getIds('store');

Retrieves all ids from the store as a list.

store.saveOne(collection, data[, metadata, idFn])

  await store.saveOne('store', {data: "val"}, {metadata: 'json'}, () => {return 1;})

Saves one data item into the store.

store.saveMany(collection, records)

Saves several data items into the store.

class: runStore

  await runstore.saveOne('random-store', {id: 'some', data: 'other'});
  await runstore.saveMany('random-store', [{id: 'some2', data: 'other'}]);

Scrapex stores data in collections and serves it to the user through the script or API as required. runStore is an inner level store that supposedly stores run-specific data. Although capable of storing all data, it is advised to use said store for debugging purposes as the use script has no access to stored data other than manually fetching the data by UI. It lacks the fetch APIs..

Run store lacks the get APIs but retains both save APIs the other stores have. The snippet below is a sample store interaction:

runStore.saveOne(collection, data[, metadata, idFn])

  await runstore.saveOne('store', {data: "val"}, {metadata: 'json'}, () => {return 1;})

Saves one data item into the run store.

runStore.saveMany(collection, records)

Saves several data items into the run store.

Tutorials