Scraping hackernews using scrapex.ai

Goal:

To scrape all the posts from the first 10 pages of https://news.ycombinator.com/ in the following specification.

How to scrape hackernews

Tags

Tags DataType Extractor
title Text Text
points Text Text
comments Integer Text
URL Text Prop

Sample Data

title author_url comments points
About Offline First https://news.ycombinator.com/user?id=thunderbong 84 131 points
Atari ST in daily use since 1985 [video] https://news.ycombinator.com/user?id=pmarin 37 98 points

Tutorial:

Step 1: Create a new project

Creating a project

Creating a project

  1. Head over to https://app.scrapex.ai/ and click on the New Project button
  2. Enter a project name and enter https://news.ycombinator.com/ as the URL
  3. Leave proxy settings as such and click on the Create button.

Step 2: Configure the scraper

  1. Click on the scraper titled Default Scraper to open up the Configurator

Creating a project

  1. Configurator would open up with the hackernews website loaded in the cloud browser. Panel on the left consists of 5 tools; namely Single Select, Multi Select, Group, Builtin and Meta.

Configuring a Scraper

  1. As we are looking obtain data in a grouped manner, Group would be the right tool for this job.

    • Click on the Group (3rd tool in the left toolbar)
    • In the popup that opens up, type in the group name as “posts”
    • Click the confirm button to create a group.

Configuring a Scraper

  1. A temporary Untitled Multi Select tag is created and displayed on the right panel. As show in the above figure, click on the post title.

Using DOM Nested

  1. As soon as you click on the post title, all similar titles are also selected intelligently using our prediction algorithm.

Using DOM Nested

  1. Let us now rename the tag and provide it a descriptive name.
    • Click on the icon situated on the right side of the tag name
    • Click on the edit option.
    • Enter the tag name title and press Enter

Making a selection

Renaming a tag

  1. Let us now create another tag to extract out the points associated with each post.

    • Click on the Group again
    • In the popup, select posts from the dropdown in the Select Existing Collection Section.
    • This would automatically create a new temporary tag under the posts group.

Using DOM Nested

  1. Rename the tag as points. Perform a click on one of the points in the cloud browser as shown in the below figure.

Using DOM Nested

  1. Sometimes the prediction algorithm to select all similar elements requires more input. Please click once more on one of the points as illustrated in the figure below. After providing one more selection as input, we can see that all other similar elements are selected automatically.

Using DOM Nested

Using DOM Nested

  1. As a next task, let us try to extract the number of comments associated with each post. Repeat step 7 to create a tag under the posts group. Now click on the number of comments as illustrated in the above figure.

Improving selector prediction

  1. As you would have noticed, apart from the comments being selected, extraneous elements was also selected. This is because the algorithm tries to predict the most general css selector based on the inputs by the user. Let us provide more inputs to refine the prediction.

    • Click on the 2nd comment number, to provide more selections to the prediction algorithm.

    Improving selector prediction

    • Click X on the most irrelevant element. In the above example, we have chosen to click X on the submit button. Immediately, we see the elements that are highlighted is decreasing.

    Improving selector prediction

  • Once more, as the above screenshot illustrates, click X to remove the authors element as it is irrelevant.

Improving selector prediction

  • We see that the prediction has improved considerably, and just make one more click on the extraneous element as illustrated above.

Configurator Preview

  • This is how we can iteratively refine our prediction. Rename this tag as comments.
  1. Click on the preview button on the bottom left corner. A popup opens up with the data preview for the posts collection.

Improving selector prediction

Improving selector prediction

Improving selector prediction

  1. The last task to be accomplished is to extract out the Author’s profile URL. Repeat step 7 to create an untitled tag. Rename this tag to author_url. Click on the author name as illustrated in the below image.

  2. All authors are selected automatically selected. But the information we are interested in is the href property of the author tag rather than the text itself.

  3. This can be accomplished by changing the extractor type to Prop and the setting the attribute to href. Once again,

    • Click on the options button on the black bar on the right panel.
    • Click on the Extractor dropdown and select Prop
    • Enter href in the Attributes Input box.
    • Click confirm.
    • Click on the preview button to view the extracted data.
    • Click on the submit button on the bottom right corner to save the configuration

Improving selector prediction

  1. Click on Submit to submit the URL.

Step 4: Extracting more URLs out

  1. On clicking submit, we are redirected to the scrapers dashboard page. The actual scraping and extraction happens in the background the data is extracted as shown in the below image.

Improving selector prediction

  1. Click on Add URL button to add more URLS

Improving selector prediction

  1. Let us now see how we can add more URLs and extract data out from them.

Improving selector prediction

  1. On adding urls, the content is downloaded in the background and details are extracted out. Improving selector prediction