Scraping hackernews using scrapex.ai
To scrape all the posts from the first 10 pages of https://news.ycombinator.com/ in the following specification.
Step 1: Create a new project
- Head over to https://app.scrapex.ai/ and click on the New Project button
- Enter a project name and enter https://news.ycombinator.com/ as the URL
- Leave proxy settings as such and click on the Create button.
Step 2: Configure the scraper
- Click on the scraper titled Default Scraper to open up the Configurator
- Configurator would open up with the hackernews website loaded in the cloud browser. Panel on the left consists of 5 tools; namely Single Select, Multi Select, Group, Builtin and Meta.
As we are looking obtain data in a grouped manner, Group would be the right tool for this job.
- Click on the Group (3rd tool in the left toolbar)
- In the popup that opens up, type in the group name as “posts”
- Click the confirm button to create a group.
- A temporary Untitled Multi Select tag is created and displayed on the right panel. As show in the above figure, click on the post title.
- As soon as you click on the post title, all similar titles are also selected intelligently using our prediction algorithm.
- Let us now rename the tag and provide it a descriptive name.
- Click on the icon situated on the right side of the tag name
- Click on the edit option.
- Enter the tag name
title and press Enter
Let us now create another tag to extract out the points associated with each post.
- Click on the Group again
- In the popup, select posts from the dropdown in the Select Existing Collection Section.
- This would automatically create a new temporary tag under the posts group.
- Rename the tag as points. Perform a click on one of the points in the cloud browser as shown in the below figure.
- Sometimes the prediction algorithm to select all similar elements requires more input. Please click once more on one of the points as illustrated in the figure below. After providing one more selection as input, we can see that all other similar elements are selected automatically.
- As a next task, let us try to extract the number of comments associated with each post. Repeat step 7 to create a tag under the posts group. Now click on the number of comments as illustrated in the above figure.
As you would have noticed, apart from the comments being selected, extraneous elements was also selected. This is because the algorithm tries to predict the most general css selector based on the inputs by the user. Let us provide more inputs to refine the prediction.
- Click on the 2nd comment number, to provide more selections to the prediction algorithm.
- Click X on the most irrelevant element. In the above example, we have chosen to click X on the submit button. Immediately, we see the elements that are highlighted is decreasing.
- Once more, as the above screenshot illustrates, click X to remove the authors element as it is irrelevant.
- We see that the prediction has improved considerably, and just make one more click on the extraneous element as illustrated above.
- This is how we can iteratively refine our prediction. Rename this tag as comments.
- Click on the preview button on the bottom left corner. A popup opens up with the data preview for the posts collection.
The last task to be accomplished is to extract out the Author’s profile URL. Repeat step 7 to create an untitled tag. Rename this tag to author_url. Click on the author name as illustrated in the below image.
All authors are selected automatically selected. But the information we are interested in is the href property of the author tag rather than the text itself.
This can be accomplished by changing the extractor type to Prop and the setting the attribute to href. Once again,
- Click on the options button on the black bar on the right panel.
- Click on the Extractor dropdown and select Prop
- Enter href in the Attributes Input box.
- Click confirm.
- Click on the preview button to view the extracted data.
- Click on the submit button on the bottom right corner to save the configuration
- Click on Submit to submit the URL.
- On clicking submit, we are redirected to the scrapers dashboard page. The actual scraping and extraction happens in the background the data is extracted as shown in the below image.
- Click on Add URL button to add more URLS
- Let us now see how we can add more URLs and extract data out from them.
- On adding urls, the content is downloaded in the background and details are extracted out.