Web scraping for fun in NodeJS

Introduction

Web Scraping is an IT technique of data extraction performed by a crawler.

Also called scraper or spider, a crawler is a piece of software that browses through web pages and extracts their contents for various purposes, like data analysis or for indexing new and fresh contents to search.

There are lots of reasons one may need to collect data exposed on the web: for a search engine like Jobrapido, crawlers are used to index job adverts for the users, keeping them updated with all the job platforms. Someone else may want to keep their own local set of cooking recipes, hotels prices for the next holiday, etc.

The majority of scraping frameworks are available as open-source projects. They are fully configurable and have a lot of features, so you just need to learn how they work.

However, some of them are quite cumbersome: for example, Scrapy (a Python open-source project) has plenty of features but is too bulky to be used in a small home project. I would not want to spend too much time if I have to write a crawler for myself, but Scrapy takes some time to get understood, configured, and tuned.

While looking for another solution we found that the NodeJS runtime, which is very good at managing asynchronous requests, allows us to write a decent and simple web scraping engine in a very short time.

In this article, we are going to explain how.

The theory behind web scraping

All the web scrapers function more or less in the same way: they visit a web page, and from that page they can extract two kinds of information:

  • data to be stored, usually in the form of text
  • web URLs, which can be used to iterate the same process in another page

You just need to choose a starting web page. It must be a page where you can find links that allow you to navigate the website and reach the information that you need. Once you find this path, you have to tell the crawler engine what to do: which of these links to follow, and what to do with them.

The process is iterative: you may need more than one web request to reach the data you need, by following links through the various pages of the website. The crawler engine will take care of providing a way to repeat the process after every call.

Since a page contains multiple links, the number of requests to do usually increases every time you process a page, as you can see in the following image. A good crawling engine should allow you to perform these requests as much concurrently as possible, to reduce the execution time of the entire crawling process.

Finally, when you have consumed all your URLs making requests, you will get a lot of content, with which you can do whatever you want: write it to a database, send it to an API, and so on.

Our approach

Let’s start with an example, trying to get some information from a site composed of a homepage containing links to a second page, which contains links to a detail page, which has the data we want to retrieve.

In the beginning, we are going to use the NodeJS requests library in order to fetch the HTML from the web.

It has been deprecated, but we need its callback implementation in order to understand how the crawler engine works. Once the functionality is explained, we will replace it in the final version with node-fetch, which is currently supported.

The response contains lots of HTML links, but we need to follow just some of them.

To select the links we want, we will use Cheerio (https://cheerio.js.org/), a library that can parse HTML, transform it into an Xpath tree and filter it, and perform operations for each element found.

We don’t need to know how Cheerio works in detail, we just ought to know that in its simplest form we have to specify the name of the tag we need to select, following a very simple syntax. In this case, we will select all links contained in a div with a particular id.

As the first step, we will make a call to the homepage of our site, and we log to the console the list of HTML links we need:

import request = require('request');
import cheerio = require("cheerio");


request('http://www.crawlmeplease.com/', (error, response, body) => {
  console.error(`error: ${error}`);
  cheerio
  .load(body)("div[id='links'] a")
  .each((index, element) => {
    console.log(`link html:${element.attribs.href}`);
    console.log(`link text:${element.firstChild.data}`);
  })
});

For every URL we obtain, we have to make another call to parse the page it points to, and retrieve the final content.

Now we will do two things: refactor the code to extract a callback function, and create a similar callback function to handle the response of the subsequent page.

Then, we will do the same thing for an additional page, but this time we will just print the link we have:

import request = require("request");

request("http://www.crawlmeplease.com/", parseFirstPage);

function parseFirstPage(error, response, body) {
  console.error(`error: ${error}`);
  cheerio
    .load(body)("div[id='links'] a")
    .each((index, element) => {
      request(element.attribs.href, parseSecondPage);
    });
}

function parseSecondPage(error, response, body) {
  console.error(`error: ${error}`);
  cheerio
    .load(body)("div[id='otherlinks'] a")
    .each((index, element) => {
      request(element.attribs.href, parseDetailPage);
    });
}

function parseDetailPage(error, response, body) {
  console.error(`error: ${error}`);
  cheerio
    .load(body)("div[id='otherlinks'] a")
    .each((index, element) => {
      console.log(`link text:${element.firstChild.data}`);
    });
}

As you can see, we could go on and on, calling all the links in the page, until we find what we need.

Every time a page is parsed, a certain number of URLs are added and are expected to be called, but we do not save them in the memory, the NodeJS stack keeps track of the URLs to call.

Wrapping it up in a library

Finally, we will take advantage of the similar layout of the parsing functions to create a little crawler engine that can manage all the calls for us.

The purpose of this refactoring is to keep the common code in a library that parses the page and iterates over the links. The client part it is left only with the logic to locate the proper link and to use the information retrieved.

As an additional goal, only the engine will have a dependency on Cheerio, which is good as it is an implementation detail.

The result of the refactoring will be the following for the engine part:

(We also replaced the request library as said before)

import cheerio = require("cheerio");
import fetch from "node-fetch";

export function callAndParse(url, path, callback) {
  fetch(url)
    .then((res) => res.text())
    .then((body) =>
      cheerio
        .load(body)(path)
        .each((index, element) => {
          callback(element.firstChild.data, element.attribs.href);
        })
    )
    .catch((e) => console.error(e));
}

And this will be the client code that can perform the same crawling as the previous one:

import { callAndParse } from 'crawler.engine';

callAndParse('http://www.crawlmeplease.com/', "div[id='links'] a", parseSecondPage);

function parseSecondPage(text, href) {
  callAndParse(href, "div[id='otherlinks'] a", parseDetailPage);
}

function parseDetailPage(text) {
  console.log(`found text: ${text}`);
}

As you can see the client code is composed of just the parsing logic and nothing more.

This is of course a simplified implementation because it can retrieve just the two elements related to the link (href and text); but believe us, most of the time you’re not going to need anything more than this.

But why are we using callbacks instead of promises or async functions?

The trick here is that, with callbacks we can keep the client code way simpler than it could be if we use promises.

With callbacks, we are giving back to the client code each time just the minimum information for either to make another request or to use the data, so we can keep just the crawling logic in it, without having to iterate over the links, or having to import Cheerio.

Obviously, you could rewrite the code using promises, but you would have to make some changes that bring more code in the client part, which we want to keep as lean as possible.

With promises, the response is returned every time they are resolved. So, after processing the page, the library should pack all the data in a structure of links and text and then return it. This data has to be processed and iterated over in the client code, which is something we don’t want.

The same is the case with async functions.

Another case could be to make the promise return a generator. Again, this will add to the client code the task to iterate over the generator, making it heavier to maintain.

Conclusions.

The actual code of the library is a lot more complex than what you see here.

We covered the scenario of following HTTP redirections, connection errors, and retries.

We also replaced the node-fetch library with a custom request queue, in order to keep some interval between consecutive calls and to avoid being blocked by some DDOS protection system.

But the client code still remains the same, and as you can see it is easy enough to allow you to write a lot of crawlers with little effort, so you can have a lot of fun collecting data from the web.

Thank you for reading.

Mirko Caruso - Software Engineer @ Jobrapido
Fabio Ranfi - Software Engineer @ Jobrapido

Please follow and like us: