A web scraped API using cheerio

01 November 2020 • ☕️ 3 min read

#Javascript #Typescript #node.js #toy project

Overview

Cheerios is a web scraping tool written in Javascript. Using this library, I have built a web scraped API that provides online petitions data in NZ. I was availbe to find NZ petitions data at NZ Parliament website.

API

The swagger doc is available here.

GET /api/v1/petitions/:status

Request

https://petitions-nz.herokuapp.com/api/v1/petitions/open

Response

{
    "status": "open",
    "currentPage": 1,
    "countPerPage": 50,
    "totalPage": 3,
    "totalNumber": 118,
    "petitions": [
        {
            "id": 99776,
            "requester": "Kim Hyunwoo",
            "title": "Express concern to South Korea at actions of diplomat",
            "documentId": "PET_99776",
            "status": "Open",
            "closingDate": "21 Sep 2020",
            "signatures": 3
        },
        ...
    ]
}

GET /api/v1/petition/:id

Request

https://petitions-nz.herokuapp.com/api/v1/petition/99776

Response

{
    "id": 99776,
    "requester": "Kim Hyunwoo",
    "title": "Express concern to South Korea at actions of diplomat",
    "documentId": "PET_99776",
    "status": "Closed",
    "closingDate": "21 Sep 2020",
    "signatures": 3
}

What I learned

Timeout issue

I have encountered the issue while fetching a list of petitions. I received the following error message when unit testing.

Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Error

I decided to use in-memory cache to boost the performance. I came across node-cache and redis. I thought redis would be a over-engineered solution for this project. Instead, I chose node-cache as it's simple and fast.

import NodeCache from 'node-cache';

class Cache{

    private cache : NodeCache;

    constructor(ttl : number) //I have set ttl to 1 hour
    {
        this.cache = new NodeCache({stdTTL: ttl, checkperiod: ttl * 0.2, useClones: false});
    }

    public async get(key: NodeCache.Key, func: Function, params: any) : Promise<any> {
        const value = this.cache.get(key);
        //if cached data
        if(value)
        {
            return Promise.resolve(value);
        }
        
        //if not cached, store data
        let result = await func(params);
        this.cache.set(key, result);
        return result;
    }

    public del(keys: string | number | NodeCache.Key[]): void
    {
        this.cache.del(keys);
    }

    public flush(): void {
        this.cache.flushAll();
    }
}

export default Cache;

But it still did not resolve the timeout issue. I have looked into the other API documents and I found the query like ?page=1&limit=30. And then I created a paginate class to calculate offset and page number.

import { IPetitionItem, IPetitionList } from "../types/petitions.types";

class Paginate{
    private limit: number;
    constructor(limit: number)
    {
        this.limit = limit;
    }

    public getOffset(page: number): number
    {
		return (page? (page - 1) : 0) * this.limit;

    }

    public getNumberOfPage(total: number)
    {
        return Math.ceil(total/this.limit);
    }

    public getPaginatedItems(list: IPetitionItem[], page: number, total: number, status: string) : IPetitionList
    {
        const offset = this.getOffset(page);
        const count = this.getNumberOfPage(total);
        const paginated = list.slice(offset, offset + this.limit);

        return {
            status: status,
            currentPage: page,
            totalPage: count,
            countPerPage: list.length,
            totalNumber: total,
            petitions: paginated
        };
    }
}

export default Paginate;

The timeout issue is gone by introducing pagination in the API. Now API fetches 50 items at max per each page. You can find the code at my repo.