A web scraped API using cheerio
Overview
Cheerios is a web scraping tool written in Javascript. Using this library, I have built a web scraped API that provides online petitions data in NZ. I was availbe to find NZ petitions data at NZ Parliament website.
API
The swagger doc is available here.
GET /api/v1/petitions/:status
Request
https://petitions-nz.herokuapp.com/api/v1/petitions/open
Response
{
"status": "open",
"currentPage": 1,
"countPerPage": 50,
"totalPage": 3,
"totalNumber": 118,
"petitions": [
{
"id": 99776,
"requester": "Kim Hyunwoo",
"title": "Express concern to South Korea at actions of diplomat",
"documentId": "PET_99776",
"status": "Open",
"closingDate": "21 Sep 2020",
"signatures": 3
},
...
]
}
GET /api/v1/petition/:id
Request
https://petitions-nz.herokuapp.com/api/v1/petition/99776
Response
{
"id": 99776,
"requester": "Kim Hyunwoo",
"title": "Express concern to South Korea at actions of diplomat",
"documentId": "PET_99776",
"status": "Closed",
"closingDate": "21 Sep 2020",
"signatures": 3
}
What I learned
Timeout issue
I have encountered the issue while fetching a list of petitions. I received the following error message when unit testing.
Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Error
I decided to use in-memory cache to boost the performance. I came across node-cache and redis. I thought redis would be a over-engineered solution for this project. Instead, I chose node-cache
as it's simple and fast.
import NodeCache from 'node-cache';
class Cache{
private cache : NodeCache;
constructor(ttl : number) //I have set ttl to 1 hour
{
this.cache = new NodeCache({stdTTL: ttl, checkperiod: ttl * 0.2, useClones: false});
}
public async get(key: NodeCache.Key, func: Function, params: any) : Promise<any> {
const value = this.cache.get(key);
//if cached data
if(value)
{
return Promise.resolve(value);
}
//if not cached, store data
let result = await func(params);
this.cache.set(key, result);
return result;
}
public del(keys: string | number | NodeCache.Key[]): void
{
this.cache.del(keys);
}
public flush(): void {
this.cache.flushAll();
}
}
export default Cache;
But it still did not resolve the timeout issue. I have looked into the other API documents and I found the query like ?page=1&limit=30
. And then I created a paginate class to calculate offset and page number.
import { IPetitionItem, IPetitionList } from "../types/petitions.types";
class Paginate{
private limit: number;
constructor(limit: number)
{
this.limit = limit;
}
public getOffset(page: number): number
{
return (page? (page - 1) : 0) * this.limit;
}
public getNumberOfPage(total: number)
{
return Math.ceil(total/this.limit);
}
public getPaginatedItems(list: IPetitionItem[], page: number, total: number, status: string) : IPetitionList
{
const offset = this.getOffset(page);
const count = this.getNumberOfPage(total);
const paginated = list.slice(offset, offset + this.limit);
return {
status: status,
currentPage: page,
totalPage: count,
countPerPage: list.length,
totalNumber: total,
petitions: paginated
};
}
}
export default Paginate;
The timeout issue is gone by introducing pagination in the API. Now API fetches 50 items at max per each page. You can find the code at my repo.