Lightweight Docker image with NodeJS server to spit out HTML from loaded JS using Puppeteer and Chrome
Medium story: HTML from the Javascript world
| Image size | RAM usage | 
|---|---|
| 558MB | 110MB+ | 
The program is written in NodeJS with Typescript, in the src directory.
Runs a NodeJS server accepting HTTP requests with two URL parameters:
urlwhich is the URL to prerender into HTMLwaitwhich is the optional load event to wait for before stopping the prerendering. It can be:load(wait for theloadevent)domcontentloaded(wait for theDOMContentLoadedevent)networkidle0(default, wait until there is no network connections for at least 500 ms)networkidle2(wait until there are less than 3 network connections for at least 500 ms)
For example:
http://localhost:8000/?url=https://github.com/qdm12/htmlspitter
- The server scales up Chromium instances if needed
 - It limits the number of opened pages per instance to prevent one page crashing all the other pages
 - It has a 1 hour cache for loaded HTML
 - It has a queue system for requests once the maximum number of pages/chromium instances is reached
 - Not compatible with other architectures than amd64 as Chrome-Beta is only built for 
amd64for now and is required. 
- 
Run the container
docker run -it --rm --init -p 8000:8000 qmcgaw/htmlspitter
You can also use docker-compose.yml.
 
| Name | Default | Possible values | Description | 
|---|---|---|---|
MAX_PAGES | 
10 | 
-1 or integer larger than 0 | 
Max number of pages per Chromium instance at any time, -1 for no max | 
MAX_HITS | 
300 | 
-1 or  integer larger than 0 | 
Max number of pages opened per Chromium instance during its lifetime (before relaunch), -1 for no max | 
MAX_AGE_UNUSED | 
60 | 
-1 or integer larger than 0 | 
Max age in seconds of inactivity before the browser is closed, -1 for no max | 
MAX_BROWSERS | 
10 | 
-1 or integer larger than 0 | 
Max number of Chromium instances at any time, -1 for no max | 
MAX_CACHE_SIZE | 
10 | 
-1 or integer larger than 0 | 
Max number of MB stored in the cache, -1 for no max | 
MAX_QUEUE_SIZE | 
100 | 
-1 or integer larger than 0 | 
Max size of queue of pages per Chromium instance, -1 for no max | 
LOG | 
normal | 
normal or json | 
Format to use to print logs | 
TIMEOUT | 
15000 | 
-1 or integer larger than 0 | 
Timeout in ms to load a page, -1 for no timeout | 
If you obtain the error:
{"error":"Error: Failed to launch chrome!\nFailed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted\n\n\nTROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md\n"}Then you might need to use seccomp with the chrome.json file of this repository:
wget https://raw.githubusercontent.com/qdm12/htmlspitter/master/chrome.json
docker run -it --rm --init --security-opt seccomp=$(pwd)/chrome.json -p 8000:8000 qmcgaw/htmlspitter- A built-in local memory cache holds HTML content obtained the last hour and is limited in the size of characters it contains.
 - A built-in pool of Chromium instances creates and removes Chromium instances according to the server load.
 - Each Chromium instance has a limited number of pages so that if one page crashes Chromium, not all page loads are lost.
 - As Chromium caches content, each instance is destroyed and re-created once it reaches a certain number of page loads.
 
- chrome.json may be required depending on your host OS.
 - The 
--initflag is added to prevent eventual zombie Chromium processes to exist when the container stops the main NodeJS program. - A built in healthcheck is implemented by running 
node build/healthcheck.jsagainst a running instance. 
- Chromium is written in C++ and multi threaded so it scales well with more CPU cores
 - The NodeJS program should not be the bottleneck because all the work is done by Chromium
 - The bottleneck will be CPU and especially RAM used by Chromium instance(s)
 - You can scale up by having multiple machines running the program, behind a load balancer
 
- Either use the Docker container development image with Visual Studio Code and the remote development extension
 - Or install Node and NPM on your machine
 
# Install all dependencies
npm i
# Transcompile the Typescript code to Javascript and run build/main.js with
npm run startTest it with, for example:
wget -qO- http://localhost:8000/?url=https://github.com/qdm12/htmlspitterYou can also:
- 
Run tests
npm t
 - 
Run the sever with hot reload (performs
npm run starton each .ts change)npx nodemon
 - 
Build Docker
docker build -t qmcgaw/htmlspitter .You can also specify the branch of Google Chrome from
beta(default),stableandunstabledocker build -t qmcgaw/htmlspitter --build-arg GOOGLE_CHROME_BRANCH=unstable
 - 
There are two environment variables you might find useful:
PORTto set the HTTP server listening portCHROME_BINwhich is the path to the Chrome binary orPuppeteer-bundled
 
- Show Chrome version at start
 - Fake user agents
 - Prevent recursive calls to localhost
 - Format JSON or raw HTML
 - Limit Chromium instances in terms of RAM
 - Compression Gzip
 - Sync same URL with Redis (not getting twice the same URL)
 - Sync Cache with Postgresql or Redis depending on size
 - Limit data size in Postgresql according to time created
 - Unit testing
 - ReactJS GUI
 - Static binary in Scratch Docker image
 
- Credits to jessfraz for chrome.json
 - The Google Chrome team
 - The Puppeteer developers
 
This repository is under an MIT license
