A fun Microservice for scraping stuff from sites for Hellomouse Apps
- Download webpages as HTML (with assets like CSS, videos, images, etc... embedded as base64), PDF, WEBP (screenshot)
- Special handling for certain websites, currently we have:
- Twitter / X: Tweets are downloaded as HTML + attached media (images, videos)
- Reddit: Posts and comments are downloaded with any attached assets
- Soundcloud: Songs are downloaded with metadata (HTML + audio)
- Newgrounds: Songs are downloaded with metadata (HTML + audio)
- Imgur: Albums and gallerys are downloaded with all images and metadata (HTML + images / videos)
- Youtube: Videos are downloaded
- Pixiv: Albums are downloaded
- Bilibili: Videos are downloaded
Install dependencies
npm install
Setup the config. You will need a PostgresSQL database running as well as the hellomouse-apps-api server (run the server first to generate the required tables).
There is an example config in the root directory. Copy it and rename it to config.js. Here are the properties:
export const dbUser = 'hellomouse_board'; // PostgresSQL user
export const dbIp = '127.0.0.1'; // Postgres Server location
export const dbPort = 5433; // Postgres Server port
export const dbPassword = 'my password'; // Postgres Server password
export const dbName = 'hellomouse_board'; // Postgres Server DB name
export const fileDir = './saves'; // Path to store all files, in general, web files are stored under this path/site_downloads/file.extTo setup yt-dlp (optional) you can place your browser cookies in secret/yt-cookies.txt for use in downloading youtube videos, and
secret/bilibili-cookies.txt for downloading bilibili videos.
To setup pixiv cookies (optional, for bypassing rate limiting and age restrictions) you can place your browser cookies (exported as a JS array of objects like [{ name: ... }])) and put the result in secret/pixiv-cookies.txt.
Run the server:
node index.js