Utility to parse https://ustavshik.ru/books texts. These are Church Slavonic texts (books).
These texts are renedered with HTML and use obsolete USC encoding (a variant of USC, actually, that is based on utf-8, not cp1251).
The goal of this utility is to convert Ustavshik HTML into XML format with text encoding following Unicode standard.
Requires Python3
It is recommended to use virtual Python environment, like this:
mkdir workdir
cd workdir
python3 -m venv .venv
source .venv/bin/activateThen, install the package directly from GitHub:
pip install git+https://github.com/slavonic/ustavshik-parserOnce installed, you can run thee utility like this:
python -m ustavThis will give you the basic usage information, like this:
usage: ustav [-h] source target
ustav: error: the following arguments are required: source, targetpython -m ustav https://ustavshik.ru/books/chasoslov chasoslov.xmlThe command above reads the web page, does the format conversion, and saves XML as chasoslov.xml.
You can download HTML first, and save as local files:
wget https://ustavshik.ru/books/chasoslov > chasoslov.htmland then use this utility to do the conversion from HTML:
python -m ustav chasoslov.html chasoslov.xml