This cli tool allows text to be encoded, decoded and inspected using byte pair encoding. After encoding some text, the tool can also generate some new text (often gibberish) based on the original text.
E.g. Here is some generated text, based on the Bee Movie script:
"I will be able live are you? A lot to floll the honey production this this is this?"
This project uses CMake to build.
git clone https://github.com/laurens-moonens/BytePairEncoding.git
cd BytePairEncoding
mkdir build
cd build
cmake ..
cmake --build .
Only tested on Arch Linux using Make. More testing to follow.
- Cmake (version 3.14 or higher)
Run ./bpe -h
for a full list of commands and options.
./bpe encode -i ../data/bee_movie_script.txt -b ../data/bee_movie.bpe
Optionally the encoded tokens can be outputted as well.
-t ../data/bee_movie.tkn
./bpe decode -b ../data/bee_movie.bpe -t ../data/bee_movie.tkn -o ../data/bee_movie_decoded.txt
./bpe inspect -b ../data/bee_movie.bpe
./bpe generate -b ../data/bee_movie.bpe
Optionally the generated text can be outputted to a file.
-o ../data/bee_movie_generated.txt
Optionally the number of generated tokens can be specified (defaults to 100).
-c 69