Skip to content

laurens-moonens/BytePairEncoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BytePairEncoding

This cli tool allows text to be encoded, decoded and inspected using byte pair encoding. After encoding some text, the tool can also generate some new text (often gibberish) based on the original text.

E.g. Here is some generated text, based on the Bee Movie script:
"I will be able live are you? A lot to floll the honey production this this is this?"


How to build

This project uses CMake to build.

git clone https://github.com/laurens-moonens/BytePairEncoding.git
cd BytePairEncoding

mkdir build
cd build
cmake ..
cmake --build .

Only tested on Arch Linux using Make. More testing to follow.

Requirements

  • Cmake (version 3.14 or higher)

How to use

Run ./bpe -h for a full list of commands and options.

Encoding a file

./bpe encode -i ../data/bee_movie_script.txt -b ../data/bee_movie.bpe

Optionally the encoded tokens can be outputted as well.
-t ../data/bee_movie.tkn

Decoding an encoded file

./bpe decode -b ../data/bee_movie.bpe -t ../data/bee_movie.tkn -o ../data/bee_movie_decoded.txt 

Inspecting the generated BPE table

./bpe inspect -b ../data/bee_movie.bpe

Generating new text using the BPE table

./bpe generate -b ../data/bee_movie.bpe

Optionally the generated text can be outputted to a file.
-o ../data/bee_movie_generated.txt
Optionally the number of generated tokens can be specified (defaults to 100).
-c 69

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published