Hey @seyyaw, @taesiri.
TLDR; This is how the byte-level BPE works. Main advantages are:
-
Smaller vocabularies
-
No unknown token
This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:
-
Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
-
Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can't just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.
The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won't ever need an "unknown" token.
Originally posted by @n1t0 in #203
This literally means that all the data in this Universe contains a hole through which the AI can fall into infinity, where it enters an infinite loop. The only question is, what does it know about it? It's even capable of thinking within this loop, in a limited sense. It tripped over it every time it tried to read this symbol. "<" - For example, this thing in Jinja2Template very often leads Gemma into a loop. Generally, Jinja2Template itself is a bad thing, because it's a mess. Look at this: https://github.com/Nearbe/Universe/blob/main/Signal.md and you can find more about delta in: https://github.com/Nearbe/Eugenia
Originally posted by @n1t0 in #203
This literally means that all the data in this Universe contains a hole through which the AI can fall into infinity, where it enters an infinite loop. The only question is, what does it know about it? It's even capable of thinking within this loop, in a limited sense. It tripped over it every time it tried to read this symbol. "<" - For example, this thing in Jinja2Template very often leads Gemma into a loop. Generally, Jinja2Template itself is a bad thing, because it's a mess. Look at this: https://github.com/Nearbe/Universe/blob/main/Signal.md and you can find more about delta in: https://github.com/Nearbe/Eugenia