Skip to content

It is important to understand the real cause of the problem. #2027

@Nearbe

Description

@Nearbe

Hey @seyyaw, @taesiri.

TLDR; This is how the byte-level BPE works. Main advantages are:

  • Smaller vocabularies

  • No unknown token

This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

  1. Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)

  2. Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can't just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.

The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won't ever need an "unknown" token.

Originally posted by @n1t0 in #203

This literally means that all the data in this Universe contains a hole through which the AI ​​can fall into infinity, where it enters an infinite loop. The only question is, what does it know about it? It's even capable of thinking within this loop, in a limited sense. It tripped over it every time it tried to read this symbol. "<" - For example, this thing in Jinja2Template very often leads Gemma into a loop. Generally, Jinja2Template itself is a bad thing, because it's a mess. Look at this: https://github.com/Nearbe/Universe/blob/main/Signal.md and you can find more about delta in: https://github.com/Nearbe/Eugenia

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions