Skip to content

Commit e93dfb6

Browse files
authored
Workaround to skip utf-8 characters in plaintext tokenizer (#11)
1 parent 8f208f4 commit e93dfb6

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

tokenizer/plaintext/plaintext_tokenizer.cpp

+5
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,11 @@ int main(int argc, char* argv[]) {
5353
while (std::cin >> std::noskipws >> c) {
5454
bool is_punctuation = !isspace(c) && !std::isdigit(c) && !std::isalpha(c);
5555

56+
if ((unsigned int)(c) > 127) {
57+
// FIXME: for now, just skip utf-8 characters since nlohmann dump gets stuck
58+
continue;
59+
}
60+
5661
// ------------------------------
5762
// decide when to break the current string
5863
// break on spaces, punctuation (any symbol), or if we switch between letters and numbers

0 commit comments

Comments
 (0)