Skip to content

Invalid UTF8 bytes in default TAGS.txt #5406

Open
@s-kganz

Description

Short description
Some of the language tags in the default TAGS.txt cause a UnicodeDecodeError.

Environment information

  • Operating System: Windows 11

  • Python version: 3.10.13

  • tensorflow-datasets/tfds-nightly version: tfds-nightly 4.9.4.dev202405100044

  • tensorflow/tf-nightly version: tensorflow 2.10.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes

Reproduction instructions
Make a toy dataset with tfds new test. Then try to instantiate the builder.

from test.test_dataset_builder import *
b = Builder()

Link to logs
Stack trace here

Expected behavior
The builder to instantiate without error.

Additional context
Deleting lines 73, 79, 126, 128, 156, and 173 in TAGS.txt fixes the problem. These are all language tags.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions