Skip to content

Commit 7968ef0

Browse files
Nayef211Nayef Ahmed
and
Nayef Ahmed
authored
Fix UTF8 decoding error in GPT2BPETokenizer decode method (#2092)
Summary: - PyBind11 throws an error when decoding a C++ `std::string` which contains incomplete UTF8 byte sequences since the default UTF8 conversion uses `"strict"` error handling ([ref](https://docs.python.org/3/library/codecs.html#error-handlers)) - To resolve user issues (see [post](https://fb.workplace.com/groups/pytorchtext/permalink/899318121386487/)) we set the error handling to `"ignore"` which ignores the malformed data and continues decoding the string Differential Revision: D43361716 fbshipit-source-id: 4ac488e4b4b894c8049728941a2ee36b1799258a Co-authored-by: Nayef Ahmed <[email protected]>
1 parent 282f1b2 commit 7968ef0

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

torchtext/csrc/register_pybindings.cpp

+10-1
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,16 @@ PYBIND11_MODULE(_torchtext, m) {
179179
.def_property_readonly("byte_encoder_", &GPT2BPEEncoder::GetByteEncoder)
180180
.def("encode", &GPT2BPEEncoder::Encode)
181181
.def("tokenize", &GPT2BPEEncoder::Tokenize)
182-
.def("decode", &GPT2BPEEncoder::Decode)
182+
.def(
183+
"decode",
184+
[](const c10::intrusive_ptr<GPT2BPEEncoder>& self,
185+
const std::vector<int64_t>& tokens) {
186+
std::string s = self->Decode(tokens);
187+
PyObject* py_obj =
188+
PyUnicode_DecodeUTF8(s.data(), s.length(), "ignore");
189+
py::str py_s = py::reinterpret_steal<py::str>(py_obj);
190+
return py_s;
191+
})
183192
.def(
184193
"add_special_tokens",
185194
[](const c10::intrusive_ptr<GPT2BPEEncoder>& self,

0 commit comments

Comments
 (0)