Fix UTF8 decoding error in GPT2BPETokenizer decode method (#2092)

Nayef211 · Nayef Ahmed · web-flow · commit 7968ef0605ed · 2023-03-02T10:31:34.000-05:00
Summary: - PyBind11 throws an error when decoding a C++ `std::string` which contains incomplete UTF8 byte sequences since the default UTF8 conversion uses `"strict"` error handling ([ref](https://docs.python.org/3/library/codecs.html#error-handlers)) - To resolve user issues (see [post](https://fb.workplace.com/groups/pytorchtext/permalink/899318121386487/)) we set the error handling to `"ignore"` which ignores the malformed data and continues decoding the string Differential Revision: D43361716 fbshipit-source-id: 4ac488e4b4b894c8049728941a2ee36b1799258a Co-authored-by: Nayef Ahmed <nayef211@meta.com>
diff --git a/torchtext/csrc/register_pybindings.cpp b/torchtext/csrc/register_pybindings.cpp
@@ -179,7 +179,16 @@ PYBIND11_MODULE(_torchtext, m) {
       .def_property_readonly("byte_encoder_", &GPT2BPEEncoder::GetByteEncoder)
       .def("encode", &GPT2BPEEncoder::Encode)
       .def("tokenize", &GPT2BPEEncoder::Tokenize)
-      .def("decode", &GPT2BPEEncoder::Decode)
+      .def(
+          "decode",
+          [](const c10::intrusive_ptr<GPT2BPEEncoder>& self,
+             const std::vector<int64_t>& tokens) {
+            std::string s = self->Decode(tokens);
+            PyObject* py_obj =
+                PyUnicode_DecodeUTF8(s.data(), s.length(), "ignore");
+            py::str py_s = py::reinterpret_steal<py::str>(py_obj);
+            return py_s;
+          })
       .def(
           "add_special_tokens",
           [](const c10::intrusive_ptr<GPT2BPEEncoder>& self,