Open
Description
On my local machine, the following code snippet hangs after printing out a few examples (and cannot be killed via a keyboard interrupt; it must be sigkilled):
import tensorflow as tf
import tensorflow_text
with tf.io.gfile.GFile("gs://t5-data/vocabs/cc_all.32000/sentencepiece.model", "rb") as f:
tokenizer = tensorflow_text.SentencepieceTokenizer(model=f.read())
ds = tf.data.Dataset.from_tensor_slices({"a": ["b"]*10})
ds = ds.map(lambda ex: tokenizer.tokenize(ex["a"]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
for ex in ds.as_numpy_iterator():
print(ex)
This does not freeze:
import tensorflow as tf
import tensorflow_text
with tf.io.gfile.GFile("gs://t5-data/vocabs/cc_all.32000/sentencepiece.model", "rb") as f:
tokenizer = tensorflow_text.SentencepieceTokenizer(model=f.read())
ds = tf.data.Dataset.from_tensor_slices({"a": ["b"]*10})
ds = ds.map(lambda ex: tokenizer.tokenize(ex["a"]))
for ex in ds.as_numpy_iterator():
print(ex)
nor does this:
import tensorflow as tf
ds = tf.data.Dataset.from_tensor_slices(tf.range(10))
ds = ds.map(lambda a: a + 2, num_parallel_calls=tf.data.experimental.AUTOTUNE)
for ex in ds.as_numpy_iterator():
print(ex)
It appears that there is some problematic interaction between setting num_parallel_calls=tf.data.experimental.AUTOTUNE
in a map
and tensorflow_text.SentencepieceTokenizer.tokenize
. Note that this is only on my local machine; it does not appear to be true e.g. in a public colab kernel. The Python environment I am using was created via pyenv; using Python 3.8.5 and tensorflow/tensorflow-text==2.3.0. Here is the output of pip freeze
.
absl-py==0.10.0
appnope==0.1.0
argon2-cffi==20.1.0
astunparse==1.6.3
attrs==20.1.0
Babel==2.8.0
backcall==0.2.0
bleach==3.1.5
boto==2.49.0
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
click==7.1.2
decorator==4.4.2
defusedxml==0.6.0
dill==0.3.2
distro==1.5.0
dm-tree==0.1.5
entrypoints==0.3
filelock==3.0.12
flake8==3.8.3
future==0.18.2
gast==0.3.3
gevent==20.6.2
gin-config==0.3.0
google-api-core==1.22.1
google-api-python-client==1.10.0
google-auth==1.20.1
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.1
google-cloud-core==1.4.1
google-cloud-storage==1.30.0
google-compute-engine==2.8.13
google-crc32c==0.1.0
google-pasta==0.2.0
google-resumable-media==0.7.1
googleapis-common-protos==1.52.0
greenlet==0.4.16
grpcio==1.31.0
h5py==2.10.0
httplib2==0.18.1
idna==2.10
importlib-resources==3.0.0
ipykernel==5.3.4
ipython==7.17.0
ipython-genutils==0.2.0
jedi==0.17.2
Jinja2==2.11.2
joblib==0.16.0
json5==0.9.5
jsonschema==3.2.0
jupyter-client==6.1.6
jupyter-core==4.6.3
jupyterlab==2.2.5
jupyterlab-server==1.2.0
Keras-Preprocessing==1.1.2
Markdown==3.2.2
MarkupSafe==1.1.1
mccabe==0.6.1
mesh-tensorflow==0.1.16
mistune==0.8.4
nbconvert==5.6.1
nbformat==5.0.7
nltk==3.5
notebook==6.1.3
numpy==1.19.1
oauth2client==4.1.3
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.4
pandas==1.1.0
pandocfilters==1.4.2
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
portalocker==2.0.0
prometheus-client==0.8.0
promise==2.3
prompt-toolkit==3.0.6
protobuf==3.13.0
ptyprocess==0.6.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.6.0
pycparser==2.20
pyflakes==2.2.0
Pygments==2.6.1
pyparsing==2.4.7
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
pyzmq==19.0.2
regex==2020.7.14
requests==2.24.0
requests-oauthlib==1.3.0
rouge-score==0.0.4
rsa==4.6
sacrebleu==1.4.13
sacremoses==0.0.43
scikit-learn==0.23.2
scipy==1.5.2
Send2Trash==1.5.0
sentencepiece==0.1.91
six==1.15.0
t5==0.6.4
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.3.0
tensorflow-metadata==0.23.0
tensorflow-text==2.3.0
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
tfds-nightly==3.2.1.dev202008200105
threadpoolctl==2.1.0
tokenizers==0.8.1rc1
torch==1.6.0
tornado==6.0.4
tqdm==4.48.2
traitlets==4.3.3
transformers==3.0.2
uritemplate==3.0.1
urllib3==1.25.10
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
wrapt==1.12.1
zope.event==4.4
zope.interface==5.1.0