08 Dec 05:54

broken

5e0be92

v2.4.0-rc1 Pre-release

Pre-release

Release 2.4.0-rc1

Major Features and Improvements

Windows support!
Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
Added normalize_utf8_with_offsets and find_source_offsets ops.
Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
Added string_to_id to SentencepieceTokenizer.
Support Android build.
RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

Add a minimal count_words function to wordpiece_vocabulary_learner.
Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
Add dep on tensorflow_hub in pip_package/setup.py
Add filegroup BUILD target for test_data segmentation Hub module.
Extend documentation for class HubModuleSplitter.
Read SP model file in bytes mode in tests.
Update intro.ipynb colab.
Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
Update StateBasedSentenceBreaker handling of text input tensors.
Reduce over-broad dependencies in regex_split library.
Fix broken builds.
Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
Update README regarding versions.
Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
Convert non-tensor inputs in pad along dimension op.
Add the necessity to install coreutils to the build instructions if building on MacOS.
Add filegroup BUILD target for test_data segmentation Hub module.
Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
Add Spliter / SplitterWithOffsets abstract base classes.
Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
Change variable names for token offsets: "limit" -> "end".
Fix presubmit failed for MacOS.
Allow dense tensor inputs for RegexSplit.
Fix imports in tools/.
BertTokenizer: Error out if the user passes a normalization_form that will be ignored.
Update documentation for Sentencepiece.tokenize_with_offsets.
Let WordpieceTokenizer read vocabulary files.
Numerous build improvements / adjustments (mostly to support Windows):
- Patch out googletest & glog dependencies from Sentencepiece.
- Switch to using Bazel's internal patching.
- ICU data is built statically for Windows.
- Remove reliance on tf_kernel_library.
- Patch TF to fix problematic Python executable searching.
- Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

Assets 2

18 Nov 08:28

broken

v2.4.0-rc0

e200a15

v2.4.0-rc0 Pre-release

Pre-release

Release 2.4.0-rc0

Major Features and Improvements

Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
Added normalize_utf8_with_offsets and find_source_offsets ops.
Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
Added string_to_id to SentencepieceTokenizer.
Support Android build.
Support Windows build (Py3.6 & Py3.7 this release).
RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
Add dep on tensorflow_hub in pip_package/setup.py
Add filegroup BUILD target for test_data segmentation Hub module.
Extend documentation for class HubModuleSplitter.
Read SP model file in bytes mode in tests.
Update intro.ipynb colab.
Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
Update StateBasedSentenceBreaker handling of text input tensors.
Reduce over-broad dependencies in regex_split library.
Fix broken builds.
Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
Update README regarding versions.
Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
Convert non-tensor inputs in pad along dimension op.
Add the necessity to install coreutils to the build instructions if building on MacOS.
Add filegroup BUILD target for test_data segmentation Hub module.
Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
Add Spliter / SplitterWithOffsets abstract base classes.
Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
Change variable names for token offsets: "limit" -> "end".
Fix presubmit failed for MacOS.
Allow dense tensor inputs for RegexSplit.
Fix imports in tools/.
BertTokenizer: Error out if the user passes a normalization_form that will be ignored.
Update documentation for Sentencepiece.tokenize_with_offsets.
Let WordpieceTokenizer read vocabulary files.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

Assets 2

23 Oct 00:49

broken

v2.4.0-b0

8621b71

2.4.0-b0 Pre-release

Pre-release

Release 2.4.0-b0

Please note that this is a pre-release and meant to run with TF v2.3.x. We wanted to give access to some of the features we were adding to 2.4.x, but did not want to wait for the TF release.

Major Features and Improvements

Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Bug Fixes and Other Changes

Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
Add dep on tensorflow_hub in pip_package/setup.py
Add filegroup BUILD target for test_data segmentation Hub module.
Extend documentation for class HubModuleSplitter.
Read SP model file in bytes mode in tests.

Thanks to our Contributors

Assets 2

28 Jul 22:33

broken

v2.3.0

fa898b7

2.3.0

Release 2.3.0

Major Features and Improvements

Added UnicodeCharacterTokenizer
Tokenizers are now tf.Modules and can be saved from within Keras layers.

Bug Fixes and Other Changes

Allow wordpiece_tokenizer to output int32 tokens natively.
Tracks the Sentencepiece model resource via a TrackableResource.
oss-segmenter:
- fix end-offset error in split_merge_tokenizer_kernel.
TensorFlow text python ops wordshape:
- More comprehensive emoji handling
Other:
- Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
- Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
- add normalize kernals test
- Fix Sentencepiece tests.
- Add some metric logs to tokenizers.
- Fix documentation formatting for SplitMergeTokenizer
- Bug fix: make sure tokenize() method does not ignore itself.
- Improve logging efficiency.
- Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
- Add the ability to define a user-defined destination directory to make testing easier.
- Fix typo in documentation of BertTokenizer
- Clarify docstring of UnicodeScriptTokenizer about splitting on space
- Add executable flag to the run_build.sh script.
- Clarify docstring of WordpieceTokenizer on unknown_token:
- Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors

Assets 2

15 Jul 22:20

broken

v2.3.0-rc1

27eefc5

2.3.0-rc1 Pre-release

Pre-release

Release 2.3.0-rc1

Major Features and Improvements

Added UnicodeCharacterTokenizer

Bug Fixes and Other Changes

oss-segmenter:
- fix end-offset error in split_merge_tokenizer_kernel.
TensorFlow text python ops wordshape:
- More comprehensive emoji handling
Other:
- Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
- Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
- add normalize kernals test
- Add some metric logs to tokenizers.
- Fix documentation formatting for SplitMergeTokenizer
- Bug fix: make sure tokenize() method does not ignore itself.
- Improve logging efficiency.
- Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
- Add the ability to define a user-defined destination directory to make testing easier.
- Fix typo in documentation of BertTokenizer
- Clarify docstring of UnicodeScriptTokenizer about splitting on space
- Add executable flag to the run_build.sh script.
- Clarify docstring of WordpieceTokenizer on unknown_token:
- Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors

Assets 2

04 Jun 20:51

broken

v2.2.1

c3e4b15

2.2.1

Release 2.2

Major Features and Improvements

Python 3.8 release builds added

Bug Fixes and Other Changes

Add backup storage locations for some dependencies.

Assets 2

11 May 18:20

gregbillock

v2.2.0

c5701eb

2.2.0 release

Release 2.2

Major Features and Improvements

Breaking Changes

Bug Fixes and Other Changes

Update version

Thanks to our Contributors

Assets 2

10 Apr 22:59

tf-text-github-robot

v2.2.0-rc2

72be3dd

v2.2.0-rc2 Pre-release

Pre-release

Bug fixes

Force MacOS builds to build for OSX 10.9 so they can be installed to a wider range of MacOS versions.

Assets 2

17 Mar 20:52

broken

v2.2.0-rc1

1a14cec

v2.2.0-rc1 Pre-release

Pre-release

Release 2.2.0-rc1

Major Features and Improvements

Add op for solving max-spanning-tree (MST) problems. The code here is intended for NLP applications, but attempts to remain agnostic to particular NLP tasks (such as dependency parsing).
Add max_spanning_tree_gradient.
Add support for 'preserve_unused_tokens' options in BertTokenizer.

Bug Fixes and Other Changes

Documentation updates.
Reorganize the BUILD file for keras layers.
Update model server testing. The test script now generates a model that integrates into tf serving's testing infra.
Remove unneeded heavy dependencies in regex_split library.
Turn TF text's ConstrainedSequence implementations into standalone callable functions.
Fix bug in ViterbiAnalysis computation triggered when not using transition_weights.
Removing testing_utils run_tf_function which is enabled by default now.
Update patch params to work with Bazel >=1.0.0
Remove circular dependencies by removing submodule imports from ragged package.
Prevent lack of ragged_ops.py being released in TF from breaking tf.Text

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Hyunwoo Cho

Assets 2

01 Feb 02:19

broken

v2.1.1

ded2905

v2.1.1

Minor Updates

BertTokenizer to accept a string tensor for the vocab_lookup_table.

Bug Fixes

Update ICU data name so as to not conflict with core TF in model server.

Assets 2

Releases: tensorflow/text

v2.4.0-rc1

Release 2.4.0-rc1

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

v2.4.0-rc0

Release 2.4.0-rc0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

2.4.0-b0

Release 2.4.0-b0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

2.3.0

Release 2.3.0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

2.3.0-rc1

Release 2.3.0-rc1

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

2.2.1

Release 2.2

Major Features and Improvements

Bug Fixes and Other Changes

Uh oh!

2.2.0 release

Release 2.2

Major Features and Improvements

Breaking Changes

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

v2.2.0-rc2

Bug fixes

Uh oh!

v2.2.0-rc1

Release 2.2.0-rc1

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

Uh oh!

v2.1.1

Minor Updates

Bug Fixes

Uh oh!