On-the-fly Batch-Time Tokenization Release

Pre-release

ohmeow released this 25 Sep 22:47

· 358 commits to master since this release

0.0.14

6f1a19f

This release simplifies the API and introduces a new on-the-fly tokenization feature whereby all tokenization happens during mini-batch creation. There are several upsides to this approach. First, it gets you training faster. Second, it reduces RAM utilization during the reading of your raw data (esp. nice with very large datasets that would give folks problems on platforms like colab). And lastly, I believe the approach provides some flexibility to include data augmentation and/or build adverserial models amongst other things.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-the-fly Batch-Time Tokenization Release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!