Skip to content

On-the-fly Batch-Time Tokenization Release

Pre-release
Pre-release

Choose a tag to compare

@ohmeow ohmeow released this 25 Sep 22:47
· 358 commits to master since this release

This release simplifies the API and introduces a new on-the-fly tokenization feature whereby all tokenization happens during mini-batch creation. There are several upsides to this approach. First, it gets you training faster. Second, it reduces RAM utilization during the reading of your raw data (esp. nice with very large datasets that would give folks problems on platforms like colab). And lastly, I believe the approach provides some flexibility to include data augmentation and/or build adverserial models amongst other things.