Yoyodyne provides small-vocabulary sequence-to-sequence generation with and without feature conditioning.
These models are implemented using PyTorch and Lightning.
Yoyodyne is inspired by FairSeq (Ott et al. 2019) but differs on several key points of design:
- It is for small-vocabulary sequence-to-sequence generation, and therefore
includes no affordances for machine translation or language modeling.
Because of this:
- The architectures provided are intended to be reasonably exhaustive.
- There is little need for data preprocessing; it works with TSV files.
- It has support for using features to condition decoding, with architecture-specific code for handling feature information.
- It supports the use of validation accuracy (not just loss) for model selection and early stopping.
- Models are specified using YAML configuration files.
- Releases are made regularly and bugs addressed.
- It has exhaustive test suites.
- 🚧 UNDER CONSTRUCTION 🚧: It has performance benchmarks.
Yoyodyne was created by Adam Wiemerslage, Kyle Gorman, Travis M. Bartley, and other contributors like yourself.
To install Yoyodyne and its dependencies, run the following command:
pip install .
Then, optionally install additional dependencies for developers and testers:
pip install -r requirements.txt
Yoyodyne is also compatible with Google Colab GPU runtimes.
- Click "Runtime" > "Change runtime type".
- Under the "Hardware accelerator", select a "GPU", then click "Save".
- You may be prompted to delete the old runtime. Do so if you wish.
- Then install and run using the
!as a prefix to shell commands.
Yoyodyne uses YAML for configuration files; see the example configuration files for examples.
Yoyodyne supports OmegaConf's variable interpolation syntax, which is useful to link hyperparameters, particularly to set the hyperparameters of source and/or features encoders in a way that is compatible with the outer-level model arguments for the decoder. For instance, if one wants to use the same hidden size for encoders and decoders one can simply set one value and then use variable interpolation for the others, as in the following configuration snippet:
...
model:
init_args:
...
decoder_hidden_size: 512
source_encoder:
init_args:
hidden_size: ${model.decoder_hidden_size}
features_encoder:
init_args:
hidden_size: ${model.decoder_hidden_size}
...
Occasionally one may wish to set one hyperparameter as some (non-identity)
function of another. For example, if one is using a bidirectional RNN source
encoder and a linear features encoder, the size of the latter's output size must
be set to twice that of the source encoder's hidden size. For this, Yoyodyne
registers the multiply custom
resolver,
as shown in the following snippet:
...
model:
init_args:
class_path: yoyodyne.models.SoftAttentionLSTMModel
decoder_hidden_size: 512
source_encoder:
class_path: yoyodyne.models.modules.LSTMEncoder
init_args:
hidden_size: ${model.decoder_hidden_size}
features_encoder:
class_path: yoyodyne.models.modules.LinearEncoder
init_args:
hidden_size: ${multiply:${model.init_args.decoder_hidden_size}, 2}
...
Other custom resolvers can be registered in the main
method if desired.
Yoyodyne operates on basic tab-separated values (TSV) data files. The user can specify source, features, and target columns, and separators used to parse them.
The default data format is a two-column TSV file in which the first column is the source string and the second the target string.
source target
To enable the use of a features column, one specifies a (non-zero)
data: features_col: argument, and optionally also a data: features_sep:
argument (the default features separator is ";"). For instance, for the
SIGMORPHON 2016 shared task
data:
source feat1,feat2,... target
the format is specified as:
...
data:
...
features_col: 2
features_sep: ,
target_col: 3
...
Alternatively, for the CoNLL-SIGMORPHON 2017 shared task, the first column is the source (a lemma), the second is the target (the inflection), and the third contains semi-colon delimited features strings:
source target feat1;feat2;...
the format is specified as simply:
...
data:
...
features_col: 3
...
Yoyodyne reserves symbols of the form <...> for internal use.
Feature-conditioned models also use [...] to avoid clashes between features
symbols and source and target symbols, and in some cases, {...} to avoid
clashes between source and target symbols. Therefore, users should not provide
any symbols of the form <...>, [...], or {...}.
The yoyodyne command-line tool uses a subcommand interface, with four
different modes. To see a full set of options available for each subcommand, use
the --print_config flag. For example:
yoyodyne fit --print_config
will show all configuration options (and their default values) for the fit
subcommand.
For more detailed examples, see the configs directory.
In fit mode, one trains a Yoyodyne model, either from scratch or, optionally,
resuming from a pre-existing checkpoint. Naturally, most configuration options
need to be set at training time. E.g., it is not possible to switch modules
after training a model.
This mode is invoked using the fit subcommand, like so.
yoyodyne fit --config path/to/config.yaml
Alternatively, one can resume training from a pre-existing checkpoint so long as it matches the specification of the configuration file.
yoyodyne fit --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
Setting the seed_everything: argument to some fixed value ensures a
reproducible experiment (modulo hardware non-determism).
A specification for a model includes specification of the overall architecture
and for most models, a specification of the source encoder. One may also specify
a separate features encoder or use model: features_encoder: true to indicate
that the source and features encoders should share parameters.
Each model exposes its own hyperparameters; consult the example configuration files and model docstrings for more information.
The following are general-purpose models:
yoyodyne.models.SoftAttentionGRUModel: a GRU decoder with an attention mechanism; the initial hidden state is treated as a learned parameter. This is most commonly used withyoyodyne.models.modules.GRUEncoders.yoyodyne.models.SoftAttentionLSTMModel: the same asyoyodyne.models.SoftAttentionGRUModelbut with an LSTM decoder instead. This is most commonly used withyoyodyne.models.modules.LSTMEncoders.yoyodyne.models.TransformerModel: a transformer decoder; sinusodial positional encodings and layer normalization are used. This is most commonly used withyoyodyne.models.modules.TransformerEncoders.yoyodyne.models.CausalTransformerModel: a transformer decoder without separate encoder modules, also known as a prefix LM.
The following models are particularly appropriate for when source and target share symbols:
yoyodyne.models.PointerGeneratorGRUModel: a GRU decoder with a pointer-generator mechanism; the initial hidden state is treated as a learned parameter. This is most commonly used withyoyodyne.models.modules.GRUEncoders.yoyodyne.models.PointerGeneratorLSTMModel: the same asyoyodyne.models.PointerGeneratorGRUModelbut with an LSTM decoder instead. This is most commonly used withyoyodyne.models.modules.LSTMEncoders.yoyodyne.models.PointerGeneratorTransformerModel: a transformer decoder with a pointer-generator mechanism. This is most commonly used withyoyodyne.models.modules.TransformerEncoders.
The following models are particularly appropriate for transductions which are largely monotonic:
yoyodyne.models.HardAttentionGRUModel: an GRU decoder which models generation as a Markov process. By default it assumes a non-monotonic progression over the source, but withmodel: enforce_monotonic: truethe model is made to progress over each source character in linear order. By specifyingmodel: attention_context: 1(or larger values) one can widen the context window for state transitions. This is most commonly used withyoyodyne.models.modules.GRUEncoders.yoyodyne.models.HardAttentionLSTMModel: the same asyoyodyne.models.HardAttentionGRUModelbut with an LSTM decoder instead. This is most commonly used withyoyodyne.models.modules.LSTMEncoders.
The following models are also appropriate for transductions which are largely
monotonic, but require additional precomputation with the
maxwell library:
yoyodyne.models.TransducerGRU: a GRU decoder with a neural transducer mechanism trained with imitation learning. This is most commonly used withyoyodyne.models.modules.LSTMEncoders.yoyodyne.models.TransducerLSTM: the same asyoyodyne.models.TransducerGRUbut with an LSTM decoder instead. This is most commonly used withyoyodyne.models.modules.LSTMEncoders.
The following models are not recommended for most users; they generally perform poorly and are present only for historical or testing reasons:
yoyodyne.models.GRUModel: a GRU decoder which uses the last non-padding hidden state(s) of the encoder(s) in lieu of attention; the initial hidden state is treated as a learned parameter. This is most commonly used withyoyodyne.models.modules.GRUEncoders.yoyodyne.models.LSTMModel: the same asyoyodyne.models.GRUModelbut with an LSTM decoder instead. This is most commonly used withyoyodyne.models.modules.LSTMEncoders.
In RNN (e.g., GRU and LSTM) models and modules, information is passed between adjacent source, features, and target symbols, providing a sort of inductive bias towards locality. In contrast, transformer models and modules are in some sense global, and any biases towards locality must be injected into the system via positional encoding systems.
For core transformer modules (including causal and pointer-generator variants), the user can specify the following positional encodings:
-
yoyodyne.models.modules.AbsolutePositionalEncoding: a trainable positional encoding scheme with a unique representation for each position$i$ in a sequence. -
yoyodyne.models.modules.NullPositionalEncoding: this dummy module disables positional encoding; it has no parameters. -
yoyodyne.models.modules.SinusodialPositionalEncoding: a parameter-free (i.e., non-trainable) positional encoding; this is the default for most modules.
The following snippet, for example, enables absolute positional encoding for the source encoder and decoder of an transformer model:
model:
class_path: yoyodyne.models.TransformerModel
init_args:
source_encoder:
class_path: yoyodyne.models.modules.TransformerEncoder
init_args:
positional_encoding:
class_path: yoyodyne.models.modules.AbsolutePositionalEncoding
decoder_positional_encoding:
class_path: yoyodyne.models.modules.AbsolutePositionalEncoding
There is one additional positional encoding option: there are variants of the
core transformer models and modules which support rotary positional encoding
(RoPE). RoPE is implemented as a variant form of multihead attention deep within
the transformer model and cannot be selected using positional_encoding or
decoder_positional_encoding arguments. Rather, it gives rise to the following
models and modules:
yoyodyne.models.RotaryCausalTransformerModelyoyodyne.models.RotaryPointerGeneratorTransformerModelyoyodyne.models.RotaryTransformerModelyoyodyne.models.modules.RotaryCausalTransformerDecoderyoyodyne.models.modules.RotaryFeatureInvariantTransformerEncoderyoyodyne.models.modules.RotaryPointerGeneratorTransformerDecoderyoyodyne.models.modules.RotaryTransformerDecoderyoyodyne.models.modules.RotaryTransformerEncoder
Mixing rotary and non-rotary positional encodings within a single model is not recommended.
Yoyodyne requires an optimizer and an learning rate scheduler. The default
optimizer is yoyodyne.optimizers.Adam, and the default scheduler is
yoyodyne.schedulers.Dummy, which keeps learning rate fixed at its initial
value and takes no explicit configuration arguments.
The following YAML snippet shows the use of the Adam optimizer with a
non-default initial learning rate and the
yoyodyne.schedulers.WarmupInverseSquareRoot LR scheduler:
...
model:
...
optimizer:
class_path: yoyodyne.optimizers.Adam
init_args:
lr: 1.0e-5
beta2: 0.9
scheduler:
class_path: yoyodyne.schedulers.WarmupInverseSquareRoot
init_args:
warmup_epochs: 10
...
The
ModelCheckpoint
is used to control the generation of checkpoint files. A sample YAML snippet is
given below.
...
checkpoint:
filename: "model-{epoch:03d}-{val_accuracy:.4f}"
mode: max
monitor: val_accuracy
verbose: true
...
Alternatively, one can specify a checkpointing that minimizes validation loss, as follows.
...
checkpoint:
filename: "model-{epoch:03d}-{val_loss:.4f}"
mode: min
monitor: val_loss
verbose: true
...
A checkpoint config must be specified or Yoyodyne will not generate any checkpoints.
The user will likely want to configure additional callbacks. Some useful examples are given below.
The
LearningRateMonitor
callback records learning rates:
...
trainer:
callbacks:
- class_path: lightning.pytorch.callbacks.LearningRateMonitor
init_args:
logging_interval: epoch
...
The
EarlyStopping
callback enables early stopping based on a monitored quantity and a fixed
patience:
...
trainer:
callbacks:
- class_path: lightning.pytorch.callbacks.EarlyStopping
init_args:
monitor: val_loss
patience: 10
verbose: true
...
By default, Yoyodyne performs some minimal logging to standard error and uses progress bars to keep track of progress during each epoch. However, one can enable additional logging faculties during training, using a similar syntax to the one we saw above for callbacks.
The
CSVLogger
is enabled by default, and logs all monitored quantities to a CSV file.
The
WandbLogger
works similarly to the CSVLogger, but sends the data to the third-party
website Weights & Biases, where it can be used to
generate charts or share artifacts:
...
trainer:
logger:
- class_path: lightning.pytorch.loggers.WandbLogger
init_args:
project: unit1
save_dir: /Users/Shinji/models
...
Note that this functionality requires a working account with Weights & Biases.
Dropout probability and/or label smoothing are specified as arguments to the
model and its encoders:
...
model:
source_encoder:
class_path: ...
init_args: ...
dropout: 0.5
decoder_dropout: 0.5
label_smoothing: 0.1
...
Batch size is specified using data: batch_size: ... and defaults to 32.
By default, the source and target vocabularies share embeddings so identical
source and target symbols will have the same embedding. This can be disabled
with data: tie_embeddings: false.
By default, training uses 32-bit precision. However, the trainer: precision:
flag allows the user to perform training with half precision (16), or with
mixed-precision formats like bf16-mixed if supported by the accelerator. This
might reduce the size of the model and batches in memory, allowing one to use
larger batches, or it may simply provide small speed-ups.
There are a number of ways to specify how long a model should train for. For example, the following YAML snippet specifies that training should run for 100 epochs or 6 wall-clock hours, whichever comes first:
...
trainer:
max_epochs: 100
max_time: 00:06:00:00
...
In validation mode, one runs the validation step over labeled validation data
(specified as data: val: path/to/validation.tsv) using a previously trained
checkpoint (--ckpt_path path/to/checkpoint.ckpt from the command line),
recording loss and other statistics for the validation set. In practice this is
mostly useful for debugging.
This mode is invoked using the validate subcommand, like so:
yoyodyne validate --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
In test mode, one computes accuracy over held-out test data (specified as
data: test: path/to/test.tsv) using a previously trained checkpoint
(--ckpt_path path/to/checkpoint.ckpt from the command line); it differs from
validation mode in that it uses the test file rather than the val file.
This mode is invoked using the test subcommand, like so:
yoyodyne test --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
In predict mode, a previously trained model checkpoint
(--ckpt_path path/to/checkpoint.ckpt from the command line) is used to label
an input file. One must also specify the path where the predictions will be
written:
...
predict:
path: path/to/predictions.txt
...
This mode is invoked using the predict subcommand, like so:
yoyodyne predict --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
The examples directory contains interesting examples, including:
concatenateprovides sample code for concatenating source and features symbols à la Kann & Schütze (2016).wandb_sweepsshows how to use Weights & Biases to run hyperparameter sweeps.
- Maxwell is used to learn a stochastic edit distance model for the transducer models.
- Yoyodyne Pretrained provides a similar interface but uses large pre-trained models to initialize the encoder and decoder modules.
Yoyodyne is distributed under an Apache 2.0 license.
We welcome contributions using the fork-and-pull model.
In addition to releases available via
GitHub and
PyPI, the 0.3.3 version is available as
the legacy branch.
Yoyodyne is beholden to the heavily object-oriented design of Lightning, and wherever possible uses Torch to keep computations on the user-selected accelerator. Furthermore, since it is developed at "low-intensity" by a geographically-dispersed team, consistency is particularly important. Some consistency decisions made thus far:
- Abstract classes overrides are enforced using PEP 3119.
A model in Yoyodyne is a sequence-to-sequence architecture and inherits from
yoyodyne.models.BaseModel. These models in turn consist of ("have-a") one or
more encoders responsible for encoding the source (and features, where
appropriate), and a decoder responsible for predicting the target sequence
using the representation generated by the encoders. The encoders and decoder are
themselves Torch modules.
The model is responsible for constructing the encoders and decoders. The model
dictates the type of decoder. The model communicates with its modules by calling
them as functions (which invokes their forward methods); however, in some
cases it is also necessary for the model to call ancillary members or methods of
its modules.
When features are present, models are responsible for fusing source and features encodings, and do so in a model-specific fashion. For example, ordinary RNNs and transformers concatenate source and features encodings on the length dimension (and thus require that the encodings be the same size), whereas hard attention and transducer models average across the features encoding across the length dimension and the concatenate the resulting tensor with the source encoding on the encoding dimension; by doing so they preserve the source length and make it impossible to attend directly to features symbols.
Each model supports greedy decoding implemented via a greedy_decode method;
many models (vanilla RNNs, pointer-generator RNNs and all transformers) support
beam search during prediction (though not during training, validation, or
testing) via a beam_decode method. Beam search decoding is enabled by setting
beam_width to some value > 1; batch_size must also be set to 1.
...
data:
...
batch_size: 1
...
model:
class_path: yoyodyne.models.SoftAttentionLSTMModel
init_args:
...
beam_width: 5
...
prediction:
path: /Users/Shinji/predictions.tsv
...
The resulting prediction files will be a 10-column TSV file consisting of the top 5 target hypotheses and their log-likelihoods (collated together), rather than single-file text files just containing the top hypothesis.
Some models can only be treated with teacher forcing, but others can also be
trained with student forcing by setting model: teacher_forcing: false. When
using student forcing with transformer models, one should set
data: max_target_length: ... to a value appropriate for the data to avoid
unnecessary attention computations, which are quadratic in the maximum target
length.
The "units" of tests/yoyodyne_test.py are
essentially small integration tests running through training, prediction, and
evaluation.
There are two kinds of data sets here. "Toy" data sets consist of simple transductions over a small alphabet:
copy(i.e., repeat the input string twice)identityreverseupper(i.e., map to uppercase)
These are configured to train for 20 epochs, training for no more than 2 minutes.
In contrast, the two "real" data sets target existing problems:
ice_g2p: Icelandic G2P data from the 2021 SIGMORPHON shared tasktur_inflection: Turkish inflection generation data from the CoNLl-SIGMORPHON 2017 shared task
These are instead configured to train for up to 50 epochs (with early stopping), training for no more than 10 minutes.
There are also a few tests which confirm that specific misconfigurations raise exceptions.
To run all tests, run the following:
pytest -vvv tests
Given this large number of units and the allotted amount of training time, which
accounts for the vast majority of compute time, running the full set of tests
could take as long as a few hours. Thus one may wish instead to specify a subset
of tests using the -k flag. For example, to run all the "toy" tests, run the
following:
pytest -vvv -k toy tests
Or, to just run the Icelandic G2P tests, run the following:
pytest -vvv -k g2p tests
Or, to just run the misconfiguration tests, run the following:
pytest -vvv -k misconfiguration tests
See the pytest
documentation for more
information on the test runner.
- Create a new branch. E.g., if you want to call this branch "release":
git checkout -b release - Sync your fork's branch to the upstream master branch. E.g., if the upstream
remote is called "upstream":
git pull upstream master - Increment the version field in
pyproject.toml. - Stage your changes:
git add pyproject.toml. - Commit your changes:
git commit -m "your commit message here" - Push your changes. E.g., if your branch is called "release":
git push origin release - Submit a PR for your release and wait for it to be merged into
master. - Tag the
masterbranch's last commit. The tag should begin withv; e.g., if the new version is 3.1.4, the tag should bev3.1.4. This can be done:- on GitHub itself: click the "Releases" or "Create a new release" link on the right-hand side of the Yoyodyne GitHub page) and follow the dialogues.
- from the command-line using
git tag.
- Build the new release:
python -m build - Upload the result to PyPI:
twine upload dist/*
Kann, K. and Schütze, H. 2016. Single-model encoder-decoder with explicit morphological representation for reinflection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 555-560.
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. 2019. fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48-53.
(See also yoyodyne.bib for more work used during the
development of this library.)