There are various ways one can provide "multiple sources" as inputs. For instance, you can use "tags" to separate out the pieces, or you can encode it into the source symbols ("I'm a g from source 1" vs. "I'm a g from source 2" via g_1 or g_2), or in Yoyodyne, if you have exactly two sources, the second one can be provided as the feature column. All of these can be done basically by hacking around on your TSV files. But there's one other way you might do this that can't be done in this fashion: appending "source embeddings" to each symbol in the source tape.
Formally, let us suppose that the encoder produces "contextual embeddings" or "encodings" of each symbol, giving us a tensor of shape $B \times L \times H$ where $B$ is the batch size, $L$ the max length of the batch, and $H$ the size of the hidden layer. (I put aside things like bidirectionality or multiple hidden layers for simplicity.) A source embedding is a learned $S \times D$ embedding table where $S$ is the number of sources and $D$ is the size of the embedding. E.g., if there are two sources and the size is 32, it'll be $2 \times 32$. An encoding augmented with source embeddings will then be of size $B \times L \times (H + D)$, with the source embedding of the corresponding symbol "appended" to the bottom of the size-$H$ encoding.
Compared to the tag system, this is more explicit, and feels like it might be a stronger inductive bias because the system doesn't have to learn to pass through or attend to (depending) to said tags.
Implementationally, one could imagine that the source indices tensor (0 if the symbol is from source 1, 1 if it's from source 2, etc.), of shape $B \times L$, is produced during the data loading phase and is attached to the batch object. Then, the encoder is run as before, producing the encoding of shape $B \times L \times H$. Then, if a source indices tensor is present in the batch, these are used to retrieve a source embedding tensor of shape $B \times L \times D$, and then the encoding and source embedding tensors are concatenated along the last dimension. This could all live within the base model, I think, with a call-back like interface: if there is a source indices tensor in the batch and a source embedding in the model, create the source embedding and concatenate it.
Interface-wise, I would propose that we make, I would suggest that something like --source_col 1,2,5 means that there are three sources, in columns 1, 2, and 5 of the TSV files. When more than one source column is specified in this manner, the data module ought to 1) concatenate the 3 streams of source symbols together and 2) generate source indices tensors and the model ought to 3) generate a source embedding table too if there is more than one source specified. To size the source embedding table we should add a flag --source_embedding_size and set the default to something small, like 16 or 32. There may be model classes this whole thing won't work with and they just should throw an informative exception to that effect.
There is another case where one might want concatenation of other embeddings: in true polyglot models where there is a single source and target for each example, but these differ in what "language" they are from. This can't be seen as a special case of the above design as far as I can tell, because there aren't really multiple source columns, or if there are they may be "ragged" rather than a fixed number of sources, and you may also want to have a separate "target embedding" and concatenate it to the decoder outputs as in this, section 2.4. So that's an issue for another time.
(I hate to submit feature request issues without a serious intent to take care of them any time soon, but I do have a student interested in this and if he decides to go through with it I'll assign them.)
There are various ways one can provide "multiple sources" as inputs. For instance, you can use "tags" to separate out the pieces, or you can encode it into the source symbols ("I'm a g from source 1" vs. "I'm a g from source 2" via
g_1org_2), or in Yoyodyne, if you have exactly two sources, the second one can be provided as the feature column. All of these can be done basically by hacking around on your TSV files. But there's one other way you might do this that can't be done in this fashion: appending "source embeddings" to each symbol in the source tape.Formally, let us suppose that the encoder produces "contextual embeddings" or "encodings" of each symbol, giving us a tensor of shape$B \times L \times H$ where $B$ is the batch size, $L$ the max length of the batch, and $H$ the size of the hidden layer. (I put aside things like bidirectionality or multiple hidden layers for simplicity.) A source embedding is a learned $S \times D$ embedding table where $S$ is the number of sources and $D$ is the size of the embedding. E.g., if there are two sources and the size is 32, it'll be $2 \times 32$ . An encoding augmented with source embeddings will then be of size $B \times L \times (H + D)$ , with the source embedding of the corresponding symbol "appended" to the bottom of the size-$H$ encoding.
Compared to the tag system, this is more explicit, and feels like it might be a stronger inductive bias because the system doesn't have to learn to pass through or attend to (depending) to said tags.
Implementationally, one could imagine that the source indices tensor (0 if the symbol is from source 1, 1 if it's from source 2, etc.), of shape$B \times L$ , is produced during the data loading phase and is attached to the batch object. Then, the encoder is run as before, producing the encoding of shape $B \times L \times H$ . Then, if a source indices tensor is present in the batch, these are used to retrieve a source embedding tensor of shape $B \times L \times D$ , and then the encoding and source embedding tensors are concatenated along the last dimension. This could all live within the base model, I think, with a call-back like interface: if there is a source indices tensor in the batch and a source embedding in the model, create the source embedding and concatenate it.
Interface-wise, I would propose that we make, I would suggest that something like
--source_col 1,2,5means that there are three sources, in columns 1, 2, and 5 of the TSV files. When more than one source column is specified in this manner, the data module ought to 1) concatenate the 3 streams of source symbols together and 2) generate source indices tensors and the model ought to 3) generate a source embedding table too if there is more than one source specified. To size the source embedding table we should add a flag--source_embedding_sizeand set the default to something small, like 16 or 32. There may be model classes this whole thing won't work with and they just should throw an informative exception to that effect.There is another case where one might want concatenation of other embeddings: in true polyglot models where there is a single source and target for each example, but these differ in what "language" they are from. This can't be seen as a special case of the above design as far as I can tell, because there aren't really multiple source columns, or if there are they may be "ragged" rather than a fixed number of sources, and you may also want to have a separate "target embedding" and concatenate it to the decoder outputs as in this, section 2.4. So that's an issue for another time.
(I hate to submit feature request issues without a serious intent to take care of them any time soon, but I do have a student interested in this and if he decides to go through with it I'll assign them.)