Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the wrapper Bidirectional for RNN layers #1790

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

dcecchini
Copy link

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • API changes require approval from a committer (different from the author, if applicable)

Description

The implementation was based on the implementation of BiLSTM_CNN_CRF_Model present on TextModels.jl but has a few differences:

  • It leaves the Bidirectional usage closer to Keras as it allows the creation of, for example, Bidirectional(Flux.LSTM, in, out) inside a Chain declarations.
  • It explicitly adds the dims=1 parameter on the reverse function, which was needed in my experiments but more generic implementation may be possible (I was not able to use flip probably for the same reason).

I also updated the Base.show and _big_show functions so that it is printed in the same style of Flux giving the number of trainable parameters. The implementation added for the _big_show could be added to src/show.jl later.

This is my first PR here, I read the information on CONTRIBUTING.md, but please let me know if it is not following the correct standards. I really enjoy Julia even though I am still learning.

Hope I can contribute to this great project Flux.

src/layers/recurrent.jl Outdated Show resolved Hide resolved
src/layers/recurrent.jl Outdated Show resolved Hide resolved
src/layers/recurrent.jl Outdated Show resolved Hide resolved

# Concatenate the forward and reversed backward weights
function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},OneHotArray}) where {T}
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1))
Copy link
Member

@ToucheSir ToucheSir Nov 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1))
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=3)); dims=3))

Sorry, just found this. When applying Flux RNNs on dense sequences, the temporal dim is actually the last one. See

function (m::Recur)(x::AbstractArray{T, 3}) where T
and
Folding over a 3d Array of dimensions `(features, batch, time)` is also supported:
.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused now because I was expecting the variable x to be the result of an OneHotMatrix operation on a sentence and would be a matrix. In my experiments, I was using an array of one-hot encoded sentences (that were padded to fixed size) like this:

X_onehot = [OneHotMatrix(x, vocab_size) for x in X]
y_onehot = [OneHotMatrix(x, num_tags) for x in y]

where X is an array of padded sentences and y is the corresponding labels of each word in the sentence (I am experimenting with Named Entity Recognition models). So the input data would be a matrix (batch, (time, features)) and not in the (features, batch, time) format.

I am not sure how to proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://fluxml.ai/Flux.jl/stable/models/recurrence/#Working-with-sequences has most of the nitty-gritty details about working with batched sequences for RNNs. In short, the only supported formats are ((features, batch), time) and (features, batch, time). Unlike Python frameworks, Flux puts the batch dim last for all layers because of column major layout. RNNs just add another time dim after that.

Since this definition of Bidirectional takes a contiguous array instead of a vector of arrays, m.forward() and m.backward() dispatch to (m::Recur)(x::AbstractArray{T, 3}). To support both, you'd need something like the following (note: untested!):

  1. (m::Bidirectional)(x::AbstractArray{T, 3}) where T for (features, batch, time)
  2. (m::Bidirectional)(x::Vector{<:AbstractVecOrMat{T}}) where T for ((features, batch), time)

Copy link
Member

@mcabbott mcabbott Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the rnn layers do not literally accept a vector of matrices. At the link this is iterated through by hand. Should method 2 of Bidirectional(fwd, rev)(x) handle that?

julia> LSTM(3,5)(randn(Float32, 3,)) |> size
(5,)

julia> LSTM(3,5)(randn(Float32, 3,7)) |> size
(5, 7)

julia> LSTM(3,5)(randn(Float32, 3,7,11)) |> size
(5, 7, 11)

julia> LSTM(3,5)([randn(Float32, 3,7) for _ in 1:11]) |> size
ERROR: MethodError

julia> function (b::Bidirectional)(xs::AbstractVector{<:AbstractVecOrMat})
         top = [b.forward(x) for x in xs]
         bot = reverse([b.reverse(x) for x in reverse(xs)])
         vcat.(top, bot)
       end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should method 2 of Bidirectional(fwd, rev)(x) handle that?

Yes, exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And applied to randn(Float32, 3,7), should it fail, or still reverse in the 3rd dimension?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signatures specifically prohibit that, so it should fail. Did you have a specific edge case in mind that isn't caught by the above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just wonder about the mismatch between what LSTM accepts and what this accepts. Prohibiting things which don't make sense is good, you want to know soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the RNN layers can work with single timesteps but Bidirectional can't, as it needs to see the whole sequence up-front in order to reverse it. If anything the former should be brought in line with the latter, but that's a conversation for another issue (#1678 probably).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great conversation here and the suggestions. I just tested and the Bidirectional, as it is, is able to process randn(Float32, 3,7):

Bidirectional(LSTM, 3, 5)(randn(Float32, 3, 7)) |> size
(10, 7)

But about the format, I am still confused on how to preprocess the text data so that it would end in (seq_length, (features, samples)) format. It seems counterintuitive to me.

I usually follow: read the data -> split sentences -> split words -> pad -> one-hot. So my data would be an array with N sentences, where every sentence is described as a one-hot matrix of its words. In this way, I ended up with the (samples, (features, seq_length)) format. How should I preprocess the text data so that I would end up with (seq_length, (features, samples)).

Also, by checking these formats I discovered that I should probably use dims=2 in my formulation on the reverse function (not use the default to reverse in all dimensions, it should reverse only the time that in my case is the second dimension of the onehotmatrix).

@ToucheSir
Copy link
Member

ToucheSir commented Nov 28, 2021

@dcecchini do you have local (behavioural) tests that weren't pushed? I only see changes in src/.

@dcecchini
Copy link
Author

@dcecchini do you have local (behavioural) tests that weren't pushed? I only see changes in src/.

I don't think so, I did all the changes on src/ only and pushed all of them. I will make a few modifications as mentioned on the comments here and push them shortly.

…w method to the `show.jl` file.

Modified the constructor to accept `a...` and `ka...` instead of only the `in` and `out`.
@ToucheSir
Copy link
Member

Testing wise, having something in https://github.com/FluxML/Flux.jl/blob/6168692ddb2680ebe02f461bf0c59c2b6cb0ff1e/test/layers/recurrent.jl that checks the logic (need not be complex) would be good. You can see how this is done for other composite layers such as SkipConnection or Parallel.



"""
Bidirectional{A,B}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we settle on the constructor, this line should be the various constructors not the type definition.

@ToucheSir
Copy link
Member

ToucheSir commented Nov 29, 2021 via email

@dcecchini
Copy link
Author

dcecchini commented Nov 29, 2021

To be clear, accepting that input is a bug, not a feature :P Workflow-wise, the key is to put the batch dim in between the time and feature ones. There are a few ways to do this, but one of the easiest would be Flux.onehotbatch(transpose(reduce(hcat, batch_of_padded_sentences)), vocab). See the Flux.onehotbatch https://fluxml.ai/Flux.jl/stable/data/onehot/#Flux.onehotbatch docs for more details. We're actively thinking about how to make working with RNNs more ergonomic (the old interface is not terribly so), but in the meantime getting your data into (samples, seq_length) or (samples, (seq_length)) first and then doing the one-hot transform is the easiest way to go about things.

Yes! To add the batch between time and feature is not difficult, the problem is to have the batch as the last dimension! I was able to transform in (time, (samples, features)) format (being an array of array of array and not an array of matrixes), but could not keep the samples at the end. I used:

[[X_onehot[i][:, j] for i in 1:length(X_onehot)] for j in 1:size(X_onehot[1])[2]]

but after checking your example I was able to do this:

aux = reduce(hcat, X_pad); # size(aux) = (time, samples)
Xs = [Flux.onehotbatch(aux[i, :], 1:vocab_size) for i in 1:size(aux )[1]]; 
size(Xs) # (time, )
size(Xs[1]) # (features, samples)

So, at the end, I had to use the one-hot encoding in a different way: instead of using it in a sentence, use it in arrays representing first words, seconds words etc.

Edit ----

After experimenting with this, I was unable to make it work with Embedding layer. I would like to use the (time, (features, samples)) as input to the Embedding layer, but it was not working. Also, I tried to pass it to LSTM cells and it didn't work as well.

So I really would prefer the "bugged" version as it allows to send the whole sentence in a batch and it works fine passing it as (samples, (features, time)) to embedding layer and further one...

Anyone has other solution?

@ToucheSir
Copy link
Member

(time, (features, samples)) as input to the Embedding layer, but it was not working

You'd have to pass each timestep in separately to the embedding layer, because it takes an input of size (samples,) or (features, samples). This holds true for the alternative format you posted as well, because Embedding doesn't know how to work with the outer vector otherwise.

Also, I tried to pass it to LSTM cells and it didn't work as well.

Same story here. LSTM takes an input of shape (features, samples) (one timestep, you'll have to pass it each timestep separately if you're using ((features, samples), time), or (features, samples, time) (multiple timesteps).

Alternatively, you could use dense arrays all the way through. Given a set of inputs of shape (samples, time), pass it to Embedding to get an output of (features, samples, time). As you can see, this is already in the right shape to be handled by LSTM or any following RNN layer.

...it allows to send the whole sentence in a batch

Just to clarify, using the time dimension as the batch dim means that each token in the sentence will be processed in parallel...which is probably not what you want with an RNN :P

@NeroBlackstone
Copy link

Any updates for this PR?
This is a very useful feature.
Are there any other examples of building bi-LSTM in Flux.jl 0.14?

@NeroBlackstone
Copy link

NeroBlackstone commented Jun 3, 2024

Questions about this implementation of Bidirectional RNN:

My question is, since bidirectional RNN requires a full time step (forward hidden state calculation ➡️ + backward hidden state calculation⬅️), so, for a single time step (1/seq_length) input, we cannot calculate the hidden state of bidirectional RNN, right?

But if we initialize model in this PR:

julia> model = Bidirectional(LSTM, 3, 5) 
Bidirectional(LSTMCell(3 => 5), LSTMCell(3 => 5))

julia> x = rand(Float32, 3)
3-element Vector{Float32}:
 0.40451288
 0.40546536
 0.7272219

julia> model(x)
10-element Vector{Float32}:
 -0.15061164
 -0.105684996
 -0.08399847
  0.113812014
 -0.03911737
  0.061232302
 -0.011352453
 -0.0018385417
 -0.06642
  0.006111855

It output single step result, is this work expected?

I think the correct behavior is (for batch size 1):

# input size(all_time_step) =  (sequence_len * features) 
julia> model(all_time_step)
# output size = (sequence_len * outsize*2)

@NeroBlackstone
Copy link

It seems that the original author has given up on implementing it, and I want to try to continue his work (this is related to my graduation,

return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1))
end

@functor Bidirectional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flux.@layer here?

@NeroBlackstone
Copy link

NeroBlackstone commented Jun 4, 2024

I think BRNN should take all the time step, so We can't use it like RNN that we have implemented, which means:

This usage is valid in RNN:

# 2 => feature, 3 => sequence_len, 4 => batch_size
x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
rnn = RNN(2=>5)
# out_dim = 3(sequence_len) * 5(hidden) * 4(batch_size)
rnn_out = [rnn(xi) for xi in x]

But same usage is invalid in BRNN:

# 2 => feature, 3 => sequence_len, 4 => batch_size
x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
brnn = Bidirectional(RNN,2,5)
# out_dim = 3(sequence_len) * 10(hidden) * 4(batch_size)
brnn_out = [brnn(xi) for xi in x]

Because in (m::Bidirectional)(x::Union{AbstractVecOrMat{T},Flux.OneHotArray}) where {T}:

function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},Flux.OneHotArray}) where {T}
    #size(x) = (2,4)(feature*batch_size)
    return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1))
end

the size of x is (feature*batch_size), we actually "reverse feature", not "reverse timestep" in this implementation

@NeroBlackstone
Copy link

NeroBlackstone commented Jun 4, 2024

So for BRNN input, I think correct dimensions is:

For 2d-matrix input or vector of vector input, which means seq_length * feature with only one batch size.

# seq_len = 3
# feature = 2

# vector of vector
x = [rand(Float32,2) for i = 1:3]
# or
x = rand(Float32,3,2)

# out put all hidden state, with double size.
# size = sequence_len * 2hidden_size
brnn(x)

For vector of 2d-matrix input, which means seq_length * feature * batch_size.

# seq_len = 3
# feature = 2
# batch_size = 4

x = [[1 2 1 2; 3 4 3 4] for i = 1:3]

# out put all hidden state, with double size.
# size = sequence_len * 2hidden_size * batch_size
brnn_out = brnn(x)

@NeroBlackstone
Copy link

NeroBlackstone commented Jun 4, 2024

I think the correct implement is:

function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},Flux.OneHotArray}) where {T}
    vec_for = [m.forward(xi) for xi in x]
    vec_back = reverse([m.backward(xi) for xi in reverse(x)])
    return vcat.(vec_for,vec_back)
end

for no batch input:

x = [rand(Float32,2) for i = 1:3]
# size = 3(seq_len) * 10(double hidden_size)
brnn(x)
# size = 3(seq_len) * 5(hidden_size)
[rnn(xi) for xi in x]

for input with batchs:

x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
# size = 3(seq_len) * 10(double hidden_size) * 4(batch_size)
brnn(x1)
# size = 3(seq_len) * 5(hidden_size) * 4(batch_size)
[rnn(xi) for xi in x]

The output size of RNN and BRNN is same

@dcecchini
Copy link
Author

Hi @NeroBlackstone , at that time I could not finish the implementation due to the difficulty to get something to work on the designed structure for RNN cells as you can see in the history above. I would be glad to work on this, but I have little time to spare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants