Hello, you have finished a great job, and i have some questions and hopeful to get answers:
I find that some tricks in your code that "Given the initial states of the decoder, decoder inputs are initialized to zero and are updated using backpropagation. Outputs of the decoder and passed on to the output layer"
i want to know why you initialized decoder inputs, this is not same as traditional LSTM framework