I am using Ossian to train a Bangla (Bengali) voice. My data-set consists of ~4000 sentences (7 hours of speech). The error graph I obtained after training the acoustic model looks like this:

I have used (almost) all the default settings, except changing some hyper-parameters as follows:
- batch_size : 128
- training_epochs : 15
- L2_regularization: 0.003
The synthesized speech does not sound bad. But I think there are lot of rooms for improvements available by looking at the error graph. Can someone direct me to any changes to improve the acoustic model? Do I need more data (I am working on it), or reduce the size/layer of the NN? Any suggestions about the hyper-parameters? Thanks.