SaShiMi-796/goals.txt at master · necrashter/SaShiMi-796 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Paper title: It’s Raw! Audio Generation with State-Space Models
Paper link: https://proceedings.mlr.press/v162/goel22a/goel22a.pdf

Qualitative results: Examples on section 1, https://hazyresearch.stanford.edu/sashimi-examples/

Quantitative results: Table 4, column 1. Reproducing columns 2 and 3 is not feasible because it relies on human feedback.

—— version 1 submission ——

First, we have tested the model on MNIST and obtained good results.
However, we have failed to reproduce the targeted results.
The samples from the model resemble piano sounds but they lack musicality. It sounds as if someone was pressing the piano keys randomly.
Our average NLL on test dataset is 5.192 (in base 2) which is nowhere near our target 1.294.

We have already outlined the challenges we faced at the end of main.ipynb.
Based on these, our future work plan is:
- Check whether there are any remaining bugs in our implementation.
- Train with different hyperparameters and optimizers.
- Train for longer durations if our resources allow.
- If all else fails, train on a different dataset. In particular, SC09 may be simpler.

—— version 2 submission ——

With an 8-layer SaShiMi model, we managed to achieve an NLL of 1.325 (in base 2) in our target dataset after 160 epochs.
For comparison, the result reported in the paper is 1.294.
Although our result is slightly higher, the model in the paper was trained longer (600K steps on page 19, which would be about 400 epochs in our setup).
We believe it's reasonable to expect that our model can achieve the same or better NLL value with longer training and/or better hyperparameter choices.
Furthermore, our generated samples are similar to the ones provided by the authors.
Therefore, we think that we've successfully reproduced the paper.

We've made many critical changes and bugfixes since the version 1 submission.
These are explained in the "Challenges" section of main.ipynb in detail.