Description
Hi @urialon,
As mentioned in one of the previous issues, I am trying to train and test Code2Seq for the code summarization tasks on our own python dataset. I am able to train the model but now the predictions/training doesn't seem to be correct. This issue seems to be similar to #62 which is also not properly resolved. Following are the things that I have tried:
-
First time I tried to train with the same default config and after a couple of epochs, the predicted text for all cases was like "the|the|the|the|the|the".
-
Following the suggestions of Code Captioning Task #17 and Hi, how could I reproduce results for code documentation as described in the paper #45, I updated the model config to make it suitable for predicting longer sequences. But then also the predictions were similar but the length of predicted texts was varying which might be because I changed MAX_TARGET_PARTS as part of the config.
-
Next I have followed the suggestions in Empty hypothesis when periods are included in dataset #62 and make sure that there is no extra delimiter(",", "|" and " "), there is no punctuation and numbers, no non-alphanumeric characters(using str.isalpha() check over both doc and paths) and removing extra pipes(||). This time there was empty hypothesis for all the validation data points like Empty hypothesis when periods are included in dataset #62.
-
To check if there is any issue in my setup, I tried to train the model using the python150k dataset and it's training properly on that so I am assuming it's some kind of dataset issue only.
-
I have observed that during the first 1 or 2 epochs there are some texts in prediction but with more epochs it goes down to become empty for all data points.
Here are some of the training logs during my experiments.
training-logs-1.txt
training-logs-2(config change).txt
training-logs-3(alnum).txt
Thanks & Regards,
Tamal Mondal