I have been going over the implementation of the NASCaps repo, and to understand how the algorithm searches the architecture I am following the README.md there to run the "main.py" with its args as mentioned in the file. and I have encountered an issue explained down below:
Once a gene is created and the corresponding CapsNet model is created, upon training the model for evaluating the population (method evaluate_population > wrap_train_test > train) I get the following error:
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
run_metadata_ptr)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/engine/training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/ak11263/nascaps/nsga/main.py", line 893, in train
callbacks=[timeout_call, log, checkpoint, lr_decay])
File "/home/ak11263/nascaps/nsga/main.py", line 652, in wrap_train_test
runid, _ = train(model=model, data=((x_train_current, y_train), (x_test_current, y_test)), args=args)
File "/home/ak11263/nascaps/nsga/main.py", line 525, in evaluate_population
p["runid"], train_acc = wrap_train_test(p["gene"])
File "/home/ak11263/nascaps/nsga/main.py", line 711, in run_NSGA2
evaluate_population(parent)
File "/home/ak11263/nascaps/nsga/main.py", line 1065, in <module>
rets = run_NSGA2(metrics=["accuracy_drop", "energy", "memory", "latency"], inshape=inshape, p_size=args.population, q_size=args.offsprings, generations=args.generations)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(100, 160), b.shape=(160, 784), m=100, n=784, k=160
[[{{node decoder/dense_1/MatMul}}]]
[[{{node loss/decoder_loss/Mean_3}}]]
After disabling (commenting out) the training and testing of the generated model and replacing it with a dummy model to generate a random test_acc I have seen that the program runs successfully.
I have been looking around the net and have some suggestions that the use of tensorflow v1 is causing the issue (I also have seen that it has been showing me plenty of warnings of deprecations).
I also have started migrating the project into tensorflow 2, although not very successfully.
It would have been delightful if I could have been given any suggestions.
I have been going over the implementation of the NASCaps repo, and to understand how the algorithm searches the architecture I am following the README.md there to run the "main.py" with its args as mentioned in the file. and I have encountered an issue explained down below:
Once a gene is created and the corresponding CapsNet model is created, upon training the model for evaluating the population (method evaluate_population > wrap_train_test > train) I get the following error:
After disabling (commenting out) the training and testing of the generated model and replacing it with a dummy model to generate a random test_acc I have seen that the program runs successfully.
I have been looking around the net and have some suggestions that the use of tensorflow v1 is causing the issue (I also have seen that it has been showing me plenty of warnings of deprecations).
I also have started migrating the project into tensorflow 2, although not very successfully.
It would have been delightful if I could have been given any suggestions.