Skip to content
This repository was archived by the owner on Dec 26, 2018. It is now read-only.
This repository was archived by the owner on Dec 26, 2018. It is now read-only.

a bug of "OOM, about gpu" when I run it on more than one spark worker node #12

@younfor

Description

@younfor

for example : I had change the files to load my own pic like [None,32,32,3] . Everything is OK, but when I set the partition=2 or 4 , 8 ... and my computer information is gtx1070, ubuntu14.04, 8G. I also change the model init code with:
config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.allocator_type = 'BFC' #config.gpu_options.per_process_gpu_memory_fraction = 0.2 session = tf.Session(config=config)
upon will enable several process in one gpu.
the bug is when the programer run some epoches , I find "nvidia-smi" 's gpu memory grows without stop.
from 800MB to 2G , 4G, 8G... finally show some errors like cuda OOM.
my way to solve it:
after my check and try to fix it, I find a function leads to the GPU Memory Leak
def reset_gradients(self): #with self.session.as_default(): #self.gradients = [tf.zeros(g[1].get_shape()).eval() for g in self.compute_gradients] self.gradients = [0.0]*len(self.compute_gradients) # my modify self.num_gradients = 0
though I don't the details why this change can works ,but it did.
email :younfor@yeah.net

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions