Is your feature request related to a problem? Please describe.
CPU offloading in backward works by iterating this over layers:
- Load weights from CPU
- Run backward from head gradients
- Store gradients to CPU
These steps are currently run sequentially.
Describe the solution you'd like
Can we run these steps in parallel?
- Switch between two shards on GPU
- Run backward on one while loading weights for the other, and storing gradients for the previous
Needs clear understanding how async CPU <-> GPU transfer works! We know how transfer between GPUs works.
Is your feature request related to a problem? Please describe.
CPU offloading in
backwardworks by iterating this over layers:These steps are currently run sequentially.
Describe the solution you'd like
Can we run these steps in parallel?
Needs clear understanding how async CPU <-> GPU transfer works! We know how transfer between GPUs works.