Rewrite matrices in the CUDA implementation

The CUDA implementation is very slow because of how the cuda matrices are implemented. All the host methods that call the __global__ kernels have multiple calls to cudaMalloc and cudaMemcpy, and this increases the execution times greatly.
The class should be reimplemented so that the device pointers are class attributes, and in this way the kernels can be called directly on them withoud having to allocate and copy memory every time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite matrices in the CUDA implementation #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Rewrite matrices in the CUDA implementation #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions