Skip to content

Rewrite matrices in the CUDA implementation #3

Open
@sbaldu

Description

@sbaldu

The CUDA implementation is very slow because of how the cuda matrices are implemented. All the host methods that call the global kernels have multiple calls to cudaMalloc and cudaMemcpy, and this increases the execution times greatly.
The class should be reimplemented so that the device pointers are class attributes, and in this way the kernels can be called directly on them withoud having to allocate and copy memory every time.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

improvingImproving an already existing featureperformanceRegarding the performance of the library

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions