Releases: necst/grcuda
grcuda-0.4.6
What's Changed
- Hotfix scripts generic by @ian-ofgod in #54
Full Changelog: grcuda-0.4.3...grcuda-0.4.6
grcuda-0.4.5
What's Changed
- Hotfix scripts generic by @ian-ofgod in #54
Full Changelog: grcuda-0.4.3...grcuda-0.4.5
grcuda-0.4.3
What's Changed
- [GRCUDA-hotfix] added generic install scripts + fallback configuration in #53
Full Changelog: grcuda-0.4.1...grcuda-0.4.3
grcuda-0.4.2
What's Changed
- Added Java implementation of the benchmark suite in #49
- Grcuda 132 refactor deviceselectionpolicy in grcudastreampolicy in #50
- [GRCUDA-hotfix] updated Java benchmarks configs in #51
- [GRCUDA-hotfix] move B9M into the Java benchmark suite in #52
New Contributors
Full Changelog: grcuda-0.4.0...grcuda-0.4.2
GrCUDA 0.4.1 - July 2023
Miscellaneous
- Bumped graal and mx versions
- Added Java implementation of the benchmark suite:
- Implemented the multi-gpu benchmarks present in the Python suite in Java.
- The class
Benchmark.javaprovides a template for future use cases. - Created configuration files to easily adapt the benchmarks to different types of workloads.
- The suite is built as a Maven project, indeed running
mvn testwill execute all the benchmarks based on the configuration file of the appropriate GPU architecture. - In the default configuration, all of the benchmarks will be executed, all the benchmarks' input sizes will be tested, and all of the scheduling policies will be run
- Refactor
DeviceSelectionPolicyinGrCUDAStreamPolicy- Each policy type has a separate class
- Keep only
retrieveImpl - Now
TransferTimeDeviceSelectionPolicyextendsDeviceSelectionPolicy - Delete previously commented methods and clean code
- Added license for each file
GrCUDA 0.4.0 (multi-GPU support) - June 2022
New features
- Enabled support for multiple GPU in the asynchronous scheduler:
- Added the
GrCUDADeviceManagercomponent that encapsulates the status of the multi-GPU system. It tracks the currently active GPUs, the streams and the currently active computations associated with each GPU, and what data is up-to-date on each device. - Added the
GrCUDAStreamPolicycomponent that encapsulates new scheduling heuristics to select the best device for each new computation (CUDA streams are uniquely associated with a GPU), using information such as data locality and the current load of the device. We currently support 5 scheduling heuristics with increasing complexity:ROUND_ROBIN: simply rotate the scheduling between GPUs. Used as initialization strategy of other policies;STREAM_AWARE: assign the computation to the device with the fewest busy stream, i.e. select the device with fewer ongoing computations;MIN_TRANSFER_SIZE: select the device that requires the least amount of bytes to be transferred, maximizing data locality;MINMIN_TRANSFER_TIME: select the device for which the minimum total transfer time would be minimum;MINMAX_TRANSFER_TIMEselect the device for which the maximum total transfer time would be minimum.
- Modified the
GrCUDAStreamManagercomponent to select the stream with heuristics provided by the policy manager. - Extended the
CUDARuntimecomponent with APIs for selecting and managing multiple GPUs. - Added the possibility to export the computation DAG obtained with a certain policy. If the
ExportDAGstartup option is enabled, before the context's cleanup, the graph will be exported in .dot format in the path specified by the user as option's argument. - Added support for Graal 22.1 and CUDA 11.7.
- Added the
GrCUDA MultiGPU Pre-release
New features
- Enabled support for multiple GPU in the asynchronous scheduler:
- Added the
GrCUDADeviceManagercomponent that encapsulates the status of the multi-GPU system. It tracks the currently active GPUs, the streams and the currently active computations associated with each GPU, and what data is up-to-date on each device. - Added the
GrCUDAStreamPolicycomponent that encapsulates new scheduling heuristics to select the best device for each new computation (CUDA streams are uniquely associated to a GPU), using information such as data locality and the current load of the device. We currently support 5 scheduling heuristic with increasing complexity:ROUND_ROBIN: simply rotate the scheduling between GPUs. Used as initialization strategy of other policies;STREAM_AWARE: assign the computation to the device with the fewest busy stream, i.e. select the device with fewer ongoing computations;MIN_TRANSFER_SIZE: select the device that requires the least amount of bytes to be transferred, maximizing data locality;MINMIN_TRANSFER_TIME: select the device for which the minimum total transfer time would be minimum;MINMAX_TRANSFER_TIMEselect the device for which the maximum total transfer time would be minimum.
- Modified the
GrCUDAStreamManagercomponent to select the stream with heuristics provided by the policy manager. - Extended the
CUDARuntimecomponent with APIs for selecting and managing multiple GPUs.
- Added the
GrCUDA 0.3.0 - December 2021
New features
-
Enabled support for cuBLAS and cuML in the asynchronous scheduler
- Streams' management is now supported both for CUML and CUBLAS
- This feature can be possibly applied to any library, by extending the
LibrarySetStreamFunctionclass
-
Enabled support for cuSPARSE
- Added support for CSR and COO
spmvandgemvi. - Known limitation: Tgemvi works only with single-precision floating-point arithmetics.
- Added support for CSR and COO
-
Added the support of precise timing of kernels, for debugging and complex scheduling policies
- Associated a CUDA event to the start of the computation to get the elapsed time from start to the end
- Added
ElapsedTimefunction to compute the elapsed time between events, aka the total execution time - Logging of kernel timers is controlled by the
grcuda.TimeComputationoption (false by default) - Implemented with the
ProfilableElementclass to store timing values in a hash table and support future business logic - Updated documentation for the use of the new
TimeComputationoption in README
-
Added read-only polyglot map to retrieve GrCUDA options. Retrieve it with
getoptions. Option names and values are provided as strings. Find the full list of options inGrCUDAOptions. -
Enabled the usage of TruffleLoggers for logging the execution of GrCUDA code
- GrCUDA has different types of loggers, each one with its own functionality
- Implemented GrCUDALogger class is in order to have access to loggers of interest when specific features are needed
- Changed all the print in the source code in log events, with different logging levels
- Added documentation about logging in docs
Miscellaneous
-
Removed deprecation warning for Truffle's ArityException.
-
Set TensorRT support to experimental
- TensorRT is currently not supported on CUDA 11.4, making it impossible to use along with a recent version of cuML
- Known limitation: due to this incompatibility, TensorRT is currently not available on the async scheduler
GrCUDA 0.2.1 - October 2021
Minor fixes:
- Fixed path in installation script
- Fixed creation of result directory in Python benchmark suite
- Fixed Makefile for CUDA benchmarks
GrCUDA 0.2.0 - October 2021
GrCUDA 0.2.0 - October 2021
API Changes
- Added option to specify arguments in NFI kernel signatures as
const- The effect is the same as marking them as
inin the NIDL syntax - It is not strictly required to have the corresponding arguments in the CUDA kernel marked as
const, although that's recommended - Marking arguments as
constorinenables the async scheduler to overlap kernels that use the same read-only arguments
- The effect is the same as marking them as
New asynchronous scheduler
-
Added a new asynchronous scheduler for GrCUDA, enable it with
--experimental-options --grcuda.ExecutionPolicy=async- With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes immediately
- The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU data are accessed by the host thread
- Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using different streams
- Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams
- The scheduler supports different options, see
README.mdfor the full list - It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
-
Enabled partial support for cuBLAS and cuML in the aync scheduler
- Known limitation: functions in these libraries work with the async scheduler, although they still run on the default stream (i.e. they are not asynchronous)
- They do benefit from prefetching
-
Set TensorRT support to experimental
- TensorRT is currently not supported on CUDA 11.4, making it impossible to use along a recent version of cuML
- Known limitation: due to this incompatibility, TensorRT is currently not available on the async scheduler
New features
- Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray, MultiDimDeviceArrayView, and provides high-level array interfaces
- Added API for prefetching
- If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance
- Added API for stream attachment
- Always enabled in GPUs with with architecture older than Pascal, and the async scheduler is active. With the sync scheduler, it can be manually enabled
- It restricts the visibility of GPU data to the specified stream
- In architectures newer or equal than Pascal it can provide a small performance benefit
- Added
copyTo/copyFromfunctions on generic arrays (Truffle interoperable objects that expose the array API)- Internally, the copy is implemented as a for loop, instead of using CUDA's
memcpy - It is still faster than copying using loops in the host languages, in many cases, and especially if host code is not JIT-ted
- It is also used for copying data to/from DeviceArrays with column-major layout, as
memcpycannot copy non-contiguous data
- Internally, the copy is implemented as a for loop, instead of using CUDA's
Demos, benchmarks and code samples
- Added demo used at SeptembeRSE 2021 (
demos/image_pipeline_localanddemos/image_pipeline_web)- It shows an image processing pipeline that applies a retro look to images. We have a local version and a web version that displays results a in web page
- Added benchmark suite written in Graalpython, used in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
- It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling.
Miscellaneosus
- Added dependency to
grcuda-datasubmodule, used to store data, results and plots used in publications and demos. - Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it?
- Added support for Java 11 along with Java 8
- Added option to specify the location of cuBLAS and cuML with environment variables (
LIBCUBLAS_DIRandLIBCUML_DIR) - Refactored package hierarchy to reflect changes to current GrCUDA (e.g.
gpu -> runtime) - Added basic support for TruffleLogger
- Removed a number of existing deprecation warnings
- Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking
- Updated documentation
- Bumped GraalVM version to 21.2
- Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (see
oci_setup/) - Added documentation to setup IntelliJ Idea for GrCUDA development
- Added documentation about Python benchmark suite
- Added documentation on asynchronous scheduler options