-
Notifications
You must be signed in to change notification settings - Fork 304
Draft at NVIDIA cuDSS (GPU sparse linear solver) integration in Ipopt. #849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stable/3.14
Are you sure you want to change the base?
Conversation
…er options (not finished yet). Will need to modify also the configure and makefile
… cuDSS handles, config, data, etc... to actually call the factor/solve functions.
…echeck to be sure that everything is ok. Now trying to play with configure scripts to understand how to link correct libraries and provide necessary header files (CUDA and cuDSS).
…configuration scripts to install.
…pCuDSS linear solver.
…files to build with cuDSS! Still remains to be solved seg fault in c and with warm restart (maybe trying to reallocate matrix and vector 2 times?)
…) - cudss fails with error CUDSS_STATUS_ALLOC_FAILED (due to pointers being reallocated but not freed before ipopt 2nd restart?).
|
|
Yes, definitely. There have been requests for such an interface before. It's great to have someone doing the actual work for this. I'll need to find some time (and machine) to look into this in more detail and try it out. |
|
Great then! I will continue to work on this (testing with the provided examples and try to add the other available cuDSS features). |
It would actually require a complete revamp of Ipopt and a rewrite of the nonlinear solver. I developed the solver described in the paper you mentioned, and many components would need to be adapted, including the automatic differentiation and the way memory is allocated and provided from the GPU. Even adapting the KKT systems to be solved without an LBL' factorization would already be challenging with your PR, since saddle-point systems are not guaranteed to admit such a factorization (unlike MA27, MA57, PARDISO, or MUMPS). What the solver typically does internally is perturb the pivots to ensure they are nonzero, but that has a strong impact on the numerical factors and on the quality of the descent direction. As a result, more iterations might be needed, if convergence can even be reached. For Ipopt, the main benefit would only come from solving the linear systems on GPU, but with all the caveats mentioned above and the additional CPU–GPU communication overhead. Note that solving sparse linear systems is mostly limited by communication rather than computation, so the main advantage of multi-GPU support would be having more memory to store all factors on the device. I’m not saying it's impossible, but given the development effort, it would essentially mean building a new solver from scratch. |
Yes, I guess that than this could be more of a long long term project (especially for the AD part). Maybe it could be possible (and easier) to adapt some AD source-code transformation tool to evaluate the the gradients and hessian directly on GPU.
Thank you very much for the help and the feedback! Do you think that using the LDU factorization (with global pivoting) avaible in cuDSS (even if this means passing from a symmetric matrix to a general matrix) will improve robustness? |
If we can evaluate the objective and constraints on the GPU, it should not be too hard to perform "forward" AD on them to compute gradients, Jacobians, and Hessians. The coloring itself is purely CPU-friendly (like the ordering for KKT systems), but the directional derivatives can be evaluated on the GPU. Vector-mode AD can easily exploit the GPU to compute compressed columns of Jacobians and Hessians. The same applies to the decompression phase, which can also be efficiently vectorized on GPU. For reference, I'm adding a blog post about these topics: https://iclr-blogposts.github.io/2025/blog/sparse-autodiff/ Another paradigm would be to use a dedicated modeling language or DSL (like AMPL, JuMP, or GAMS) where we can define the objective and constraints as generators, such that we can "kernelize" the Jacobian and Hessian when we have redundant sub-expressions. In both cases, one issue we will face is how to tell Ipopt whether the vectors are in host, device, or unified memory.
Yes, I highly expect the LDU factorization of cuDSS to be more robust for the KKT systems in Ipopt. |
Great then I will proceed to change the interface so that the user can have the possibility to choose between the two algorithms (LDL' - symmetric matrix - and LDU - general matrix -) using Ipopt options.
I guees this remains the main issue to be solved for "porting" it on GPU... Maybe it will be simpler to just fork Ipopt and try to develop a GPU only version rather than trying to adapt everything to work on GPU. |
… for appveyor tests).
… to work with cuDSS and avoid ambiguities (some calls need int64_t values for the matrix dimension for example... think it's better to always use int64_t)
…lls with same problem. Compiled and tested with IPOPT_SINGLE option.
Hi, I am currently trying to integrate NVIDIA cuDSS 0.7.0 using CUDA 13 (https://docs.nvidia.com/cuda/cudss/) in Ipopt as a further sparse linear solver option (working on GPU).
Would you be interested in adding such a feature to Ipopt?
Unfortunately I have no experience with autotools and how to integrate/compile correctly the new files using the existing toolchain (nvcc is needed to compile at least one file). For now I have added some code in the configure script and in the Makefiles to compile everything. The procedure that I am currently using is to first generate a shared library with nvcc (libCuDSS folder) and then to link this library directly to ipopt.
I was able to run the tests shipped with Ipopt and everything seems to work fine (correct results). There is still an issue when trying to re-solve (solve for two consecutive times) a problem (like it's done in hs071_c.c, for example). I am still working on this and I am performing more checks to verify the correctness of the implementation.
I would like to test it with the IPOPT_SINGLE option activated, too.
P.s.: In the future it would be nice to add the multi-gpu and the multi-gpu-multi-node support provided by cuDSS and maybe port ipopt to natively work on GPU (avoiding expensive device to host - and viceversa - data copies) as was done in https://arxiv.org/html/2405.14236v2.