Skip to content

Conversation

@tonino102008
Copy link

Hi, I am currently trying to integrate NVIDIA cuDSS 0.7.0 using CUDA 13 (https://docs.nvidia.com/cuda/cudss/) in Ipopt as a further sparse linear solver option (working on GPU).

Would you be interested in adding such a feature to Ipopt?

Unfortunately I have no experience with autotools and how to integrate/compile correctly the new files using the existing toolchain (nvcc is needed to compile at least one file). For now I have added some code in the configure script and in the Makefiles to compile everything. The procedure that I am currently using is to first generate a shared library with nvcc (libCuDSS folder) and then to link this library directly to ipopt.

I was able to run the tests shipped with Ipopt and everything seems to work fine (correct results). There is still an issue when trying to re-solve (solve for two consecutive times) a problem (like it's done in hs071_c.c, for example). I am still working on this and I am performing more checks to verify the correctness of the implementation.

I would like to test it with the IPOPT_SINGLE option activated, too.

P.s.: In the future it would be nice to add the multi-gpu and the multi-gpu-multi-node support provided by cuDSS and maybe port ipopt to natively work on GPU (avoiding expensive device to host - and viceversa - data copies) as was done in https://arxiv.org/html/2405.14236v2.

tonino102008 and others added 12 commits October 19, 2025 20:03
…er options (not finished yet). Will need to modify also the configure and makefile
… cuDSS handles, config, data, etc... to actually call the factor/solve functions.
…echeck to be sure that everything is ok. Now trying to play with configure scripts to understand how to link correct libraries and provide necessary header files (CUDA and cuDSS).
…files to build with cuDSS! Still remains to be solved seg fault in c and with warm restart (maybe trying to reallocate matrix and vector 2 times?)
…) - cudss fails with error CUDSS_STATUS_ALLOC_FAILED (due to pointers being reallocated but not freed before ipopt 2nd restart?).
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@svigerske
Copy link
Member

Would you be interested in adding such a feature to Ipopt?

Yes, definitely. There have been requests for such an interface before. It's great to have someone doing the actual work for this.

I'll need to find some time (and machine) to look into this in more detail and try it out.

@tonino102008
Copy link
Author

Great then! I will continue to work on this (testing with the provided examples and try to add the other available cuDSS features).
I am still waiting to sign the contributor agreement (I have asked the company for which I work for if it ok to contribute... I think that by the start of next week I will have a - positive! - answer).

@amontoison
Copy link
Contributor

amontoison commented Oct 31, 2025

P.s.: In the future it would be nice to add the multi-gpu and the multi-gpu-multi-node support provided by cuDSS and maybe port ipopt to natively work on GPU (avoiding expensive device to host - and viceversa - data copies) as was done in https://arxiv.org/html/2405.14236v2.

It would actually require a complete revamp of Ipopt and a rewrite of the nonlinear solver. I developed the solver described in the paper you mentioned, and many components would need to be adapted, including the automatic differentiation and the way memory is allocated and provided from the GPU.

Even adapting the KKT systems to be solved without an LBL' factorization would already be challenging with your PR, since saddle-point systems are not guaranteed to admit such a factorization (unlike MA27, MA57, PARDISO, or MUMPS).
cuDSS only supports LDL' factorization (where D is diagonal) with 1x1 pivoting.

What the solver typically does internally is perturb the pivots to ensure they are nonzero, but that has a strong impact on the numerical factors and on the quality of the descent direction. As a result, more iterations might be needed, if convergence can even be reached.

For Ipopt, the main benefit would only come from solving the linear systems on GPU, but with all the caveats mentioned above and the additional CPU–GPU communication overhead. Note that solving sparse linear systems is mostly limited by communication rather than computation, so the main advantage of multi-GPU support would be having more memory to store all factors on the device.

I’m not saying it's impossible, but given the development effort, it would essentially mean building a new solver from scratch.

@tonino102008
Copy link
Author

tonino102008 commented Nov 1, 2025

It would actually require a complete revamp of Ipopt and a rewrite of the nonlinear solver. I developed the solver described in the paper you mentioned, and many components would need to be adapted, including the automatic differentiation and the way memory is allocated and provided from the GPU.

Yes, I guess that than this could be more of a long long term project (especially for the AD part). Maybe it could be possible (and easier) to adapt some AD source-code transformation tool to evaluate the the gradients and hessian directly on GPU.

Even adapting the KKT systems to be solved without an LBL' factorization would already be challenging with your PR, since saddle-point systems are not guaranteed to admit such a factorization (unlike MA27, MA57, PARDISO, or MUMPS).
cuDSS only supports LDL' factorization (where D is diagonal) with 1x1 pivoting.

What the solver typically does internally is perturb the pivots to ensure they are nonzero, but that has a strong impact on the numerical factors and on the quality of the descent direction. As a result, more iterations might be needed, if convergence can even be reached.

Thank you very much for the help and the feedback! Do you think that using the LDU factorization (with global pivoting) avaible in cuDSS (even if this means passing from a symmetric matrix to a general matrix) will improve robustness?

@amontoison
Copy link
Contributor

It would actually require a complete revamp of Ipopt and a rewrite of the nonlinear solver. I developed the solver described in the paper you mentioned, and many components would need to be adapted, including the automatic differentiation and the way memory is allocated and provided from the GPU.

Yes, I guess that than this could be more of a long long term project (especially for the AD part). Maybe it could be possible (and easier) to adapt some AD source-code transformation tool to evaluate the the gradients and hessian directly on GPU.

If we can evaluate the objective and constraints on the GPU, it should not be too hard to perform "forward" AD on them to compute gradients, Jacobians, and Hessians.
Note that we recently worked on a feature in Tapenade to compute the sparsity patterns of Jacobians and Hessians for any C or Fortran code.
We don't need a GPU version of the original code to recover the sparsity pattern on GPU, because this is a kind of boolean AD where the only operations involved are bitwise XOR, which can easily be GPU-kernelized.

The coloring itself is purely CPU-friendly (like the ordering for KKT systems), but the directional derivatives can be evaluated on the GPU. Vector-mode AD can easily exploit the GPU to compute compressed columns of Jacobians and Hessians. The same applies to the decompression phase, which can also be efficiently vectorized on GPU.

For reference, I'm adding a blog post about these topics: https://iclr-blogposts.github.io/2025/blog/sparse-autodiff/

Another paradigm would be to use a dedicated modeling language or DSL (like AMPL, JuMP, or GAMS) where we can define the objective and constraints as generators, such that we can "kernelize" the Jacobian and Hessian when we have redundant sub-expressions.
This happens quite often when the constraints are discretized ODEs or PDEs (optimal control is a nice example!), or in energy problems (DCOPF / ACOPF / SCOPF), where the physical equations (Kirchhoff's laws) lead to a small subset of operations that are well suited for SIMD parallelism.
This is what we did in ExaModels.jl, although of course it remains problem-dependent.

In both cases, one issue we will face is how to tell Ipopt whether the vectors are in host, device, or unified memory.
Ipopt was not originally designed for this, and everything passed as void* is expected to be host memory.

Even adapting the KKT systems to be solved without an LBL' factorization would already be challenging with your PR, since saddle-point systems are not guaranteed to admit such a factorization (unlike MA27, MA57, PARDISO, or MUMPS).
cuDSS only supports LDL' factorization (where D is diagonal) with 1x1 pivoting.
What the solver typically does internally is perturb the pivots to ensure they are nonzero, but that has a strong impact on the numerical factors and on the quality of the descent direction. As a result, more iterations might be needed, if convergence can even be reached.

Thank you very much for the help and the feedback! Do you think that using the LDU factorization (with global pivoting) avaible in cuDSS (even if this means passing from a symmetric matrix to a general matrix) will improve robustness?

Yes, I highly expect the LDU factorization of cuDSS to be more robust for the KKT systems in Ipopt.

@tonino102008
Copy link
Author

Yes, I highly expect the LDU factorization of cuDSS to be more robust for the KKT systems in Ipopt.

Great then I will proceed to change the interface so that the user can have the possibility to choose between the two algorithms (LDL' - symmetric matrix - and LDU - general matrix -) using Ipopt options.

In both cases, one issue we will face is how to tell Ipopt whether the vectors are in host, device, or unified memory.
Ipopt was not originally designed for this, and everything passed as void* is expected to be host memory.

I guees this remains the main issue to be solved for "porting" it on GPU... Maybe it will be simpler to just fork Ipopt and try to develop a GPU only version rather than trying to adapt everything to work on GPU.

… to work with cuDSS and avoid ambiguities (some calls need int64_t values for the matrix dimension for example... think it's better to always use int64_t)
…lls with same problem. Compiled and tested with IPOPT_SINGLE option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants