ModelStar merge; needs review due to divergent histories#284
ModelStar merge; needs review due to divergent histories#284m-usama-z wants to merge 28 commits intoverivital:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds significant new functionality for neural network verification and weight perturbation analysis, particularly focused on RUL (Remaining Useful Life) prediction for Collins turbofan engines. The changes include new verification capabilities, memory management improvements, GPU support enhancements, and parallelization of verification tasks.
- Added weight perturbation support for robustness analysis of neural networks
- Implemented parallelized verification with memory-aware LP solver management
- Enhanced GPU/CPU device switching and memory handling for large star sets
Reviewed Changes
Copilot reviewed 50 out of 54 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| monotonicity_CI_shift10_w20.vnnlib | VNN-LIB specification file for RUL monotonicity property verification with 400 input variables |
| collins_RUL_test.m | Test script for verifying RUL CNN with weight perturbations |
| verify_specification.m | Added parallelized verification with memory management andLP scheduling |
| lpsolver.m | Changed default solver to Gurobi with fallback logic and improved error handling |
| Star.m | Added static helper methods and optimized range estimation for unconstrained neurons |
| ImageStar.m | Optimized range computation using vectorized operations and improved GPU handling |
| NN.m | Major additions: weight perturbation, sampling, memory management, and evaluation improvements |
| Layer files | Added toCPU methods across all layers for CPU/GPU switching |
| FeatureInputLayer.m | Contains typo in affineMap call |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
code/nnv/engine/utils/lpsolver.m
Outdated
| if ~any(strcmp(varargin{1}, ["valid"])) | ||
| error("Should not need linprog; did gurobi cause a problem, or was it not used at all? Maybe adjust gurobi's feasibility tolerance...") | ||
| end |
There was a problem hiding this comment.
The error check assumes varargin{1} exists, but varargin could be empty. This will cause an indexing error if lpsolver is called without the optional 10th argument. Add a length check before accessing varargin{1}.
code/nnv/engine/utils/lpsolver.m
Outdated
|
|
||
| % solve using glpk (linprog as backup) | ||
| elseif strcmp(lp_solver, 'glpk') | ||
| error("Should not need to use glpk. Did gurobi and linprog both fail?") |
There was a problem hiding this comment.
Error message is overly restrictive. The glpk solver path may be legitimately needed in some cases, and this unconditional error prevents its use entirely. Consider making this a warning or removing it if glpk is meant to be disabled.
|
first comment, @m-usama-z please make sure tests pass (action failed), so please update on your end until they pass (or let us know more about why they don't); also, please put all your email text here in the request so we don't have combination of emails and other discussions happening in the git once that is done, @sammsaski @atumlin @HCWDavid (and @Navidhashemicodes and @mldiego if you have any cycles in your off time as we prepare the NNV 3 paper ), please review the changes, especially the points raised by @m-usama-z regarding hard to merge parts |
fixed typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Thank you for initiating the review process, Professor @ttj. Failed WorkflowRegarding the failed workflow action, it was due to the error:
I updated the CI file to use v2 and re-ran the workflow in the fork. Upon that, the workflow failed due to lack of space during python packages installation. I am not sure about the best way to deal with this, so any advice would be appreciated. Code Review Before MergingThe differences between my code and NNV include, besides the ModelStar additions for weight perturbation analysis, many differences that arose, due to divergent development, in scripts that are not directly concerned with the base implementation of ModelStar. Hence, while the number of affected files is large, the number of changes per file for most files is small. Only a few files are changed extensively:
I would appreciate it if member(s) of the team could spare some time to take a look at these changes and coordinate with me where any further changes are required. I have not resolved a number of differences that appeared due to divergent development because I am not sure what to do with regards to them; I am not listing all minor differences here as it is more convenient to just look at them. However, it is important to note that my pull request will "appear mergeable" because it is a simple change of code devoid of the past history of the repository which would have assisted merging through pull-and-push (because, for my development, I had cloned/copied the code instead of forking/keeping track of the original git history), so it would be necessary for a member of your team who is familiar with the recent development history to eyeball, at the very least, the deleted/modified functions without relying on automatic merge to take care of these. The functions that are entirely new should be safer to process/prune later as well. Some Miscellaneous Points
Thank you very much for your time and support, and I look forward to working with you. |
|
thanks @m-usama-z ! I guess the CI stuff is currently broken due to some environment things. So, don't worry about that, we'll have to look into it on our end. However, can you please make sure all the tests run fine locally and/or confirm you've done that? If any problems, please let us know. We don't need a ton of tests, just testing anything new you've added ideally (the example you've added on RUL of course helps as a regression test), as well as to ensure none of the existing tests break (especially given the number of changes). |
|
Thank you for letting me know about the CI problem, Professor @ttj. I ran the existing tests locally and found a few issues that I could not resolve because they already exist in a fresh clone of NNV:
I will work on adding tests for the new features in parallel with my other deadlines. |
|
Hello Professor @ttj, I have added tests for the three major functionalities added to NNV:
I hope this covers the major changes. I shall be grateful if a member of your team can take a quick look at the changes in this PR and let me know if another test/modifications are needed. Thank you for your time and consideration. |
|
With regards to an example for inclusion in the paper: we may use an MNIST CNN in this regard. I do not know of any existing framework for verification against convolutional layer weight perturbations, so we should not need to compare these results with any other approaches. The code for the Conv2D test, |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 51 out of 55 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
thanks Usama! Re some of the questions:
Hopefully Samuel/Diego/Anne will have some time to look over the other changes and see if any concerns or whether we can just merge. I am working on a new set of tests that also will help us with the merge and soundness regression checking, but probably we will merge your changes first, then I will merge those and see if any problems introduced. But it will be good to have the other set of eyes on first. |
|
testing review @claude |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
I have removed the toCPU functions for all the cases where I found them to be absent in NNV's code, except the Conv2D layer: layer wide perutrbations in Conv2D require many convolutions and GPU usage makes them faster, but parallelizing them can cause GPU memory to run out, so my code switches its parameters to CPU if GPU memory runs low. I also deleted some utility function files that were absent from the NNV code and were not used by me either. I have also added polyfitn in the utils folder. I have, in general, checked logical coherence of changes to avoid breaking existing functionality. There are some changes, however, that I would suggest at least a quick review of before merging:
Other than these, I do not have any logical confusions or any observations to make about things breaking. |
|
thanks Usama, sounds great! We will try to check over soon, I have been working on an improved test suite that also will hopefully help us identify if any problems are introduced. But hopefully @mldiego @sammsaski @atumlin can help check some of the details. For the solver, I thought we had made this globally configurable, but will check on it. one quick question as we start to try to merge: can you please point me at a simple weight perturbation example I can run, to see if everything is looking appropriate? I think you have set up in tests or examples from the conversation, but direct pointer(s) to what we should run for your additions will help make sure we've tested that things are working for your additions properly |
|
Sure, Professor. Running the tests in Regarding the lpsolver being globally configurable, I was unaware of that, in part because linprog was hardcoded at some locations. I appreciate your team checking up on that. Let me list the hardcoded functions I know of here:
I will make whatever changes are advised with regards to global configurability of lpsolver. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 34 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
For the upcoming tool paper, will using Gurobi introduce any issues for the artifact submission part of it? That is another license we will be responsible for. It might make the most sense to just use linprog for now and test with Gurobi later on. Were you able to reproduce the accuracy issues you were seeing? |
|
Hello @sammsaski, Regarding linprog, unfortunately, I would have to spend some time trying to reconstruct a case where linprog was failing, so I went ahead and changed the defaults back to linprog for the time being. Later on, it should be sufficient to simply be able to change the solver globally as needed (which is not the case right now due to it being hardcoded at some locations) and then the user of NNV could use gurobi as desirable for better accuracy, but this global configurability update could be done after the submission. As for the experiment, I have added the multi-layer MNIST perturbation experiment you requested in the Last but not least, this code should work as is if the compute has 120GB of free memory (the code appears to be using 104GB in my testing), otherwise please change the memory management configuration in lines 96-101. Using the commented config, lines 96-98, instead of the current config, lines 99-101, enables automatic memory management, but that can result in frequent recreation of the parallel pool to adjust the number of workers, which can make the code rather slow. But hopefully, we should not need to worry about it and the current setting (64 workers throughout without automatic memory management) should work because this is a rather small, fully-connected network. The reason automatic memory configuration can make the code slow is that I stayed conservative in the data points used for polynomial estimation of memory usage using the size of the inequalities matrix, in order to be on the safer side and avoid crashing MATLAB. Also, memory management has only been implemented in the The simpler alternative to using automatic memory management for low-memory is to simply reduce the number of workers from 64 to an appropriate quantity for the memory available, in lines 99-101 (if parallelization is not desired at all, all lines 96-101 should be commented). However, I just thought I would take this chance to explain some important characteristics regarding the automatic memory management implementation. |
|
@sammsaski, Apologies, just one more important thing I should have mentioned (although you would notice it soon during testing): the experiment as it is now is going to take hours to verify 100 images (maybe around 4 hours). If a smaller compute time is desirable while keeping the number of images large, the option to perturb only a single-layer layer instead of 2 layers is still there (I provided this code because you selected the two-layer perturbation experiment, but the single-layer experiment would be faster). However, if compute time being a few hours is not a concern, then there should be nothing to worry about. Please let me know if a different experiment is required. |
|
thanks Usama and Samuel! for the purpose of repeatability, I would suggest we definitely have something with lower compute requirements (both memory and total runtime) as a demonstration for it in the paper. the artifact evaluation reviewers are not going to be able to do that, generally it is reasonable to assume they have a laptop and will spend maybe a few hours on this total. if any other existing example is sufficient for that, great, but needed >32GB RAM (and maybe even >16GB) is a stretch, and total runtimes we should aim an hour max (as there also will be other experiments that also will take some time). basically, we just need for this a demo and sanity check to illustrate the capabilities of the weight perturbations, not a full experimental analysis |
|
Thank you for that explanation, Professor @ttj. In that case, one option is to simply compare output ranges of our framework with ranges from an existing work for a single image, it works as a demo as well as a check. Another option is layer-wise perturbation of ACAS Xu and verification of vnnlib properties. Since it's a small NN, this quickly covers many layers (the caveat is that these properties' input specifications are not properly used, rather the input upper and lower bounds are averaged to get a single input for which the property is verified in the presence of weight perturbations). Last but not least, we could perhaps perturb only the last 2 or 3 layers of the MNIST MLP, one at a time, if image classification is preferred over ACAS Xu. @sammsaski, please let me know which is preferable, and I can help in setting it up. |
|
I think only perturbing the last 2 or 3 layers for the MNIST MLP would be best |
|
I have added code for single-layer perturbation in the directory The code for plotting is in the |
|
great, thanks Usama! I will plan to merge things tonight or tomorrow morning. The test suite I've made is about ready, so I'll try to run your perturbation examples and the regression test suite to identify any issues, but @sammsaski please let me know if you notice anything that needs further review. I have fixed some minor issues while creating the test suite as well, so it will be good if my changes also are reviewed as there are quite some updates. As the paper deadline is coming up quickly, please focus on finalizing that and any further updates it needs based on the status, as the artifact evaluation will be afterward and more time for that, but I think we'll be in reasonable shape for that. |
|
Sounds good. I've been taking a look and am set up to run tests tonight to verify that no soundness issues introduced from changes as well. |
code/nnv/engine/nn/NN.m
Outdated
| end | ||
| end | ||
|
|
||
| function p = fit_poly_to_mem_consumption_of_LP(obj) |
There was a problem hiding this comment.
seems like this maybe shouldn't belong in this file...?
| end | ||
|
|
||
| % for ModelStar experiments. Can be removed. | ||
| function net = acas_combine_affine_layer_with_fc(obj, args) |
There was a problem hiding this comment.
yes, this definitely doesn't belong here, this is for an example, not the general NN class
There was a problem hiding this comment.
I've gone over this one in a bit of detail. I think quite a number of these changes may problematic unfortunately, and I am not sure we should merge in the current status. Quite a few things appear maybe were temporary changes or don't make sense to be in this file, along with some other exemplars (eg the modifications to some layers) as I've gotten into the details.
As the FM paper is going to be due shortly, I would suggest we focus just on the paper right now and include the high-level for the perturbations, and then we will need to refactor significantly the weight perturbation parts before the artifact evaluation.
I would suggest for layers that have weight perturbations, a new layer type is introduced for those that is a subclass or extension of the existing layer types (eg, like Conv2DLayerWP.m or similar, which would subclass from Conv2DLayer, and allow for more easily adding the weight perturbation fields), and I think will allow for cleaner handling of the weight perturbation specific parts versus modifying the existing codebase all over to handle this specific use cases. Further, a lot of the parallel processing handling I think needs to be handled better or appears like it was for some temporary experimentation. But maybe I am missing something.
If all the tests are passing with all these changes though, maybe it is okay though so @sammsaski please weigh in
|
Thank you for the feedback, Professor @ttj. I had a similar idea early on, as in the weight perturbation specific changes could be moved over, and the rest, like memory management, can be worked out later; especially if it is not going to be mentioned in the paper. The memory usage prediction for LPs using a polynomial is indeed experimental; it is not a finalized feature, and it has been implemented in one variant of the reachability function only. If the only experiment that is going to be in the paper is the aforementioned MNIST MLP experiment, it appears that all that would have to be finalized for now is the fully-connected (FC) layer weight perturbation processing for the single-layer case, which is relatively simple (the multi-layer perturbation part of the code is a bit more involved). If I'm not mistaken, the artifact submission deadline is Feb 13? That should give quite some time to integrate the rest of the desirable changes more cleanly, perhaps over multiple PRs, to streamline the review and integration process. Although, even for now, the code for the MNIST MLP experiment discussed above would have to be changed in sync with the changes in implementation of the FC layer weight perturbations. I'm not sure if creating subclasses is better (I don't have much experience with inheritance in MATLAB), but if your team finds it easier to modify the NN layers in that fashion, that may be fine. One option is to try implementing just the FC layer single-layer perturbation case first, and testing it. If that change can be made quickly, the experiment code, |
|
thanks Usama! Yes, I think for the paper given the deadline, if it is simple to include just a small bit of changes for the example we include in the paper, that would be ideal, then we can focus on merging and refactoring afterward the full set of changes. as all the proof of concept is there and implemented, I think it is fine to include and highlight this new support, but we should refactor before incorporating all these changes. I will go ahead and finalize my test suite and merge that first. this will also hopefully help with your changes when we merge the full ones, as I have at least some actual regression tests now (previously most of the tests were visual with not so many checking results). for how to implement, I think anything is fine, but the subclass suggestion was partly to hopefully help keep the code separate: it seems there are special purpose things that need to be done for handling the weight perturbations, which as of now, are getting mixed in with the rest of the analysis code, but maybe I have missed something similarly, for all examples and things that are not general features, definitely those should be in example/test files for those tied to specific benchmarks like acas, mnist, etc., and likewise utilities separate where appropriate, and not in the main classes |
|
Noted, Professor @ttj. I will try to create a separate pull request with changes for the fully-connected layer only within tonight. However, the modification of the experiment's code will likely get pushed to tomorrow; I hope to finish that within tomorrow. I will keep the FC layer's code in the same class ( For the cleaning up of the rest of the code of |
No description provided.