Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf-optimized WrapCylinder::_make_spiral_path #3164

Closed
wants to merge 5 commits into from

Conversation

adamkewley
Copy link
Contributor

@adamkewley adamkewley commented Mar 16, 2022

This is purely a perf change that aims to be a logically equivalent implementation, but with a bunch of computation skipped.

When I was doing the perf passes on recent PRs I noticed that RajagopalModel, which is wrapping heavy, spends a lot of computation in WrapCylinder::_make_spiral_path. I then investigated the method and found that it's doing what appears to be a large amount of computation redundantly. Things like re-fetching radiuses, matrix multiplications, computing closest points to lines in a suboptimal way (if some vectors are known to be unit vectors, some steps of that alg can be skipped).

So I carefully refactored the method--anyone who's worked near this part of the codebase knows how fraught with terror that process is 😉--and I believe I managed to figure out what it's probably doing (it looks like it's just generating points on a cylinder's surface). I then gradually const-qualified things, renamed them for clarity, jumbled things around, etc. and ended up with this PR.

The overall story is that the method should be (hopefully) easier to read now--which is nice--but it's also faster, which has an impact on macro-scale benchmarks:

Test Name lhs [secs] σ [secs] rhs [secs] σ [secs] Speedup
RajagopalModel 5.57 0.01 5.12 0.00 1.09
Arm26 0.15 0.00 0.15 0.00 1.01
ToyDropLanding 10.49 0.00 10.58 0.03 0.99
Gait2354 0.18 0.00 0.19 0.00 1.00
passive_dynamic 2.62 0.00 2.65 0.00 0.99
passive_dynamic_noanalysis 1.75 0.00 1.76 0.00 1.00
ToyDropLanding_nomuscles 0.27 0.00 0.28 0.00 1.00

Effectively, it buys a ~10 % perf-bump in a model that uses a lot of cylinders. I imagine with a little bit more, uh, massaging I'd be able to improve on this a little bit.

Brief summary of changes

  • Added CalcDistanceSquaredPointToLine that takes a UnitVec (faster)
  • const-qualified some private methods in WrapCylinder
  • const-qualified some WrapMath methods
  • Heavily refactored _make_spiral_path

Testing I've completed

  • Test suite, perf benchmark - need to double-check it in the UI etc

Looking for feedback on...

  • Usual stuff, correctness check (if you dare)

CHANGELOG.md (choose one)

  • Can update with a generic-sounding message along the lines of "Improved wrapping performance for WrapCylinder"

This change is Reviewable

Copy link
Member

@aymanhab aymanhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamkewley thanks for taking this on. Other than changing variable names and const correctness I wouldn't touch this code in any way since we have no way to test the jungle of corner cases that are "supported" by the older code. If you revert these other changes/refactoring it will be easier to review/approve. Thank you. We can rewrite, actually I did write the math to "solve" for the points without any of the gymnastics involved here but didn't proceed for the same reason.

@adamkewley
Copy link
Contributor Author

adamkewley commented Mar 17, 2022

The primary goal of this PR is performance, where dropping those refactors would turn this into a fairly trivial tidy-up.

I am not particularly interested in a trivial tidy-up with this PR specifically. I want the 10pct performance boost it is able to yield without having to change the overall algorithim. This is a mathematical rearrangement of the same alg, keeping warts like the 2mm wrapping placement (now on my future hitlist, because it wont work with insect models) and goto's (also on the hitlist) in place.

This PR was created by gradually mathematically rearranging the existing implementation to do the same thing, but without some of the intermediate steps that are particularly harsh on the CPU.

Example: The existing implementation was generating a rotation matrix (m), generating another rotation matrix (R), inverting it (~R), and multiplying it (n = m * ~R), followed by only using one row in the result (n[1]). The mathematical rearrangement notices that the result vector (n[1]) is effectively the dot product of m[1] with each column of R (a right-multiplication: m[1] * R), this is equivalent to left-multiplication with ~R, so we can drop the inverse (~~R == R) and rewrite n = m * ~R as v = R * m[1], where v == n[1]. After this change, only one row of m is ever used (m[1]). An earlier step of calc_wrap_spiral sets m[1] = axis, so drop m completely and calculate v = R*axis. Dropping m removes a lot of the code that was setting it up m (eg cross products for generating an orthogonal axis set, Matrix::set calls, etc.) and some transpose ops.

Similar deal with the point-on-line alg. There's steps in it that mean that you can skip work if you know some preconditions hold (eg one vector is of unit length, because it's always the unit cylinder vector).

Mathematically, the main concern I have with this refactor is ensuring that NormalizeOrZero wasn't injecting unusual behavior into parts of it - that is something that should be reviewed.

You are more familiar with the pitfalls, though, so what should I be looking out for, specifically? What problems did you encounter when you gave this code area a swing? Are there some example files/scripts I can pump through this to see what explodes?

I'm extremely keen to get this perf uptick, given that it's almost entirely isolated to one small block of maths that everyone can review mostly in isolation. The alternative is something like a wide-ranging perf change that hits various APIs, this is comparatively simple, once we all review it.

I do not believe it is wise policy, long-term, for us to avoid these code sections. I think there needs to be a dialogue about what kind of tests, verification, etc. would be "good enough" (keeping in mind what's already lurking in there) for us to confidently start refactoring this wrapping code. Clearly, there's a general ambition to sort it out :)

Copy link
Member

@tkuchida tkuchida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to upgrade algorithms that are major bottlenecks in the current workflows, particularly the ones that affect lots of users/models/scenarios. One question: How much of this code is checked by regression tests? Someone could go through and confirm that all the math is the same as before, but it would be a lot easier to review this PR if it were preceded by one that adds regression tests for any cases that aren't currently covered. This PR would then be 👍 as long as all these tests pass. Perhaps @aymanhab already has tests for the "jungle 🌴 of corner cases" that need to be checked, or has a list of the tests that are missing?

static double CalcWrapAngle(const UnitVec3& a, const UnitVec3& b, bool reverse)
{
double ang = acos(SimTK::dot(a, b));
return !reverse ? ang : -(2.0*SimTK_PI - ang);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see the logic but return reverse ? (ang - 2.0*SimTK_PI) : ang; might be easier to read. Also not sure what reverse is used for… if it comes from the old line 644

if (far_side_wrap)
    theta = 2.0 * SimTK_PI - theta;

then why is it now ang - 2.0*SimTK_PI rather than 2.0*SimTK_PI - ang?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, it's because the original code was using a variable called sense to flip the magnitude of the angle so, in effect, it was being double-flipped by the time it reached the rotation step.

The reason I reversed the if was a silly premature optimization: most compilers have a concept of a "happy" path on branching code (assumed to be the true branch if the compiler can't guess the weighting). The happy branch usually does the comparison and then just keeps moving through the code, the unhappy branch requires a jump (this is why there's the concept of "unlikely" annotations in compilers).

It's not an applicable optimization here because this isn't a short method where skipping a jump and immediately returning would have advantages. I'll reverse it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I wasn't aware of the likely/unlikely attributes.

WrapResult& aWrapResult) const
void WrapCylinder::_make_spiral_path(const SimTK::Vec3& p1,
const SimTK::Vec3& p2,
bool doFarSideWrap,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update comment on line 623 to match change of variable name. (Doxygen-style comments do not belong in the cpp file but that's another story 📘.)

// Each muscle segment on the surface of the cylinder should be
// 0.002 meters long. This assumes the model is in meters, of course.
int numWrapSegments = (int)(aWrapResult.wrap_path_length / 0.002);
if (numWrapSegments < 1) numWrapSegments = 1;
int numWrapSegments = static_cast<int>(out.wrap_path_length / 0.002);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest marking this hard-coded 0.002 "TODO"

p2,
g_CylinderDirection,
out.r2,
temp_wrap_pt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time in a long time I've seen a go-to :D

Although on my local branch (which is even more refactored, for a ~15pct boost, by doing stuff like optimized quaternion rotations) I tried to rewrite the code in an iteration loop, which worked fine but (imho) is actually kind of uglier (I hate to admit this) because the original algorithim only logically performs an adjustment if n>=1. It then has to perform two method calls (point adjustment), and this only happens if it's not the final iteration of the loop. Also, the loop block contains the necessary state to finalize the algorithm. This means that you end up with something like a continue (jump) on the adjustment step (the goto) and a break at the bottom of the loop (the fall through), which also felt gross :(

@adamkewley
Copy link
Contributor Author

@tkuchida thanks for reading through this :). I understand that comparing the before/after of this code section is particularly tricky - especially because I made things a little harder by renaming variables (for clarity, it will hopefully make it easier in the future, but it makes your job harder because you have to map r1a onto r1AxialPos etc.)

I've done a quick pass through with your comments and I'm ready for whatever you + @aseth1 spot next :)

I also have a personal branch with more extensive changes (more dramatic rearrangements, slightly bigger perf boost), but I'm going to park that as a local thing and focus on this. This is a ~10 % boost for a relatively small amount of changes. Later rearrangements yield maybe another 5 % but require (e.g.) putting the spiral implementation into a SpiralPath class (to separate "building a spiral" from "iterating on tangent points"), and performing higher-perf quaternion rotations directly (rather than simbody's implementation, which does axis+angle -> SimTK::Quaternion -> SimTK::Rotation -> matrix_multiplication). Some of those changes may ideally involve (e.g.) PRing simbody, reorganizing WrapMath, etc.

@nickbianco
Copy link
Member

Hi @adamkewley, I haven't taken a line-by-line look through this yet, but at a high-level the changes you've made seem reasonable. Of course, the devil is in the details here, but I'm also interested in making these changes provided we can accompany them with the right tests. A lot of the changes here are things I noticed when making the wrapping parallelization fix a while ago.

Speaking of that PR (#2910), I found it helpful to create a set of temporary unit tests to double check the math changes (you helped improve them with randomization too). It could be useful to do that here too; they don't cover the regression tests we would need, but I think it was helpful to give everyone confidence that the math was sound.

If you do find time to get around to the suite of regression tests, I'd be happy to give them a look or a formal code review, whatever's helpful.

@adamkewley
Copy link
Contributor Author

Thanks @nickbianco :) - I'll try and cook up a reasonable test suite for this

@adamkewley
Copy link
Contributor Author

Closing this - it's been open for a while and I believe #3301 includes some parts of the optimization.

If we're going to use anything from this then it should be sliced out and put into a separate PR etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants