Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement initial regression detection #383

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

rwstauner
Copy link
Contributor

@rwstauner rwstauner commented Jan 27, 2025

  • Analyze ratio_in_yjit to detect regressions
  • Add a few comments to old code
  • Add cli to show full analysis report
  • Upload analysis txt to website so slack message can link to it
  • Send slack notification with regression info
  • Take slack channel from ENV to simplify testing
  • Convert actual markdown links to slack mrkdwn format
  • Sort textual analysis for consistency

This sets up regression detection for the ratio_in_yjit metric.
It looks for streaks to identify stable values.
It notifies for any given benchmark if in the last 30 days it's lower than a previous streak.
I included the data that goes into the analysis to help us fine tune the algorithm and to make it easier to decide if a notification is worth investigating.

This is what the slack notification looks like with recent data:
image

This is what the latest data from 2025-01-27 would report:

RatioInYJIT aarch64_yjit_stats
        object-new streaks: [[89.3, 8], [89.26, 1], [89.3, 5], [89.26, 1], [88.71, 4], [88.76, 1], [88.5, 2], [88.56, 1], [88.5, 2], [88.56, 2], [88.61, 3]]
                   highest_streak_value: 89.3
                   longest_streak: [89.3, 8]
                   geomean: 88.95031326213396
                   regression: dropped from 89.30 to 88.61
    setivar_object streaks: [[83.6, 15], [83.61, 2], [83.48, 1], [83.61, 5], [83.48, 1], [83.61, 2], [83.48, 1], [82.68, 1], [82.53, 1], [82.68, 1]]
                   highest_streak_value: 83.61
                   longest_streak: [83.6, 15]
                   geomean: 83.4934889154428
                   regression: dropped from 83.61 to 82.68
     setivar_young streaks: [[83.73, 9], [83.6, 2], [83.73, 5], [83.61, 2], [83.73, 5], [83.61, 1], [83.73, 2], [83.61, 2], [83.73, 1], [83.61, 1]]
                   highest_streak_value: 83.73
                   longest_streak: [83.73, 9]
                   geomean: 83.69731576465095
                   regression: dropped from 83.73 to 83.61

RatioInYJIT x86_64_yjit_stats
        object-new streaks: [[86.47, 15], [85.34, 1], [85.43, 1], [85.34, 2], [85.43, 4], [85.34, 2], [84.01, 1], [85.43, 1], [85.52, 3]]
                   highest_streak_value: 86.47
                   longest_streak: [86.47, 15]
                   geomean: 85.89437406692068
                   regression: dropped from 86.47 to 85.52
    setivar_object streaks: [[80.28, 1], [80.09, 1], [80.28, 12], [80.09, 1], [79.91, 1], [80.09, 2], [79.52, 1], [79.91, 1], [79.52, 3], [80.09, 1], [79.91, 3], [79.52, 3]]
                   highest_streak_value: 80.28
                   longest_streak: [80.28, 12]
                   geomean: 80.00876774948351
                   regression: dropped from 80.28 to 79.52
     setivar_young streaks: [[80.28, 15], [80.09, 1], [80.28, 1], [80.09, 1], [79.91, 5], [80.09, 1], [79.91, 2], [80.09, 2], [79.91, 1], [80.09, 1]]
                   highest_streak_value: 80.28
                   longest_streak: [80.28, 15]
                   geomean: 80.14317695071685
                   regression: dropped from 80.28 to 80.09

and here's one that doesn't currently register a regression (because the last value goes back up):

           hexapdf streaks: [[98.26, 1], [98.22, 1], [97.84, 1], [98.82, 1], [97.33, 1], [97.29, 1], [97.98, 1], [97.84, 2], [99.74, 1], [97.84, 1], [97.87, 1], [97.92, 1], [97.46, 1], [98.33, 1], [97.52, 1], [98.65, 1], [99.04, 1], [97.74, 1], [97.44, 1], [97.59, 1], [97.19, 1], [97.11, 1], [97.73, 1], [98.83, 1], [96.98, 1], [97.99, 1], [97.54, 1], [97.34, 1], [98.49, 1]]
                   highest_streak_value: 97.84
                   longest_streak: [97.84, 2]
                   geomean: 97.92337515531628

refs #366

@rwstauner rwstauner self-assigned this Jan 27, 2025
@rwstauner rwstauner changed the title Implement regression detection Implement initial regression detection Jan 27, 2025
bin/analysis Outdated Show resolved Hide resolved
@maximecb
Copy link
Contributor

Nice! Code changes look relatively compact too 👌

It looks for streaks to identify stable values.

Neat! I don't fully understand how the streaks are computed though, is it some average +/- some delta percentage?

Small details wrt the slack notification:

When showing regressions, it would be nice to use the same capitalization formatting as the stat name eg ratio_in_yjit Might also be nice to show a percentage first e.g. -2.1%, from xxx% to yyy% (otherwise you kinda have to mentally calculate the percentage).

Comment on lines +140 to +164
# Iterate looking for contiguous streaks of values that are within the tolerance.
(1...vals.size).each do |i|
prev, curr = vals.values_at(i-1, i)
delta = curr - prev

# If this iteration is within the defined tolerance
# from the last iteration track it as a streak.
if delta.abs <= TOLERANCE
# Keep track of the highest value that we've seen more than once in a row.
max = [prev, curr].max
high_streak = max if !high_streak || high_streak < max
end

# If we have seen any streaks (meaning there has been some consistency)...
if high_streak
delta = curr - high_streak

# If this iteration was lower than the highest streak and
# outside the tolerance range record it as a regression.
regression = if delta < -TOLERANCE
diff_pct = 0 - delta / high_streak * 100
sprintf "dropped %.*f%% from %.*f to %.*f", ROUND, diff_pct, ROUND, high_streak, ROUND, curr
end
end
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the streak detection with tolerance

@rwstauner
Copy link
Contributor Author

This shows it would have caught the regression last Sep (#366):

$ bin/analysis --benchmarks=railsbench build/raw-benchmark-data.prod/raw_benchmark_data/x86_64/2024-09
ratio_in_yjit x86_64_yjit_stats
 railsbench streaks: [[99.7, 2], [99.81, 4], [99.7, 6], [99.48, 16]]
            highest_streak_value: 99.81
            longest_streak: [99.48, 16]
            geomean: 99.58991325387635
            regression: dropped 0.33% from 99.81 to 99.48
$ bin/analysis --benchmarks=railsbench build/raw-benchmark-data.prod/raw_benchmark_data/x86_64/2024-10
ratio_in_yjit x86_64_yjit_stats
 railsbench streaks: [[99.48, 13], [99.6, 1], [99.57, 2], [99.6, 14]]
            highest_streak_value: 99.6
            longest_streak: [99.6, 14]
            geomean: 99.54598300131984
$ bin/analysis --benchmarks=railsbench build/raw-benchmark-data.prod/raw_benchmark_data/x86_64/2024-11
ratio_in_yjit x86_64_yjit_stats
 railsbench streaks: [[99.6, 26], [99.82, 4]]
            highest_streak_value: 99.82
            longest_streak: [99.6, 26]
            geomean: 99.62930529510919

and in late November the number goes back up.

COUNT = 30
NAME = :ratio_in_yjit
ROUND = 2
TOLERANCE = 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on what is the unit of the tolerance?
What is count?
What is round?


# Check the list of values for one benchmark.
# Returns either nil or string description of regression.
def check_one(vals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it called check_one?

Comment on lines +142 to +143
prev, curr = vals.values_at(i-1, i)
delta = curr - prev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this just comparing each value to the previous one?

What if you have a sequence of gradually decreasing values? 98.1, 98.0, 97.9, 97.8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants