Description
Hello there :)
I work on Datadog's Ruby profiler and I arrived here while investigating a customer issue where they were using the rugged gem and it breaks in combination with our profiler due to a libssh2 bug.
The TL;DR is that libssh2 doesn't yet correctly handle the interruption of system calls caused by profilers that use unix signals, and thus having such a profiler running breaks network calls using rugged.
This issue is not specific to the Datadog Ruby profiler 😭; the also great stackprof profiler gem is unfortunately affected by this issue:
require 'rugged'
require 'stackprof'
puts "Cloning..."
creds = Rugged::Credentials::SshKey.new(username: 'git', publickey: '...', privatekey: '...')
StackProf.run(mode: :wall, out: 'tmp/stackprof-cpu-myapp.dump') do
Rugged::Repository.clone_at('ssh://[email protected]/libgit2/rugged.git', '/tmp/some-directory', credentials: creds)
end
puts "Cloned!"
gets this output:
Cloning...
Traceback (most recent call last):
3: from repro-rugged.rb:8:in `<main>'
2: from repro-rugged.rb:8:in `run'
1: from repro-rugged.rb:9:in `block in <main>'
repro-rugged.rb:9:in `clone_at': remote rejected authentication: Error waiting on socket (Rugged::SshError)
...but removing StackProf
makes the clone work fine -- it's not actually an authentication error, it's the system call interruption at work.
Ok so why am I double-reporting this issue when I've already reported it to the libssh2 developers as well?
Since it's common to use a system libssh2 with rugged, even if the fix to libssh2 was released today, it'll take months/years to arrive on Linux distros.
Furthermore, it's hard to detect from Ruby code if rugged is linked with a broken libssh2, because while rugged provides a Rugged.libgit2_version
, there's no corresponding API to probe the libssh2 version (at least I didn't find one in libgit2 directly or rugged).
Currently, when rugged is detected, the Datadog profiler needs to fall back to an alternative code path that yields lower-quality data.
I would love for this to not be the case!
Thus my question is: Would you consider accepting a pull request modifying rugged calls that can trigger network operations (I think it's only clone/fetch/pull/push/submodule stuff?) with a method that temporarily disables signal handling (for SIGPROF
and SIGALARM
) during that call?
Something similar to:
static VALUE rb_git_repo_clone_at(int argc, VALUE *argv, VALUE klass)
{
VALUE url, local_path, rb_options_hash;
git_clone_options options = GIT_CLONE_OPTIONS_INIT;
struct rugged_remote_cb_payload remote_payload = { Qnil, Qnil, Qnil, Qnil, Qnil, Qnil, Qnil, 0 };
git_repository *repo;
int error;
rb_scan_args(argc, argv, "21", &url, &local_path, &rb_options_hash);
Check_Type(url, T_STRING);
FilePathValue(local_path);
parse_clone_options(&options, rb_options_hash, &remote_payload);
+ block_profiling_signals();
error = git_clone(&repo, StringValueCStr(url), StringValueCStr(local_path), &options);
+ unblock_profiling_signals();
if (RTEST(remote_payload.exception))
rb_jump_tag(remote_payload.exception);
rugged_exception_check(error);
return rugged_repo_new(klass, repo);
}
This would make rugged work great under stackprof / the Datadog profiler and as a bonus we could detect the fixed version and avoid the fall back to an alternative code path that yields lower-quality data that we currently have.
Thoughts? I'm shamelessly tagging @tenderlove here since you're also a maintainer of stackprof ;)
Activity