Skip to content

WIP: RSDK-10435: Dial direct #430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

lia-viam
Copy link
Collaborator

@lia-viam lia-viam commented May 2, 2025

Enables dialing directly to gRPC, potentially circumventing rust-utils entirely.

This includes some QoL/drive-by changes, such as structifying DialOptions and removing the shared_ptr from RobotClient

@lia-viam lia-viam requested a review from a team as a code owner May 2, 2025 18:16
@lia-viam lia-viam requested review from njooma and stuqdog and removed request for a team May 2, 2025 18:16
server_->shutdown();

if (parent_) {
try {
parent_->close();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was causing a double call to stop_all, the second one happening after disconnecting from the robot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does conan now give us a vehicle for building the entire dependency tree under ASAN and UBSAN?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes in theory but it's probably pretty tricky, see here for discussion (old conan docs) https://docs.conan.io/1/howtos/sanitizers.html

Copy link
Member

@stuqdog stuqdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A couple small things but generally looks good to me.

@@ -1,3 +1,4 @@
#include <grpc/grpc.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) should move this to a separate include block

Comment on lines 15 to 23
#include <viam/api/proto/rpc/v1/auth.grpc.pb.h>
#include <viam/api/proto/rpc/v1/auth.pb.h>
#include <viam/api/robot/v1/robot.grpc.pb.h>
#include <viam/api/robot/v1/robot.pb.h>
#include <viam/sdk/common/client_helper.hpp>
#include <viam/sdk/common/exception.hpp>
#include <viam/sdk/common/private/instance.hpp>
#include <viam/sdk/rpc/private/viam_grpc_channel.hpp>
#include <viam/sdk/rpc/private/viam_rust_utils.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Hm we did this wrong here (my bad!) but in most places in the SDK we've got the api includes in a separate block before the sdk includes. Could we do the same here?


return grpc::CreateChannel(address, tls_creds);
}

std::shared_ptr<grpc::Channel> create_viam_channel(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not worth fixing here, but it's unfortunate and a little confusing that create_viam_channel creates a grpc::Channel, not a ViamChannel. :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it :)

Comment on lines +143 to +155
using namespace proto::rpc::v1;

auto auth_stub = AuthService::NewStub(channel_for_auth);
ClientContext ctx;
AuthenticateRequest req;

*req.mutable_entity() = opts.auth_entity.get();
*req.mutable_credentials()->mutable_payload() = opts.credentials->payload();
*req.mutable_credentials()->mutable_type() = opts.credentials->type();

AuthenticateResponse resp;

auto status = auth_stub->Authenticate(ctx, req, &resp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we factor this out into a helper function? I think it's a bit cleaner, and also will be relevant if/when we want to have app client support in C++.

Comment on lines 83 to 85
/// @brief Bypass WebRTC and connect directly to the robot.
/// This dials directly through grpc bypassing rust utils.
bool disable_webrtc = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(opt) Maybe not a big deal (I don't think we do this in other SDKs), but it might be good to add a note here about the specialized uri form that is required (<part>.<location>.local.viam.cloud:8080). I could definitely imagine that being a gotcha.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I wanted to ask you about that, should we try to validate the URI under this code path and error out of it's not of that form?

Copy link
Member

@stuqdog stuqdog May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good question. I think probably no. My thinking is:

  1. that format might change in the future and that could lead to some issues that might be annoying to track down, and
  2. that format is currently required assuming the user isn't doing anything weird or bespoke. If, e.g., they were to use a different app instance than app.viam.com then maybe the format would be different? I have no idea but I don't want to just break things for someone if such a case were to arise

On the other hand, it is a little annoying if we have users saying "why doesn't this work??" and we say "oh it says in the comments" and they say "that wasn't helpful, if you'd just given me an error message I would've known much faster". So while I'm inclined not to error out, I'm open to being convinced.

Maybe we add a warning or info log message? Probably excessive 😅

@lia-viam lia-viam changed the title RSDK-10435: Dial direct WIP: RSDK-10435: Dial direct May 8, 2025
Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look out of interest. Please don't let these comments get in the way of moving this forward, they are just observations and notes on things that have bugged me about the code that is under modification here, so this seemed like a time to raise them. I'm happy for any or none of these issues to be deferred into future work.

// TODO (RSDK-917): We currently don't provide a flag for disabling webRTC, instead relying on a
// `local` uri. We should update dial logic to consider such a flag.

struct DialOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what was the motivation for making this a struct over a class? Keeping it as a class would make it easier in the future to hide the implementation details behind a pimpl in order to make it not part of the ABI.

I also liked the fluent style setters, so you could write:

ViamChannel::dial(where, DialOptions().set_entity(...).set_credentials()...));

Admittedly, the structure of some of the existing setters is a little off. For instance, it isn't clear to me why set_credentials takes an optional. When would you call set_credentials with a disengaged one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so personally I was applying the C++ core guideline against trivial getters and setters in a class which hint that something probably should have just been a struct. With newer C++ (or older C++ with gnu extensions) I'm much more in favor of using designated initializers, eg

ViamChannel::dial(where, {
    .entity = ...,
    .credentials = ...,
});

this PR has exposed a bunch of weird stuff around these classes and optional, I was going to try to clean up the example code because there's a lot of unnecessary instances of constructing an options and then a optional<options> which is set to the previous options

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Designated initializers are definitely great (though C++20 IIRC), but it does create a very tightly bound interface: not only can you not change the sizes / orders of the members of the struct, but you can't even interpose on the act of setting the values to add behavior or checks. I have some doubts that they are the right thing to use for a public interface in an SDK.

On the other hand, we aren't there yet with worrying about ABI, so I'm not going to push too hard either way on it.


/// @brief Timeout of connection attempts when initially dialing a robot
/// Defaults to 20sec to match the default timeout duration
std::chrono::duration<float> initial_connection_attempt_timeout_{20};
std::chrono::duration<float> initial_connection_attempt_timeout{20};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⬇️ - I've always wondered: what is the point of having Options separate from DialOptions? And if they are specifically part of RobotClient, maybe they belong as part of that class (i.e. RobotClient::Options) or at least in that header? Having a SDK namespace level type named Options when it is only associated with RobotClienthas always bothered me.

I'm also curious if having the default should be to have a refresh interval (what would it be?), and provide an API to opt out:

class [RobotClient::]Options {
    Options();
    Options(DialOptions dial_options);

    Options& disable_refresh();
    Options& set_refresh_interval(std::chrono...);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect a lot of this may be cruft that was not thought over a ton and hasn't been updated. I agree no refresh seems like an odd default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing an overhaul of this API (commit forthcoming) and made the troubling discovery that a lot of our example code says that passing a refresh interval of zero seconds does auto refresh. I'm not sure what auto refresh means or if this was ever the case!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: documentation has been updated to reflect that zero seconds means 'only on config update'

@@ -11,7 +11,8 @@
namespace viam {
namespace sdk {

class DialOptions;
struct DialOptions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these really ViamChannelOptions? Or ViamChannel::Options? In which case, this header could be viam_channel.h?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they probably are indeed ViamChannel::Options

}

const std::chrono::duration<float> float_timeout = opts.timeout;
void* ptr = init_rust_runtime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was captured by unique_ptr with free_rust_runtime as the deleter type, some things could be simplified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea i'll try to do it that way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to do it, you could probably just create a nice shim class in sdk/rpc/private/viam_rust_utils.h that called init_rust_runtime in the ctor, and presented dial as a member.


const std::string& token =
Instance::current(Instance::Creation::open_existing).impl_->direct_dial_token;
if (!token.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised to see this as a global, but maybe I'm misunderstanding. But if you had one channel that was doing direct dial and another that wasn't, wouldn't this by trying to add this metadata to the contexts for both channels?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, although I'm not sure how common/feasible such a situation would be in practice, or if we should error out if we detect that kind of thing? cc @stuqdog

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is a good question. I'm not sure how common such a situation is in practice currently, but I do think it's certainly feasible and I could imagine a case where, e.g., someone is connecting to two machines at once and would prefer to use direct gRPC but is required to use webRTC for some of its added functionality on one of the machines only.

For that matter, even if both channels were dialing direct, if they were dialing to different machines then the token would only be valid for one of them! So yeah, probably tokens should either not be global or (if that's not feasible) should be keyed to a particular channel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok so this is probably something we should handle better. i'll try to look into the architecture and see how to make this work probably on a per channel basis

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acmorrow after much wailing and gnashing of teeth we figured out a way to do this without a global, thereby ballooning the surface area of this PR considerably.

server_->shutdown();

if (parent_) {
try {
parent_->close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does conan now give us a vehicle for building the entire dependency tree under ASAN and UBSAN?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants