-
Notifications
You must be signed in to change notification settings - Fork 5.2k
mobile: support QUIC connection migration #42104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
|
@abeyad can you take a first pass? |
Signed-off-by: Dan Zhang <[email protected]>
abeyad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @danzh2010 !
mobile/library/cc/engine_builder.cc
Outdated
| } | ||
| } | ||
|
|
||
| if (use_quic_platform_packet_writer_ || enable_connection_migration_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would use_quic_platform_packet_writer_ be set if connection migration is not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_quic_platform_packet_writer_ can be used on Android regardless of enabling migration or not. As long as the platform support network handles.
|
|
||
| private: | ||
| Network::ConnectionSocketPtr | ||
| createConnectionSocketOnGivenNetwork(Network::Address::InstanceConstSharedPtr peer_addr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the definition of this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed.
| } | ||
|
|
||
| void DefaultSystemHelper::bindSocketToNetwork(Network::ConnectionSocket&, int64_t) { | ||
| PANIC("unreachable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why panicking here for Apple? wouldn't this get called by platform_packet_writer_factory.cc on any platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 choice here: 1) Apple will continue using the default packet writer. It doesn't propagate the network handle to QUICHE anyway. 2) using the extension packet writer as well, but because of lack of network handle, this function won't be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like this deserves a comment in the code. Since this class explicitly has "Apple"in the name, it is assumed to work on Apple platforms, I think. So if we think choice 1 is a valid option, then perhaps we should nuke the class? But option 2 seems plausible and worth a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nuke the class" means leave this method body empty?
Anyway, I added comment about why this is unreachable.
|
|
||
| struct CreationResult { | ||
| // Not null. | ||
| std::unique_ptr<EnvoyQuicPacketWriter> writer_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of the using declarations with EnvoyQuicPacketWriterPtr, but up to you if you want to do that here for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me either, especially this class name isn't too long. I'll leave as is.
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
abeyad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will let @RyanTheOptimist review as an Envoy maintainer
|
/assign @botengyao for API review |
|
/assign @adisuissa for API review |
adisuissa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Left a few comments about the API.
| // picked by the platform. A connection is this special state is only allowed to | ||
| // serve new requests for a certain period of time before being drained, and | ||
| // meanwhile, QUIC will try to migrate to the default network if possible. | ||
| message ConnectionMigrationSettings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is QUIC only settings (and will not be used in other places), I think it should be inside the QuicProtocolOptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // network change events from the platform, i.e. the current network gets | ||
| // disconnected, or upon the QUIC detecting a bad connection. After migration, the | ||
| // connection may be on a different network other than the default network | ||
| // picked by the platform. A connection is this special state is only allowed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // picked by the platform. A connection is this special state is only allowed to | |
| // picked by the platform. A connection in this special state is only allowed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // if it hasn't been idle for longer than this idle period. Otherwise, the | ||
| // connection will be closed instead. | ||
| // Default to 30s. | ||
| google.protobuf.Duration max_idle_time_before_migration = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this field is highly coupled with migrate_idle_connections being set to true.
If that's the case, will it make sense to create a new type (e.g., IdleConnectionsMigration) that has the max_idle_time... field there?
There will be an idle_connections_migration field that if not set, the idle connections will be closed upon a migration signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| bool migrate_idle_connections = 1; | ||
|
|
||
| // If idle connections are allowed to be migrated, only migrate the connection | ||
| // if it hasn't been idle for longer than this idle period. Otherwise, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just top make sure I understand correctly, this implies that once a migration signal was received, the implementation will look "backwards" at all the connections and see if any of them has been idle (according to the set timeout), and if they are not, they will be migrated. Is that correct?
(I just want to make sure that this doesn't imply that once a migration signal arrives, a timer starts for all the connections to see which is idle and which is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just top make sure I understand correctly, this implies that once a migration signal was received, the implementation will look "backwards" at all the connections and see if any of them has been idle (according to the set timeout), and if they are not, they will be migrated. Is that correct?
Yes
| // After migrating to a non-default network interface, the connection will | ||
| // only be allowed to stay on that network for up to this period of time before | ||
| // being drained unless it migrates to the default network or that network | ||
| // gets picked as the default by the device by then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like many edge cases and races may be introduced by this :)
I think one thing that is missing is the definition of "default network". I guess it is a QUIC internal definition.
Can you please add a link in the first mention of default network in this proto, that explains what's its definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default network is not a QUIC terminology, but one for mobile platforms. Both Android and iOS has the concept of the default network to interact with the internet, usually prefer unmetered network (WIFI) over metered ones (cellular). I inlined some explanation here. I couldn't find an external link to explain this concept.
RyanTheOptimist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small comments.
(It might have been simpler to do a stand-alone PR which added the new writer and the ability to use it, but c'est la vie :>)
| } | ||
|
|
||
| void DefaultSystemHelper::bindSocketToNetwork(Network::ConnectionSocket&, int64_t) { | ||
| PANIC("unreachable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like this deserves a comment in the code. Since this class explicitly has "Apple"in the name, it is assumed to work on Apple platforms, I think. So if we think choice 1 is a valid option, then perhaps we should nuke the class? But option 2 seems plausible and worth a comment?
| virtual std::vector<std::pair<int64_t, ConnectionType>> getAllConnectedNetworks() PURE; | ||
|
|
||
| /** | ||
| * Binds the given socket to the network interface associated with the handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment about what happens if the underlying socket doesn't support networking binding. (Should it be a no-op, or is it the callers responsibility to not call it in this case. If the latter, should we expose "supportsNetworkBinding()" or somesuch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller (platform writer) will check against invalid network handle using that as a signal of whether the socket supports network binding or not. The network handle can be invalid even on platforms that support network binding, i.e. the default Android network monitor doesn't propagate the network handle to native code. So whether the network binding is meaningful or not doesn't purely depends on the SystemHelper interface.
| * Set whether to use a platform specific APIs to create UDP socket and the associated QUIC packet | ||
| * writer. Note that `setUseV2NetworkMonitor()` also needs to be called to take effect. This is a | ||
| * temporary API which will be deprecated once the platform specific extension is verified to work | ||
| * and will be used as the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship between this method and setUseV2NetworkMonitor? I would have thought that using this method would enable the new writer regardless of any migration / network monitoring things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, this API can be called regardless of whether the other migration API is called or not. But if the latter is called the engine builder will behave as if the former is also called.
| } | ||
|
|
||
| /** | ||
| * Set whether to migrate idle connections to a different network upon network events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other similar methods use the phrase "QUIC connections". We should probably be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * Note that `setUseV2NetworkMonitor()` also needs to be called to take effect. | ||
| * If enabled, the engine will automatically be configured to use platform packet writer. * | ||
| */ | ||
| public NativeCronvoyEngineBuilderImpl setEnableConnectionMigration(boolean enable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this and the following methods are QUIC-specific, should they have QUIC in the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed them to mention Quic
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
0eba0f0
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
|
/retest |
Signed-off-by: Dan Zhang <[email protected]>
|
/retest |
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
|
/retest |
|
/wait @danzh2010 is OOO |
|
Looks like the only remaining CI issue is test coverage: |
adisuissa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm api
Commit Message: this PR adds support of QUIC connection migration. To achieve this, this PR made several change:
createPersistentQuicInfoForCluster()for a custom QUIC client packet writer. The config API isQuicProtocolOptions.client_packet_writer. And provided an extension implementationQuicPlatformPacketWriterFactoryfor mobile which knows how to bind sockets to given network handles provided by Android.QuicNetworkConnectivityObserverImplto actually propagate network change events (connected, disconnected, became defualt) of a specific network to QuicSession's migration manager interface which makes decision whether and how to migrate the connection. Added 2 new interfacesgetDefaultNetwork()andgetAlternativeNetwork()toEnvoyQuicNetworkObserverRegistryand implemented them in the subclassEnvoyMobileQuicNetworkObserverRegistryusing the network states cached inConnectivityManagerImpl. These interfaces are used bycreateQuicNetworkConnection()andEnvoyQuicMigrationHelperto pass network information to the packet writer factory and QUICHE respectively.QuicProtocolOptions.connection_migrationfor upstream cluster to configure several migration options, i.e. whether to migrate an idle QUIC connection or not, etc. (Currently hidden from doc) And add mobile APIssetEnableConnectionMigration()to Android engine builders to populate this config knob in the bootstrap config. Note that if this knob is configured,QuicProtocolOptions.client_packet_writermust be configured with a packet writer extension that supports binding socket to a given network handle of the platform's own definition. And Android engine builders will automatically plug inQuicPlatformPacketWriterFactoryextension in such case.Currently we handles 4 network change events with connection migration if there is alternative network to use in Android:
Migration between different ports on the same network and to a different server address is already supported.
To enable connection migration in Cronvoy engine, these APIs need to be called:
setEnableConnectionMigration(true)
setUseV2NetworkMonitor(true)
addRuntimeGuard("drain_pools_on_network_change", true) // default false, needed to disable request rehash during conn pool picking
The feature also depends on these default-true runtime guards:
envoy.reloadable_features.mobile_use_network_observer_registryenvoy.reloadable_features.decouple_explicit_drain_pools_and_dns_refresh// needed to disable request rehash during conn pool pickingIf any of them are turned off, migration needs to be turned off as well.
Additional Description: use
QuicPlatformPacketWriterFactoryextension by default inexamples/java/hello_world:hello_envoyJava app to ensure correct interaction with Android APIs.Risk Level: low, the feature is default off
Testing: new integration tests added
Docs Changes: Y
Release Notes: N
Platform Specific Features: QUIC connection migration on Android