Skip to content

Conversation

@danzh2010
Copy link
Contributor

@danzh2010 danzh2010 commented Nov 18, 2025

Commit Message: this PR adds support of QUIC connection migration. To achieve this, this PR made several change:

  • Added an extension point in createPersistentQuicInfoForCluster() for a custom QUIC client packet writer. The config API is QuicProtocolOptions.client_packet_writer. And provided an extension implementation QuicPlatformPacketWriterFactory for mobile which knows how to bind sockets to given network handles provided by Android.
  • Changed QuicNetworkConnectivityObserverImpl to actually propagate network change events (connected, disconnected, became defualt) of a specific network to QuicSession's migration manager interface which makes decision whether and how to migrate the connection. Added 2 new interfaces getDefaultNetwork() and getAlternativeNetwork() to EnvoyQuicNetworkObserverRegistry and implemented them in the subclass EnvoyMobileQuicNetworkObserverRegistry using the network states cached in ConnectivityManagerImpl. These interfaces are used by createQuicNetworkConnection() and EnvoyQuicMigrationHelper to pass network information to the packet writer factory and QUICHE respectively.
  • Also added a config API QuicProtocolOptions.connection_migration for upstream cluster to configure several migration options, i.e. whether to migrate an idle QUIC connection or not, etc. (Currently hidden from doc) And add mobile APIs setEnableConnectionMigration() to Android engine builders to populate this config knob in the bootstrap config. Note that if this knob is configured, QuicProtocolOptions.client_packet_writer must be configured with a packet writer extension that supports binding socket to a given network handle of the platform's own definition. And Android engine builders will automatically plug in QuicPlatformPacketWriterFactory extension in such case.

Currently we handles 4 network change events with connection migration if there is alternative network to use in Android:

  1. the current network gets disconnected;
  2. a different network gets picked as the default by the platform;
  3. a packet write error occurs;
  4. QuicConnection detects path degrading.

Migration between different ports on the same network and to a different server address is already supported.

To enable connection migration in Cronvoy engine, these APIs need to be called:
setEnableConnectionMigration(true)
setUseV2NetworkMonitor(true)
addRuntimeGuard("drain_pools_on_network_change", true) // default false, needed to disable request rehash during conn pool picking

The feature also depends on these default-true runtime guards:
envoy.reloadable_features.mobile_use_network_observer_registry
envoy.reloadable_features.decouple_explicit_drain_pools_and_dns_refresh // needed to disable request rehash during conn pool picking
If any of them are turned off, migration needs to be turned off as well.

Additional Description: use QuicPlatformPacketWriterFactory extension by default in examples/java/hello_world:hello_envoy Java app to ensure correct interaction with Android APIs.

Risk Level: low, the feature is default off
Testing: new integration tests added
Docs Changes: Y
Release Notes: N
Platform Specific Features: QUIC connection migration on Android

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #42104 was opened by danzh2010.

see: more, trace.

@RyanTheOptimist
Copy link
Contributor

@abeyad can you take a first pass?

Signed-off-by: Dan Zhang <[email protected]>
Copy link
Contributor

@abeyad abeyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @danzh2010 !

}
}

if (use_quic_platform_packet_writer_ || enable_connection_migration_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would use_quic_platform_packet_writer_ be set if connection migration is not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_quic_platform_packet_writer_ can be used on Android regardless of enabling migration or not. As long as the platform support network handles.


private:
Network::ConnectionSocketPtr
createConnectionSocketOnGivenNetwork(Network::Address::InstanceConstSharedPtr peer_addr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the definition of this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed.

}

void DefaultSystemHelper::bindSocketToNetwork(Network::ConnectionSocket&, int64_t) {
PANIC("unreachable");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why panicking here for Apple? wouldn't this get called by platform_packet_writer_factory.cc on any platform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 choice here: 1) Apple will continue using the default packet writer. It doesn't propagate the network handle to QUICHE anyway. 2) using the extension packet writer as well, but because of lack of network handle, this function won't be called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like this deserves a comment in the code. Since this class explicitly has "Apple"in the name, it is assumed to work on Apple platforms, I think. So if we think choice 1 is a valid option, then perhaps we should nuke the class? But option 2 seems plausible and worth a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"nuke the class" means leave this method body empty?
Anyway, I added comment about why this is unreachable.


struct CreationResult {
// Not null.
std::unique_ptr<EnvoyQuicPacketWriter> writer_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the using declarations with EnvoyQuicPacketWriterPtr, but up to you if you want to do that here for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me either, especially this class name isn't too long. I'll leave as is.

abeyad
abeyad previously approved these changes Nov 26, 2025
Copy link
Contributor

@abeyad abeyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will let @RyanTheOptimist review as an Envoy maintainer

@abeyad
Copy link
Contributor

abeyad commented Nov 26, 2025

/assign @botengyao

for API review

@abeyad
Copy link
Contributor

abeyad commented Nov 26, 2025

/assign @adisuissa

for API review

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Left a few comments about the API.

// picked by the platform. A connection is this special state is only allowed to
// serve new requests for a certain period of time before being drained, and
// meanwhile, QUIC will try to migrate to the default network if possible.
message ConnectionMigrationSettings {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is QUIC only settings (and will not be used in other places), I think it should be inside the QuicProtocolOptions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// network change events from the platform, i.e. the current network gets
// disconnected, or upon the QUIC detecting a bad connection. After migration, the
// connection may be on a different network other than the default network
// picked by the platform. A connection is this special state is only allowed to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// picked by the platform. A connection is this special state is only allowed to
// picked by the platform. A connection in this special state is only allowed to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// if it hasn't been idle for longer than this idle period. Otherwise, the
// connection will be closed instead.
// Default to 30s.
google.protobuf.Duration max_idle_time_before_migration = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this field is highly coupled with migrate_idle_connections being set to true.
If that's the case, will it make sense to create a new type (e.g., IdleConnectionsMigration) that has the max_idle_time... field there?
There will be an idle_connections_migration field that if not set, the idle connections will be closed upon a migration signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bool migrate_idle_connections = 1;

// If idle connections are allowed to be migrated, only migrate the connection
// if it hasn't been idle for longer than this idle period. Otherwise, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just top make sure I understand correctly, this implies that once a migration signal was received, the implementation will look "backwards" at all the connections and see if any of them has been idle (according to the set timeout), and if they are not, they will be migrated. Is that correct?
(I just want to make sure that this doesn't imply that once a migration signal arrives, a timer starts for all the connections to see which is idle and which is not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just top make sure I understand correctly, this implies that once a migration signal was received, the implementation will look "backwards" at all the connections and see if any of them has been idle (according to the set timeout), and if they are not, they will be migrated. Is that correct?

Yes

Comment on lines 76 to 79
// After migrating to a non-default network interface, the connection will
// only be allowed to stay on that network for up to this period of time before
// being drained unless it migrates to the default network or that network
// gets picked as the default by the device by then.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like many edge cases and races may be introduced by this :)
I think one thing that is missing is the definition of "default network". I guess it is a QUIC internal definition.
Can you please add a link in the first mention of default network in this proto, that explains what's its definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default network is not a QUIC terminology, but one for mobile platforms. Both Android and iOS has the concept of the default network to interact with the internet, usually prefer unmetered network (WIFI) over metered ones (cellular). I inlined some explanation here. I couldn't find an external link to explain this concept.

Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small comments.

(It might have been simpler to do a stand-alone PR which added the new writer and the ability to use it, but c'est la vie :>)

}

void DefaultSystemHelper::bindSocketToNetwork(Network::ConnectionSocket&, int64_t) {
PANIC("unreachable");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like this deserves a comment in the code. Since this class explicitly has "Apple"in the name, it is assumed to work on Apple platforms, I think. So if we think choice 1 is a valid option, then perhaps we should nuke the class? But option 2 seems plausible and worth a comment?

virtual std::vector<std::pair<int64_t, ConnectionType>> getAllConnectedNetworks() PURE;

/**
* Binds the given socket to the network interface associated with the handle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment about what happens if the underlying socket doesn't support networking binding. (Should it be a no-op, or is it the callers responsibility to not call it in this case. If the latter, should we expose "supportsNetworkBinding()" or somesuch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller (platform writer) will check against invalid network handle using that as a signal of whether the socket supports network binding or not. The network handle can be invalid even on platforms that support network binding, i.e. the default Android network monitor doesn't propagate the network handle to native code. So whether the network binding is meaningful or not doesn't purely depends on the SystemHelper interface.

* Set whether to use a platform specific APIs to create UDP socket and the associated QUIC packet
* writer. Note that `setUseV2NetworkMonitor()` also needs to be called to take effect. This is a
* temporary API which will be deprecated once the platform specific extension is verified to work
* and will be used as the default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between this method and setUseV2NetworkMonitor? I would have thought that using this method would enable the new writer regardless of any migration / network monitoring things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, this API can be called regardless of whether the other migration API is called or not. But if the latter is called the engine builder will behave as if the former is also called.

}

/**
* Set whether to migrate idle connections to a different network upon network events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other similar methods use the phrase "QUIC connections". We should probably be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Note that `setUseV2NetworkMonitor()` also needs to be called to take effect.
* If enabled, the engine will automatically be configured to use platform packet writer. *
*/
public NativeCronvoyEngineBuilderImpl setEnableConnectionMigration(boolean enable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and the following methods are QUIC-specific, should they have QUIC in the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed them to mention Quic

@repokitteh-read-only repokitteh-read-only bot removed the deps Approval required for changes to Envoy's external dependencies label Nov 26, 2025
@danzh2010 danzh2010 dismissed stale reviews from RyanTheOptimist and abeyad via 0eba0f0 November 26, 2025 21:06
@repokitteh-read-only repokitteh-read-only bot added the deps Approval required for changes to Envoy's external dependencies label Nov 26, 2025
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
Signed-off-by: Dan Zhang <[email protected]>
@danzh2010
Copy link
Contributor Author

/retest

Signed-off-by: Dan Zhang <[email protected]>
@danzh2010
Copy link
Contributor Author

/retest

@abeyad
Copy link
Contributor

abeyad commented Dec 1, 2025

/retest

@abeyad
Copy link
Contributor

abeyad commented Dec 1, 2025

/wait

@danzh2010 is OOO

@abeyad
Copy link
Contributor

abeyad commented Dec 1, 2025

Looks like the only remaining CI issue is test coverage:

✗ source/common/quic: 93.2% (threshold: 93.3%)

@abeyad abeyad added the no stalebot Disables stalebot from closing an issue label Dec 1, 2025
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

@repokitteh-read-only repokitteh-read-only bot removed the api label Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deps Approval required for changes to Envoy's external dependencies no stalebot Disables stalebot from closing an issue waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants