Skip to content

Monitor CPU usage and internally rate limit #1545

Open
@carver

Description

@carver

What's wrong?

Especially with the state network, and when onboarding a new node, trin can redline the CPU while accepting offers and re-offering to peers. (I tended to see my fans spinning up when trin reached about 10 offers accepted per second).

Just like with storage, we don't want to overuse CPU. We prefer the client to feel light, and as something that can be constantly running in the background.

Possible solution

Monitor cpu_time::ProcessTime inside trin, and cap it at some small % by default (2%? 5%? 10%?). Add a CLI flag to manually configure it.

Every 10s that the CPU is above the limit, shrink the data radius of the state network by 10% of its current level. (This is making the assumption, based on the current experience, that state is the only offender for high CPU usage). Every 10s that the CPU is below the limit by at least half, and the data storage is under target, then grow the radius by 10%. I think we want to be quite responsive, which this accomplishes by being willing to cut the radius in ~half in 1 minute.

This approach might actually accelerate state nodes finding their natural "true" radius point faster, and mitigate the fill & dump behavior when launching a fresh client (which is more slow and painful on state than history).

Challenges

trin could use CPU for other reasons, like due to user interaction (ie~ when someone is using the RPC API). It is easy to imagine that CPU usage will spike then, and it would be wrong to mess with state radius at that point. We probably cannot punt on this, and will need to include a solution with the first implementation.

Another awkward aspect of this approach is that it's hard to tell which network is using too much CPU. We probably don't want to adjust every network's radius at once if CPU is high. Right now, it's only ever state, so I think we can punt this challenge. But if multiple networks start using a lot of CPU, we might want a clever way to measure computation that isn't just checking cpu_time::ProcessTime.

Timeline

I won't try to get this into the imminent stable release, of course, just planning ahead. This is another good reason not to enable state by default.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions