Skip to content

feat: Cluster Sharding telemetry#32878

Open
sebastian-alfers wants to merge 4 commits intomainfrom
sharding-telemetry
Open

feat: Cluster Sharding telemetry#32878
sebastian-alfers wants to merge 4 commits intomainfrom
sharding-telemetry

Conversation

@sebastian-alfers
Copy link
Contributor

No description provided.

private val verboseDebug = context.system.settings.config.getBoolean("akka.cluster.sharding.verbose-debug-logging")

private val instrumentation =
ClusterShardingInstrumentationProvider.get(context.system).instrumentation("shard_region", typeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the scope parameter?
Wouldn't it be easier to include the typeName in the shardBufferSize and increaseShardBufferSize methods?
Then a single ClusterShardingInstrumentationProvider instance can be used, instead of creating a new one for each shard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the scope parameter?

Idea is to have it as an attribute in the metric, to drill down by component ("shard_region" or "shard").

Yes, I can move it to a single instance in the extension and pass in the params.

} else {
shardBuffers.append(shardId, msg, snd)

instrumentation.shardBufferSize(totBufSize + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increaseShardBufferSize isn't used?
what's the plan should it always report the size, or increment/decrement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea was that, instead of having to calculate the size each time, to use +1 / -1 where possible.

}

val typeName = "GiveMeYourHome"
val initiallyOnForth = "on-fourth"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used?

runOn(second) {
val probe = TestProbe()
(1 to 100).foreach { n =>
shardRegion.tell(Get(s"id-$n"), probe.ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the warmup you use "id1" and here "id-1". Those are different shard ids, but perhaps make it more clear by using something completely different in warmup, such as "a", "b", "c"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a race condition in this test. You use a shared counter for all shards, which is reset to 0 when any shard is started. So it could increase the counter for id-1, but then the shard actor for id-2 is started, and resetting the counter to 0 again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would trigger if we would create more than node in addition to the coordinator? Or are the more than one instance of ShardRegion in this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was thinking wrong, it's reset when the region is started, and we only have one region (typeName) per jvm here.

}
eventually {
ClusterShardInstrumentatioSpecConfig.counter.get() shouldBe 100
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test could continue by removing the blackhole, and see that the buffer size decrease to 0 again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good! And also dropping messages if the buffer is full.

@InternalStableApi
class EventsourcedInstrumentationProvider(system: ExtendedActorSystem) extends Extension {
private val fqcnConfigPath = "akka.persistence.telemetry.eventsourced.instrumentations"
private val fqcnConfigPath = "akka.persistence.telemetry.eventsourced.instrumentatiffons"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing!

runOn(second) {
val probe = TestProbe()
(1 to 100).foreach { n =>
shardRegion.tell(Get(s"id-$n"), probe.ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was thinking wrong, it's reset when the region is started, and we only have one region (typeName) per jvm here.

val second = role("second")
testTransport(on = true)

val counter = new AtomicInteger()
Copy link
Contributor

@patriknw patriknw Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event though we have isolation by separate JVMs for each node here, it would be nice to not use a global counter, but place the counter inside SpecClusterShardingTelemetry.

From the test you can access it with

ClusterShardingInstrumentationProvider(system).instrumentation.asInstanceOf[SpecClusterShardingTelemetry].counter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

dropped,
shard)
// better to decrease by "dropped" to avoid calculating the size?
instrumentation.shardBufferSize(scope, typeName, shardBuffers.size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the scope parameter? Isn't that always "shard_region"? Is it some kind of "might be good in the future"? If we have more buffers in sharding that we want to instrument, we can have explicit methods for them in the SPI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be shardBuffers.totalSize instead of shardBuffers.size?

// better to decrease by "dropped" to avoid calculating the size?

this drop should be rare, so performance of calculating the size is not a a reason, but might be better to have symmetry in the SPI with increase and decrease:

def shardBufferSize(typeName: String, size: Int): Unit
def incrementShardBufferSize(typeName: String, delta: Int): Unit
def decrementShardBufferSize(typeName: String, delta: Int): Unit

Then you can use decrement from deliver too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

But then this would be shardRegionBufferSize for now here, and later we add shardBufferSize once we add instrumentation to akka.cluster.sharding.Shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be shardBuffers.totalSize instead of shardBuffers.size

Ouch, jea! Good catch. I think this scenario is just not triggered yet in this test case?

@patriknw patriknw changed the title wip: Cluster Sharding telemetry feat: Cluster Sharding telemetry Feb 9, 2026
@patriknw patriknw marked this pull request as ready for review February 9, 2026 15:08
Copy link
Contributor

@patriknw patriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good


override def shardRegionBufferSize(
selfAddress: Address,
shardRegionActor: ActorRef,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the selfAddress and shardRegionActor? is that because Cinnamon has that existing structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the current version both are used. I let @pvlugter share his thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't work on cluster sharding instrumentation originally, but if you're integrating with what's there already, these will be for the identity and for accessing metadata.

But you can also have this new telemetry be completely separate. You'll mostly just want the entity type for a metric label.

Copy link
Contributor

@patriknw patriknw Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it feels like Cinnamon should already know the ActorSystem, and thereby the address, and I don't see why this metric should be coupled to shardRegionActor. The address + typeName should be enough to create a unique key. However, if that is needed because it makes it easier on the Cinnamon side, then so be it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure why self address is being passed and why the actor ref is being used. I see that the address is already accessed automatically from the actor system for some cluster instrumentation, and agree that the entity type name is what should be used for identifying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I remove it again (I thought somewhere Cinnamon needed it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll likely need it if you're cross-integrating this instrumentation with the existing sharding instrumentation. Otherwise you can define this SPI cleanly, which I think is preferable.

Copy link
Contributor

@patriknw patriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, even though I don't see the full rational for the address and shardRegionActor parameters


override def shardRegionBufferSize(
selfAddress: Address,
shardRegionActor: ActorRef,
Copy link
Contributor

@patriknw patriknw Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it feels like Cinnamon should already know the ActorSystem, and thereby the address, and I don't see why this metric should be coupled to shardRegionActor. The address + typeName should be enough to create a unique key. However, if that is needed because it makes it easier on the Cinnamon side, then so be it.

Copy link
Contributor

@johanandren johanandren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few left over todos dropped

typeName,
dropped,
shard)
// better to decrease by "dropped" to avoid calculating the size?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// better to decrease by "dropped" to avoid calculating the size?

sebastian-alfers and others added 2 commits February 17, 2026 09:28
Co-authored-by: Johan Andrén <johan@markatta.com>
* feat: metric for dropped messages in Shard Region buffer
@sebastian-alfers
Copy link
Contributor Author

Running a nightly base on this branch: https://github.com/akka/akka-core/actions/runs/22094044273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants