feat: Cluster Sharding telemetry by sebastian-alfers · Pull Request #32878 · akka/akka-core

sebastian-alfers · 2026-02-03T16:17:50Z

No description provided.

patriknw · 2026-02-05T06:50:51Z

akka-cluster-sharding/src/main/scala/akka/cluster/sharding/ShardRegion.scala

  private val verboseDebug = context.system.settings.config.getBoolean("akka.cluster.sharding.verbose-debug-logging")

+  private val instrumentation =
+    ClusterShardingInstrumentationProvider.get(context.system).instrumentation("shard_region", typeName)


What is the purpose of the scope parameter?
Wouldn't it be easier to include the typeName in the shardBufferSize and increaseShardBufferSize methods?
Then a single ClusterShardingInstrumentationProvider instance can be used, instead of creating a new one for each shard.

What is the purpose of the scope parameter?

Idea is to have it as an attribute in the metric, to drill down by component ("shard_region" or "shard").

Yes, I can move it to a single instance in the extension and pass in the params.

patriknw · 2026-02-05T06:56:03Z

akka-cluster-sharding/src/main/scala/akka/cluster/sharding/ShardRegion.scala

    } else {
      shardBuffers.append(shardId, msg, snd)
-
+      instrumentation.shardBufferSize(totBufSize + 1)


increaseShardBufferSize isn't used?
what's the plan should it always report the size, or increment/decrement

Idea was that, instead of having to calculate the size each time, to use +1 / -1 where possible.

patriknw · 2026-02-05T06:58:50Z

...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

+  }
+
+  val typeName = "GiveMeYourHome"
+  val initiallyOnForth = "on-fourth"


...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

patriknw · 2026-02-05T07:06:14Z

...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

+      runOn(second) {
+        val probe = TestProbe()
+        (1 to 100).foreach { n =>
+          shardRegion.tell(Get(s"id-$n"), probe.ref)


In the warmup you use "id1" and here "id-1". Those are different shard ids, but perhaps make it more clear by using something completely different in warmup, such as "a", "b", "c"

I think there is a race condition in this test. You use a shared counter for all shards, which is reset to 0 when any shard is started. So it could increase the counter for id-1, but then the shard actor for id-2 is started, and resetting the counter to 0 again.

That would trigger if we would create more than node in addition to the coordinator? Or are the more than one instance of ShardRegion in this test?

Right, I was thinking wrong, it's reset when the region is started, and we only have one region (typeName) per jvm here.

patriknw · 2026-02-05T07:07:27Z

...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

+        }
+        eventually {
+          ClusterShardInstrumentatioSpecConfig.counter.get() shouldBe 100
+        }


the test could continue by removing the blackhole, and see that the buffer size decrease to 0 again

Yes, good! And also dropping messages if the buffer is full.

sebastian-alfers · 2026-02-05T09:24:10Z

akka-persistence/src/main/scala/akka/persistence/telemetry/EventsourcedInstrumentation.scala

 @InternalStableApi
 class EventsourcedInstrumentationProvider(system: ExtendedActorSystem) extends Extension {
-  private val fqcnConfigPath = "akka.persistence.telemetry.eventsourced.instrumentations"
+  private val fqcnConfigPath = "akka.persistence.telemetry.eventsourced.instrumentatiffons"


patriknw · 2026-02-05T10:14:56Z

...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

+      runOn(second) {
+        val probe = TestProbe()
+        (1 to 100).foreach { n =>
+          shardRegion.tell(Get(s"id-$n"), probe.ref)


Right, I was thinking wrong, it's reset when the region is started, and we only have one region (typeName) per jvm here.

patriknw · 2026-02-05T10:17:44Z

...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

+  val second = role("second")
+  testTransport(on = true)
+
+  val counter = new AtomicInteger()


event though we have isolation by separate JVMs for each node here, it would be nice to not use a global counter, but place the counter inside SpecClusterShardingTelemetry.

From the test you can access it with

ClusterShardingInstrumentationProvider(system).instrumentation.asInstanceOf[SpecClusterShardingTelemetry].counter

patriknw · 2026-02-06T08:51:13Z

akka-cluster-sharding/src/main/scala/akka/cluster/sharding/ShardRegion.scala

            dropped,
            shard)
+          // better to decrease by "dropped" to avoid calculating the size?
+          instrumentation.shardBufferSize(scope, typeName, shardBuffers.size)


What is the purpose of the scope parameter? Isn't that always "shard_region"? Is it some kind of "might be good in the future"? If we have more buffers in sharding that we want to instrument, we can have explicit methods for them in the SPI?

shouldn't this be shardBuffers.totalSize instead of shardBuffers.size?

// better to decrease by "dropped" to avoid calculating the size?

this drop should be rare, so performance of calculating the size is not a a reason, but might be better to have symmetry in the SPI with increase and decrease:

def shardBufferSize(typeName: String, size: Int): Unit def incrementShardBufferSize(typeName: String, delta: Int): Unit def decrementShardBufferSize(typeName: String, delta: Int): Unit

Then you can use decrement from deliver too

Sounds good.

But then this would be shardRegionBufferSize for now here, and later we add shardBufferSize once we add instrumentation to akka.cluster.sharding.Shard?

shouldn't this be shardBuffers.totalSize instead of shardBuffers.size

Ouch, jea! Good catch. I think this scenario is just not triggered yet in this test case?

akka-cluster-sharding/src/main/scala/akka/cluster/sharding/ShardRegion.scala

patriknw

looking good

patriknw · 2026-02-09T15:10:34Z

...-sharding/src/main/scala/akka/cluster/sharding/internal/ClusterShardingInstrumentation.scala

+
+  override def shardRegionBufferSize(
+      selfAddress: Address,
+      shardRegionActor: ActorRef,


do we need the selfAddress and shardRegionActor? is that because Cinnamon has that existing structure?

Yes, in the current version both are used. I let @pvlugter share his thoughts.

I didn't work on cluster sharding instrumentation originally, but if you're integrating with what's there already, these will be for the identity and for accessing metadata.

But you can also have this new telemetry be completely separate. You'll mostly just want the entity type for a metric label.

To me it feels like Cinnamon should already know the ActorSystem, and thereby the address, and I don't see why this metric should be coupled to shardRegionActor. The address + typeName should be enough to create a unique key. However, if that is needed because it makes it easier on the Cinnamon side, then so be it.

Also not sure why self address is being passed and why the actor ref is being used. I see that the address is already accessed automatically from the actor system for some cluster instrumentation, and agree that the entity type name is what should be used for identifying.

Ok, then I remove it again (I thought somewhere Cinnamon needed it).

You'll likely need it if you're cross-integrating this instrumentation with the existing sharding instrumentation. Otherwise you can define this SPI cleanly, which I think is preferable.

patriknw

LGTM, even though I don't see the full rational for the address and shardRegionActor parameters

patriknw · 2026-02-12T09:38:23Z

...-sharding/src/main/scala/akka/cluster/sharding/internal/ClusterShardingInstrumentation.scala

+
+  override def shardRegionBufferSize(
+      selfAddress: Address,
+      shardRegionActor: ActorRef,


To me it feels like Cinnamon should already know the ActorSystem, and thereby the address, and I don't see why this metric should be coupled to shardRegionActor. The address + typeName should be enough to create a unique key. However, if that is needed because it makes it easier on the Cinnamon side, then so be it.

johanandren

LGTM with a few left over todos dropped

johanandren · 2026-02-17T07:56:47Z

akka-cluster-sharding/src/main/scala/akka/cluster/sharding/ShardRegion.scala

            typeName,
            dropped,
            shard)
+          // better to decrease by "dropped" to avoid calculating the size?


Suggested change

// better to decrease by "dropped" to avoid calculating the size?

...-sharding/src/multi-jvm/scala/akka/cluster/sharding/ClusterShardingInstrumentationSpec.scala

Co-authored-by: Johan Andrén <johan@markatta.com>

* feat: metric for dropped messages in Shard Region buffer

sebastian-alfers · 2026-02-17T10:02:58Z

Running a nightly base on this branch: https://github.com/akka/akka-core/actions/runs/22094044273

patriknw reviewed Feb 5, 2026

View reviewed changes

sebastian-alfers commented Feb 5, 2026

View reviewed changes

patriknw reviewed Feb 5, 2026

View reviewed changes

patriknw reviewed Feb 6, 2026

View reviewed changes

patriknw changed the title ~~wip: Cluster Sharding telemetry~~ feat: Cluster Sharding telemetry Feb 9, 2026

patriknw marked this pull request as ready for review February 9, 2026 15:08

patriknw reviewed Feb 9, 2026

View reviewed changes

patriknw approved these changes Feb 12, 2026

View reviewed changes

johanandren approved these changes Feb 12, 2026

View reviewed changes

sebastian-alfers force-pushed the sharding-telemetry branch from f44207d to 0fbb370 Compare February 12, 2026 11:44

feat: Cluster Sharding telemetry

b0272ea

sebastian-alfers force-pushed the sharding-telemetry branch from 9c2e143 to b0272ea Compare February 16, 2026 12:05

feat: telemetry for shard handoff (#32888)

ecee976

johanandren approved these changes Feb 17, 2026

View reviewed changes

sebastian-alfers and others added 2 commits February 17, 2026 09:28

Apply suggestions from code review

5be1f42

Co-authored-by: Johan Andrén <johan@markatta.com>

feat: metric for dropped messages in Shard Region buffer (#32887)

e8549a1

* feat: metric for dropped messages in Shard Region buffer

Conversation

sebastian-alfers commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patriknw Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patriknw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patriknw Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patriknw left a comment

Choose a reason for hiding this comment

Uh oh!

patriknw Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johanandren left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebastian-alfers commented Feb 17, 2026

Uh oh!

patriknw Feb 5, 2026 •

edited

Loading

patriknw Feb 12, 2026 •

edited

Loading

patriknw Feb 12, 2026 •

edited

Loading