Skip to content

RFC: Add a state machine implementation #218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: v310
Choose a base branch
from

Conversation

JeroenSoeters
Copy link
Contributor

@JeroenSoeters JeroenSoeters commented Apr 13, 2025

RFC: Add a state machine implementation

Background

This draft PR proposes a StateMachine to be added to the standard actors library. The implementation is loosely based on gen_statem. It operates similarly to gen_statem in callback mode state_functions, but instead of a single callback per state we use a callback per (state, messageType) tuple.

Implementation

The StateMachine is implemented as an Actor in the standard actors library. It follows the same pattern as the Supervisor where the Init function implementation is mandatory and returns the specification of the StateMachine. The specification in this draft PR contains the initial state as well as the callback functions for the states. States are of type gen.Atom, similar to their Erlang/Elixir counterparts, and the StateMachine has a generic type parameter D for the data associated with the state machine (again akin to the Erlang implementation). A difference with the Erlang implementation is that the data and the new state are not returned from the state callback functions, instead they are part of the mutable state of the actor.

For creating the StateMachineSpec there are several helper functions and types as we need to capture the type of the message being processed by the callbacks. We also want to reduce unnecessary copying of the spec, therefore we use the functional option pattern.

When attempting an invalid state transition, following Erlang/OTP's "let it crash" philosophy, the process terminates.

API

Below is an example of what the API for the consumer of the state machine looks like. There is a [PR (https://github.com/ergo-services/examples/pull/8) in the examples repo as well

// the data associated with the state machine
type OrderData struct {
	items     []string
	processed time.Time
	shipped   time.Time
	delivered time.Time
	canceled  time.Time
}

// the actor
type Order struct {
	act.StateMachine[OrderData]
}

// the factory
func factoryOrder() gen.ProcessBehavior {
	return &Order{}
}

// setting up the spec
func (order *Order) Init(args ...any) (act.StateMachineSpec[OrderData], error) {
	spec := act.NewStateMachineSpec(gen.Atom("new"),
                // initial data
                act.WithData(OrderData{ items: []string {"shoes"}})

	        // state enter callback, invoked for every state transition
                act.WithStateEnterCallback(logStateChanges)	

		// an order in the `new` state can be processed and canceled
		act.WithStateMessageHandler(gen.Atom("new"), process),
		act.WithStateMessageHandler(gen.Atom("new"), cancel),

		// an order in the `processing` state can be shipped and canceled
		act.WithStateMessageHandler(gen.Atom("processing"), ship),
		act.WithStateMessageHandler(gen.Atom("processing"), cancel),

		// an order in the `shipped` state can be delivered
		act.WithStateMessageHandler(gen.Atom("shipped"), deliver),

                // for an order in the `delivered` state we can query the delivery time
                act.WithStateCallHandler(gen.Atom("delivered"), deliveryTime),

                // can't do anything with orders in the `canceled` state


	)

	return spec, nil
}

// messages for the state callbacks
type Process struct{}

type Ship struct {
	priority bool
}

type Deliver struct {}

type Cancel struct {
	reason string
}

type DeliveryTime struct {}

type ContactCarrier struct {}

type SendCustomerSurvey struct {}

// state callbacks
func process(state gen.Atom, data OrderData, message Process, proc gen.Process) (gen.Atom, OrderData, []act.Action, error) {
	if len(data.items) < 1 {
		proc.Log().Info("cannot process order without items")
		return state, data, nil
	}
	data.processed = time.Now()
	return gen.Atom("processing"), data, nil, nil
}

func ship(state gen.Atom, data OrderData, message Ship, proc gen.Process) (gen.Atom, OrderData, []act.Action, error) {
	// if the order is not delivered within 3 days we contact the carrier
        timeout := StateTimeout {
		Duration: 3 * 24 * time.Hours, // setting a 3-day timer is probably not a good idea until we support code hot-loading ;)
		Message: ContactCarrier {},
        }
	data.shipped = time.Now()
	return gen.Atom("shipped"), data, []act.Action{ timeout }, nil
}

func deliver(state gen.Atom, data OrderData, message Deliver, proc gen.Process) (gen.Atom, OrderData, []act.Action, error) {
	// send a customer survey 4 hours after delivery
        timeout := GenericTimeout {
                Name: gen.Atom("customerSurvey"),
		Duration: 4 * time.Hours,
		Message: SendCustomerSurvey {},
        }
	data.delivered= time.Now()
	return gen.Atom("delivered"), data, []act.Action{ timeout }, nil
}

func deliveryTime(state gen.Atom, data OrderData, message DeliveryTime, proc gen.Process) (gen.Atom, OrderData, time.Time, []act.Action, error) {
	return state, data, data.delivered, nil, nil
}

// state enter callback (invoked for every state transition)
func logStateChanges(oldState gen.Atom, newState gen.Atom, data OrderData, proc gen.Process) (gen.Atom, OrderData, error) {
	proc.Log().Info("order now in %s state, was %s", newState, oldState)
	return newState, data, nil
}
...

Open questions

  • Since Ergo doesn't support code hot-loading, any timeouts currently do not survive an application restart. Is this acceptable?
  • What minimum feature set do we want before merging this into a release candidate branch? The Erlang version is quite feature-rich, but I'd like to know if we want to be on par or if we are okay with deferring some features for future versions. Some notable gaps right now:
    • All-state handlers (I'll probably add this soon-ish)
    • Support for updating timers
    • Event postponing
    • No EventContent is delivered to timeout callbacks
    • Timers are not persistent (in Erlang, the runtime manages timers, and they are designed to survive hot code swapping)
    • Internal events
    • Hibernation

TODO

  • have a means of injecting the initial data
  • support for events
  • state enter callbacks (for code you want to run on every state transition)
  • state timeouts
  • generic timeouts
  • event timeouts
  • documentation

@CLAassistant
Copy link

CLAassistant commented Apr 13, 2025

CLA assistant check
All committers have signed the CLA.

@JeroenSoeters JeroenSoeters changed the title Initial StateMachine implementation RFC: Add a state machine implementation Apr 13, 2025
@halturin
Copy link
Collaborator

Hi Jeroen,

Thanks for the great contribution — I really appreciate it! That said, I’m still a bit uncertain about the design of the FSM implementation. My first attempt was actually quite similar to your approach, using reflection. However, in the end, I found it wasn’t as clean and elegant as I’d hoped.

I'll take a deeper look into your code - maybe we can find an even better solution together.

@JeroenSoeters
Copy link
Contributor Author

Hi Jeroen,

Thanks for the great contribution — I really appreciate it! That said, I’m still a bit uncertain about the design of the FSM implementation. My first attempt was actually quite similar to your approach, using reflection. However, in the end, I found it wasn’t as clean and elegant as I’d hoped.

I'll take a deeper look into your code - maybe we can find an even better solution together.

Thanks Taras. I copied an example of what the consumer API looks like in the PR description as well. I'm not a big fan of the reflection code either, but I couldn't find another way, especially whilst keeping the API of the FSM reasonably elegant and type-safe. That being said, I am relatively new to Go, so maybe there are better ways of doing this.

Let me know! Once we agree on an approach, I would be happy to expand on this PR and add some of the additional functionality available in gen_statem.

@JeroenSoeters JeroenSoeters force-pushed the state-machine branch 2 times, most recently from c63516a to 039be32 Compare April 16, 2025 15:49
@JeroenSoeters
Copy link
Contributor Author

JeroenSoeters commented Apr 16, 2025

Hi @halturin,

I have cleaned up this PR to make it a bit more readable. I've also changed the API to be more similar to the other actors (replacing StateCallback with MessageHandler and CallHandler) and we now support returning values from synchronous calls as well (which was one of the open questions above). I've also added a test for the non-happy path (ensuring the process terminates) and updated the Spec so you can provide the initial data for the StateMachine.

Let me know what you think and if you have any specific concerns you want me to look into. Happy to have a more synchronous conversation about this as well :)

PS: the discord link on GH is invalid for me?

@halturin
Copy link
Collaborator

[4/16/25] After thinking about this more I am leaning towards not coupling events to states. In the case of messages and calls, a process explicitly asks the StateMachine to do work

agree with it. no need to couple.

[4/16/25] I am pretty convinced you do want to be able to query the state machine. Therefore, I am introducing a second callback type with a return value. I am type aliasing these as StateMessageHandler and StateCallHandler to keep the naming in line with the rest of the framework

glad to see how you aligning with the framework.

the only think, atm, not sure about the functional approach With.... I usually prefer the way like Spec, Options etc. but still, it's not a mandatory thing.

PS: great progress 👍

@JeroenSoeters
Copy link
Contributor Author

the only think, atm, not sure about the functional approach With.... I usually prefer the way like Spec, Options etc. but still, it's not a mandatory thing.

Is it the naming or the functional options pattern in general? If it is the naming, we can remove the With... or name it Add... or Register.... If your hesitation concerns the pattern itself, I agree it would be nicer to declare these functions in the struct initializer for the StateMachineSpec. You can actually do that, however, there will be no compile-time guarantees around the methods you'd register, and if you were to make a mistake and mix up the order of parameters or return values, it would be very painful to debug this.

PS: I just pushed support for state timeouts. State enter callback support was already added. I will add generic timeouts and event timeouts next and update the example to demonstrate the newly added features. I might just port the lock example from the gen_statem documentation.

@JeroenSoeters JeroenSoeters force-pushed the state-machine branch 2 times, most recently from 092f4fa to ccfdc65 Compare April 24, 2025 16:20
@halturin
Copy link
Collaborator

If your hesitation concerns the pattern itself

this :) not the naming. just to align with the common style.

what do you think about using an event as a trigger for state transitions without requiring a callback function?
if needed, we could define a callback that runs upon entering the new state. haven't seen all the code, but noticed it has this feature.

PS: i’ve also noticed a panic when Monitor events fail. perhaps we should retry until it succeeds? not sure which approach is best yet — just an idea...

@halturin
Copy link
Collaborator

halturin commented Apr 25, 2025

forgot to mention… let's create a separate example in ergo.services/examples for the StateMachine, with a clearer and more visual demonstration of its functionality

I can also give you access to docs.ergo.services for writing documentation

@JeroenSoeters
Copy link
Contributor Author

If your hesitation concerns the pattern itself

this :) not the naming. just to align with the common style.

what do you think about using an event as a trigger for state transitions without requiring a callback function? if needed, we could define a callback that runs upon entering the new state. haven't seen all the code, but noticed it has this feature.

Not sure how this would work with an event trigger, could you elaborate more on this approach?

One other approach I was thinking of, sacrificing type safety for consistency: we could allow for registering a single handler per state instead of per state and message type. This is actually what Erlang does as well. Of course, Erlang has pattern matching, so you're still effectively dispatching on the type of the message. In this approach, we would deliver any incoming message to the handler for the state the state machine is currently in. Then it is up to the user to do the type assertions on the incoming message (now of type any instead of type M) and handle all the logic for a single state in one function. You can register one MessageHandler and one CallHandler per state. For the latter, it's again the user's responsibility to return the correct type (which would now also be of any instead of type R). We wouldn't have to capture the generic type anymore for the map key, as the map key would be the state, and the callbacks would be of type func (data D, message any, proc gen.Process) (gen.Atom, D, []act.Action, error). As they are all of the same type you could now register them in the struct initializer for the spec and get rid of the functional options pattern.

I want to clean up the current branch so that this is finished. We could try the alternate approach on a separate branch. I'm not 100% sure yet this would work, I'd have to see the code first.

Take a look at the updated example with the existing API and let me know what you think.

One other thought: we could also align the rest of ergo with this PR. It all comes down to whether we want to dispatch messages based on their type or if a single catch-all handler is preferred, and it is up to the user to interrogate the type with assertions and dispatch to the correct logic themselves.

PS: i’ve also noticed a panic when Monitor events fail. perhaps we should retry until it succeeds? not sure which approach is best yet — just an idea...

Yea unsure about this. If I register a handler for an event, do I expect that event to be registered or not? The semantics become a bit fuzzy if the event cannot be registered. What if I made a typo in the event, how will I find out? What will be the retry strategy for the actor to call the MonitorEvent function? What if the event got registered and an event got published just before we retry the MonitorEvent function, is it acceptable to register and still potentially miss events?

PS: docs access would be good, I want to finalize the API first though before writing those

@JeroenSoeters
Copy link
Contributor Author

JeroenSoeters commented Apr 25, 2025

The following code compiles, it doesn't actually work yet, but should give a good idea what the handler per state solution would look like:

func factoryOrder() gen.ProcessBehavior {
	return &Order{}
}

type Process struct{}

type Ship struct {
	priority bool
}

type Deliver struct{}

type Cancel struct {
	reason string
}

type DeliveryTime struct {
}

func (order *Order) Init(args ...any) (act.StateMachineSpec[OrderData], error) {
	spec := act.StateMachineSpec[OrderData]{
		InitialState: gen.Atom("new"),
		Data:         OrderData{items: []string{"socks"}},
		StateMessageHandlers: map[gen.Atom]act.StateMessageHandler[OrderData]{
			gen.Atom("new"):        new_order, // a single method for each state
			gen.Atom("processing"): processing,
			gen.Atom("shipped"):    shipped,
		},
		StateCallHandlers: map[gen.Atom]act.StateCallHandler[OrderData]{
			gen.Atom("delivered"): delivered,
		},
	} // no more functional options, just a regular struct initializer

	return spec, nil
}

func new_order(state gen.Atom, data OrderData, message any, proc gen.Process) (gen.Atom, OrderData, []act.Action, error) {
	switch message.(type) { // up to the user to query the message type
	case Process:
		proc.Log().Info("processing order")
		if len(data.items) < 1 {
			proc.Log().Info("cannot process order without items")
			return state, data, nil, nil
		}
		data.processed = time.Now()
		return gen.Atom("processing"), data, nil, nil
	case Cancel:
		proc.Log().Info("canceling order...")
		data.canceled = time.Now()
		return gen.Atom("canceled"), data, nil, nil
	default: // we need some kind of error here that users need to return in the default case
		return state, data, nil, fmt.Errorf("some error")
	}
}

func processing(state gen.Atom, data OrderData, message any, proc gen.Process) (gen.Atom, OrderData, []act.Action, error) {
	switch message.(type) {
	case Ship:
		proc.Log().Info("shiping order...")
		data.shipped = time.Now()
		return gen.Atom("shipped"), data, nil, nil
	case Cancel:
		proc.Log().Info("canceling order...")
		data.canceled = time.Now()
		return gen.Atom("canceled"), data, nil, nil
	default:
		return state, data, nil, fmt.Errorf("some error")
	}
}

func shipped(state gen.Atom, data OrderData, message any, proc gen.Process) (gen.Atom, OrderData, []act.Action, error) {
	switch message.(type) {
	case Deliver:
		proc.Log().Info("delivering order...")
		data.delivered = time.Now()
		return gen.Atom("delivered"), data, nil, nil
	default:
		return state, data, nil, fmt.Errorf("some error")
	}
}

func delivered(state gen.Atom, data OrderData, message any, proc gen.Process) (gen.Atom, OrderData, any, []act.Action, error) { // notice the return type is now `any`, so no compile-time guarantees
	switch message.(type) {
	case DeliveryTime:
		return state, data, data.delivered, nil, nil
	default:
		return state, data, data.delivered, nil, fmt.Errorf("some error")
	}
}

FWIW, with this approach, I don't think we need to use any reflection anymore

@JeroenSoeters
Copy link
Contributor Author

@halturin did you get a chance to look at these approaches? what do you think?

@halturin
Copy link
Collaborator

Not yet, sorry. I've had a tough time over the last couple of weeks. This week, I hope.

PS: I sent an invitation to the gitbook. Pls, add your page in the Actors section.

@halturin
Copy link
Collaborator

may I ask you to change the target to v310 branch. I'm thinking of including it as a part of the 3.1 release

@JeroenSoeters JeroenSoeters changed the base branch from master to v310 May 1, 2025 13:14
@JeroenSoeters
Copy link
Contributor Author

may I ask you to change the target to v310 branch. I'm thinking of including it as a part of the 3.1 release

Just rebased v310 and updated the base branch.

Have you given the 2 different approaches some thought? I will write up the docs once we decided which one we'll go with.

Summarizing my thoughts about the 2 approaches:

Approach 1 (the current implementation)

Pros:

  • This approach is more type-safe.
    • There is no need for type assertions on the message being passed to the handler.
    • For synchronous calls the result is typed.
    • Registering the handlers in the spec is type-safe, which is arguably the biggest benefit of this approach. The type signatures of the handlers are relatively complex and it is easy to get the ordering wrong or omit one of the input or output parameters. This would be a run-time error.

Cons:

  • Having type-safe handlers is not how the other actors behave right now. Existing message handlers take messages of type any and call handlers return a result of type any.
  • Because of the type-safe approach, it is not possible to initialize the spec in a struct initializer. We need helper methods that are generic. In existing actors the spec can be initialized without the functional options pattern. There is however only one other actor (the Supervisor) that requires you to provide a spec afaik.

Approach 2 (as suggested in the comment)

Pros:

  • The signatures of the message handler and call handler are similar to those of the existing actors.
  • No need for a functional options pattern as the spec can be initialized in the struct initializer.
  • No need for reflection.

Cons:

  • Less type-safe and will be hard to debug issues.

I do think if we were to go with approach 1, that it would be worth considering whether (at some point in the future) the other actors could follow a similar pattern, meaning in gen.ProcessBehavior we would make HandleMessage and HandleCall generic over the message and return type:

HandleMessage[M any](from gen.PID, message M) error
HandleCall[M any, R any](from gen.PID, ref gen.Ref, request M) (R, error)

@halturin
Copy link
Collaborator

halturin commented May 1, 2025

I’ll take a closer look at your PR tomorrow (or the day after).

As for the two approaches... The main issue with generics is that they significantly impact performance, so I'm not sure the second one would work well for the high load scenario (in terms of performance, I mean).

UPD: forgot to mention - that’s actually the reason I use any in lib.QueueMPSC

@halturin
Copy link
Collaborator

halturin commented May 4, 2025

I’ve reviewed your implementation carefully — you’ve done a great and impressive job!
At this point, I have a few general questions and suggestions:

  • The use of the act namespace feels a bit speculative. You've introduced Action, MessageTimeout, and Option types that, by name, don’t clearly relate to a StateMachine.
    It might make more sense to move this actor into a separate library, such as ergo.services/actor (currently private).
    I’m planning to publish predefined actors there — CQRS (almost done), Saga (in progress), and Raft (to be ported from ergo v2).
    This way, the StateMachine would get its own namespace ergo.services/actor/statemachine, and naming would be cleaner — no need to prefix everything with State, Machine, etc.

  • You're using goroutines and context.Context, but as far as I understand the logic, this isn’t necessary.
    You could just use gen.Process.SendAfter() instead.

  • Also, keep in mind that calling gen.Process methods from external goroutines is generally not supported.
    Most of those methods will return gen.ErrNotAllowed if called outside the actor's goroutine.

  • As I mentioned earlier, generics have a noticeable impact on performance — but that impact is minor compared to using reflection to invoke methods on objects.
    Reflective method calls are extremely costly and should be avoided in any performance-sensitive path.

Overall, it's a solid implementation. We just need to think a bit more about how to avoid relying on reflection for method dispatch

@halturin
Copy link
Collaborator

halturin commented May 5, 2025

In the best scenario, using generic costs you around 30% of performance https://planetscale.com/blog/generics-can-make-your-go-code-slower

@JeroenSoeters
Copy link
Contributor Author

Thanks Taras!

* The use of the `act` namespace feels a bit speculative. You've introduced `Action`, `MessageTimeout`, and `Option` types that, by name, don’t clearly relate to a StateMachine.
  It might make more sense to move this actor into a separate library, such as `ergo.services/actor` (currently private).
  I’m planning to publish predefined actors there — CQRS (almost done), Saga (in progress), and Raft (to be ported from ergo v2).
  This way, the StateMachine would get its own namespace `ergo.services/actor/statemachine`, and naming would be cleaner — no need to prefix everything with `State`, `Machine`, etc.

Agree on moving this over. Could you give me access to this repo?

* You're using goroutines and `context.Context`, but as far as I understand the logic, this isn’t necessary.
  You could just use `gen.Process.SendAfter()` instead.

I initially wanted to do this, but if I use gen.Process.SendAfter() I wouldn't be able to cancel the timeout. For a generic timeout, this would be fine, but the state timeout gets canceled in the SetCurrentState function, where I use the context to see if there is a timeout active, and the cancelFunc to cancel that timeout if there is one. Similarly, the message timeout gets canceled in the ProcessRun method.

* Also, keep in mind that calling `gen.Process` methods from external goroutines is generally not supported.
  Most of those methods will return `gen.ErrNotAllowed` if called outside the actor's goroutine.

I believe the only place where this is happening is in the timeouts. It is working, at least, the tests are passing? Are there any concerns with the current implementation.

* As I mentioned earlier, generics have a noticeable impact on performance — but that impact is minor compared to using reflection to invoke methods on objects.
  Reflective method calls are extremely costly and should be avoided in any performance-sensitive path.

Overall, it's a solid implementation. We just need to think a bit more about how to avoid relying on reflection for method dispatch

Agreed that the reflection code is suboptimal. By far the easiest way around this is to use the alternative approach I suggested in the comments. I think by keeping the functional options pattern, we can still have decent type safety (the same level of type safety as users get throughout the rest of the framework). By not making StateMessageHandler and StateCallHandler generic over their message and return types, we wouldn't need any reflection anymore. There would be a single callback function per state, and the user would have to do the type assertions for the various messages they expect in that state in the handler. Alternatively, we could see if there are unsafe ways around this, but I need to think a bit more about this. Honestly, I feel introducing this pattern of dispatching by message type (and result type) should be considered for all actors instead of just one. It is not directly related to the StateMachine. Instead, it is about having to do type assertions in general in every handler. So maybe it is best we split this out in a separate RFC, and settle for a "one handler per state" approach.

What do you think?

@JeroenSoeters
Copy link
Contributor Author

Hi @halturin, friendly nudge here to check what direction you want me to take with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants