Skip to content

Commit d7d371d

Browse files
committed
Add some notes about this
1 parent db9ecc1 commit d7d371d

2 files changed

Lines changed: 122 additions & 23 deletions

File tree

ARCHITECTURE.md

Lines changed: 71 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,48 +3,91 @@ Core SDK Architecture
33

44
## High level description
55

6-
The below diagram depicts how future SDKs are split into two parts. The `sdk-core` common code, which is written in Rust, and a `sdk-lang` package specific to the language the user is writing their workflow/activity in. For example a user writing workflows in Rust would be pulling in (at least) two crates - `temporal-sdk-core` and `temporal-sdk-rust`.
6+
The below diagram depicts how future SDKs are split into two parts. The `sdk-core` common code,
7+
which is written in Rust, and a `sdk-lang` package specific to the language the user is writing
8+
their workflow/activity in. For example a user writing workflows in Rust would be pulling in (at
9+
least) two crates - `temporal-sdk-core` and `temporal-sdk-rust`.
710

811
![Arch Diagram](https://lucid.app/publicSegments/view/7872bb33-d2b9-4b90-8aa1-bac111136aa5/image.png)
912

10-
The `core` communicates with the Temporal service in the same way that existing SDKs today do, via gRPC. It's responsible for polling for tasks, processing those tasks according to our state machine logic, and then driving the language-specific code and shuttling events to it and commands back.
13+
The `core` communicates with the Temporal service in the same way that existing SDKs today do, via
14+
gRPC. It's responsible for polling for tasks, processing those tasks according to our state machine
15+
logic, and then driving the language-specific code and shuttling events to it and commands back.
1116

12-
The `sdk-lang` side communicates with `sdk-core` via either C bindings, IPC, or (later) bindings to a WASM interface. Many languages already have nice support for calling into Rust code - generally speaking these implementations are using C bindings under the hood. For example, we use [neon](https://neon-bindings.com/) to support the TS/JS sdk, and we will likely use [PyO3](https://github.com/PyO3/pyo3) for Python. It is expected that such usages will layer another crate on top of `core` which brings in these language specific libraries to expose `core` to that language in an ergonomic manner. IPC will exist as a thin layer on top of the C bindings. Care should be taken here to avoid unnecessary copying and [de]serialization. Then `sdk-lang` is responsible for dispatching tasks to the appropriate user code (to whatever extent parts of this can be reasonably put in the core code, we desire that to make lang-specific SDKs as small as possible).
17+
The `sdk-lang` side communicates with `sdk-core` via either C bindings, IPC, or (later) bindings to
18+
a WASM interface. Many languages already have nice support for calling into Rust code - generally
19+
speaking these implementations are using C bindings under the hood. For example, we
20+
use [neon](https://neon-bindings.com/) to support the TS/JS sdk, and we will likely
21+
use [PyO3](https://github.com/PyO3/pyo3) for Python. It is expected that such usages will layer
22+
another crate on top of `core` which brings in these language specific libraries to expose `core` to
23+
that language in an ergonomic manner. IPC will exist as a thin layer on top of the C bindings. Care
24+
should be taken here to avoid unnecessary copying and [de]serialization. Then `sdk-lang` is
25+
responsible for dispatching tasks to the appropriate user code (to whatever extent parts of this can
26+
be reasonably put in the core code, we desire that to make lang-specific SDKs as small as possible).
1327

14-
As a general note, the more we can push from `sdk-lang` into `sdk-core`, the easier our ecosystem is to maintain in the long run as we will have less semantically identical code.
28+
As a general note, the more we can push from `sdk-lang` into `sdk-core`, the easier our ecosystem is
29+
to maintain in the long run as we will have less semantically identical code.
1530

1631
### Glossary of terms
1732

18-
There are many concepts involved in the chain of communication from server->core->lang that all roughly mean "Do something". Unfortunately this leads to quite a lot of overloaded terminology in the code. This list should help to disambiguate:
19-
20-
* `HistoryEvent` (often referred to simply as an `Event`): These events come from the server and represent the history of the workflow. They are defined in the protobuf definitions for the Temporal service itself.
21-
* `Command`: These are the commands defined in the temporal service protobufs that are returned by clients upon completing a `WorkflowTask`. For example, starting a timer or an activity.
22-
* `WorkflowTask`: These are how the server represents the need to run user workflow code (or the result of it executing). See the `HistoryEvent` proto definition for more.
23-
* `WorkflowActivation`: These are produced by the Core SDK when the lang sdk needs to "activate" the user's workflow code, either running it from the beginning or resuming a cached workflow.
24-
* `WorkflowActivationJob` (shorthand: `Job`s): These are included in `WorkflowActivation`s and represent the actual things that have happened since the last time the workflow was activated (if ever). EX: Firing a timer, proceeding with the result of an activity, etc. They are typically derived from `HistoryEvent`s, but also include things like evicting a run from the cache.
25-
* `WorkflowActivationCompletion`: Provided by the lang side when completing an activation. The (successful) completion contains one or more `WorkflowCommand`s, which are often translated into `Command`s as defined in the Temporal service protobufs, but also include things like query responses.
33+
There are many concepts involved in the chain of communication from server->core->lang that all
34+
roughly mean "Do something". Unfortunately this leads to quite a lot of overloaded terminology in
35+
the code. This list should help to disambiguate:
36+
37+
* `HistoryEvent` (often referred to simply as an `Event`): These events come from the server and
38+
represent the history of the workflow. They are defined in the protobuf definitions for the
39+
Temporal service itself.
40+
* `Command`: These are the commands defined in the temporal service protobufs that are returned by
41+
clients upon completing a `WorkflowTask`. For example, starting a timer or an activity.
42+
* `WorkflowTask`: These are how the server represents the need to run user workflow code (or the
43+
result of it executing). See the `HistoryEvent` proto definition for more.
44+
* `WorkflowActivation`: These are produced by the Core SDK when the lang sdk needs to "activate" the
45+
user's workflow code, either running it from the beginning or resuming a cached workflow.
46+
* `WorkflowActivationJob` (shorthand: `Job`s): These are included in `WorkflowActivation`s and
47+
represent the actual things that have happened since the last time the workflow was activated (if
48+
ever). EX: Firing a timer, proceeding with the result of an activity, etc. They are typically
49+
derived from `HistoryEvent`s, but also include things like evicting a run from the cache.
50+
* `WorkflowActivationCompletion`: Provided by the lang side when completing an activation. The (
51+
successful) completion contains one or more `WorkflowCommand`s, which are often translated into
52+
`Command`s as defined in the Temporal service protobufs, but also include things like query
53+
responses.
2654

2755
Additional clarifications that are internal to Core:
28-
* `StateMachine`s also handle events and produce commands, which often map directly to the above `HistoryEvent`s and `Command`s, but are distinct types. The state machine library is Temporal agnostic - but all interactions with the machines pass through a `TemporalStateMachine` trait, which accepts `HistoryEvent`s, and produces `WorkflowTrigger`s.
29-
* `WorkflowTrigger`: These allow the state machines to trigger things to happen to the workflow. Including pushing new `WfActivationJob`s, or otherwise advancing workflow state.
3056

57+
* `StateMachine`s also handle events and produce commands, which often map directly to the above
58+
`HistoryEvent`s and `Command`s, but are distinct types. The state machine library is Temporal
59+
agnostic - but all interactions with the machines pass through a `TemporalStateMachine` trait,
60+
which accepts `HistoryEvent`s, and produces `WorkflowTrigger`s.
61+
* `WorkflowTrigger`: These allow the state machines to trigger things to happen to the workflow.
62+
Including pushing new `WfActivationJob`s, or otherwise advancing workflow state.
3163

3264
### Core SDK Responsibilities
3365

34-
- Communication with Temporal service using a generated gRPC client, which is wrapped with somewhat more ergonomic traits.
35-
- Provide interface for language-specific SDK to drive event loop and handle returned commands. The lang sdk will continuously call/poll the core SDK to receive new tasks, which either represent workflows being started or awoken (`WorkflowActivation`) or activities to execute (`ActivityTask`). It will then call its workflow/activity functions with the provided information as appropriate, and will then push completed tasks back into the core SDK.
36-
- Advance state machines and report back to the temporal server as appropriate when handling events and commands
66+
- Communication with Temporal service using a generated gRPC client, which is wrapped with somewhat
67+
more ergonomic traits.
68+
- Provide interface for language-specific SDK to drive event loop and handle returned commands. The
69+
lang sdk will continuously call/poll the core SDK to receive new tasks, which either represent
70+
workflows being started or awoken (`WorkflowActivation`) or activities to execute (
71+
`ActivityTask`). It will then call its workflow/activity functions with the provided information
72+
as appropriate, and will then push completed tasks back into the core SDK.
73+
- Advance state machines and report back to the temporal server as appropriate when handling events
74+
and commands
3775

3876
### Language Specific SDK Responsibilities
3977

4078
- Periodically poll Core SDK for tasks
41-
- Call workflow and activity functions as appropriate, using information in events it received from Core SDK
79+
- Call workflow and activity functions as appropriate, using information in events it received from
80+
Core SDK
4281
- Return results of workflows/activities to Core SDK
43-
- Manage concurrency using language appropriate primitives. For example, it is up to the language side to decide how frequently to poll, and whether or not to execute worklows and activities in separate threads or coroutines, etc.
82+
- Manage concurrency using language appropriate primitives. For example, it is up to the language
83+
side to decide how frequently to poll, and whether or not to execute worklows and activities in
84+
separate threads or coroutines, etc.
4485

4586
### Example Sequence Diagrams
4687

47-
Here we consider what the sequence of API calls would look like for a simple workflow executing a happy path. The hello-world workflow & activity in imaginary Rust (don't pay too much attention to the specifics, just an example) is below. It is meant to be like our most basic hello world samples.
88+
Here we consider what the sequence of API calls would look like for a simple workflow executing a
89+
happy path. The hello-world workflow & activity in imaginary Rust (don't pay too much attention to
90+
the specifics, just an example) is below. It is meant to be like our most basic hello world samples.
4891

4992
```rust
5093
#[workflow]
@@ -64,13 +107,18 @@ async fn hello_activity(name: &str) -> String {
64107

65108
## API Definition
66109

67-
We define the interface between the core and lang SDKs in terms of gRPC service definitions. The actual implementations of this "service" are not generated by gRPC generators, but the messages themselves are, and make it easier to hit the ground running in new languages.
68-
69-
See the latest API definition [here](https://github.com/temporalio/sdk-core/tree/master/sdk-core-protos/protos/local/temporal/sdk/core)
110+
We define the interface between the core and lang SDKs (mostly) in terms of gRPC service
111+
definitions. The actual implementations of this "service" are not generated by gRPC generators, but
112+
the messages themselves are, and make it easier to hit the ground running in new languages.
70113

114+
See the latest API
115+
definition [here](https://github.com/temporalio/sdk-core/tree/master/sdk-core-protos/protos/local/temporal/sdk/core)
71116

72117
## Other topics
118+
73119
- [Sticky task queues](arch_docs/sticky_queues.md)
120+
- [Workflow task chunking](arch_docs/workflow_task_chunking.md)
74121

75122
## Workflow Processing Internals
123+
76124
![Workflow Internals](arch_docs/diagrams/workflow_internals.svg)
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Workflow Task Chunking
2+
3+
One source of complexity in Core is the chunking of history into "logical" Workflow Tasks.
4+
5+
Workflow tasks (WFTs) always take the following form in event history:
6+
7+
* \[Preceding Events\] (optional)
8+
* WFT Scheduled
9+
* WFT Started
10+
* WFT Completed
11+
* \[Commands\] (optional)
12+
13+
In the typical case, the "logical" WFT consists of all the commands from the last workflow task,
14+
any events generated in the interrim, and the scheduled/started preamble. So:
15+
16+
* WFT Completed
17+
* \[Commands\] (optional)
18+
* \[Events\] (optional)
19+
* WFT Scheduled
20+
* WFT Started
21+
22+
Commands and events are both "optional" in the sense that:
23+
24+
Workflow code, after being woken up, might not do anything, and thus generate no new commands
25+
26+
There may be no events for more nuanced reasons:
27+
28+
1. The workflow might have been running a long-running local activity. In such cases, the workflow
29+
must "workflow task heartbeat" in order to avoid timing out the workflow task. This means
30+
completing the WFT with no commands while the LA is ongoing.
31+
2. The workflow might have received an update, which does not come as an event in history, but
32+
rather as a "protocol message" attached to the task.
33+
3. Server can forcibly generate a new WFT with some obscure APIs
34+
35+
Core does not consider such empty WFT sequences as worthy of waking lang (on replay - as a new
36+
task, they always will), since nothing meaningful has happened. Thus, they are grouped together
37+
as part of a "logical" WFT with the last WFT that had any real work in it.
38+
39+
## Possible issues as of this writing (5/25)
40+
41+
The "new WFT force-issued by server" case would, currently, not cause a wakeup on replay for the
42+
reasons discussed above. In some obscure edge cases (inspecting workflow clock) this could cause
43+
NDE.
44+
45+
### Possible solutions
46+
47+
* Core can attach a flag on WFT completes in order to be explicit that that WFT may be skipped on
48+
replay. IE: During WFT heartbeating for LAs.
49+
* We could legislate that server should never send empty WFTs. Seemingly the only case of this
50+
is
51+
the [obscure api](https://github.com/temporalio/temporal/blob/d189737aa2ed1b07c221abb9fbdd28ecf68f0492/proto/internal/temporal/server/api/adminservice/v1/service.proto#L151)

0 commit comments

Comments
 (0)