Skip to content

docs: added orchestration feature request #857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

qor-lb
Copy link
Contributor

@qor-lb qor-lb commented Apr 5, 2025

closes: #273

Copy link

github-actions bot commented Apr 5, 2025

License Check Results

🚀 The license check job ran with the Bazel command:

bazel run //:license-check

Status: ⚠️ Needs Review

Click to expand output
[License Check Output]
2025/04/07 19:19:08 Downloading https://releases.bazel.build/7.4.0/release/bazel-7.4.0-linux-x86_64...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Loading: 
Loading: 0 packages loaded
Loading: 0 packages loaded
    currently loading: 
Analyzing: target //:license-check (1 packages loaded, 0 targets configured)
Analyzing: target //:license-check (1 packages loaded, 0 targets configured)

Analyzing: target //:license-check (43 packages loaded, 8 targets configured)

Analyzing: target //:license-check (94 packages loaded, 10 targets configured)

Analyzing: target //:license-check (99 packages loaded, 10 targets configured)

Analyzing: target //:license-check (130 packages loaded, 811 targets configured)

Analyzing: target //:license-check (133 packages loaded, 1619 targets configured)

Analyzing: target //:license-check (144 packages loaded, 2455 targets configured)

Analyzing: target //:license-check (145 packages loaded, 2465 targets configured)

Analyzing: target //:license-check (145 packages loaded, 2465 targets configured)

Analyzing: target //:license-check (148 packages loaded, 4487 targets configured)

Analyzing: target //:license-check (149 packages loaded, 4611 targets configured)

INFO: Analyzed target //:license-check (150 packages loaded, 4737 targets configured).
[11 / 13] [Prepa] JavaToolchainCompileBootClasspath external/rules_java~/toolchains/platformclasspath.jar
[12 / 13] [Prepa] Building license.check.license_check.jar ()
INFO: Found 1 target...
Target //:license.check.license_check up-to-date:
  bazel-bin/license.check.license_check
  bazel-bin/license.check.license_check.jar
INFO: Elapsed time: 23.342s, Critical Path: 2.85s
INFO: 13 processes: 9 internal, 3 processwrapper-sandbox, 1 worker.
INFO: Build completed successfully, 13 total actions
INFO: Running command line: bazel-bin/license.check.license_check ./formatted.txt -review -project automotive.score -repo https://github.com/eclipse-score/score -token otyhZ4eaRYK1tKLNNF-Y
[main] INFO Querying Eclipse Foundation for license data for 76 items.
[main] INFO Found 52 items.
[main] INFO Querying ClearlyDefined for license data for 25 items.
[main] INFO Found 25 items.
[main] INFO License information could not be automatically verified for the following content:
[main] INFO 
[main] INFO pypi/pypi/-/docutils/0.21.2
[main] INFO 
[main] INFO This content is either not correctly mapped by the system, or requires review.
[main] INFO A review is required for pypi/pypi/-/docutils/0.21.2.
[main] INFO A review request already exists https://gitlab.eclipse.org/eclipsefdn/emo-team/iplab/-/issues/19880 .

Copy link

github-actions bot commented Apr 5, 2025

The created documentation from the pull request is available at: docu-html

@qor-lb qor-lb force-pushed the lb_runtime_feature_request branch 2 times, most recently from 5e0ad98 to e4bccd9 Compare April 7, 2025 19:18

Abstract
========

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few basic questions which I did not find completely answered:

  1. Does the feature intend to provide a mechanism to globally manage ("orchestrate") the compute resources within an ECU as a whole?
  2. Or does it intend to provide means for user-level scheduling and configuration of tasks and threads within a single application potentially consisting of multiple processes? (Similar to what an async framework -- e.g. with Tokio in Rust -- does for a single process.)
  3. Or is it both?

If both, I would propose to split the feature request into at least two different requests, namely one for 1) and another one for 2)

In existing platforms for microprocessors (µP), each application is expected to interact with approximately 15 system services or daemons - such as ``ara::diag`` for diagnostics and ``ara::com`` for SOME/IP communication. Under a straightforward implementation, this interaction model results in the creation of around 15 threads per application. When scaled to 100-150 applications, this amounts to roughly 1500 to 2250 threads solely managing inter-process communication, excluding those performing the core application tasks.

Given that the HPC's µP typically provides between 2 and 16 cores, only a limited number of threads can be processed in parallel. In POSIX-like operating systems, threads become the currency for concurrency, meaning that when the thread count far exceeds available cores, the system must rely on context switching. With context switching times estimated between 2µs and 4.5µs [#f1]_ [#f2]_ [#f3]_, even a 100ms time slice could spend between 3% and 10% of its duration on context switching alone - assuming each thread is scheduled once. This overhead increases dramatically if threads are forced to switch more frequently due to competition for processing time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use your worst-case figures (2250 threads, only 2 cores, 4.5 us task switching time), I end up with 5 ms task-switching time for all tasks. This is 5% of 100 ms. With your lower-end figures (1500 threads -- still a lot --, 16 cores and 2 us task switching time) I get a total of 188 us, which is ~0.2% of the 100ms cycle time.
How did you calculate the 3% ... 10% figure?

The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference.

Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones.

Copy link
Contributor

@armin-acn armin-acn Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the integrator still need to figure out the minimum requirements (e.g. compute resources) of each application with the developers, which might be an iterative process if resources are very limited?

How would the integrator combine the different concurrent programs they get from different suppliers? Is it intended to merge all of them into a huge program? Or would the integrator still have to use operating system mechanisms and tools to distribute the resources across applications?

- Cooperative multi-tasking, allowing multiple concurrent tasks to use the same OS thread.
- Provision of a configurable thread pool to enable parallelism for concurrent tasks.
- Introduction of additional thread pools only when necessary, such as when tasks differ in criticality or require separation by process boundaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ from what the various available async frameworks offer?

------------------------------------------------------------------------------------------------

To harness the benefits of user-space multi-tasking while still providing a user-friendly and deterministic interface for concurrent programs, this proposal advocates for a nested task-based programming framework. The choice for a nested structure over a graph-based one is driven by the need to design reliable programs and enable straightforward control flow analysis - a requirement that becomes critical during safety inspections. Although graph-based structures may have a gentler learning curve and offer rapid initial results, they often become limiting when more complex scheduling descriptions are needed, such as conditional branching, time monitoring, and error handling paths.

Copy link
Contributor

@armin-acn armin-acn Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A safety-certified framework for task-based asynchronous programming could be very beneficial. However, it should at least enable application development in the usual async programming model using async functions, spawning of tasks, and "await"ing of futures (as, e.g. with "Tokio" in Rust).

A "nested" programming model could be added on top for applications actually needing it.

If a nested language is part of the feature request, it should be specified in that feature request, so that the programming model can be understood on a conceptual level.

Also: The nested approach appears to me like introducing a kind of an interpreted language on top of the native language. This might make it harder to verify the correctness of programs using it and thus make safety evaluations mor difficult.


In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems.

The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing frameworks for user-level scheduling based on cooperative multitasking in many languages including Rust and C++. As you know, Rust even provides language elements for it. Unfortunately, the frameworks I know of (e.g. "Tokio" for Rust) are not safety certified.

Having a (safety-certified) framework for task-based asynchronous programming could be very beneficial for the reasons you are pointing out below. I think, such a framework should enable application development in the common async programming model using async functions, spawning of tasks, and "await"-ing of futures (as, e.g. with "Tokio" in Rust).

According to my understanding, the functionality requested in this feature request more or less needs such a framework (at least a minimal one) as a basis. Thus, it could make sense to create a separate feature request for a safety-certified async-framework and then base remainder of the present feature request on top of that.


In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems.

The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had a broad agreement in score that if possible, the native means of the programming language or programming environment should be used. In case of user space scheduling and Rust, Rust async and the corresponding well known and wide spread API already solves this problem. Of course this needs a safe async runtime and working on that would be a very useful thing to do.

:satisfies: stkh_req__execution_model__processes
:status: invalid

The system **SHALL** implement user-level scheduling for task management so that task switches occur in the nanosecond range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, this could be achieved by providing an async framework comparable to "Tokio" in Rust, i.e. supporting the usual async programming model using async functions, spawning of tasks, and "await"ing of futures. There are similar frameworks for C++.

:satisfies: stkh_req__execution_model__processes
:status: invalid

The system **SHALL** support cooperative multi-tasking, allowing multiple concurrent tasks to share the same OS thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: This could be achieved by providing an async framework similar to the known ones.

:status: invalid

The system **SHALL** provide a configurable thread pool for executing concurrent tasks. Additional thread pools **MAY** be introduced only when necessary (e.g., when tasks differ in criticality or require separation by process boundaries).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: This could be achieved by providing an async framework similar to the known ones.

:status: invalid

The programming framework **SHALL** allow developers to express concurrent and sequential dependencies, conditional branching, timing constraints, and error handling paths while abstracting explicit thread management and complex synchronization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: This could be achieved by providing an async framework similar to the known ones. There is no need for a "nested" language to achieve this.

:satisfies: stkh_req__execution_model__processes
:status: invalid

The system **SHALL** decouple algorithm design from deployment specifics, allowing dynamic updates, upgrades, and new deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe in more detail, how the intended orchestrator approach can solve this without closely coupling the applications into a huge meta-program?

:satisfies: stkh_req__execution_model__processes, stkh_req__dev_experience__tracing_of_exec
:status: invalid

The system **SHALL** provide hooks for tracing and profiling task execution to verify behavior and control flow of the integrated system.
Copy link
Contributor

@armin-acn armin-acn Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the tracing and logging frameworks already provide means for this? If not, could you please specify in more detail the features the orchestration framework needs to add?

:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__security_features
:status: invalid

The orchestration feature **SHALL** assume that all code executing within a process is trusted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a requirement to the orchestrator but rather a requirement on how developers shall partition their applications (i.e., independent of the existence of the orchestration feature).

:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety
:status: invalid

All tasks within a single process **SHALL** share the same ASIL level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a requirement to the orchestrator but rather a requirement on how developers shall partition their applications (i.e., independent of the existence of the orchestration feature).

:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety
:status: invalid

The system **SHALL** implement priority-based preemption between thread pools to ensure that lower-priority programs cannot interfere with higher-priority programs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to provide means for that in the orchestration feature? Or is there a separate mechanism for that (e.g. based on OS tools)?


Concurrent programming in our target environment spans multiple scopes. At one level, concurrency exists within the algorithms of individual applications or system services, while at another level, multiple applications must execute concurrently across the platform. The challenge is to offer an interface that is not only simple and expressive but also deterministic and reliable - an important requirement in safety-critical systems.

Traditional thread-based concurrency in POSIX-like environments introduces complexities such as deadlocks, livelocks, and starvation. These issues, coupled with the inherent difficulties in debugging and validating thread-based systems, can compromise both performance and reliability. [#f4]_ [#f5]_ [#f6]_ Moreover, current designs often separate the management of timing requirements, monitoring, and error handling from the control flow. Integrating these aspects closer to the application logic would promote higher cohesion and lower coupling, enabling more effective debugging and validation, particularly when addressing application-specific scenarios.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how the orchestrator helps to avoid dead locks, livelocks and starvation? In particular also in the case where priority based OS scheduling is used as described below.


The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference.

Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how the Orchestrator helps to integrate multiple applications and other services? For example, what does an integrator have to do when adding another orchestrator-enabled application to an existing system? I assume such an application comes with a description according to the "nested task-based programming framework". How will this new description be merged into the existing descriptions? How can it be ensured that the existing applications will still meet their timing requirements?


The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference.

Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the potential upsides of the approach of an algorithm independent description of scheduling requirements, there is the downside, that the application logic is split into two parts: The part in the algorithm and the part in the scheduling description. I think this does not improve user friendliness from the perspective of an application developer. Therefore, the pros and cons have to be weighted carefully against each other.

- Free from complex synchronization mechanisms.
- Capable of expressing both concurrent and sequential dependencies.
- Capable of expressing conditional branching within the program.
- Capable of expressing timing constraints and error handling paths directly within the program.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain what timing constraints you envision and what exactly the error handling paths would do? Is the error handling the reaction to a failed timing constraint check or is this error handling something more general? How does user code interact with the error handling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request Orchestration
3 participants