-
Notifications
You must be signed in to change notification settings - Fork 22
docs: added orchestration feature request #857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run //:license-check Status: Click to expand output
|
The created documentation from the pull request is available at: docu-html |
5e0ad98
to
e4bccd9
Compare
|
||
Abstract | ||
======== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few basic questions which I did not find completely answered:
- Does the feature intend to provide a mechanism to globally manage ("orchestrate") the compute resources within an ECU as a whole?
- Or does it intend to provide means for user-level scheduling and configuration of tasks and threads within a single application potentially consisting of multiple processes? (Similar to what an async framework -- e.g. with Tokio in Rust -- does for a single process.)
- Or is it both?
If both, I would propose to split the feature request into at least two different requests, namely one for 1) and another one for 2)
In existing platforms for microprocessors (µP), each application is expected to interact with approximately 15 system services or daemons - such as ``ara::diag`` for diagnostics and ``ara::com`` for SOME/IP communication. Under a straightforward implementation, this interaction model results in the creation of around 15 threads per application. When scaled to 100-150 applications, this amounts to roughly 1500 to 2250 threads solely managing inter-process communication, excluding those performing the core application tasks. | ||
|
||
Given that the HPC's µP typically provides between 2 and 16 cores, only a limited number of threads can be processed in parallel. In POSIX-like operating systems, threads become the currency for concurrency, meaning that when the thread count far exceeds available cores, the system must rely on context switching. With context switching times estimated between 2µs and 4.5µs [#f1]_ [#f2]_ [#f3]_, even a 100ms time slice could spend between 3% and 10% of its duration on context switching alone - assuming each thread is scheduled once. This overhead increases dramatically if threads are forced to switch more frequently due to competition for processing time. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use your worst-case figures (2250 threads, only 2 cores, 4.5 us task switching time), I end up with 5 ms task-switching time for all tasks. This is 5% of 100 ms. With your lower-end figures (1500 threads -- still a lot --, 16 cores and 2 us task switching time) I get a total of 188 us, which is ~0.2% of the 100ms cycle time.
How did you calculate the 3% ... 10% figure?
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the integrator still need to figure out the minimum requirements (e.g. compute resources) of each application with the developers, which might be an iterative process if resources are very limited?
How would the integrator combine the different concurrent programs they get from different suppliers? Is it intended to merge all of them into a huge program? Or would the integrator still have to use operating system mechanisms and tools to distribute the resources across applications?
- Cooperative multi-tasking, allowing multiple concurrent tasks to use the same OS thread. | ||
- Provision of a configurable thread pool to enable parallelism for concurrent tasks. | ||
- Introduction of additional thread pools only when necessary, such as when tasks differ in criticality or require separation by process boundaries. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ from what the various available async frameworks offer?
------------------------------------------------------------------------------------------------ | ||
|
||
To harness the benefits of user-space multi-tasking while still providing a user-friendly and deterministic interface for concurrent programs, this proposal advocates for a nested task-based programming framework. The choice for a nested structure over a graph-based one is driven by the need to design reliable programs and enable straightforward control flow analysis - a requirement that becomes critical during safety inspections. Although graph-based structures may have a gentler learning curve and offer rapid initial results, they often become limiting when more complex scheduling descriptions are needed, such as conditional branching, time monitoring, and error handling paths. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A safety-certified framework for task-based asynchronous programming could be very beneficial. However, it should at least enable application development in the usual async programming model using async functions, spawning of tasks, and "await"ing of futures (as, e.g. with "Tokio" in Rust).
A "nested" programming model could be added on top for applications actually needing it.
If a nested language is part of the feature request, it should be specified in that feature request, so that the programming model can be understood on a conceptual level.
Also: The nested approach appears to me like introducing a kind of an interpreted language on top of the native language. This might make it harder to verify the correctness of programs using it and thus make safety evaluations mor difficult.
|
||
In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. | ||
|
||
The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are existing frameworks for user-level scheduling based on cooperative multitasking in many languages including Rust and C++. As you know, Rust even provides language elements for it. Unfortunately, the frameworks I know of (e.g. "Tokio" for Rust) are not safety certified.
Having a (safety-certified) framework for task-based asynchronous programming could be very beneficial for the reasons you are pointing out below. I think, such a framework should enable application development in the common async programming model using async functions, spawning of tasks, and "await"-ing of futures (as, e.g. with "Tokio" in Rust).
According to my understanding, the functionality requested in this feature request more or less needs such a framework (at least a minimal one) as a basis. Thus, it could make sense to create a separate feature request for a safety-certified async-framework and then base remainder of the present feature request on top of that.
|
||
In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. | ||
|
||
The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had a broad agreement in score that if possible, the native means of the programming language or programming environment should be used. In case of user space scheduling and Rust, Rust async and the corresponding well known and wide spread API already solves this problem. Of course this needs a safe async runtime and working on that would be a very useful thing to do.
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** implement user-level scheduling for task management so that task switches occur in the nanosecond range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, this could be achieved by providing an async framework comparable to "Tokio" in Rust, i.e. supporting the usual async programming model using async functions, spawning of tasks, and "await"ing of futures. There are similar frameworks for C++.
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** support cooperative multi-tasking, allowing multiple concurrent tasks to share the same OS thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: This could be achieved by providing an async framework similar to the known ones.
:status: invalid | ||
|
||
The system **SHALL** provide a configurable thread pool for executing concurrent tasks. Additional thread pools **MAY** be introduced only when necessary (e.g., when tasks differ in criticality or require separation by process boundaries). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: This could be achieved by providing an async framework similar to the known ones.
:status: invalid | ||
|
||
The programming framework **SHALL** allow developers to express concurrent and sequential dependencies, conditional branching, timing constraints, and error handling paths while abstracting explicit thread management and complex synchronization. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: This could be achieved by providing an async framework similar to the known ones. There is no need for a "nested" language to achieve this.
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** decouple algorithm design from deployment specifics, allowing dynamic updates, upgrades, and new deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe in more detail, how the intended orchestrator approach can solve this without closely coupling the applications into a huge meta-program?
:satisfies: stkh_req__execution_model__processes, stkh_req__dev_experience__tracing_of_exec | ||
:status: invalid | ||
|
||
The system **SHALL** provide hooks for tracing and profiling task execution to verify behavior and control flow of the integrated system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't the tracing and logging frameworks already provide means for this? If not, could you please specify in more detail the features the orchestration framework needs to add?
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__security_features | ||
:status: invalid | ||
|
||
The orchestration feature **SHALL** assume that all code executing within a process is trusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a requirement to the orchestrator but rather a requirement on how developers shall partition their applications (i.e., independent of the existence of the orchestration feature).
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety | ||
:status: invalid | ||
|
||
All tasks within a single process **SHALL** share the same ASIL level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a requirement to the orchestrator but rather a requirement on how developers shall partition their applications (i.e., independent of the existence of the orchestration feature).
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety | ||
:status: invalid | ||
|
||
The system **SHALL** implement priority-based preemption between thread pools to ensure that lower-priority programs cannot interfere with higher-priority programs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended to provide means for that in the orchestration feature? Or is there a separate mechanism for that (e.g. based on OS tools)?
|
||
Concurrent programming in our target environment spans multiple scopes. At one level, concurrency exists within the algorithms of individual applications or system services, while at another level, multiple applications must execute concurrently across the platform. The challenge is to offer an interface that is not only simple and expressive but also deterministic and reliable - an important requirement in safety-critical systems. | ||
|
||
Traditional thread-based concurrency in POSIX-like environments introduces complexities such as deadlocks, livelocks, and starvation. These issues, coupled with the inherent difficulties in debugging and validating thread-based systems, can compromise both performance and reliability. [#f4]_ [#f5]_ [#f6]_ Moreover, current designs often separate the management of timing requirements, monitoring, and error handling from the control flow. Integrating these aspects closer to the application logic would promote higher cohesion and lower coupling, enabling more effective debugging and validation, particularly when addressing application-specific scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how the orchestrator helps to avoid dead locks, livelocks and starvation? In particular also in the case where priority based OS scheduling is used as described below.
|
||
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how the Orchestrator helps to integrate multiple applications and other services? For example, what does an integrator have to do when adding another orchestrator-enabled application to an existing system? I assume such an application comes with a description according to the "nested task-based programming framework". How will this new description be merged into the existing descriptions? How can it be ensured that the existing applications will still meet their timing requirements?
|
||
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the potential upsides of the approach of an algorithm independent description of scheduling requirements, there is the downside, that the application logic is split into two parts: The part in the algorithm and the part in the scheduling description. I think this does not improve user friendliness from the perspective of an application developer. Therefore, the pros and cons have to be weighted carefully against each other.
- Free from complex synchronization mechanisms. | ||
- Capable of expressing both concurrent and sequential dependencies. | ||
- Capable of expressing conditional branching within the program. | ||
- Capable of expressing timing constraints and error handling paths directly within the program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what timing constraints you envision and what exactly the error handling paths would do? Is the error handling the reaction to a failed timing constraint check or is this error handling something more general? How does user code interact with the error handling?
closes: #273