Skip to content

Update OS onboarding docs #115501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 16 additions & 24 deletions docs/project/os-onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ Continuing with the idea of pragmatism, if you only read this far, you've got th

References:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the above section is leading with the "why" rather than the "what" -- that is, it is explaining our reasoning before mentioning the strategy. Should we invert that for people who are only interested in the "what"?


- [New Operating System Version Onboarding Guide](https://github.com/dotnet/dnceng/blob/main/Documentation/ProjectDocs/OS%20Onboarding/Guidance.md)
- [OS Support Matrix](./os-support.md)
- [.NET OS Support Tracking](https://github.com/dotnet/core/issues/9638)
- [New Operating System Version Onboarding Guide](https://github.com/dotnet/dnceng/blob/main/Documentation/ProjectDocs/OS%20Onboarding/Guidance.md)

## Context

Expand All @@ -25,43 +26,34 @@ Nearly all the APIs that touch native code (networking, cryptography) and deal w

## Approach

Our rule is that we declare support (for all [supported .NET releases](https://github.com/dotnet/core/blob/main/releases.md)) for a new OS version after it is validated in dotnet/runtime `main`. We will only hold support on additional testing in special cases (which are uncommon).
For all OSes we aim to support, we aim to provide support on OS release day. We only require validation in dotnet/runtime `main` (for all [supported .NET releases](https://github.com/dotnet/core/blob/main/releases.md)), often relying on [non-final OS builds](https://github.com/dotnet/runtime/pull/111768#issuecomment-2617229139).

We aim to have "day of" support for about half the OSes we support, including Azure Linux, Ubuntu LTS, and Windows. This means we need to perform ahead-of-time signoff on [non-final builds](https://github.com/dotnet/runtime/pull/111768#issuecomment-2617229139).

Our testing philosophy is based on perceived risk and past experience. The effective test matrix is huge, the product of OSes \* supported versions \* architectures. We try to make smart choices to **skip testing most of the matrix** while retaining much of the **practical coverage**. We also know where we tend to get bitten most when we don't pay sufficient attention. For example, our bug risk across Linux, macOS, and Windows is not uniform.

We use pragmatism and efficiency to drive our decision making. All things being equal, we'll choose the lowest cost approach.
We use pragmatism and efficiency to drive our decision making. All things being equal, we'll choose the lowest cost approach.

## Testing
## OS Lifecycle

Testing is the bread and butter of OS onboarding, particularly for a mature runtime like ours. New OS support always needs some form of test enablement.
We update `main` to bleeding edge OS versions, even pre-release versions. This approach provides us with confidence for new OS releases and reduces remediation cost in release branches. We also find that new OS releases require product and test updates to support significant changes in foundational components.

Linux, Wasm, and some Windows testing is done in container images. This approach enables us to test many and regularly changing OS versions in a fixed/limited VM environment. The container image creation/update process is self-service (discussed later).

We use VMs (Linux and Windows) and raw metal hardware (Android and Apple) in cases where containers are not practical or where direct testing is desired. This is the primary model for Apple and Windows OSes. The VMs and mobile/Apple hardware are relatively slow to change and require support from dnceng (discussed later).
There are special considerations when `main` is the next .NET LTS (odd years):

### Adding coverage
- New Debian releases tends to ship in the middle of odd years (our LTS year). It is best to [add coverage as early as possible](https://github.com/dotnet/runtime/pull/111768), in part because [Preview 1 ships pre-release Debian version](https://github.com/dotnet/dotnet-docker/discussions/6272) in container images.
- Ubuntu LTS ship 6 months after .NET LTS. It is important that this combination has excellent support. It is recommended that we move forward with Ubuntu interim builds (examples: [24.10](https://github.com/dotnet/runtime/pull/111504), [25.04](https://github.com/dotnet/runtime/pull/113405)) requiring us to update to the next LTS during servicing (and then not update again).

New OS coverage should be added/tested first in `main`. If changes are required, we should prove them out first in `main` before committing to shipping them in a servicing release, if necessary.
We update `release` branches primarily to accommodate EOL OS references. Alpine, Azure Linux, and Fedora are examples of OSes with shorter release cycles than .NET that require regular remediation.

There are multiple reasons to add a new OS reference in a release branch:
We avoid testing multiple versions of an OS in a single branch. We will often have multiple versions of an OS across branches, with older branches having references to older OSes. We believe that this approach provides sufficient coverage and is most likely to align with user behavior. It also aligns with the container images that we publish.

- Known product (as opposed to test) breaks that require validation and regression testing.
- Past experience suggests that coverage is required to protect against risk.
- OS version is or [will soon go EOL](https://github.com/dotnet/runtime/issues/111818#issuecomment-2613642202) and should be replaced by a newer version.

For example, we frequently need to backport Alpine updates to release branches to avoid EOL references but less commonly for Ubuntu, given the vast difference in support length.

A good strategy is to keep `main` at the bleeding edge of new OS versions. That way those references have a decent chance of never needing remediation once they end up in release branches.

### Updating or removing coverage
## Testing

We will often replace an older OS version with a new one, when it comes available. This approach is an effective strategy of maintaining the same level of coverage and of remediating EOL OSes ahead of time. For the most part, we don't need to care about a specific version. We just want coverage for the OS, like Alpine.
Testing is the bread and butter of OS onboarding, particularly for a mature runtime like ours. New OS support always needs some form of test enablement.

We should remediate any EOL OS references in our codebase. They don't serve any benefit and come with some risk. They are also likely to result in compliance tickets (that come with a deadline) that we want to avoid.
Linux, Wasm, and some Windows testing is done in container images. This approach enables us to test many and regularly changing OS versions in a fixed/limited VM environment. The container image creation/update process is self-service (discussed later).

In the case that a .NET version will be EOL in <6 (and certainly <3) months, new coverage can typically be skipped. We may even be able to skip remediating EOL OS references. We often opt to stop updating [supported OSes](https://github.com/dotnet/core/blob/main/os-lifecycle-policy.md) late in support period for related reasons. A lazy approach is often the best approach late in the game. Don't upset what's working.
We use VMs (Linux and Windows) and raw metal hardware (Android and Apple) in cases where containers are not practical or where direct testing is desired. This is the primary model for Apple and Windows OSes. The VMs and mobile/Apple hardware are relatively slow to change and require support from dnceng (discussed later).

## Building

Expand All @@ -71,7 +63,7 @@ We use both containers and VMs for building, depending on the OS. If we test in

Our primary concern is ensuring that we are using [supported operating systems and tools for our build](https://github.com/dotnet/runtime/tree/main/docs/workflow/requirements).

Our Linux build containers are based on Azure Linux. We [typically need to update them](https://github.com/dotnet/runtime/issues/112191) with a new version of Azure Linux once per release. We do not update the toolset, however. That's fixed, per release.
Our Linux build containers are based on Azure Linux. We [typically need to update them](https://github.com/dotnet/runtime/issues/112191) with a new version of Azure Linux once per release. Toolset updates are [limited to patch versions](https://github.com/dotnet/dotnet-buildtools-prereqs-docker/pull/1422).

For Apple, we likely need to make an adjustment at each macOS or iOS release to account for an Xcode version no longer being supported.

Expand Down
38 changes: 38 additions & 0 deletions docs/project/os-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# OS Support Matrix

.NET is a cross-platform product, requiring good coverage across [supported OSes](https://github.com/dotnet/core/blob/main/release-notes/10.0/supported-os.md).

This document describes our coverage intent. It is a higher-level description than our pipelines.

Pipelines:

- [Runtime](https://github.com/dotnet/runtime/blob/main/eng/pipelines/coreclr/templates/helix-queues-setup.yml)
- [Libraries](https://github.com/dotnet/runtime/blob/main/eng/pipelines/libraries/helix-queues-setup.yml)

## Run types

We rely on multiple levels of testing to provide good coverage at reasonable cost. Our testing uses OSes both as a vehicle to test .NET itself and to validate discover distro-specific breakage. This is why we test multiple OSes for each run type.

- Inner loop -- Baseline set of tests that validate correct functional behavior, for PRs and branch builds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth calling out that inner-loop for main means lots and lots of test coverage. For release branches it's just servicing fixes.

FWIW, it may be simpler to remove this distinction bewteen inner-loop only/outer-loop included tiers. Main will run everything eventually, just not on PR. Release branches could be changed to do the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explains:

  • What these different "loops" are for / why they exist to enable others to make reasonable future changes to their pipeline definitions. It was not at all clear to me.
  • Why they have different OSes in them and what informs that choice
  • Any difference between main and release

We do not need to use tiers to do that. That was the most obvious approach.

- Outer loop -- A much larger set of tests that validate expected edge case behavior, for ([on-demand builds)](https://github.com/dotnet/runtime/pull/115415#issuecomment-2864759316).
- Extra platforms -- Additional OSes that are run in a rolling build that can target either inner our outer loop tests.

The libraries pipeline defines these run types. The runtime pipeline has only the inner loop run type.

The remainder of the document defines the tiers we apply to each operating system family. The tiers will be adapted over time, as needed. Architecture is another aspect of coverage. It isn't covered in this document.

## Linux Tiers

The following tiers apply for Linux.

Check failure on line 26 in docs/project/os-support.md

View workflow job for this annotation

GitHub Actions / lint

Trailing spaces [Expected: 0; Actual: 1]

- Tier 1: Azure Linux, Debian, Ubuntu (in inner and outer loop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd invert this language. That is, define what each tier means, then classify each OS as being in a particular tier.

- Tier 2: Alpine and CentOS Stream (in inner loop)
- Tier 3: Fedora and OpenSUSE (in extra platforms)
Comment on lines +28 to +30
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Tier 1: Azure Linux, Debian, Ubuntu (in inner and outer loop)
- Tier 2: Alpine and CentOS Stream (in inner loop)
- Tier 3: Fedora and OpenSUSE (in extra platforms)
- Tier 1: present in inner and outer loop (Azure Linux, Debian, Ubuntu)
- Tier 2: present in inner loop only (Alpine and CentOS Stream)
- Tier 3: present in extra platforms only (Fedora and OpenSUSE)


## Windows Tiers

The following tiers apply for windows.

- Tier 1: Windows Server 2016, Windows Server 2022, Windows Server 2025, Windows 11

Note: "Windows 10" references in pipelines are really Windows Server 2016.
Loading