Skip to content

TELCODOCS-1874: Adding hub cluster RDS - Tech Preview #93098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3332,6 +3332,8 @@ Topics:
File: telco-core-rds
- Name: Telco RAN DU reference design specifications
File: telco-ran-du-rds
- Name: Telco hub reference design specifications
File: telco-hub-rds
- Name: Comparing cluster configurations
Dir: cluster-compare
Distros: openshift-origin,openshift-enterprise
Expand Down
Binary file added images/telco-hub-cluster-rds-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
78 changes: 78 additions & 0 deletions modules/telco-hub-acm-observability.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-acm-observability_{context}"]
= {rh-rhacm} Observability

Cluster observability is provided by the multicluster engine and {rh-rhacm}.

* Observability storage needs several `PV` resources and an S3 compatible bucket storage for long term retention of the metrics.
* Storage requirements calculation is complex and dependent on the specific workloads and characteristics of managed clusters.
Requirements for `PV` resources and the S3 bucket depend on many aspects including data retention, the number of managed clusters, managed cluster workloads, and so on.
* Estimate the required storage for observability by using the observability sizing calculator in the {rh-rhacm} capacity planning repository.
See the Red Hat Knowledgebase article link:https://access.redhat.com/articles/7103886[Calculating storage need for MultiClusterHub Observability on telco environments] for an explanation of using the calculator to estimate observability storage requirements.
The below table uses inputs derived from the telco RAN DU RDS and the hub cluster RDS as representative values.

[NOTE]
====
The following numbers are estimated.
Tune the values for more accurate results.
Add an engineering margin, for example +20%, to the results to account for potential estimation inaccuracies.
====

.Cluster requirements
[cols="42%,42%,16%",options="header"]
|====
|Capacity planner input
|Data source
|Example value

|Number of control plane nodes
|Hub cluster RDS (scale) and telco RAN DU RDS (topology)
|3500

|Number of additional worker nodes
|Hub cluster RDS (scale) and telco RAN DU RDS (topology)
|0

|Days for storage of data
|Hub cluster RDS
|15

|Total Number of pods per cluster
|Telco RAN DU RDS
|120

|Number of namespaces (excl OCP)
|Telco RAN DU RDS
|4

|Number of metric samples per hour
|Default value
|12

|Number of hours of retention in Receiver PV
|Default value
|24
|====

With these input values, the sizing calculator as described in the Red Hat Knowledgebase article link:https://access.redhat.com/articles/7103886[Calculating storage need for MultiClusterHub Observability on telco environments] indicates the following storage needs:

.Storage requirements
[options="header"]
|====
2+|alertmanager PV 2+|thanos-receive PV 2+|thanos-compactor PV

|*Per replica* |*Total* |*Per replica* |*Total* 2+|*Total*

|10GBi |30GBi |10GBi |30GBi 2+|100GBi
|====

.Storage requirements
[options="header"]
|====
2+|thanos-rule PV 2+|thanos-store PV 2+|Object bucket^[1]^

|*Per replica* |*Total* |*Per replica* |*Total* |*Per day* |*Total*

|30GBi |90GBi |100GBi |300GBi |15GBi |101GBi
|====
[1] For object bucket we assume we disable downsampling, so only need to calculate storage for raw data.
7 changes: 7 additions & 0 deletions modules/telco-hub-acmMCH-yaml.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[id="telco-hub-acmMCH-yaml"]
.acmMCH.yaml
[source,yaml]
----
link:https://raw.githubusercontent.com/openshift-kni/telco-reference/release-4.19/telco-hub/configuration/reference-crs/required/acm/acmMCH.yaml[role=include]
----

35 changes: 35 additions & 0 deletions modules/telco-hub-architecture-overview.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
:_mod-docs-content-type: CONCEPT
[id="telco-hub-architecture-overview_{context}"]
= Hub cluster architecture overview


Use the features and components running on the management hub cluster to manage many other clusters in a hub-and-spoke topology.
The hub cluster provides a highly-available and centralized interface for managing the configuration, lifecycle, and observability of the fleet of deployed clusters.

[NOTE]
====
All management hub functionality can be deployed on a dedicated {product-title} cluster or as applications that are co-resident on an existing cluster.
====

Managed cluster lifecycle::
Using a combination of Day 2 Operators, the hub cluster provides the necessary infrastructure to deploy and configure the fleet of clusters by using a GitOps methodology.
Over the lifetime of the deployed clusters, further management of upgrades, scaling the number of clusters, node replacement, and other lifecycle management functions can be declaratively defined and rolled out.
You can control the timing and progression of the rollout across the fleet.

Monitoring::
+
--
The hub cluster provides monitoring and status reporting for the managed clusters through the Observability pillar of the {rh-rhacm} Operator.
This includes aggregated metrics, alerts, and compliance monitoring through the Governance policy framework.
--

The Telco management hub reference design specifications (RDS) and the associated reference CRs describe the telco engineering and QE validated method for deploying, configuring and managing the lifecycle of telco managed cluster infrastructure.
The reference configuration includes the installation and configuration of the hub cluster components on top of {product-title}.


.Hub cluster reference design components
image::telco-hub-cluster-reference-design-components.png[]

.Hub cluster reference design architecture
image::telco-hub-cluster-rds-architecture.png[]

21 changes: 21 additions & 0 deletions modules/telco-hub-assisted-service.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-assisted-service_{context}"]
= Assisted Service

The Assisted Service is deployed with the multicluster engine and {rh-rhacm}.

.Assisted Service storage requirements
[cols="1,2", options="header"]
|====
|Persistent volume resource
|Size (GB)

|`imageStorage`
|50

|`filesystemStorage`
|700

|`dataBaseStorage`
|20
|====
17 changes: 17 additions & 0 deletions modules/telco-hub-cluster-topology.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-cluster-topology_{context}"]
= Cluster topology

In production settings, the {product-title} hub cluster must be highly available to maintain high availability of the management functions.

Limits and requirements::
Use a highly available cluster topology for the hub cluster, for example:
* Compact (3 nodes combined control plane and compute nodes)
* Standard (3 control plane nodes + N compute nodes)

Engineering considerations::
* In non-production settings, a {sno} cluster can be used for limited hub cluster functionality.
* Certain capabilities, for example {rh-storage}, are not supported on {sno}.
In this configuration some hub cluster features might not be available.
* The number of optional compute nodes can vary depending on the scale of the specific use case.
* Compute nodes can be added later as required.
5 changes: 5 additions & 0 deletions modules/telco-hub-engineering-considerations.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-engineering-considerations_{context}"]
= Hub cluster engineering considerations

The follwing sections describe the engineering considerations for hub cluster resource scaling targets and utilization.
27 changes: 27 additions & 0 deletions modules/telco-hub-git-repository.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
:_mod-docs-content-type: CONCEPT
[id="telco-hub-git-repository_{context}"]
= Git repository

The telco management hub cluster supports a GitOps driven methodology for installing and managing the configuration of OpenShift clusters for various telco applications.
This methodology requires an accessible Git repository that serves as the authoritative source of truth for cluster definitions and configuration artifacts.

Red Hat does not offer a commercially supported Git server.
An existing Git server provided in the production environment can be used.
Gitea and Gogs are examples of self-hosted Git servers that you can use.

The Git repository is typically provided in the production network external to the hub cluster.
In a large-scale deployment, multiple hub clusters can use the same Git repository for maintaining the definitions of managed clusters. Using this approach, you can easily review the state of the complete network.
As the source of truth for cluster definitions, the Git repository should be highly available and recoverable in disaster scenarios.

[NOTE]
====
For disaster recovery and multi-hub considerations, run the Git repository separately from the hub cluster.
====

Limits and requirements::
* A Git repository is required to support the {ztp} functions of the hub cluster, including installation, configuration, and lifecycle management of the managed clusters.
* The Git repository must be accessible from the management cluster.

Engineering considerations::
* The Git repository is used by the GitOps Operator to ensure continuous deployment and a single source of truth for the applied configuration.

30 changes: 30 additions & 0 deletions modules/telco-hub-gitops-operator-and-ztp-plugins.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-gitops-operator-and-ztp-plugins_{context}"]
= GitOps Operator and {ztp}

New in this release::
* No reference design updates in this release

Description::
GitOps Operator and {ztp} provide a GitOps-based infrastructure for managing cluster deployment and configuration.
Cluster definitions and configurations are maintained as a declarative state in Git.
You can apply `ClusterInstance` CRs to the hub cluster where the `SiteConfig` Operator renders them as installation CRs.
In earlier releases, a {ztp} plugin supported the generation of installation CRs from `SiteConfig` CRs.
This plugin is now deprecated.
A separate {ztp} plugin is available to enable automatic wrapping of configuration CRs into policies based on the `PolicyGenerator` or `PolicyGenTemplate` CR.
+
You can deploy and manage multiple versions of {product-title} on managed clusters by using the baseline reference configuration CRs.
You can use custom CRs alongside the baseline CRs.
To maintain multiple per-version policies simultaneously, use Git to manage the versions of the source and policy CRs by using `PolicyGenerator` or `PolicyGenTemplate` CRs.


Limits and requirements::
* 300 single node `SiteConfig` CRs can be synchronized for each ArgoCD application.
You can use multiple applications to achieve the maximum number of clusters supported by a single hub cluster.
* To ensure consistent and complete cleanup of managed clusters and their associated resources during cluster or node deletion, you must configure ArgoCD to use background deletion mode.

Engineering considerations::
* To avoid confusion or unintentional overwrite when updating content, use unique and distinguishable names for custom CRs in the `source-crs` directory and extra manifests.
* Keep reference source CRs in a separate directory from custom CRs.
This facilitates easy update of reference CRs as required.
* To help with multiple versions, keep all source CRs and policy creation CRs in versioned Git repositories to ensure consistent generation of policies for each {product-title} version.
22 changes: 22 additions & 0 deletions modules/telco-hub-hub-cluster-day-2-operators.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-hub-cluster-day-2-operators_{context}"]
= Day 2 Operators in the hub cluster

The management hub cluster relies on a set of Day 2 Operators to provide critical management services and infrastructure.
Use Operator versions that match the set of managed cluster versions in your fleet.

Install Day 2 Operators using Operator Lifecycle Manager (OLM) and `Subscription` CRs.
`Subscription` CRs identify the specific Day 2 Operator to install, the catalog in which the operator is found, and the appropriate version channel for the Operator.
By default OLM installs and attempt to keep Operators updated with the latest z-stream version available in the channel.
By default all Subscriptions are set with an `installPlanApproval: Automatic` value.
In this mode, OLM automatically installs new Operator versions when they are available in the catalog and channel.

[NOTE]
====
Setting `installPlanApproval` to automatic exposes the risk of the Operator being updated outside of defined maintenance windows if the catalog index is updated to include newer Operator versions.
In a disconnected environment where you are building and maintaining a curated set of Operators and versions in the catalog, and if you follow a strategy of creating a new catalog index for updated versions, the risk of the Operators being inadvertently updated is largely removed.
However, if you want to further close this risk, the `Subscription` CRs can be set to `installPlanApproval: Manual` which prevents Operators from being updated without explicit administrator approval.
====

Limits and requirements::
* When upgrading a Telco hub cluster, the versions of {product-title} and Operators must meet the requirements of all relevant compatibility matrixes.
52 changes: 52 additions & 0 deletions modules/telco-hub-hub-cluster-openshift-deployment.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-hub-cluster-openshift-deployment_{context}"]
= {product-title} installation on the hub cluster

Description::
+
--
The reference method for installing {product-title} for the hub cluster is through the Agent-based Installer.

Agent-based Installer provides installation capabilities without additional centralized infrastructure.
The Agent-based Installer creates an ISO image which you mount to the server to be installed.
When you boot the server, {product-title} is installed alongside optionally supplied extra manifests, such as {ztp} custom resources.

[NOTE]
====
You can also install {product-title} in the hub cluster by using other installation methods.
====

If hub cluster functions are being applied to an existing {product-title} cluster, the Agent-based Installer installation is not required.
The remaining steps to install Day 2 Operators and configure the cluster for these functions remains the same.
When {product-title} installation is complete, the set of additional Operators and their configuration must be installed on the hub cluster.

The reference configuration includes all of these CRs, which you can apply manually, for example:

[source,terminal]
----
$ oc apply -f <reference_cr>
----

You can also add the reference configuration to the Git repository and apply it using ArgoCD.

[NOTE]
====
If applying manually the CRs manually, take care to apply the CRs in the order indicated by the ArgoCD wave annotations.
Any CRs without annotations are in the initial wave.
====
--

Limits and requirements::
* Agent-based Installer requires an accessible image repository containing all required {product-title} and Day 2 Operator images.
* Agent-based Installer builds ISO images based on a specific OpenShift releases and specific cluster details.
Installation of a second hub requires a separate ISO image to be built.

Engineering considerations::
* Agent-based Installer provides a baseline {product-title} installation.
You apply Day 2 Operators and other configuration CRs after the cluster is installed.
* The reference configuration supports Agent-based Installer installation in a disconnected environment.
* A limited set of additional manifests can be supplied at installation time.
* Any `MachineConfiguration` CRs you require should be included as extra manifests during installation.



4 changes: 4 additions & 0 deletions modules/telco-hub-hub-components.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-hub-components_{context}"]
= Hub cluster components

16 changes: 16 additions & 0 deletions modules/telco-hub-hub-disaster-recovery.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-hub-disaster-recovery_{context}"]
= Hub cluster disaster recovery

Note that loss of the hub cluster does not typically create a service outage on the managed clusters.
Functions provided by the hub cluster will be lost, such as observability, configuration and LCM updates being driven through the hub cluster, and so on.

Limits and requirements::

* Backup,restore and disaster recovery are offered by the cluster backup and restore Operator, which depends on the OpenShift API for Data Protection (OADP) Operator.

Engineering considerations::

* The cluster backup and restore operator can be extended to third party resources of the hub cluster based on user configuration.
* The cluster backup and restore operator is not enabled by default in ACM.
The reference configuration enables this feature.
17 changes: 17 additions & 0 deletions modules/telco-hub-local-storage-operator.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-local-storage-operator_{context}"]
= Local Storage Operator

New in this release::
* No reference design updates in this release

Description::
You can create persistent volumes that can be used as `PVC` resources by applications with the Local Storage Operator.
The number and type of `PV` resources that you create depends on your requirements.

Engineering considerations::
* Create backing storage for `PV` CRs before creating the `PV`.
This can be a partition, a local volume, LVM volume, or full disk.
* Refer to the device listing in `LocalVolume` CRs by the hardware path used to access each device to ensure correct allocation of disks and partitions, for example, `/dev/disk/by-path/<id>`.
Logical names (for example, `/dev/sda`) are not guaranteed to be consistent across node reboots.

21 changes: 21 additions & 0 deletions modules/telco-hub-logging.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-logging_{context}"]
= Logging

New in this release::
* No reference design updates in this release

Description::
The Cluster Logging Operator enables collection and shipping of logs off the node for remote archival and analysis.
The reference configuration uses Kafka to ship audit and infrastructure logs to a remote archive.

Limits and requirements::
* The reference configuration does not include local log storage.
* The reference configuration does not include aggregation of managed cluster logs at the hub cluster.

Engineering considerations::
* The impact of cluster CPU use is based on the number or size of logs generated and the amount of log filtering configured.
* The reference configuration does not include shipping of application logs.
The inclusion of application logs in the configuration requires you to evaluate the application logging rate and have sufficient additional CPU resources allocated to the reserved set.


20 changes: 20 additions & 0 deletions modules/telco-hub-managed-cluster-deployment.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
:_mod-docs-content-type: REFERENCE
[id="telco-hub-managed-cluster-deployment_{context}"]
= Managed cluster deployment

Description::
As of {rh-rhacm} 2.12, using the SiteConfig Operator is the recommended method for deploying managed clusters.
The SiteConfig Operator introduces a unified ClusterInstance API that decouples the parameters that define the cluster from the manner in which it is deployed.
The SiteConfig Operator uses a set of cluster templates that are instantiated using the data from a `ClusterInstance` CR to dynamically generate installation manifests.
Following the GitOps methodology, the `ClusterInstance` CR is sourced from a Git repository through ArgoCD.
The `ClusterInstance` CR can be used to initiate cluster installation by using either Assisted Installer, or the image-based installation available in multicluster engine.

Limits and requirements::
* The SiteConfig ArgoCD plugin which handles `SiteConfig` CRs is deprecated from {product-title} 4.18.


Engineering considerations::
* You must create a `Secret` CR with the login information for the cluster baseboard management controller (BMC).
This Secret is then referenced in the `SiteConfig` CR.
Integration with a secret store such as Vault can be used to manage the secrets.
* Besides offering deployment method isolation and unification of Git and non-Git workflows, the SiteConfig Operator provides better scalability, greater flexibility with the use of custom templates, and an enhanced troubleshooting experience.
Loading