The redhat-operators CatalogSource in openshift-marketplace remained in TRANSIENT_FAILURE instead of becoming
READY.
For OCS/ODF installs, ocs-ci disables the default OperatorHub redhat-operators source and creates a CatalogSource
with the same name that points to the OCS registry image (for example, quay.io/rhceph-dev/ocs-registry:…).
That image is large. The catalog pod takes a long time to start listening on gRPC port 50051. OpenShift’s Operator
Lifecycle Manager (OLM) uses startup probes on that port. If the server is not ready in time, Kubernetes keeps
restarting the container. The CatalogSource then never reports a good connection and shows TRANSIENT_FAILURE.
So the failure is not mainly “bad image pull” or “wrong kubeconfig”; it is startup time vs. how long OLM waits for a
very heavy catalog index.
What helped on the cluster (workaround)
• Using the default Red Hat operator index for redhat-operators again (OperatorHub) made the CatalogSource go to
READY quickly.
Ideas for a proper fix
- In ocs-ci / config: set something like
grpcPodConfig.memoryTarget on the CatalogSource so the catalog pod
gets enough memory / GOMEMLIMIT for a big index (helps in many cases).
- Design: avoid replacing the real redhat-operators with the OCS image—use a separate CatalogSource name (e.g.
ocs-catalog) for the OCS registry and point only ODF/OCS subscriptions at it, so the normal Red Hat catalog
keeps working.
- Upstream: if the OCS index still cannot start within OLM’s startup window, that may need OLM (longer tolerance)
or image (faster startup) changes.
The redhat-operators CatalogSource in openshift-marketplace remained in
TRANSIENT_FAILUREinstead of becomingREADY.For OCS/ODF installs, ocs-ci disables the default OperatorHub redhat-operators source and creates a CatalogSource
with the same name that points to the OCS registry image (for example, quay.io/rhceph-dev/ocs-registry:…).
That image is large. The catalog pod takes a long time to start listening on gRPC port 50051. OpenShift’s Operator
Lifecycle Manager (OLM) uses startup probes on that port. If the server is not ready in time, Kubernetes keeps
restarting the container. The CatalogSource then never reports a good connection and shows
TRANSIENT_FAILURE.So the failure is not mainly “bad image pull” or “wrong kubeconfig”; it is startup time vs. how long OLM waits for a
very heavy catalog index.
What helped on the cluster (workaround)
• Using the default Red Hat operator index for redhat-operators again (OperatorHub) made the CatalogSource go to
READYquickly.Ideas for a proper fix
grpcPodConfig.memoryTargeton the CatalogSource so the catalog podgets enough memory / GOMEMLIMIT for a big index (helps in many cases).
ocs-catalog) for the OCS registry and point only ODF/OCS subscriptions at it, so the normal Red Hat catalog
keeps working.
or image (faster startup) changes.