CatalogSource "redhat-operators" stuck in "TRANSIENT_FAILURE" when using OCS dev registry image

The redhat-operators CatalogSource in openshift-marketplace remained in `TRANSIENT_FAILURE` instead of becoming
  `READY`.


For OCS/ODF installs, ocs-ci disables the default OperatorHub redhat-operators source and creates a CatalogSource 
  with the same name that points to the OCS registry image (for example, quay.io/rhceph-dev/ocs-registry:…).
  That image is large. The catalog pod takes a long time to start listening on gRPC port 50051. OpenShift’s Operator
  Lifecycle Manager (OLM) uses startup probes on that port. If the server is not ready in time, Kubernetes keeps 
  restarting the container. The CatalogSource then never reports a good connection and shows `TRANSIENT_FAILURE`.
  So the failure is not mainly “bad image pull” or “wrong kubeconfig”; it is startup time vs. how long OLM waits for a
  very heavy catalog index.


  What helped on the cluster (workaround)

  • Using the default Red Hat operator index for redhat-operators again (OperatorHub) made the CatalogSource go to
    `READY` quickly.



  Ideas for a proper fix

  1. In ocs-ci / config: set something like `grpcPodConfig.memoryTarget` on the CatalogSource so the catalog pod
     gets enough memory / GOMEMLIMIT for a big index (helps in many cases).
  2. Design: avoid replacing the real redhat-operators with the OCS image—use a separate CatalogSource name (e.g.
     ocs-catalog) for the OCS registry and point only ODF/OCS subscriptions at it, so the normal Red Hat catalog
     keeps working.
  3. Upstream: if the OCS index still cannot start within OLM’s startup window, that may need OLM (longer tolerance)
      or image (faster startup) changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CatalogSource "redhat-operators" stuck in "TRANSIENT_FAILURE" when using OCS dev registry image #14873

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CatalogSource "redhat-operators" stuck in "TRANSIENT_FAILURE" when using OCS dev registry image #14873

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions