Skip to content

timely CI error due to DNS failed to resolve the service #247

Open
@sunya-ch

Description

@sunya-ch

What happened?

We found CI error failed from time to time (rerun for multiple times helps it pass)

error: connection error: Post "http://kepler-model-server.kepler.svc.cluster.local:8100/model": dial tcp: lookup kepler-model-server.kepler.svc.cluster.local on 10.96.0.10:53: no such host (http://kepler-model-server.kepler.svc.cluster.local:8100/model))
Error from server (InternalError): error when creating "tasks/train-task.yaml": Internal error occurred: failed calling webhook "webhook.pipeline.tekton.dev": failed to call webhook: Post "[https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s](https://tekton-pipelines-webhook.tekton-pipelines.svc/defaulting?timeout=10s)": dial tcp 10.96.111.114:443: connect: connection refused

What did you expect to happen?

Investigate root cause and fix

How can we reproduce it (as minimally and precisely as possible)?

Push PR

Anything else we need to know?

No response

Kepler image tag

Deployment

  • Model server
  • Estimator
  • Online trainer
  • Offline trainer
  • Profiler

Kepler model server image tag if deployed

Kepler estimator image tag if deployed

Kepler online trainer image tag if deployed

Kepler offline trainer image tag if deployed

Kepler profiler image tag if deployed

Kubernetes version

$ kubectl version
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler model server configmap if Kepler Model Server is deployed 
$ kubectl get configmap kepler-model-server-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions