Skip to content

fix(proxy): replace panic with graceful error handling in getRepo#2419

Closed
Storm1289 wants to merge 1 commit intokubeflow:mainfrom
Storm1289:fix-proxy-getrepo-panic
Closed

fix(proxy): replace panic with graceful error handling in getRepo#2419
Storm1289 wants to merge 1 commit intokubeflow:mainfrom
Storm1289:fix-proxy-getrepo-panic

Conversation

@Storm1289
Copy link
Copy Markdown
Contributor

@Storm1289 Storm1289 commented Mar 17, 2026

This PR addresses a reliability issue in cmd/proxy.go where getRepo would panic if it failed to retrieve a repository from the repoSet. Panicking on expected configuration or connection errors can cause the entire proxy server process to crash abruptly in production, leaving little room for graceful degradation or clear logging of the root cause up the initialization chain.

Changes Made

1. Refactored getRepo signature

  • Before: func getRepo[T any](repoSet datastore.RepoSet) T
  • After: func getRepo[T any](repoSet datastore.RepoSet) (T, error)

2. Error handling instead of panic

  • Before:
  panic(fmt.Sprintf("unable to get repository: %v", err))
  • After:
  var zero T
  return zero, fmt.Errorf("unable to get repository: %w", err)

3. Return value updated

  • Before: return repo.(T)
  • After: return repo.(T), nil

4. Refactored newModelRegistryService — eager repo resolution with error checks

Previously, all 14 getRepo calls were passed inline as arguments directly into core.NewModelRegistryService(...), meaning any failure would panic
mid-call with no recovery path:

  // Before — all 14 repos resolved inline, panic on any failure
  modelRegistryService := core.NewModelRegistryService(
      getRepo[models.ArtifactRepository](repoSet),
      getRepo[models.ModelArtifactRepository](repoSet),
      getRepo[models.DocArtifactRepository](repoSet),
      getRepo[models.RegisteredModelRepository](repoSet),
      getRepo[models.ModelVersionRepository](repoSet),
      getRepo[models.ServingEnvironmentRepository](repoSet),
      getRepo[models.InferenceServiceRepository](repoSet),
      getRepo[models.ServeModelRepository](repoSet),
      getRepo[models.ExperimentRepository](repoSet),
      getRepo[models.ExperimentRunRepository](repoSet),
      getRepo[models.DataSetRepository](repoSet),
      getRepo[models.MetricRepository](repoSet),
      getRepo[models.ParameterRepository](repoSet),
      getRepo[models.MetricHistoryRepository](repoSet),
      repoSet.TypeMap(),
  )

Now each repository is resolved individually with an immediate error check. core.NewModelRegistryService is only called once all 14 repos are confirmed
healthy:

  // After — each repo resolved and checked before proceeding
  dataSetRepo, err := getRepo[models.DataSetRepository](repoSet)
  if err != nil {
      return nil, err
  }
  metricRepo, err := getRepo[models.MetricRepository](repoSet)
  if err != nil {
      return nil, err
  }
  parameterRepo, err := getRepo[models.ParameterRepository](repoSet)
  if err != nil {
      return nil, err
  }
  metricHistoryRepo, err := getRepo[models.MetricHistoryRepository](repoSet)
  if err != nil {
      return nil, err
  }
  // ... (same pattern for all 14 repos)

  modelRegistryService := core.NewModelRegistryService(
      artifactRepo,
      modelArtifactRepo,
      docArtifactRepo,
      registeredModelRepo,
      modelVersionRepo,
      servingEnvRepo,
      inferenceServiceRepo,
      serveModelRepo,
      experimentRepo,
      experimentRunRepo,
      dataSetRepo,
      metricRepo,
      parameterRepo,
      metricHistoryRepo,
      repoSet.TypeMap(),
  )

Why This Matters

  • Any datastore connection or configuration problem now surfaces as a clean, returnable error instead of a process-killing panic.
  • Errors are wrapped with %w, making them inspectable via errors.Is / errors.As up the call stack.
  • The service is only constructed when all dependencies are confirmed available, preventing partial initialization states.

This commit addresses a reliability issue in cmd/proxy.go where getRepo
would panic if it failed to retrieve a repository from the repoSet.
Panicking on expected configuration or connection errors can cause the
entire proxy server process to crash abruptly in production.

Changes Made:
- Refactored getRepo to return (T, error) instead of panicking.
- Refactored newModelRegistryService to extract each of the 14
repositories individually, aggressively checking for errors after each
lookup and returning the error back to the caller.

Signed-off-by: divakarsharma2934 <divakarsharma2934@gmail.com>
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rareddy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pboyd
Copy link
Copy Markdown
Member

pboyd commented Mar 17, 2026

There's a new policy for AI-generated PRs coming: kubeflow/website#4336 Some of the finer points are still being debated, but there's general consensus that AI-generated code should be marked as such.

If this was AI-generated, can you update the commit message with Co-authored-by: [Agent Name] as requested in the new policy?

@Storm1289
Copy link
Copy Markdown
Contributor Author

hi @pboyd
I will look into this

Copy link
Copy Markdown
Member

@pboyd pboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this solves a real problem. If getRepo fails we have a programming logic error, which should only happen during development. And in development, the stacktrace is helpful. Whether it's an error or a panic, we can't recover from the error and the service should stop. Personally, I'd prefer the stacktrace, but if we need a friendlier message for some reason, that's fine with me.

Did you encounter this problem in the wild? If so, there's a serious bug that we need to address.

@Storm1289
Copy link
Copy Markdown
Contributor Author

You're right this was not encountered in the wild. I identified it during a code review but didn't fully consider that getRepo failing is a programming logic error rather than a runtime issue. The stacktrace from the panic is indeed more useful here. I'll close this PR.
Thank you for the feedback

@Storm1289 Storm1289 closed this Mar 17, 2026
@pboyd
Copy link
Copy Markdown
Member

pboyd commented Mar 17, 2026

OK, @divakarsharma2934-a11y, thanks for the PR anyway.

If you'd like to contribute, we have a few "good first issues" (I realized we had run out, but I just tagged a couple more). Also, feel free to reach out on the CNCF slack (#kubeflow-model-registry) if something isn't clear.

@Storm1289 Storm1289 deleted the fix-proxy-getrepo-panic branch March 18, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants