Skip to content

Added try-catch block in REST batch v2 openBatchSession#7409

Open
oh0873 wants to merge 10 commits intoapache:masterfrom
oh0873:hoonoh/openBatchSessionExceptionCatch
Open

Added try-catch block in REST batch v2 openBatchSession#7409
oh0873 wants to merge 10 commits intoapache:masterfrom
oh0873:hoonoh/openBatchSessionExceptionCatch

Conversation

@oh0873
Copy link
Copy Markdown
Contributor

@oh0873 oh0873 commented Apr 16, 2026

Why are the changes needed?

In Kyuubi Batch v2 version, Batch Service creates batchExecutor to submit a batch job. However, when openBatchSession fails, the executor dies and never recovers.

This results in a jobs stuck at PENDING state and no executor to submit any jobs at INITIALIZED state.

This PR added error handling logic to handle openBatchSession.

This also fixes withUpdateCount call changes fromState to targetState.

How was this patch tested?

Tested in our environment, openBatchSession exceptions like connections per user or DB connection error no longer leaves jobs stuck at PENDING state. Instead all those jobs are labeled as ERROR.

Was this patch authored or co-authored using generative AI tooling?

Test case and bug finding were assisted with Cursor agent.

oh0873 added 3 commits April 16, 2026 10:01
---
**Work Item:** #11316434 #11406267

---

**Summary**
--

Kyuubi Batch Service creates `batchExecutor` to submit a batch job. However, when `openBatchSession` fails the executor dies and never recovers. This can result in a jobs stuck at PENDING state and no executor to submit any jobs stuck at INITIALIZED state.

We added  error handling logic to handle openBatchSession failures

**Problem**
--

`openBatchSession` may fail if connections.per.users are exceeded or metedata query fails. When `openBatchSession` fails, the executor dies and does not recover unless we restart kyuubi server pods.

---

**Approach**
---

Add try block to catch exception when `openBatchSession` fails. We mark the job from PENDING to FAILED to avoid PENDING jobs taking up submitter executors for long period of time.

**Code Change**
--

Fixed incorrect behavior of `withUpdateCount`.

Added Try and Except block in `KyuubiBatchService` to catch all open batch session exceptions, then it continues to the next job.

---

**Concern**

It's not clear whether we should set the failed job with ERROR or INITIALZIED (INITIALIZED would allow it to retry).

**Test**

Related work items: #11406267
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Kyuubi Batch v2 resiliency by preventing the batch submitter thread(s) from dying permanently when openBatchSession fails, and fixes an argument-order bug in JDBC metadata state transitions.

Changes:

  • Add error-handling around batch picking/opening in KyuubiBatchService to keep the submit loop alive and attempt to mark failed scheduled batches as ERROR.
  • Fix JDBCMetadataStore.transformMetadataState parameter ordering so state transitions update the intended column/value.
  • Add a REST suite test intended to cover failures during metadata update while opening a batch session.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
kyuubi-server/src/main/scala/org/apache/kyuubi/server/KyuubiBatchService.scala Wraps submit loop with try/catch and attempts to fail scheduled batches on open failures.
kyuubi-server/src/main/scala/org/apache/kyuubi/server/metadata/MetadataManager.scala Adds failScheduledBatch helper to transition PENDING -> ERROR.
kyuubi-server/src/main/scala/org/apache/kyuubi/server/metadata/jdbc/JDBCMetadataStore.scala Corrects SQL parameter order in transformMetadataState.
kyuubi-server/src/test/scala/org/apache/kyuubi/server/api/v1/BatchesResourceSuite.scala Adds a test scenario for openBatchSession failing during metadata update.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kyuubi-server/src/main/scala/org/apache/kyuubi/server/KyuubiBatchService.scala Outdated
oh0873 and others added 4 commits April 20, 2026 09:41
Grammar

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
metadataManager.failScheduledBatch(batchId)
} catch {
case ex: Exception =>
error(s"Unable to modify metadata for $batchId to ERROR", ex)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested in our environment, openBatchSession exceptions like connections per user or DB connection error no longer leaves jobs stuck at PENDING state. Instead all those jobs are labeled as ERROR.

if there are transient network issues, and code reaches here, will job still be stuck at the PENDING state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they would still stuck at PENDING state. However, since the thread doesn't die, submitter will continue to pick up another job.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it requires the admin to reset the state manually? if so, let's say that in the error message

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, error message is modified.

if (!submitted) Thread.sleep(1000)
}
info(s"$batchId is submitted or finished.")
Thread.sleep(1000)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the intention of this sleep

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the sleep. We mostly observe the issue when there's server-wide network issue. Slowing down the submit process would reduce error rate. However, I do see this is not the only failure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I guess this is mostly caused by transient network issues, then sleep makes sense, let's keep it and add a comment to explain the intention.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Comment and sleep is added.

@pan3793 pan3793 changed the title Added Try Catch block in openBatchSession Added try-catch block in REST batch v2 openBatchSession Apr 21, 2026
@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 21, 2026

Side notes: link your email used to commit (e.g., 80ff8b3) to your GitHub account, then you don't need to wait for approval to run CI on pushing new changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants