Added try-catch block in REST batch v2 openBatchSession#7409
Added try-catch block in REST batch v2 openBatchSession#7409oh0873 wants to merge 10 commits intoapache:masterfrom
Conversation
--- **Work Item:** #11316434 #11406267 --- **Summary** -- Kyuubi Batch Service creates `batchExecutor` to submit a batch job. However, when `openBatchSession` fails the executor dies and never recovers. This can result in a jobs stuck at PENDING state and no executor to submit any jobs stuck at INITIALIZED state. We added error handling logic to handle openBatchSession failures **Problem** -- `openBatchSession` may fail if connections.per.users are exceeded or metedata query fails. When `openBatchSession` fails, the executor dies and does not recover unless we restart kyuubi server pods. --- **Approach** --- Add try block to catch exception when `openBatchSession` fails. We mark the job from PENDING to FAILED to avoid PENDING jobs taking up submitter executors for long period of time. **Code Change** -- Fixed incorrect behavior of `withUpdateCount`. Added Try and Except block in `KyuubiBatchService` to catch all open batch session exceptions, then it continues to the next job. --- **Concern** It's not clear whether we should set the failed job with ERROR or INITIALZIED (INITIALIZED would allow it to retry). **Test** Related work items: #11406267
There was a problem hiding this comment.
Pull request overview
This PR improves Kyuubi Batch v2 resiliency by preventing the batch submitter thread(s) from dying permanently when openBatchSession fails, and fixes an argument-order bug in JDBC metadata state transitions.
Changes:
- Add error-handling around batch picking/opening in
KyuubiBatchServiceto keep the submit loop alive and attempt to mark failed scheduled batches asERROR. - Fix
JDBCMetadataStore.transformMetadataStateparameter ordering so state transitions update the intended column/value. - Add a REST suite test intended to cover failures during metadata update while opening a batch session.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| kyuubi-server/src/main/scala/org/apache/kyuubi/server/KyuubiBatchService.scala | Wraps submit loop with try/catch and attempts to fail scheduled batches on open failures. |
| kyuubi-server/src/main/scala/org/apache/kyuubi/server/metadata/MetadataManager.scala | Adds failScheduledBatch helper to transition PENDING -> ERROR. |
| kyuubi-server/src/main/scala/org/apache/kyuubi/server/metadata/jdbc/JDBCMetadataStore.scala | Corrects SQL parameter order in transformMetadataState. |
| kyuubi-server/src/test/scala/org/apache/kyuubi/server/api/v1/BatchesResourceSuite.scala | Adds a test scenario for openBatchSession failing during metadata update. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Grammar Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| metadataManager.failScheduledBatch(batchId) | ||
| } catch { | ||
| case ex: Exception => | ||
| error(s"Unable to modify metadata for $batchId to ERROR", ex) |
There was a problem hiding this comment.
Tested in our environment, openBatchSession exceptions like connections per user or DB connection error no longer leaves jobs stuck at PENDING state. Instead all those jobs are labeled as ERROR.
if there are transient network issues, and code reaches here, will job still be stuck at the PENDING state?
There was a problem hiding this comment.
Yes they would still stuck at PENDING state. However, since the thread doesn't die, submitter will continue to pick up another job.
There was a problem hiding this comment.
so it requires the admin to reset the state manually? if so, let's say that in the error message
There was a problem hiding this comment.
Thank you, error message is modified.
| if (!submitted) Thread.sleep(1000) | ||
| } | ||
| info(s"$batchId is submitted or finished.") | ||
| Thread.sleep(1000) |
There was a problem hiding this comment.
what's the intention of this sleep
There was a problem hiding this comment.
I can remove the sleep. We mostly observe the issue when there's server-wide network issue. Slowing down the submit process would reduce error rate. However, I do see this is not the only failure.
There was a problem hiding this comment.
yeah, I guess this is mostly caused by transient network issues, then sleep makes sense, let's keep it and add a comment to explain the intention.
There was a problem hiding this comment.
Thank you, Comment and sleep is added.
|
Side notes: link your email used to commit (e.g., 80ff8b3) to your GitHub account, then you don't need to wait for approval to run CI on pushing new changes |
Why are the changes needed?
In Kyuubi Batch v2 version, Batch Service creates
batchExecutorto submit a batch job. However, whenopenBatchSessionfails, the executor dies and never recovers.This results in a jobs stuck at PENDING state and no executor to submit any jobs at INITIALIZED state.
This PR added error handling logic to handle openBatchSession.
This also fixes
withUpdateCountcall changes fromState to targetState.How was this patch tested?
Tested in our environment, openBatchSession exceptions like connections per user or DB connection error no longer leaves jobs stuck at PENDING state. Instead all those jobs are labeled as ERROR.
Was this patch authored or co-authored using generative AI tooling?
Test case and bug finding were assisted with Cursor agent.