Skip to content

Commit fc738ec

Browse files
fix(api-v1): return real task.status in CreateTaskResponse, not hardcoded "pending" (#426)
* fix(v1): return real task.status in CreateTaskResponse, not hardcoded "pending" POST /v1/chat/tasks returned ``status="pending"`` in the response body even though ``begin_turn`` had already atomically claimed the row as RUNNING inside the same handler before the response was sent. An SDK client doing POST followed by an immediate GET on the same task would see two contradictory status values from back-to-back calls (``pending`` then ``running``). Read ``task.status.value`` after ``begin_turn`` returns instead -- ``begin_turn`` refreshes the in-memory ``task`` after committing the atomic claim, so the post-handler view reflects the row's real state. ``AppendMessageResponse`` already followed this contract; the create path now matches it. Schema docstrings for CreateTaskResponse and AppendMessageResponse updated to describe the post-claim ``running`` semantics; the ``test_create_task_happy_path`` assertion was checking for the old ``pending`` value and has been updated. No behavior change in the orchestrator or scheduler -- this is a response-payload correctness fix only. Twenty-eight v1 task tests pass; pre-commit (ruff / mypy / isort / codespell) green. * fix(v1): update create_chat_task docstring + unify AppendMessageResponse status read Follow-up to the CreateTaskResponse status fix in the previous commit. Two consistency cleanups within the same handler module: 1. ``create_chat_task`` function docstring still described the return value as ``status='pending'`` -- inconsistent with the schema docstring and the implementation that the previous commit already moved to ``task.status.value`` (i.e. 'running'). Updated the Returns section to match. 2. ``append_message_to_task`` returned ``status="running"`` hard- coded while ``create_chat_task`` reads ``task.status.value``. Unified both endpoints on the same pattern -- reading from the refreshed in-memory row is defensive (any future ``begin_turn`` status-machine change is picked up automatically) and removes the asymmetry where one endpoint reflected the DB and the other asserted a fixed string. No behavior change for the current contract -- the previous commit already returned 'running' from CREATE; this commit makes APPEND share the same expression form and fixes the docstring drift. Twenty-eight v1 task tests pass; pre-commit green. * fix(web): persist real exception text to task.error_message on bg failure ``execute_task_background`` previously logged and broadcast the exception text on failure but never wrote it to ``task.error_message``. The row's status stayed RUNNING, and ``finish_turn``'s RUNNING-fallback branch then unconditionally set ``error_message`` to a generic placeholder ("Task execution failed without status update; see /steps."), forcing SDK and web clients to fetch ``/steps`` to discover what actually went wrong. The fix writes the real exception text to ``task.error_message`` and flips ``status`` to ``FAILED`` in the exception handler, using a fresh session because the original may be in a failed-transaction state. ``finish_turn``'s FAILED branch then only fills in a placeholder when ``error_message`` is empty, so the real message is preserved through to the final row state. Same family of fix as the ``status="pending"`` correction above: make the persisted task row reflect the real outcome rather than a generic placeholder. Affects both SDK consumers (GET /v1/chat/tasks/ {id}) and web/WebSocket consumers reading the same row. Twenty-eight v1 task tests pass; pre-commit (ruff / mypy / isort / codespell) green. * fix(web): scope bg-task failure handling by status, not by exception site execute_task_background's outer except spans post-terminal steps (assistant-message persistence and the completion/paused broadcasts) that run after the task status was already committed COMPLETED. The recently added FAILED-persistence wrote the row unconditionally, so a failure in one of those best-effort steps -- e.g. the completion broadcast losing its websocket -- rewrote an already-completed task as FAILED and stored the broadcast error in error_message. Branch the handler on the task's current status instead. Only a task still RUNNING is a genuine execution failure: record the real exception text, flip to FAILED, and emit task_error. A task already in a terminal state tripped here in a best-effort post-completion step, so observe it without touching the row or emitting a contradictory task_error; finish_turn still reconciles the terminal fields afterward. * fix(web): commit terminal task status only once the turn is durable The success path committed COMPLETED/FAILED before persisting the assistant message, which is a separate durable write. If that write failed, the row was left COMPLETED with no message and no error_message -- the status-gated failure handler treated it as a best-effort post-completion step and left it untouched. Leave the terminal status pending and let persist_assistant_message's commit land it atomically with the message. A failure in that durable write now leaves the status RUNNING, so the outer handler surfaces a real task failure instead of a contradictory empty COMPLETED row. Only notification broadcasts remain best-effort. * fix(web): land terminal status on empty-reply turns too The previous change rode the terminal-status commit on persist_assistant_message's internal commit. But that helper early-returns without committing when the assistant content is empty (a valid empty-reply turn), which left the status pending -> RUNNING -> finish_turn flipping a successful empty turn to FAILED. Add an explicit commit after persistence. It lands the terminal status whether or not a message row was written, while still surfacing a real failure when persistence raises (control never reaches the explicit commit, so the status stays uncommitted and the outer except fails it).
1 parent 7e568ce commit fc738ec

5 files changed

Lines changed: 371 additions & 36 deletions

File tree

src/xagent/web/api/v1/tasks.py

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,10 @@ async def create_chat_task(
9595
9696
Returns:
9797
:class:`CreateTaskResponse` with the new ``task_id``,
98-
``agent_id``, ``status='pending'``, and ``created_at`` for the
99-
caller to start polling from.
98+
``agent_id``, ``status='running'`` (the atomic claim inside
99+
the handler flips the row from PENDING to RUNNING before the
100+
response is sent), and ``created_at`` for the caller to
101+
start polling from.
100102
101103
Raises:
102104
V1ApiError 401: missing/invalid/revoked key (raised inside
@@ -160,10 +162,18 @@ async def create_chat_task(
160162
except TaskTurnError:
161163
raise V1ApiError(V1ErrorCode.TASK_BUSY, 409)
162164

165+
# ``status=task.status.value`` (i.e. 'running'), not 'pending':
166+
# ``begin_turn`` ran an atomic UPDATE that flipped the row to
167+
# RUNNING and ``db.refresh(task)``'d the in-memory object before
168+
# returning. Returning 'pending' would lie to the SDK client --
169+
# an immediately-following GET would see 'running' and the caller
170+
# would have to reconcile two contradictory values from
171+
# back-to-back calls. This matches the AppendMessageResponse
172+
# contract below.
163173
return CreateTaskResponse(
164174
task_id=int(task.id),
165175
agent_id=int(agent.id),
166-
status="pending",
176+
status=task.status.value,
167177
created_at=task.created_at,
168178
)
169179

@@ -311,16 +321,18 @@ async def append_message_to_task(
311321
# see a value that matches what they'd read from the DB directly
312322
# via GET /v1/chat/tasks/{id}, with no clock-skew between the two.
313323
#
314-
# ``status='running'`` (not 'pending') because the atomic UPDATE
315-
# above already flipped the row to RUNNING in the same transaction.
316-
# Returning 'pending' here would lie to the SDK client: an
317-
# immediately-following GET would see 'running' and the client
318-
# would have to reconcile two contradictory values from
319-
# back-to-back calls.
324+
# ``status=task.status.value`` (i.e. 'running'), read from the
325+
# refreshed in-memory row rather than hardcoded, mirrors the
326+
# CreateTaskResponse contract above: the atomic UPDATE inside
327+
# ``begin_turn`` flipped the row to RUNNING in the same
328+
# transaction. Returning 'pending' here would lie to the SDK
329+
# client -- an immediately-following GET would see 'running' and
330+
# the caller would have to reconcile two contradictory values
331+
# from back-to-back calls.
320332
return AppendMessageResponse(
321333
task_id=int(task.id),
322334
agent_id=int(agent.id),
323-
status="running",
335+
status=task.status.value,
324336
accepted_at=task.updated_at,
325337
)
326338

src/xagent/web/api/websocket.py

Lines changed: 71 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1442,9 +1442,18 @@ async def execute_task_background(
14421442
else:
14431443
task_updated.status = TaskStatus.FAILED
14441444
sync_workforce_run_status(db_new, task_updated, task_updated.status)
1445-
db_new.commit()
1445+
# Do NOT commit the terminal status here. Leave it
1446+
# pending so the assistant-message persistence below
1447+
# commits it atomically: the task is marked terminal
1448+
# only once the turn is durably complete. If that write
1449+
# fails, the status stays RUNNING and the outer except
1450+
# surfaces a real failure -- instead of leaving a
1451+
# COMPLETED row with no assistant message. Control
1452+
# statuses (PAUSED / WAITING_FOR_USER) above commit
1453+
# themselves; they have no assistant message to persist.
14461454
logger.info(
1447-
f"Updated task {task_id} status to {task_updated.status.value}"
1455+
f"Task {task_id} marked {task_updated.status.value} "
1456+
"(pending commit with assistant message)"
14481457
)
14491458
else:
14501459
logger.info(
@@ -1482,6 +1491,17 @@ async def execute_task_background(
14821491
if isinstance(chat_response, dict)
14831492
else None,
14841493
)
1494+
# Commit the pending terminal status. ``persist_assistant_message``
1495+
# commits internally when it writes a row, but it
1496+
# early-returns WITHOUT committing when the assistant
1497+
# content is empty (a valid empty-reply turn). This
1498+
# explicit commit lands the terminal status in that
1499+
# case too, so an empty successful turn stays COMPLETED
1500+
# rather than being left RUNNING (and later flipped to
1501+
# FAILED by finish_turn). If persistence raised, control
1502+
# never reaches here -- the status stays uncommitted and
1503+
# the outer except surfaces a real failure.
1504+
db_new.commit()
14851505

14861506
# Materialize broadcast metadata into primitives BEFORE the
14871507
# ``finally`` block closes ``db_new``. ``task_updated`` is
@@ -1596,25 +1616,57 @@ async def execute_task_background(
15961616
logger.info(f"Background task {task_id} execution completed")
15971617

15981618
except Exception as e:
1599-
logger.error(f"Background task {task_id} execution failed: {e}", exc_info=True)
1600-
# Send error event
1619+
# The outer try also spans the post-terminal steps -- assistant
1620+
# message persistence and the completion / paused broadcasts --
1621+
# that run *after* the task status was already committed terminal
1622+
# (COMPLETED above). ``_terminal_task_error_payload`` writes FAILED
1623+
# + the real error_message unconditionally, so gate it on the
1624+
# task's current status: only a task still RUNNING is a genuine
1625+
# execution failure. Otherwise a failed post-completion broadcast
1626+
# would rewrite an already-COMPLETED task as FAILED and store the
1627+
# broadcast error as the task's failure cause.
1628+
status_db = get_session_local()()
16011629
try:
1602-
message = str(e)
1603-
await manager.broadcast_to_task(
1604-
{
1605-
**_terminal_task_error_payload(
1606-
task_id,
1607-
message,
1608-
event_type="task_error",
1609-
),
1610-
"task_id": task_id,
1611-
"error": message,
1612-
"timestamp": datetime.now(timezone.utc).timestamp(),
1613-
},
1614-
task_id,
1630+
current = status_db.query(Task).filter(Task.id == task_id).first()
1631+
still_running = current is not None and (
1632+
current.status == TaskStatus.RUNNING
1633+
)
1634+
finally:
1635+
status_db.close()
1636+
1637+
if not still_running:
1638+
# Terminal state already committed; the exception came from a
1639+
# best-effort post-completion step. Observe it without touching
1640+
# the row or emitting a contradictory task_error. ``finish_turn``
1641+
# still reconciles the terminal fields afterward.
1642+
logger.warning(
1643+
f"Background task {task_id} post-terminal step failed; "
1644+
f"task state left unchanged: {e}",
1645+
exc_info=True,
1646+
)
1647+
else:
1648+
logger.error(
1649+
f"Background task {task_id} execution failed: {e}", exc_info=True
16151650
)
1616-
except Exception as broadcast_error:
1617-
logger.error(f"Failed to send error notification: {broadcast_error}")
1651+
# Genuine failure: _terminal_task_error_payload persists FAILED
1652+
# + the real error_message and builds the notification payload.
1653+
try:
1654+
message = str(e)
1655+
await manager.broadcast_to_task(
1656+
{
1657+
**_terminal_task_error_payload(
1658+
task_id,
1659+
message,
1660+
event_type="task_error",
1661+
),
1662+
"task_id": task_id,
1663+
"error": message,
1664+
"timestamp": datetime.now(timezone.utc).timestamp(),
1665+
},
1666+
task_id,
1667+
)
1668+
except Exception as broadcast_error:
1669+
logger.error(f"Failed to send error notification: {broadcast_error}")
16181670
except asyncio.CancelledError:
16191671
logger.info(f"Background task {task_id} cancelled")
16201672
raise

src/xagent/web/schemas/v1.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -177,18 +177,21 @@ class V1TemplateDetail(V1TemplateSummary):
177177
class CreateTaskResponse(BaseModel):
178178
"""``POST /v1/chat/tasks`` -> 202 Accepted response.
179179
180-
The task has been persisted and queued for background execution;
181-
callers poll ``GET /v1/chat/tasks/{task_id}`` to observe the
182-
transition pending -> running -> completed/failed.
180+
The task has been persisted, claimed as RUNNING in the same
181+
transaction, and queued for background execution; callers poll
182+
``GET /v1/chat/tasks/{task_id}`` to observe the transition
183+
running -> completed/failed.
183184
"""
184185

185186
task_id: int = Field(..., description="Newly created task primary key.")
186187
agent_id: int = Field(..., description="Agent the task is bound to.")
187188
status: str = Field(
188189
...,
189190
description=(
190-
"Initial status, always 'pending' in the 202 response. "
191-
"Use GET /v1/chat/tasks/{task_id} to observe later transitions."
191+
"Initial status, 'running' in the 202 response (the atomic "
192+
"claim inside POST commits the status flip before the "
193+
"response is sent). Use GET /v1/chat/tasks/{task_id} to "
194+
"observe later transitions."
192195
),
193196
)
194197
created_at: datetime = Field(..., description="UTC creation timestamp.")
@@ -230,7 +233,11 @@ class AppendMessageResponse(BaseModel):
230233
agent_id: int = Field(..., description="Agent the task is bound to.")
231234
status: str = Field(
232235
...,
233-
description="Initial status of the new turn, always 'pending'.",
236+
description=(
237+
"Initial status of the new turn, 'running' in the 202 "
238+
"response (the atomic claim inside POST commits the status "
239+
"flip before the response is sent)."
240+
),
234241
)
235242
accepted_at: datetime = Field(
236243
...,

tests/web/api/v1/test_tasks.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,9 @@ def test_create_task_happy_path(mock_start_task):
113113
assert resp.status_code == 202, resp.text
114114
body = resp.json()
115115
assert body["agent_id"] == agent_id
116-
assert body["status"] == "pending"
116+
# POST atomically claims RUNNING before returning 202, so the
117+
# response body reports the post-claim state, not 'pending'.
118+
assert body["status"] == "running"
117119
assert "task_id" in body
118120
assert "created_at" in body
119121
task_id = body["task_id"]

0 commit comments

Comments
 (0)