outbox event publishing failure, queue management in erroneous cases#21
outbox event publishing failure, queue management in erroneous cases#21sbilge wants to merge 2 commits into
Conversation
Coverage Report for CI Build 27622313161Coverage increased (+0.3%) to 92.478%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
| input_config_path: input_dummy.yaml | ||
| original_aem_pack_topic: original-aempacks | ||
| derived_aem_pack_topic: derived-aempacks | ||
| aem_pack_processing_event_topic: aempack-processing-events |
There was a problem hiding this comment.
The event in the names is redundant, so a more compact
| aem_pack_processing_event_topic: aempack-processing-events | |
| aem_pack_processing_topic: aempack-processing |
is better or, slightly less generic, aem_pack_processing_status_topic
|
|
||
| version: int = Field( | ||
| default=..., | ||
| description="Version assigned by the publishing service (RS), incremented on each republish.", |
There was a problem hiding this comment.
This should not point to the concrete service and republishing actually doesn't increase the version, only a genuinely new version of the AEMPack does.
| description="Version assigned by the publishing service (RS), incremented on each republish.", | |
| description="Current version of the AEMPack. Used to resolve republishing conflicts.", |
or something like that.
| async def get_failed_event_dao( | ||
| *, dao_publisher_factory: DaoPublisherFactoryProtocol, topic: str | ||
| ) -> FailedEventDao: | ||
| """Construct an outbox DAO for AEMPack processing-failure events.""" | ||
| return await dao_publisher_factory.get_dao( | ||
| name="aem_pack_failed_events", | ||
| id_field="id", | ||
| dto_model=AEMPackFailedEvent, | ||
| dto_to_event=lambda event: event.model_dump(mode="json"), | ||
| event_topic=topic, | ||
| autopublish=True, | ||
| ) |
There was a problem hiding this comment.
This should not be coupled to only the failure event, there should be events fired when an AEMPack has been successfully put into the queue and when one has successfully been processed.
We don't need events on picking up an AEMPack from the queue, only when it reaches a final state.
The collection name and model should be more more generic, something in the direction of status_events.
| aem_pack_processing_event_topic: str = Field( | ||
| default=..., | ||
| description=( | ||
| "Topic for AEMPack processing-lifecycle (status) events, e.g. processing" | ||
| " failures, and later successes." | ||
| ), | ||
| examples=["aempack-processing-events"], | ||
| ) |
There was a problem hiding this comment.
As mentioned on the yaml file:
Drop the redundant event in the name and change it to aem_pack_processing_status_topic
| class AEMPackProcessingEvent(BaseModel): | ||
| """Base for AEMPack processing-lifecycle events published on the status channel. | ||
|
|
||
| Siblings (e.g. a future succeeded event) share these fields so they can sit on | ||
| the same topic and be correlated back to the originating incoming AEMPack. | ||
| """ | ||
|
|
||
| id: UUID4 = Field( | ||
| default_factory=uuid4, description="Unique identifier of the event." | ||
| ) | ||
| pid: str = Field( | ||
| default=..., | ||
| description="Shared identifier of the incoming AEMPack and its derived packs.", | ||
| ) | ||
| model_name: str = Field( | ||
| default=..., | ||
| description="Name of the model the AEMPack being processed conforms to.", | ||
| ) | ||
| version: int = Field( | ||
| default=..., description="Version of the incoming AEMPack this event concerns." | ||
| ) | ||
|
|
||
|
|
||
| class AEMPackFailedEvent(AEMPackProcessingEvent): | ||
| """Published when data derivation fails for an incoming AEMPack.""" | ||
|
|
||
| transformation_step: str | None = Field( | ||
| default=None, | ||
| description="Name of the workflow step that failed, if known.", | ||
| ) | ||
| error_type: str = Field( | ||
| default=..., description="Class name of the error that caused the failure." | ||
| ) | ||
| error_message: str = Field( | ||
| default=..., description="Human-readable message of the underlying error." | ||
| ) |
There was a problem hiding this comment.
Could be merged into one model to be used by the outbox by making all error related fields optional.
For easier disambiguation at the recipient, an enum or literal valued field for the processing step could be included covering "Queued", "Failed" and "Processed" states.
| try: | ||
| return runner.run_workflow( | ||
| data=aem_pack.data, | ||
| annotation=_AnnotationModel.model_validate(aem_pack.annotation), | ||
| ) | ||
| except WorkflowExecutionError as error: | ||
| raise DataDerivationError( | ||
| pid=aem_pack.pid, | ||
| model_name=aem_pack.model_name, | ||
| error=error, | ||
| ) from error |
There was a problem hiding this comment.
Is the step_name accessible on the WorkflowExecutionError?
Promoting that to a field on the DataDerivationError would be a bit more transparent than the current approach.
| runner: WorkflowRunner = WorkflowRunner( | ||
| workflow=workflow.workflow, input_model=input_schema | ||
| ) |
There was a problem hiding this comment.
So this is the step that could potentially fail because it internally rederives the schemas we've already produced during model derivation?
Can't we just wrap that in a try-except and set the step_name to runner_init or something like that?
Can we assume schema equivalence between this and and the previously derived schema or do we need to re-check that?
Also, we could theoretically cache the naive runner, but don't we also store the data transformation results on the runner?
This could cause some problems depending on how it's accessed.
Seems like this needs a solution at two different levels:
- Basic error handling and potentially caching in the EMTS
- Some level of internal redesign on the metldata level, further decoupling data and model related functionality.
Can we get away with just wrapping this in a try-except here for now and deal with the redundancy/potential but probably unlikely inconsistencies later?
| AEMPack being transformed for logging and audit context; the failing | ||
| workflow step is available on the wrapped ``error``. |
There was a problem hiding this comment.
As mentioned elsewhere:
It would make some sense to make this slightly more transparent by also exposing the workflow step directly on this error.
There's some code in the current diff that tries to get it from the wrapped exception via getattr with a None fallback, but this seems like it should always be present, so this requirement should be encoded directly into the error type.
| await self._collection.update_one( | ||
| {"_id": aem_pack_id}, | ||
| { | ||
| "$set": { | ||
| FAILED_AT_FIELD: now_utc_ms_prec(), | ||
| } | ||
| }, | ||
| ) | ||
| await self.mark_processed(aem_pack_id) |
There was a problem hiding this comment.
To guard against possible race conditions and make it one call instead of two, inline mark_processed into the update_one call above.
| PROCESSED_AT_FIELD: { | ||
| "$cond": { | ||
| "if": f"${PROCESSED_AT_FIELD}", | ||
| "then": f"${PROCESSED_AT_FIELD}", | ||
| "else": None, | ||
| } | ||
| }, | ||
| # A newer version clears any prior failure so the pack is | ||
| # reprocessed instead of staying parked as failed. | ||
| FAILED_AT_FIELD: None, |
There was a problem hiding this comment.
There's a bit of an inconsistency here with how marking as failed is treated:
This resets the failed_at field, but preserves the processed_at field, so the corresponding AEMPack is still marked as done, but the failure is masked now and it's not picked up for reprocessing.
The easiest way to solve this would probably be to extend the if clause in the conditional set for the PROCESSED_AT_FIELD to include a check that FAILED_AT_FIELD is currently None.
This would probably need another look in the mongo docs, so we can be sure the check is performed before FAILED_AT_FIELD is set to None.
If that doesn't work out, another way would be to include the FAILED_AT_FIELD in the checks for the claim_next and mark_all_for_reprocessing logic.
This PR adds failure handling and outside visibility for AEMPack data derivation.
Includes:
It does not include: