Skip to content

CMR outages should not block all job types from being submitted #2761

@jtherrmann

Description

@jtherrmann

Jira: https://asfdaac.atlassian.net/browse/TOOL-3698

Note: The above link is accessible only to members of ASF.


As part of HyP3 v10.5.0 we made it so that the POST /jobs endpoint now returns a 503 Service Unavailable response if any of the CMR queries fail during job validation, in order to fix #2742.

This change successfully addressed the requirement for OPERA_RTC_S1 jobs to fail the static coverage validator if the CMR query within the validator function fails. However, it also means that all jobs will now fail to submit if the main CMR query in hyp3_api.validation._get_cmr_metadata fails. I thought this was desirable because I mistakenly assumed that most/all of our plugins depend on CMR at runtime, so would fail during CMR outages anyway. However, @asjohnston-asf explained in Mattermost during today's CMR outage:

Most jobs, RTC_GAMMA and INSAR_GAMMA in particular, do not query cmr at runtime. We deliberately eliminated it as a dependency (with this scenario in mind)

So, most HyP3 jobs that people are running would actually succeed during a CMR outage, so we should probably allow those jobs to be submitted.

This could be as simple as just raising a custom exception type from the check_opera_rtc_s1_static_coverage validator and converting that into a 503 response within the post_jobs handler. However, I suspect that for OPERA_RTC_S1 and perhaps some other job types, we still want to completely reject the job if we can't get its CMR record. In this case, we would need to implement the ability for some job types to return a fatal error if the CMR query fails during validation, while other job types would just log a warning but still get submitted.

Such a requirement might be difficult to implement while we're making one big CMR query in _get_cmr_metadata for all jobs in the payload, but we also don't want to make a separate CMR query for each job, for performance reasons. Maybe we could group jobs by type and then make one CMR query per job type present in the payload, which would allow us to customize the error handling behavior for each job type?

Before addressing this issue, it might be prudent to add API integration tests for #2740.

Edit: Actually, we could probably just add a validator function that explicitly checks if all of the job's granules are present in the granule metadata, and any job types that require that behavior could use that validator. Though I think it would require modifying _get_cmr_metadata to return an empty list if the query fails, rather than propagating the failure via response.raise_for_status().

Metadata

Metadata

Assignees

No one assigned

    Labels

    Jira BugCreate a Jira Bug for this issue

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions