CMR outages should not block all job types from being submitted

**Jira:** https://asfdaac.atlassian.net/browse/TOOL-3698

*Note: The above link is accessible only to members of ASF.*

----------------------------------------------------------------------------------------------------


As part of [HyP3 v10.5.0](https://github.com/ASFHyP3/hyp3/releases/tag/v10.5.0) we made it so that the `POST /jobs` endpoint now returns a `503 Service Unavailable` response if any of the CMR queries fail during job validation, in order to fix https://github.com/ASFHyP3/hyp3/issues/2742.

This change successfully addressed the requirement for `OPERA_RTC_S1` jobs to fail the static coverage validator if the CMR query within the validator function fails. However, it also means that *all* jobs will now fail to submit if the main CMR query in `hyp3_api.validation._get_cmr_metadata` fails. I thought this was desirable because I mistakenly assumed that most/all of our plugins depend on CMR at runtime, so would fail during CMR outages anyway. However, @asjohnston-asf explained [in Mattermost](https://chat.asf.alaska.edu/asf/pl/5bhz8cwictgejkxqq73gpqtstw) during today's CMR outage:

> Most jobs, RTC_GAMMA and INSAR_GAMMA in particular, do not query cmr at runtime. We deliberately eliminated it as a dependency (with this scenario in mind)

So, most HyP3 jobs that people are running would actually succeed during a CMR outage, so we should probably allow those jobs to be submitted.

This could be as simple as just raising a custom exception type from the `check_opera_rtc_s1_static_coverage` validator and converting that into a `503` response within the `post_jobs` handler. However, I suspect that for `OPERA_RTC_S1` and perhaps some other job types, we still want to completely reject the job if we can't get its CMR record. In this case, we would need to implement the ability for some job types to return a fatal error if the CMR query fails during validation, while other job types would just log a warning but still get submitted.

Such a requirement might be difficult to implement while we're making one big CMR query in `_get_cmr_metadata` for all jobs in the payload, but we also don't want to make a separate CMR query for each job, for performance reasons. Maybe we could group jobs by type and then make one CMR query per job type present in the payload, which would allow us to customize the error handling behavior for each job type?

Before addressing this issue, it might be prudent to add API integration tests for https://github.com/ASFHyP3/hyp3/issues/2740.

Edit: Actually, we could probably just add a validator function that explicitly checks if all of the job's granules are present in the granule metadata, and any job types that require that behavior could use that validator. Though I think it would require modifying `_get_cmr_metadata` to return an empty list if the query fails, rather than propagating the failure via `response.raise_for_status()`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CMR outages should not block all job types from being submitted #2761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CMR outages should not block all job types from being submitted #2761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions