-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Jira: https://asfdaac.atlassian.net/browse/TOOL-3621
Note: The above link is accessible only to members of ASF.
There are some challenges to supporting fan-out/fan-in jobs with an arbitrary number of inputs (i.e. granules).
For one, throughput is limited by the maximum number of parallel iterations for the step function's Map state. The Map state defaults to Inline mode, which means it only accepts the input list as a JSON array and only runs 40 parallel iterations according to the docs. However, we can use Distributed mode, which allows the Map state to accept input as a CSV or JSON file in S3 (or an S3 object list) and runs up to 10,000 parallel iterations. For SRG_TIME_SERIES in particular, this would allow us to take full advantage of the G-instance vCPU quota in the LAVAS account. Also see:
- https://states-language.net/#map-state
- https://docs.aws.amazon.com/step-functions/latest/dg/state-map.html
- https://docs.aws.amazon.com/step-functions/latest/dg/state-map-inline.html
- https://docs.aws.amazon.com/step-functions/latest/dg/state-map-distributed.html
- https://aws.amazon.com/blogs/compute/introducing-jsonl-support-with-step-functions-distributed-map/
There are also Step Functions service quotas that we may bump up against depending on the number of inputs we want to support; we've already addressed one of these in HyP3 v10.0.1, but there are other opportunities for reducing the size of the JSON data being passed between states, such as no longer passing around the job parameters (which include the full granules list) after they’re no longer needed, and writing the processing_times list (which includes one float value for each Batch job) directly to the database.