[bug] s3_reader slicer OOMs when reading out of large datasets

I ran a job that reads out of an s3 store and puts the data in elasticsearch. This bucket has 57 files in it and each file on average can have anywhere from 1-5 gigabytes. When the `s3_reader` goes to make slices, it will first list all the objects with the provided path. The max key on this request is set to 1000 (which is the default for the listObjects method) meaning that the request will return up to 1000 objects metadata on one request. 

My job file:
```json
{
    "name": "grab-taxi-data",
    "lifecycle": "persistent",
    "workers": 1,
    "log_level": "trace",
    "memory_execution_controller": 1073741824,
    "assets": [
        "elasticsearch",
        "file"
    ],
    "operations": [
        {
            "_op": "s3_reader",
            "path": "datasets-documentation/nyc-taxi",
            "size": 50000,
            "format": "tsv"
        },
        {
            "_op": "elasticsearch_bulk",
            "size": 50000,
            "index": "nyc-taxi-data"
        }
    ]
}
```

The `s3-slicer`:
https://github.com/terascope/file-assets/blob/f6f215cc2723e4f3fc60da22136a04b97eaa9f9b/packages/file-asset-apis/src/s3/s3-slicer.ts#L26-L35

It will push a promise for each key in the list to an `actions` array that will then segment the file based on size and create slice records for it. In my case it will push 57 promises in the array and up creating `1,949,527` slice objects in the array. It will then run the createSlice() function on each on of those which will add metadata to each recored and push the recored into the slicer queue. I added `1GB` of memory to the execution controller pod and that is where it OOM'ed at approximately about `336,891` slice records in the queue.

Potenital solutions:

1. We could add a configuration setting in the `s3_reader` that will allow us to manually set how many files we can create slices for at a time. This will modify `maxKeys` to limit what it can return at a time, and paginate the rest of the pages. The issue is that it doesn't really resolve the issue and the user would have to be aware about this "workaround". Also in the case of a massive file that could be 100gb would still OOM.
2. Add logic around the the `s3_slicer` that would only allow it to submit a maximum amount of slices. There are potentially a handful of issues with this. 

	private async getObjects(): Promise<FileSlice[]> {
	const data = await s3RequestWithRetry({
	client: this.client,
	func: listS3Objects,
	params: {
	Bucket: this.bucket,
	Prefix: this.prefix,
	ContinuationToken: this._nextToken
	}
	});

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions