-
Notifications
You must be signed in to change notification settings - Fork 2
Description
I ran a job that reads out of an s3 store and puts the data in elasticsearch. This bucket has 57 files in it and each file on average can have anywhere from 1-5 gigabytes. When the s3_reader goes to make slices, it will first list all the objects with the provided path. The max key on this request is set to 1000 (which is the default for the listObjects method) meaning that the request will return up to 1000 objects metadata on one request.
My job file:
{
"name": "grab-taxi-data",
"lifecycle": "persistent",
"workers": 1,
"log_level": "trace",
"memory_execution_controller": 1073741824,
"assets": [
"elasticsearch",
"file"
],
"operations": [
{
"_op": "s3_reader",
"path": "datasets-documentation/nyc-taxi",
"size": 50000,
"format": "tsv"
},
{
"_op": "elasticsearch_bulk",
"size": 50000,
"index": "nyc-taxi-data"
}
]
}The s3-slicer:
file-assets/packages/file-asset-apis/src/s3/s3-slicer.ts
Lines 26 to 35 in f6f215c
| private async getObjects(): Promise<FileSlice[]> { | |
| const data = await s3RequestWithRetry({ | |
| client: this.client, | |
| func: listS3Objects, | |
| params: { | |
| Bucket: this.bucket, | |
| Prefix: this.prefix, | |
| ContinuationToken: this._nextToken | |
| } | |
| }); |
It will push a promise for each key in the list to an actions array that will then segment the file based on size and create slice records for it. In my case it will push 57 promises in the array and up creating 1,949,527 slice objects in the array. It will then run the createSlice() function on each on of those which will add metadata to each recored and push the recored into the slicer queue. I added 1GB of memory to the execution controller pod and that is where it OOM'ed at approximately about 336,891 slice records in the queue.
Potenital solutions:
- We could add a configuration setting in the
s3_readerthat will allow us to manually set how many files we can create slices for at a time. This will modifymaxKeysto limit what it can return at a time, and paginate the rest of the pages. The issue is that it doesn't really resolve the issue and the user would have to be aware about this "workaround". Also in the case of a massive file that could be 100gb would still OOM. - Add logic around the the
s3_slicerthat would only allow it to submit a maximum amount of slices. There are potentially a handful of issues with this.