-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Summary
checkDatasetFiles is used when submitting a Job to check if the required files exist in the origdatablocks. Whenever one does not exist, it throws an error with the missing file(s).
Steps to Reproduce
Current Behaviour
The current logic loops over the datasetList from the job payload, finds the corresponding origs, extracts the files from them from dataFileList[].path and intersects with the Jobs payload datasetList[].files. Triggers an error whenever any of the files is not found in the origs of the datasets
Expected Behaviour
I believe this can be simplified with this (or similar):
const origs = await this.origDatablocksService.findAll({
where: {and: [{ datasetId: {$in: ids }], {'dataFileList.path': {$in: dataFileList[].files}}]},
})
const origsDict = origs.reduce((previous, current) => previous[current.datasetId] = new WeakSet(current.dataFileList.map(f => f.path)), {})
const nonExisting = {};
for (const ds of datasetList) {
if (ds.files.length === 0) continue
ds.files.map(f =>
if (!origsDict[ds.pid].has(f)) nonExisting[ds.pid].push(f)
)
}
if (nonExisting) throw (nonExisting) --> needs loop for formatting hereOnly one mongo query that improves handshake overhead. DS query removed (not sure I understood the need for it). Maintains use of dicts and sets for O(n*m) complexity (very similar to what's implemented already, nice!).
The main improvement is reducing the mongo queries.
This could be further improved if await this.origDatablocksService.findAll returns too much, and one could use a cursor and pop from datasetList[].files
something like this:
const dsDict = datasetList.reduce((previous, current) => previous[current.pid] = new Set(current.files)), {})
nonExisting = {}
for await (const orig of OrigDatablocsk.find({and: [{ datasetId: {$in: ids }], {'dataFileList.path': {$in: dataFileList[].files}}]},).cursor()) {
for (const f of orig.dataFileList) {
if (dsDict[orig.datasetId].size() === 0) continue
dsDict[orig.datasetId].delete(f)
}
if (dsDict[orig.datasetId].size() === 0) continue
nonExisting[orig.datasetId] = dsDict[orig.datasetId]
}
if (nonExisting) throw (nonExisting) --> needs loop for formatting here