Skip to content

Refactor of checkDatasetFiles #2180

@minottic

Description

@minottic

Summary

checkDatasetFiles is used when submitting a Job to check if the required files exist in the origdatablocks. Whenever one does not exist, it throws an error with the missing file(s).

Steps to Reproduce

Current Behaviour

The current logic loops over the datasetList from the job payload, finds the corresponding origs, extracts the files from them from dataFileList[].path and intersects with the Jobs payload datasetList[].files. Triggers an error whenever any of the files is not found in the origs of the datasets

Expected Behaviour

I believe this can be simplified with this (or similar):

const origs = await this.origDatablocksService.findAll({
            where: {and: [{ datasetId: {$in: ids }], {'dataFileList.path': {$in: dataFileList[].files}}]},
})

const origsDict = origs.reduce((previous, current) => previous[current.datasetId] = new WeakSet(current.dataFileList.map(f => f.path)), {})

const nonExisting = {};
for (const ds of datasetList) {
 if (ds.files.length === 0) continue
 ds.files.map(f => 
   if (!origsDict[ds.pid].has(f)) nonExisting[ds.pid].push(f)
 )
}
if (nonExisting) throw (nonExisting) --> needs loop for formatting here

Only one mongo query that improves handshake overhead. DS query removed (not sure I understood the need for it). Maintains use of dicts and sets for O(n*m) complexity (very similar to what's implemented already, nice!).

The main improvement is reducing the mongo queries.

This could be further improved if await this.origDatablocksService.findAll returns too much, and one could use a cursor and pop from datasetList[].files

something like this:

const dsDict = datasetList.reduce((previous, current) => previous[current.pid] = new Set(current.files)), {})

nonExisting = {}
for await (const orig of OrigDatablocsk.find({and: [{ datasetId: {$in: ids }], {'dataFileList.path': {$in: dataFileList[].files}}]},).cursor()) {
  for (const f of orig.dataFileList) {
     if (dsDict[orig.datasetId].size() === 0) continue
     dsDict[orig.datasetId].delete(f)
   }
  if (dsDict[orig.datasetId].size() === 0) continue
 nonExisting[orig.datasetId] = dsDict[orig.datasetId]
}

if (nonExisting) throw (nonExisting) --> needs loop for formatting here

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions