-
Notifications
You must be signed in to change notification settings - Fork 0
Fix iterate
and greedyIterate
methods
#16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/clients/dataset-client.ts
Outdated
while (currentPage.items.length > 0) { | ||
totalItems += currentPage.items.length; | ||
for (const item of currentPage.items) { | ||
yield item; | ||
} | ||
|
||
offset += pageSize; | ||
currentPage = await this.superClient.listItems({ offset, limit: pageSize }); | ||
currentOffset += pageSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to update currentOffset
with how many things has actually been returned to us, instead of how many items we asked for - if this is called while the run is still in progress we may see the following situation: pageSize === 10
, currentPage.items.length === 2
, but before the next iteration 10 more items has been added to the dataset, so we output another 2 items, not knowing we missed 8.
src/clients/dataset-client.ts
Outdated
offset += pageSize; | ||
currentPage = await this.superClient.listItems({ offset, limit: pageSize }); | ||
currentOffset += pageSize; | ||
currentPage = await this.superClient.listItems({ offset: currentOffset, limit: pageSize }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it on purpose we don't forward listItemOptions
here?
src/clients/dataset-client.ts
Outdated
let currentOffset = listItemOptions.offset ?? 0; | ||
let currentPage = await this.superClient.listItems({ | ||
...listItemOptions, | ||
offset: currentOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue for creating a const offset = () => totalItems + (listItemOptions.offset ?? 0)
such we have an offset which always reflects how many elements we have already seen.
while (runStatus && ['READY', 'RUNNING'].includes(runStatus)) { | ||
const datasetIterator = this.iterate({ ...iterateOptions, offset: currentOffset }); | ||
for await (const item of datasetIterator) { | ||
currentOffset++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this, and postponed adding my review till I had let it settle a bit. I ended up having to do pretty much the same when using the library to do "reasonable segmentation". I think this shows that our abstraction is leaky - whether there will actually be an await before we get to the next element. I think it would be better if we just returned AsyncGenerator<T[], void, void>
, only yielding if there actually were any elements. I think iterate
should have the same signature, even though it may often be used without a page size such it only actually yields once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would simplify both this code, and most (if not all) usages
I would argue that it doesn't really matter to us whether the run status is up to date - as long as it is not ahead of time (and as long as we can assume it at least has gone to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good - I'd still argue for changing the API to return AsyncGenerator<T[]>
in a future release, but this implementation seems sound.
@@ -26,33 +26,35 @@ export class ExtDatasetClient<T extends DatasetItem> extends DatasetClient<T> im | |||
const { pageSize, ...listItemOptions } = options; | |||
this.customLogger.info('Iterating Dataset', { pageSize }, { url: this.url }); | |||
|
|||
let totalItems = 0; | |||
const initialOffset = listItemOptions.offset ?? 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel weird about this being in the outer scope as it really doesn't matter when we don't have a pageSize
- that would entail duplicating the log, but TBH I'm not against that as it's actually two implementations masquerading as one function..
BREAKING CHANGES:
itemsThreshold
was removed.FIXES:
offset
is actually respected;greedyIterate
does not check anymore for the items count, but tries to download the remaining items at each iteration, removing the racing condition, due to the fact that the item count may be outdated.TODO:
greedyIterate
tests