Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix iterate and greedyIterate methods #16

Open
wants to merge 2 commits into
base: dev/0.6.0
Choose a base branch
from

Conversation

gullmar
Copy link
Collaborator

@gullmar gullmar commented Mar 3, 2025

BREAKING CHANGES:

  • the option itemsThreshold was removed.

FIXES:

  • now, the option offset is actually respected;
  • greedyIterate does not check anymore for the items count, but tries to download the remaining items at each iteration, removing the racing condition, due to the fact that the item count may be outdated.

TODO:

  • is the run status always up to date?
  • changelog
  • greedyIterate tests

@gullmar gullmar requested a review from halvko March 3, 2025 12:34
@gullmar gullmar added draft and removed not-ready labels Mar 4, 2025
while (currentPage.items.length > 0) {
totalItems += currentPage.items.length;
for (const item of currentPage.items) {
yield item;
}

offset += pageSize;
currentPage = await this.superClient.listItems({ offset, limit: pageSize });
currentOffset += pageSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to update currentOffset with how many things has actually been returned to us, instead of how many items we asked for - if this is called while the run is still in progress we may see the following situation: pageSize === 10, currentPage.items.length === 2, but before the next iteration 10 more items has been added to the dataset, so we output another 2 items, not knowing we missed 8.

offset += pageSize;
currentPage = await this.superClient.listItems({ offset, limit: pageSize });
currentOffset += pageSize;
currentPage = await this.superClient.listItems({ offset: currentOffset, limit: pageSize });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it on purpose we don't forward listItemOptions here?

let currentOffset = listItemOptions.offset ?? 0;
let currentPage = await this.superClient.listItems({
...listItemOptions,
offset: currentOffset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue for creating a const offset = () => totalItems + (listItemOptions.offset ?? 0) such we have an offset which always reflects how many elements we have already seen.

while (runStatus && ['READY', 'RUNNING'].includes(runStatus)) {
const datasetIterator = this.iterate({ ...iterateOptions, offset: currentOffset });
for await (const item of datasetIterator) {
currentOffset++;
Copy link
Contributor

@halvko halvko Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this, and postponed adding my review till I had let it settle a bit. I ended up having to do pretty much the same when using the library to do "reasonable segmentation". I think this shows that our abstraction is leaky - whether there will actually be an await before we get to the next element. I think it would be better if we just returned AsyncGenerator<T[], void, void>, only yielding if there actually were any elements. I think iterate should have the same signature, even though it may often be used without a page size such it only actually yields once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would simplify both this code, and most (if not all) usages

@halvko
Copy link
Contributor

halvko commented Mar 5, 2025

I would argue that it doesn't really matter to us whether the run status is up to date - as long as it is not ahead of time (and as long as we can assume it at least has gone to the Ready state when start returns, but our usage of the library on the Socials Team wouldn't work if that wasn't the case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants