Skip to content

Conversation

@daniel-j-h
Copy link
Collaborator

For #7. Work in progress.

This changeset implements a a randomized online algorithm "reservoir sampling" for randomly sampling k items from a stream of unknown n items. We can use this to randomly sample e.g. k building features in the osmium handlers without having to store all features first or doing two passes.

Tasks:

  • Hook up to osmium handlers
  • Let users pass number of samples for randomly sampling

Refs:

Copy link
Contributor

@bkowshik bkowshik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a beautiful implementation! ❤️

i = random.randint(0, size - 1)
self.reservoir[i] = v

self.pushed += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we designate a counter, which will be incremented for every data point seen.

'''Randomly samples k items from a stream of unknown n items.
'''

def __init__(self, capacity):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reservoir is generally a list or array of predefined size.

self.reservoir = []
self.pushed = 0

def push(self, v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can begin adding data.

size = len(self.reservoir)

if size < self.capacity:
self.reservoir.append(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we encounter size elements, elements are added directly to reservoir

assert size == self.capacity
assert size <= self.pushed

p = self.capacity / self.pushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once reservoir is full, incoming data points have a size / counter chance to replace an existing sample point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants