Skip to content

Public IPNI Performance Metrics #2731

@dennis-tra

Description

@dennis-tra

Hi everyone,

we at ProbeLab are currently revamping our IPNI monitoring. We've been monitoring the retrieval performance of cid.contact since 2023 and have obviously also observed the degradations that have been point of discussions recently.

However, we've also identified a few gaps in visibility. Our goal is to show the following metrics on a public dashboard:

  • Cached/Uncached retrieval performance of provider records for individual CIDs from different geographical regions
  • Time to index. This includes:
    1. After an advertisement announcement, how long does it take until IPNI reaches out to fetch the information? (as daily CDFs and daily p50/p90 over time)
    2. After an advertisement announcement, how long does it take until the CID becomes available on the read path? (as daily CDFs and daily p50/p90 over time)
  • Daily error rates of the above two indexing timings.

If you think anything is missing, feedback is welcome!

A brief description of the methodology:

  1. Peer A generates a bunch of random CIDs (in the order of 100), generates an advertisement with a new, unique contextID and HTTP-announces it to cid.contact
  2. We are using an HTTP publisher (as opposed to libp2phttp), so cid.contact will reach out to peer A's HTTP server to fetch the information associated with the advertisement (this is timestamp 1 of the "time to index"). Timeout is 1 minute.
  3. To detect when the advertisement is available on the read-path: peer A iterates through the list of CIDs that was generated in the beginning and calls cid.contact for the provider records (we need multiple unique CIDs because responses are cached). This is done with a 1s interval (this is also limiting the resolution but the best we can do I guess). When we detect that one of the CIDs is available, we assume that the advertisement was fully indexed (this is timestamp 2 of the "time to index"). Timeout is 2 minutes.
  4. We instruct other peers in different geographical locations to request an unused CID from cid.contact twice. First request is the uncached performance, second request is the cached performance.

Is this a sound methodology?

As mentioned above, we've been probing cid.contact since 2023 and have neglected the measurement tooling for quite some time. Some things I had noticed while updating the dependencies (the main challenge for me was to move from a graphsync publisher to http):

  1. While testing locally I experienced occasional "429 Too many Requests" and also "403 Forbidden" errors when calling /ingest/announce. What are the policies around both error codes? When is a 429 triggered and when a 403?
  2. Is the first "time to index" timing measurement correct? The request pattern looks more like 1) indexer requests advertisement from the http publisher 2) indexer requests "Entries" CID from the publisher. I guess the second request is indicative of the "time to index"?

Side note: the measurement infrastructure is agnostic to the index provider (it doesn't have to be cid.contact) and we can easily replicate the probing infrastructure to target a different deployment as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions