Skip to content

identity.CacheDirectory should not cache context.Canceled / context.DeadlineExceeded errors#1345

Open
arushibandi wants to merge 4241 commits into
bluesky-social:mainfrom
arushibandi:abandi/fix-cache-dir-ctx-poison
Open

identity.CacheDirectory should not cache context.Canceled / context.DeadlineExceeded errors#1345
arushibandi wants to merge 4241 commits into
bluesky-social:mainfrom
arushibandi:abandi/fix-cache-dir-ctx-poison

Conversation

@arushibandi

@arushibandi arushibandi commented Feb 25, 2026

Copy link
Copy Markdown

I'm using the identity.DefaultDirectory() and noticed that I was seeing failed lookups due to context cancelled. This also seemed to coincide with logins/oauth redirect flows or refreshed. Restarting my process fixed this. After digging in, it seems that this is because the CacheDirectory caches all errors, not filtering for context cancelled / deadline exceeded.

My workaround for this is always pass in context.Background() into the dir.Lookup() function, but this behavior seems unintended.

The tests fail before the fix and pass after.

closes #1346

bnewbold and others added 30 commits December 2, 2025 19:13
Pulls in fixed MST key height examples, and adds a simple test to
confirm heights.

See: bluesky-social/atproto-interop-tests#11
dholms and others added 23 commits February 10, 2026 15:54
This fixes a deadlock in the outbox that was causing Tap to permanently
stall for a single DID when multiple live events for that DID would end
up in the outbox.

After processing, we check pending events before waiting on the
`notifChan` to see if there are more events that we should immediately
kick into processing.

Also, in `processPendingEvents`, we could fail to set `blockedInLive` if
there were already in-flight historical events for the DID.

Resolves bluesky-social#1331
Test suite for Tap. This isn't fully-featured. Specifically, it doesn't
do full integration testing for event/repo processing. So doesn't test
most sync 1.1 semantics.

However it does test:
- outbox, delivery, and event ordering
- the basics of the firehose state machine
- the resync system
  - claiming resync jobs
  - draining the resync buffer

Snagged a couple of bug fixes along the way as well.
…l#1330)

This should:

- Fix a potential `blockedOnLive` stall for individual DIDs when the
wakeup signal has already been consumed.
- Prevent historical events from bypassing pending live events via the
fast path.
This fixes the issue that was getting repos into a "desynchronized"
state. I was accidentally using `r.Commit`, not realizing that it
created a new commit, not return the existing commit details. This would
cause us to store & compare incorrect, locally-generated revs rather
than the rev from the commit message and subsequently end up skipping
valid commits, thinking they were an old rev.

Resolves bluesky-social#1315
Need to kick off the CI to test this, I *think* this is all that's
needed
@jazware gave the all-clear on this. This was used to do some tuning for
the original postgres bsky appview, and this is no longer relevant.
There is Go code in the new did-method-plc/go-didplc package which
supports PLC generation. For now I don't want to create a dependency
loop between the two packages though, so just removing this
functionality here.
…l#1342)

These have been deprecated for a while. The same functionality is
covered by:

- `atproto/identity`: DID resolution and parsing
- `atproto/atcrypto`: cryptographic keys
- https://github.com/did-method-plc/did-method-plc: creating and
verifying DID PLC operations

The one piece of dropped functionality here was the ability for `gosky`
to generate PLC operations and submit them to the directory. Currently
the `go-didplc` package depends on `indigo` (for crypto and syntax
packages), and I don't want to create a dependency loop (even if it
isn't a package-level dependency loop). Also the `goat plc ...` commands
support this functionality (using the `go-didplc` package).

This is part of general cleanups.
}
return nil, false, fmt.Errorf("identity not found in cache after coalesce returned")
case <-ctx.Done():
return nil, false, ctx.Err()

@arushibandi arushibandi Feb 25, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be cached right?

arushibandi added a commit to habitat-network/habitat that referenced this pull request Feb 26, 2026
… on shutdown. (#146)

Context errors are cached. We see this specifically during an OAuth
flow, where the browser re-directs a number of times, and then on the
redirect back to habitat, on page load, we fail to load because a
context cancelled error is cached. This type of error should not be
cached. I opened a PR in my fork of indigo to demonstrate the issue:
bluesky-social/indigo#1345.

Seems like these should never be cached, but I'm also wondering if we're
doing something wrong in our flow that is causing this. This is a
temporary workaround (I see this like 30 times a day, on each login,
which I'm doing frequently for testing reasons); either we should stop
this from happening or merge the fix in the upstream repo (maybe both?)

Co-authored-by: Arushi Bandi <arushibandi@users.noreply.github.com>
@arushibandi arushibandi changed the title patch identity.CacheDirectory should not cache context.Canceled / context.DeadlineExceeded errors Feb 27, 2026
@bnewbold bnewbold self-requested a review March 2, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: identity.CacheDirectory caches context.Canceled/context.DeadlineExceeded errors