Skip to content

Fix sqlite3 "database is locked" in pipelined shards cache#926

Merged
jezdez merged 3 commits intomainfrom
fix/924-shards-db-locked
Apr 30, 2026
Merged

Fix sqlite3 "database is locked" in pipelined shards cache#926
jezdez merged 3 commits intomainfrom
fix/924-shards-db-locked

Conversation

@jezdez
Copy link
Copy Markdown
Member

@jezdez jezdez commented Apr 30, 2026

Description

Fixes #924.

The pipelined shard traversal runs a cache_fetch_thread (reads) and a network_fetch_thread (writes) against separate sqlite3.Connections to the same repodata_shards.db. With the default rollback journal, a reader cannot proceed while a writer holds the exclusive lock. Under CI load the 5 s default busy timeout expires and SQLite raises sqlite3.OperationalError: database is locked.

This PR applies two changes to shards_cache.connect():

  • Enable WAL mode (PRAGMA journal_mode = WAL), which allows readers to proceed against a snapshot while a writer is appending. The pragma is wrapped in a try/except sqlite3.DatabaseError so it degrades gracefully on locking-hostile filesystems (see Test sharded repodata on locks-hostile filesystem #891) or corrupt databases.
  • Bump the connection timeout to 30 s (from the 5 s default), giving the busy handler more headroom when WAL is unavailable.
  • When WAL is confirmed active, also set PRAGMA synchronous = NORMAL (sufficient for a cache, avoids unnecessary fsync).

The dev script dev/scripts/requests-fetch-all-shards.py already used these same pragmas; this brings the production connect() in line.

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

Enable WAL journal mode and a 30s busy timeout on repodata_shards.db so the
cache reader thread no longer races with the network writer thread. Falls back
gracefully on filesystems where WAL is unsupported.
@jezdez jezdez requested a review from a team as a code owner April 30, 2026 11:55
@github-project-automation github-project-automation Bot moved this to 🆕 New in 🔎 Review Apr 30, 2026
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Apr 30, 2026
@jezdez jezdez requested review from danyeaw and dholth April 30, 2026 12:01
Copy link
Copy Markdown
Contributor

@dholth dholth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this came up, since (I thought) that we had short transactions in the shard cache system.

I've considered a different solution. It involves giving the network thread a copy of the incoming queue to the cache thread; or connecting them through the main thread which would receive bytes, and forward them to the cache thread. Then the cache thread would work through its queue, either looking up or storing requests as they came in.

conn.row_factory = sqlite3.Row
with conn as c:
try:
mode = c.execute("PRAGMA journal_mode = WAL").fetchone()[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of WAL mode. It could fail not because of filesystem locks, which are used by all sqlite3 modes, but because shared memory is not available (if conda's cache is on a shared filesystem, accessed by two computers). conda-index users requested we drop WAL mode for this reason.
On the other hand, this is only a cache; and there are lots of reasons why conda might not work if two conda's try to use the same cache concurrently.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that we turn foreign_keys = ON but we only have one table and no foreign keys. Probably good practice anyway.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine, I think. I remembered that PR. I think in this case we need it.

Comment thread news/924-fix-shards-db-locked Outdated
@github-project-automation github-project-automation Bot moved this from 🆕 New to ✅ Approved in 🔎 Review Apr 30, 2026
Co-authored-by: Daniel Holth <dholth@anaconda.com>
@jezdez jezdez enabled auto-merge (squash) April 30, 2026 12:16
@dholth
Copy link
Copy Markdown
Contributor

dholth commented Apr 30, 2026

#927 is an outline of what a queue for cache insertion could look like

Copy link
Copy Markdown
Contributor

@dholth dholth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jezdez jezdez merged commit 9f3f73b into main Apr 30, 2026
75 checks passed
@jezdez jezdez deleted the fix/924-shards-db-locked branch April 30, 2026 20:26
@github-project-automation github-project-automation Bot moved this from ✅ Approved to 🏁 Done in 🔎 Review Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed [bot] added once the contributor has signed the CLA

Projects

Status: 🏁 Done

Development

Successfully merging this pull request may close these issues.

sqlite3.OperationalError: database is locked in pipelined shards cache_fetch_thread

3 participants