Fix sqlite3 "database is locked" in pipelined shards cache#926
Conversation
Enable WAL journal mode and a 30s busy timeout on repodata_shards.db so the cache reader thread no longer races with the network writer thread. Falls back gracefully on filesystems where WAL is unsupported.
dholth
left a comment
There was a problem hiding this comment.
I'm surprised this came up, since (I thought) that we had short transactions in the shard cache system.
I've considered a different solution. It involves giving the network thread a copy of the incoming queue to the cache thread; or connecting them through the main thread which would receive bytes, and forward them to the cache thread. Then the cache thread would work through its queue, either looking up or storing requests as they came in.
| conn.row_factory = sqlite3.Row | ||
| with conn as c: | ||
| try: | ||
| mode = c.execute("PRAGMA journal_mode = WAL").fetchone()[0] |
There was a problem hiding this comment.
I'm a big fan of WAL mode. It could fail not because of filesystem locks, which are used by all sqlite3 modes, but because shared memory is not available (if conda's cache is on a shared filesystem, accessed by two computers). conda-index users requested we drop WAL mode for this reason.
On the other hand, this is only a cache; and there are lots of reasons why conda might not work if two conda's try to use the same cache concurrently.
There was a problem hiding this comment.
I notice that we turn foreign_keys = ON but we only have one table and no foreign keys. Probably good practice anyway.
There was a problem hiding this comment.
Yeah, that's fine, I think. I remembered that PR. I think in this case we need it.
Co-authored-by: Daniel Holth <dholth@anaconda.com>
|
#927 is an outline of what a queue for cache insertion could look like |
Description
Fixes #924.
The pipelined shard traversal runs a
cache_fetch_thread(reads) and anetwork_fetch_thread(writes) against separatesqlite3.Connections to the samerepodata_shards.db. With the default rollback journal, a reader cannot proceed while a writer holds the exclusive lock. Under CI load the 5 s default busy timeout expires and SQLite raisessqlite3.OperationalError: database is locked.This PR applies two changes to
shards_cache.connect():PRAGMA journal_mode = WAL), which allows readers to proceed against a snapshot while a writer is appending. The pragma is wrapped in atry/except sqlite3.DatabaseErrorso it degrades gracefully on locking-hostile filesystems (see Test sharded repodata on locks-hostile filesystem #891) or corrupt databases.PRAGMA synchronous = NORMAL(sufficient for a cache, avoids unnecessaryfsync).The dev script
dev/scripts/requests-fetch-all-shards.pyalready used these same pragmas; this brings the productionconnect()in line.Checklist - did you ...
newsdirectory (using the template) for the next release's release notes?Add / update outdated documentation?