Post-Mortem 1: @juniusfree incomplete save #590
Replies: 3 comments 9 replies
-
Some technologies make it possible to write changes more quickly by saving them one at a time instead of all at once. Using a write-ahead log, like postgres and sqlite do, lets you write each change immediately to a file, which can then be resolved into the main database at leisure later. |
Beta Was this translation helpful? Give feedback.
-
Is there some infrastructure that could reduce the reliance on particular individuals noticing/remembering to follow up on errors? For example, error triage rules that would prioritize high-severity areas like serialization. |
Beta Was this translation helpful? Give feedback.
-
athens/src/cljs/athens/electron.cljs Lines 369 to 383 in 200b84a Directly overwriting a file here is always going to be a potential race condition. Perhaps it would be safer to first write to a temporary file (e.g. via With this approach, if the program exits for any reason during the writing process, the DB is not corrupted (it may just not be the most up-to-date). The Going even a step further: startup can check if a temp file exists. If so, it means that the program exited during a filesystem sync and one can alert the user about potential ways of recovering the unsaved data from the temp file. PS. This is irrespective of the idea to add a separate append-only log of transact changes (which would be a great data redundancy strategy). PPS. I would be wary of relying solely on Dropbox as my backup strategy for frequently updating files after having seen John Hughes give a presentation about Dropbox and QuickCheck: https://vimeo.com/158002499. Hopefully Dropbox fixed all those identified issues, but guaranteeing atomic filesystem operations in a distributed system is generally a hard problem. ;) |
Beta Was this translation helpful? Give feedback.
-
Date: 01/31/2021
Author: @tangjeff0
Detection: @juniusfree
Impact: Date pages between January 25-31, 2021 lost. Potentially other lost blocks and pages (to be confirmed by Junius).
Status: Complete, with additional action items for future optimizations and error detection.
Summary: After writing notes down in Athens, Junius exited Athens. When he re-opened Athens 5 minutes later, Athens failed to start. 24 hours later, beta.33 is released, which creates a new backup on each write and waits until the last possible moment instead of trying to overwrite it from the beginning. Each backup looks like
{TIMESTAMP}-index.transit.bkp
. This means the user's db folder will eventually have many backups. This is obviously not space-efficient, and expects users to delete/manage their old backups, but this effectively removes any possibility of data loss. We will eventually come up with other approaches for writing and redundancy that are more performant and space-efficient, but the first principle should be no user data loss, which we've accomplished with this solution.Action Items
index.transit
backups, IndexedDB backups, or Electron logs of datascript transactionsRoot Cause
At the time of this writing, the way Athens persists data is to write the entire datascript database to the filesystem to a file named
index.transit
. However, this is an expensive operation. Two optimizations were made to address this issue: debounce and node.js streams.Each time an action occurs where a datascript transaction would happen, Athens dispatches a
:save
event, which leads to a:fs/write!
effect (both re-frame constructs). This event and effect are debounced 15 seconds or 15000 milliseconds, so that Athens only writes at most one time every 15 seconds, and only after the last event (source).Node.js is used because this is an Electron app, so we already have access to node.js modules like
fs
andstreams
. Streams are used because they are non-blocking/async (preventing the UI from freezing), and because they are streams of data (keeping the total in-memory usage low).Given 1 and 2, this means that if a user writes to Athens and exits (or refreshes) before the stream has completed writing to the filesystem, the write will be incomplete. This led to a
JSON.parse
error when Junius opened Athens, because the end of file was missing the closing brackets necessary to complete the nested JSON data structure.Lessons Learned
What went well
What went wrong
I should have thought more deeply about these issues when I first encountered them. Data loss is essentially the only unacceptable bug for a knowledge management tool.
Where we got lucky
Beta Was this translation helpful? Give feedback.
All reactions