Post-Mortem 1: @juniusfree incomplete save #590

jefftangx · 2021-02-01T19:43:26Z

jefftangx
Feb 1, 2021
Maintainer

Date: 01/31/2021
Author: @tangjeff0
Detection: @juniusfree
Impact: Date pages between January 25-31, 2021 lost. Potentially other lost blocks and pages (to be confirmed by Junius).
Status: Complete, with additional action items for future optimizations and error detection.
Summary: After writing notes down in Athens, Junius exited Athens. When he re-opened Athens 5 minutes later, Athens failed to start. 24 hours later, beta.33 is released, which creates a new backup on each write and waits until the last possible moment instead of trying to overwrite it from the beginning. Each backup looks like {TIMESTAMP}-index.transit.bkp . This means the user's db folder will eventually have many backups. This is obviously not space-efficient, and expects users to delete/manage their old backups, but this effectively removes any possibility of data loss. We will eventually come up with other approaches for writing and redundancy that are more performant and space-efficient, but the first principle should be no user data loss, which we've accomplished with this solution.

Action Items

prevent user from exiting Athens while Athens is loading
partial recovery of Junius's db, with help from Tim Lam
make a tutorial video on how to use Dropbox with Athens
tell everyone to use Dropbox/Google Drive/GitHub for versioning for now
move global alert error to be the first intialization call
create an Electron error log
design and implement secondary and tertiary stores for redunancy: index.transit backups, IndexedDB backups, or Electron logs of datascript transactions
reproduce/investigate Tim's issue
reproduce/investigate Johnny's issue (read-from-filesystem)

Root Cause

At the time of this writing, the way Athens persists data is to write the entire datascript database to the filesystem to a file named index.transit. However, this is an expensive operation. Two optimizations were made to address this issue: debounce and node.js streams.

Debounce
Each time an action occurs where a datascript transaction would happen, Athens dispatches a :save event, which leads to a :fs/write! effect (both re-frame constructs). This event and effect are debounced 15 seconds or 15000 milliseconds, so that Athens only writes at most one time every 15 seconds, and only after the last event (source).
node.js streams
Node.js is used because this is an Electron app, so we already have access to node.js modules like fs and streams. Streams are used because they are non-blocking/async (preventing the UI from freezing), and because they are streams of data (keeping the total in-memory usage low).

Given 1 and 2, this means that if a user writes to Athens and exits (or refreshes) before the stream has completed writing to the filesystem, the write will be incomplete. This led to a JSON.parse error when Junius opened Athens, because the end of file was missing the closing brackets necessary to complete the nested JSON data structure.

Lessons Learned

What went well

Tim Lam was able to chime in and help with a partial data recovery. By using a JSON validator (e.g. https://jsonlint.com/), we discovered that we only needed to add 3 closing brackets to create a valid JSON file. This led to a partial data recovery.

What went wrong

I encountered this issue on my personal computer before. Because I've been using Dropbox, I was able to very easily revert to a previous version. Because this was so easy, and because I often leave Athens open, I forgot to investigate this issue more deeply.
Tim Lam had this issue beforehand a few days prior to the event. Because it happened immediately upon opening Athens, and no data was lost, I didn't investigate the issue more deeply.

I should have thought more deeply about these issues when I first encountered them. Data loss is essentially the only unacceptable bug for a knowledge management tool.

Where we got lucky

The datascript transit file is very simple. It is the schema and then a flat list of datoms. That means it's extremely easy to recover the data that does exist.
Datascript datoms are ordered by entity id. This roughly corresponds to time, meaning only recent items would be lost.

jtrakk · 2021-02-01T21:15:44Z

jtrakk
Feb 1, 2021

Some technologies make it possible to write changes more quickly by saving them one at a time instead of all at once. Using a write-ahead log, like postgres and sqlite do, lets you write each change immediately to a file, which can then be resolved into the main database at leisure later.

3 replies

jefftangx Feb 1, 2021
Maintainer Author

That's what I had in mind when I wrote:

design and implement secondary and tertiary stores: index.transit backups, IndexedDB backups, or Electron logs of datascript transactions

This is actually very doable given our db, which is modeled after an append-only db.

jtrakk Feb 2, 2021

I think that's different from what I meant. A write-ahead log is where all edits go, and all edits are resolved from the write-ahead log into the main store. It is not a backup file but rather the origin of all changes to the main store. Changes must not be written to the main store before they are written to the write-ahead log.

pithyless Feb 2, 2021

I think @jtrakk and @tangjeff0 are talking about the same thing. :) The existing datascript-transit is essentially writing out a JSON file like this:

{ :schema (:schema db)
  :datoms (:eavt db) }

That means the file is corrupted until the final closing }.

A WAL for a DataScript would essentially just need to append-only the new datoms that are transacted as new independent lines (without needing to overwrite the entire file):

[e1 a1 v1 tx1 true]\n
[e2 a2 v2 tx2 false]\n
...

These datoms can then be read back as a sequence via datascript/init-db.

How do you know what datoms have changed with the transaction? Turns out datascript/transact! gives you back that info in it's report under :tx-data. And if you would like to implement a more cautious version, where we first write-ahead to disk and only then apply the data to DataScript, you can use datascript/with to get back the same :tx-data report without actually modifying the atom.

AFAICT, this is exactly how Datahike persists its data: https://github.com/replikativ/datahike/blob/development/src/datahike/migrate.clj#L6-L14

Only difference is Datahike (as of 0.2.0) supports schema-on-write (unlike DataScript), so the schema is just regular datoms. In transacting DataScript some hybrid approach would be necessary, perhaps:

{:schema ...}\n
[e1 a1 v1 tx1 true]\n
[e2 a2 v2 tx2 false]\n
...

Or perhaps separating the schema and datoms-log into two fles.

jtrakk · 2021-02-01T21:31:24Z

jtrakk
Feb 1, 2021

I encountered this issue on my personal computer before. Because I've been using Dropbox, I was able to very easily revert to a previous version. Because this was so easy, and because I often leave Athens open, I forgot to investigate this issue more deeply.
Tim Lam had this issue beforehand a few days prior to the event. Because it happened immediately upon opening Athens, and no data was lost, I didn't investigate the issue more deeply.
I should have thought more deeply about these issues when I first encountered them. Data loss is essentially the only unacceptable bug for a knowledge management tool.

Is there some infrastructure that could reduce the reliance on particular individuals noticing/remembering to follow up on errors? For example, error triage rules that would prioritize high-severity areas like serialization.

1 reply

jefftangx Feb 1, 2021
Maintainer Author

error triage rules

This sounds like a perfect use case for sentry.js. Thank you @jtrakk

pithyless · 2021-02-02T00:26:04Z

pithyless
Feb 2, 2021

athens/src/cljs/athens/electron.cljs

Lines 369 to 383 in 200b84a

    
           (defn write-file 
        
             [filepath data] 
        
             (let [r (.. stream -Readable (from data)) 
        
                   w (.createWriteStream fs filepath) 
        
                   error-cb (fn [err] 
        
                              (when err 
        
                                (js/alert (js/Error. err)) 
        
                                (js/console.error (js/Error. err))))] 
        
               (.setEncoding r "utf8") 
        
               (.on r "error" error-cb) 
        
               (.on w "error" error-cb) 
        
               (.on w "finish" (fn [] 
        
                                 (dispatch [:db/sync]) 
        
                                 (dispatch [:db/update-mtime (js/Date.)]))) 
        
               (.pipe r w)))

Directly overwriting a file here is always going to be a potential race condition. Perhaps it would be safer to first write to a temporary file (e.g. via mktemp or something more predictable ala (str filepath ".tmp")) and only after it succeeds, move the file via fs.rename or fs.renameSync.

With this approach, if the program exits for any reason during the writing process, the DB is not corrupted (it may just not be the most up-to-date). The fs.rename should delegate to the syscall rename(2) which should be atomic (or at the very least a much shorter operation during which something could go wrong).

Going even a step further: startup can check if a temp file exists. If so, it means that the program exited during a filesystem sync and one can alert the user about potential ways of recovering the unsaved data from the temp file.

PS. This is irrespective of the idea to add a separate append-only log of transact changes (which would be a great data redundancy strategy).

PPS. I would be wary of relying solely on Dropbox as my backup strategy for frequently updating files after having seen John Hughes give a presentation about Dropbox and QuickCheck: https://vimeo.com/158002499. Hopefully Dropbox fixed all those identified issues, but guaranteeing atomic filesystem operations in a distributed system is generally a hard problem. ;)

5 replies

jefftangx Feb 2, 2021
Maintainer Author

This is the best and fastest improvement we can make. Thank you so much @pithyless . Implementing this now.

jefftangx Feb 2, 2021
Maintainer Author

This is also why I love open-source

jefftangx Feb 2, 2021
Maintainer Author

See #596. Resolved and announced

https://discord.com/channels/708122962422792194/708126149536120853/806013413955469333

jtrakk Feb 2, 2021

Hopefully Dropbox fixed all those identified issues

FWIW Dropbox rewrote the sync engine recently.

pithyless Feb 2, 2021

See #596. Resolved and announced

This is even better; backups are cheap and can be dealt with later. Removing that dreadful feeling that there may be unrecoverable data loss is priceless. :)

FWIW Dropbox rewrote the sync engine recently.

That's good to know; thanks for the link! And irrespective of the new Dropbox engine, this approach to multiple write-once backups also helps reduce chance of Dropbox accidentally dropping the data.

Uh oh!

Post-Mortem 1: @juniusfree incomplete save #590

Uh oh!

Uh oh!

jefftangx Feb 1, 2021 Maintainer

Action Items

Root Cause

Lessons Learned

Replies: 3 comments · 9 replies

Uh oh!

jtrakk Feb 1, 2021

Uh oh!

jefftangx Feb 1, 2021 Maintainer Author

Uh oh!

jtrakk Feb 2, 2021

Uh oh!

pithyless Feb 2, 2021

Uh oh!

jtrakk Feb 1, 2021

Uh oh!

jefftangx Feb 1, 2021 Maintainer Author

Uh oh!

Uh oh!

pithyless Feb 2, 2021

Uh oh!

jefftangx Feb 2, 2021 Maintainer Author

Uh oh!

jefftangx Feb 2, 2021 Maintainer Author

Uh oh!

Uh oh!

jefftangx Feb 2, 2021 Maintainer Author

Uh oh!

jtrakk Feb 2, 2021

Uh oh!

Uh oh!

pithyless Feb 2, 2021

jefftangx
Feb 1, 2021
Maintainer

Replies: 3 comments 9 replies

jtrakk
Feb 1, 2021

jefftangx Feb 1, 2021
Maintainer Author

jtrakk
Feb 1, 2021

jefftangx Feb 1, 2021
Maintainer Author

pithyless
Feb 2, 2021

jefftangx Feb 2, 2021
Maintainer Author

jefftangx Feb 2, 2021
Maintainer Author

jefftangx Feb 2, 2021
Maintainer Author