File & Dir tracking data v2

Revising how we track files and dirs internally

As our uses of watchman have evolved, and the size of our repos has increased (and is projected to increase), there are some limitations that we are starting to feel:

Our internal data structures require an exclusive lock to access
Moderately abusive watchman clients can accidentally deny service to other clients by querying too frequently
The amount of memory used to track repos is a function of the size of the repo

In the future we want to support out-sourcing some of the watchman file operations to an external source of truth. For example, rather than querying the filesystem there may be some other service available that can provide the equivalent information. In some cases we may wish to defer fetching data until it becomes visible in a client query.

This document has some thoughts on how we might make changes to accommodate these points.

Current state of the world

Watchman maintains an in-memory view of each filesystem root. This consists of:

map<dirname, dirinfo>
Each dirinfo holds a map<filename, fileinfo>
Each fileinfo holds a subset of the struct stat information for a file
There is a recency index linking through the fileinfo's such that we can walk from the most recently modified end towards the eldest end
There is a suffix index linking through the fileinfo's such that for a given (case insensitive) filename suffix, we can walk through the list of files with the same filename suffix

Queries over the data obtain a lock and then walk over one or more of the indices and match fields against the data in the fileinfo.

Each observed filesystem event increments a tick counter that is used to mark that logical point in time. The recency index is therefore ordered such that the most recent end always has the largest tick value. Moving backwards through this index, the tick value of a prior node is always <= to that of the current node.

Queries

Queries are (optionally, but typically) synchronized with the filesystem view; by generating a change to a cookie file and waiting for its notifications to arrive through the kernel notification interface, we know that our view of the filesystem data is current up to the point of time where we start to query.

A client will typically generate a change cookie, wait for it to appear and then obtain a lock over the state before executing.

Waiting for the cookie and the lock can take a non-trivial time if the filesystem is actively being mutated or there are many clients.

Snapshots

One path forwards is to restructure how we track this data such that the state of the world at a given tick can be referenced by a snapshot object:

The fileinfo data can be journaled to a database (sqlite is an obvious candidate) that is stored in the STATEDIR; each watched root will have its own journal.
Each time we observe a file change, we write the record for that file to the journal along with its tick
A given generation of fileinfo is uniquely identified by its fileid (which implies its filename - the intent is to literally store a mapping of filename <-> id for this) and its tick value
There may be multiple entries in the journal; one for each observed tick that is live and reachable by a snapshot
Unreachable entries will be pruned when the associated snapshot(s) are no longer live

In addition to this stored data we can maintain an in-memory recency index; but this can be reduced to an ordered list of (fileid, tick) rather than (fileinfo) which should dramatically reduce the amount of memory that we need just to keep track of when files change.

The data in the journal can be loaded via a cache so that we won't need to run an explicit query to the storage layer each time we need to lookup the data for a file.

When we need to execute a query over the journal, we can create a snapshot for the current tick value. The snapshot will include an immutable copy of the recency index. The cost of creating the snapshot would be largely a function of making a copy of the index, but once created, since the data model is write-or-delete-not-modify, there should be no need to obtain an exclusive lock to yield the results for a given query. We just need to make sure that we never delete a record that may be referenced by a live snapshot.

Recording changes

As we observe changes, we'll need to log the new view of the fileinfo for a given file. This can be achieved by simply writing out a fileinfo record with the new tick value. By definition this operation will be a create operation for that row as no other actor can generate different data for the file at the present tick value.

For efficiency we will likely want to write through a cache to the storage layer; there is a good chance that recently changed files will be accessed in a subsequent query, either via a subscription or via a since-based query from tools such as the mercurial fsmonitor extension.

After journalling the file data, we will need to update the relevant indices:

Recency Index

Logically we want the recency index to be a set of (fileid, tick) ordered by decreasing values of tick. We have some slight tension here: we want this to be cheap to update and cheap to copy. We want to bias towards cheap to update so that we can make forward progress on consuming incoming changes with low latency.

We'll likely need to use some kind of Log Structured Merge representation for this stuff; we can bake a flattened copy of the recency index into something that is conceptually:

struct baked_index {
   tick_t tick;  // Tick at the time this was baked
   vector<pair<file_id, tick>> files; // The baked index
};

To reduce the amount of data that we need to bake into the index each time, we'd like to make a chain out of a range of baked values:

struct baked_index {
   tick_t tick;
   vector<pair<file_id, tick>> files; // The baked index fragment
   struct baked_index *prior;   // The index that we build on
   // accelerator to know which file_ids are present in this index fragment
   set<file_id> members;       
}

The nature of our data is such that the baked_index::tick value in the most recent index is always > than the baked_index::tick value linked to by its prior member. We can exploit this to limit queries that walk backwards over the time index. Each baked index segment is considered immutable once it has been baked. That allows clients to own a reference to it for as long as they like without fear of hindering progress in other parts of the system.

We will periodically need to merge these chains to prevent them from growing too long and bloated. This is particularly important for the other types of index that we'll maintain (described in sections below). We should be able to merge the indices when we settle, so the common case should be a short chain of 1 or 2 entries. A busy period may result in us creating longer chains, making the subsequent query a bit more expensive, but we should be able to merge that shortly afterwards when we settle. We also have the option of using a heuristic and allowing a client to choose to preemptively merge if they are using any generator other than the since generator.

At merge time we have the opportunity to age out dead files from the index. The merge can proceed from the the head of the list through to the tail; for the head we want to copy it into the destination struct, then for each prior segment, for each files entry, if its file_id is not in the dest->members members set, we would append it to dest->files and insert into dest->members. The merge can happen asynchronously; the critical section is publishing the merged index as the head. If the head at publish time has the same tick value as the merged data, we can just swap it out. Otherwise we will need to either recompute the whole merge(!) or merge the new deltas and make those point back to our previous merge, and repeat until we win the race and replace the head.

With all of that in mind, when processing a series of updates, we can build a new baked_index; each time mark a file as changed, we add it to baked_index::members and record the current tick value. At the end of processing for a given tick we would link that item in as the head of the index.

A snapshot would just addref on the current baked_index head for a given root.

Consuming the index is then a fairly simple matter of walking it something like this:

tick_t since;

for (index = head; index; index = index->prior) {
   if (index->tick < since) {
     // Data is older than the desired range
     break;
   }
   for (pair in index->files) {
     if (pair.tick < since) {
       // Data is older than the desired range
       break;
     }
     // Do something
     apply(pair.tick, pair.fileid);
   }
}

All files index

This is logically a set<fileid>. When we observe a new file or prune out a dead file we will update this set. We can take the composition of baked_index::members through the index chain and use that to obtain the complete list of files:

set<fileid> all_files;
for (index = head; index; index = index->prior) {
  for (fileid in index->members) {
    all_files.insert(fileid);
  }
}

Suffix Index

We want to maintain map<suffix, set<fileid>> to effectively locate all the files with a given filename suffix.

Similar to all_files, we can accumulate deltas as we observe new files being added:

struct baked_index {
   map<suffix, set<fileid>> suffixes;
}

To compute the effective value of a set of suffixes:

set<fileid> all_files_with_suffix;
for (index = head; index; index = index->prior) {
  for (suffix in interesting_suffixes) {
     for (fileid in index->suffixes[suffix]) {
        all_files_with_suffix.insert(fileid)
     }
  }
}

Path index

For the path generator we want to be able to walk portions of the directory structure. This index is logically a map<directory-file-id, set<fileid>> where directory-file-id is just the id that corresponds to the directory name. The set value in the map is the list of files/dirs that are directly contained within that directory.

For incremental updates to a given dir we will store the complete new set of contents in the paths entry. For example, if at tick=1 the paths index is logically:

paths["some/dir"] = set("foo");

and at tick=2 we observe some/dir/bar being created, the paths index in the new baked_index that we produce for this tick will hold:

paths["some/dir"] = set("foo", "bar");

We'll add this structure to the baked index:

struct baked_index {
    map<fileid, set<fileid>> paths;
}

To compute the full paths index we would walk back over the chain and compose them together, similar to the prior examples.

fileinfo cache

We will maintain a bounded MRU cache of fileinfo objects; the cache will be keyed by (fileid, tick). A cache miss just means that the client will need to request the data from the journal.

In the most common watchman use cases, the files that we just observed change will be most likely to be in the cache when a subsequent query is made (most queries are time bounded to the more recently changed items).

Executing a query

Once a client has a snapshot handle it can run an arbitrary number of queries largely uncontested by other clients.

Here's how the various generators translate to this model; recall that the purpose of a generator is to yield a series of file data nodes that we can pass through the expression evaluator:

since generator just walks the snapshot of the recency index from the most recent end
suffix generator walks the suffix index for the specified suffixes
path generator would work the path index
all generator would walk the all-files index

For each file_id yielded by the generator, the query evaluator would obtain the fileinfo from the cache and then feed the node to the expression evaluator.

How do we get there?

How do we incrementally switch our internal data structures over to this new model? It's going to be a bit awkward to holistically try this, to the high level plan is the stages below:

Introduce w_fileid_t and associated functions

The file_id is really just a smaller, cheaper placeholder for the filename string. This is logically implemented as bi-directional map of id and string value. The simplistic implementation to have this map be memory resident, but we may want to consider caching just the hot MRU values in RAM to keep memory usage down.

In the short term, we can refactor the the w_root_t structure so that it maintains this mapping, and then refactor the watchman_dir and watchman_file structures so that they are keyed and otherwise use w_fileid_t type in place of the names. We'll add an API for producing a w_fileid_t from a w_string_t or cstring, and also for producing a w_string_t from w_fileid_t. We may want to also add a bulk API where we can pass in some kind of collection (perhaps in the form of the pending list) so that we can do a bulk load from storage in the case where we are using an MRU. We can skip that in the initial work.

Add fileinfo cache

Add a pluggable API for materializing a (fileid, tick) pair as a watchman_file struct
Obviously need to be able to publish an instance into this layer, otherwise where will the data come from?
We want an in-memory implementation and a sqlite based implementation (the latter is not available on win32 by default)
Add an MRU cache for this layer
The configuration for this must be configurable per-root; we will want to tune this differently for different repos and workloads

These will all have unit tests as none of this will be connected to the main watchman flow at this stage.

Add the baked_index chain functions

We'll want a better name than is used throughout this doc, but the steps are:

Define the chainable structure
Implement the reference counting
Snapshoting is just adding a ref to the reference count
Placeholder merge function
Implement the recency (and thus all-files) indices + merge implementation
Implement the suffix index + merge implementation
Implement the path index + merge implementation

These will all have unit tests as none of this will be connected to the main watchman flow at this stage.

Plumb into the update path

At this stage we can have watchman double-write to its current set of structures and the new index structure.

When we record a changed file we want to publish it to the caching layer and update the current set of indices
When we settle we want to run a merge on the index chain

Note that this update path doesn't require the root lock to be held.

Add alternate sources for the query executor

With the data being updated in both sets of data structures, we now want to be able to query either or both.

Refactor the executor so that we can provide alternate implementations of the generators that use the new index as the data source
Add a flag to the extended query to indicate whether we should use the new or old generators, or use both, emitting them under different keys in the result set, so that integration tests can verify the results.
Add similar flag/code to the subscription generator
Augment the test environment to compare the results

Note that the new query path doesn't require that the root lock be held longer than is needed to snapshot the index.

At this point we should have a system that we can deploy to A/B test

Add a flag to turn off writing to the old data set
We can start to roll this out and administratively set which set of data to use as the source of truth and let it soak before we ramp up.

Final stretch

After a find/fix period, we can remove the old code and associated options and tidy the code up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly