RFC: add ngram indexing support to ripgrep

An alternative name for this issue is "ripgrep at scale."

I've been thinking about some kind of indexing feature in ripgrep for a long
time. In particular, I created #95 with the hope of adding fulltext support to
ripgrep. But every time I think about it, I'm stymied by the following hurdles:

1. The work required to make code oriented relevance search work well.
2. The difficulty of correctly synchronizing the state of the index with the
   state of the file system.

I think what I'm coming to realize is that we could just declare that we will
do _neither_ of the above two things. My claim is that we will still end up
with something quite useful!

So I'd like to describe my vision and I would love to get feedback from folks.
I'm going to start by talking about the feature at a very high level and end
with a description of the new flags that would be added to ripgrep. I have
**not** actually built this yet, so none of this is informed by actually using
the tool and figuring out what works. I have built
[indexing tools](https://github.com/BurntSushi/imdb-rename)
before, and my day job revolves around information retrieval, so I do have some
relevant background that I'm starting with.

While the below is meant more to describe the UX of the feature, I have added a
few implementation details in places mostly for clarity purposes.

I'd very much appreciate any feedback folks have. I'm especially interested in
whether the overall flow described below makes sense and feels natural (sans
the fact that this initially won't come with anything that keeps the index in
sync). Are there better ways of exposing an indexing feature?

I'd also like to hear feedback if something below doesn't make sense or if a question about how things work isn't answer. In particular, please treat the section that describes flags as if it were docs in a man page. Is what I wrote sufficient? Or does it omit important details?

### Search is exhaustive with no ranking

I think my hubris is exposed by ever thinking that I could do code aware
relevance ranking on my own. It's likely something that requires a paid team of
engineers to develop, at least initially. Indeed,
[GitHub is supposedly working on this](https://github.blog/2018-09-18-towards-natural-language-semantic-code-search/).

The key to simplifying this is to just declare that searching with an index
will do an exhaustive search. There should be no precision/recall trade off and
no ranking. This also very nicely clarifies the target audience for ripgrep
indexing: it is specifically for folks that want searches to go faster and are
willing to put up with creating an index. Typically, this corresponds to
frequently searching a corpus that does not fit into memory. If your corpus
fits into memory, then ripgrep's existing search is _probably_ fast enough.

Use cases like semantic search, ranking results or filtering out noise are
specifically not supported by this simplification. While I think these things
are very useful, I am now fully convinced that they don't belong in ripgrep.
They really require some other kind of tool.


### ripgrep will not synchronize your index for you

I personally find the task of getting this correct to be rather daunting. While
I think this could potentially lead to a much better user experience, I think
an initial version of indexing support in ripgrep should avoid trying to do
this. In part to make it easier to ship this feature, and in part so that we
have time to collect feedback about usage patterns.

I do however think that the index should very explicitly support fast
incremental updates that are hard to get wrong. For example:

* ripgrep should avoid creating two different index entries for the same file.
  That it, ripgrep should not allow duplicates in its index.
* By default, ripgrep should not re-index files if their last modified date
  precedes their indexed date. That is, when you re-index a directory, the only
  cost you should pay is the overhead of traversing that directory and the cost
  of checking for and indexing files that have been changed since the last time
  they were indexed.
* Re-indexing a single additional file should have a _similar_ performance
  **overhead** as re-indexing many additional files at once. While it's likely
  impossible to make the performance idential between these, the big idea here
  is that the index process itself should use a type of batching that makes
  re-indexing files easier for the user.

If I manage to satisfy these things, then I think it would be fairly
straight-forward to build a quick-n-dirty index synchronizer on top of
something like the [`notify`](https://crates.io/crates/notify) crate. So, if
it's easy, why not just put this synchronization logic into ripgrep itself?
Because I have a suspicion that I'm wrong, and moreover, I'm not sure if I want
to get into the business of maintaining a daemon.


### Indexes are only indexes, they do not contain file contents

An index generated by ripgrep only contains file paths, potentially file meta
data and an inverted index identifying which ngrams occur in each file. In
order for ripgrep to actually execute a search, it will still need to read each
file.

An index does not store any offset information, so the only optimizations that
an index provides over a normal search are the following:

1. ripgrep does not need to traverse any file hierarchy. It can read all of the
   file paths to search from the index.
2. As mentioned above, ripgrep does not need to read and search the contents of
   every file. It only needs to read files that the index believes _may_
   contain a match. An index may report a false positive (but _never_ a false
   negative), so ripgrep will still need to confirm the match.

Moreover, in terms of incrementally updating the index for a particular
directory tree, ripgrep should only need to read the contents of files that
either weren't indexed previously, or are reported by the file system to have
been modified since the last time it was indexed. (ripgrep will have an option
to forcefully re-index everything, in case the last modified time is for some
reason unreliable.)

Also, as mentioned above, in addition to the file path, ripgrep will associate
some meta data with each file. This will include the time at which the file was
indexed, but might also include other metadata (like a file type, other
timestamps, etc.) that might be useful for post-hoc sorting/filtering.


### Indexes will probably not be durable

What I mean by this is that an index should never contain critical data. It
should always be able to be re-generated from the source data. This is not to
say that ripgrep will be reckless about index corruption, but its committment
to durability will almost certainly not rise to the level of an embedded
database. That is, it will likely be possible to cut the power (or network) at
an inopportune moment that will result in a corrupt index.

It's not clear to me whether it's feasible or worth it to detect index
corruption. Needing to, for example, checksum the contents of indexed files on
disk would very likely eat deeply into the performance budget at search time.
There are certainly some nominal checks we can employ that are cheap but will
not rise to the level of robustness that one gets from a checksum.

This is primarily because ripgrep will not provide a daemon that can amortize
this cost. Instead, ripgrep must be able to quickly read the index off disk and
start searching immediately. However, it may be worthwhile to provide the
_option_ to check the integrity of the index.

The primary implication from an end user level here isn't great, but hopefully
it's rare: it will be possible for ripgrep to panic as part of reading an index
in a way that is not a bug in ripgrep. (It is certainly possible to treat
reading the index as a fallible operation and return a proper error instead
when a corrupt index is found, but the `fst` crate currently does not do this
and making it do it would be a significant hurdle to overcome.)

In order to prevent routine corruption, ripgrep will adopt Lucene's "segment
indexing" strategy. A more in depth explanation of how this works can be found
elsehwere, but effectively, it represents a partioning of the index. Once a
segment is written to disk, it is never modified. In order to search the index,
one must consult all of the active segments. Over time, segments can be merged.
(Segmenting is not just done to prevent corruption, but also for decreasing
the latency of incremental index updates.)

Additionally, ripgrep will make use of advisory file locking to synchronize
concurrent operations.


### Indexs may not have a stable format at first

I expect this feature will initially be released in an "experimental" state.
That is, one should expect newer versions of ripgrep to potentially change the
index format. ripgrep will present an error when this happens, where the remedy
will be to re-index your files.


### New flags

I don't think this list of flags is exhaustive, but I do think these cover
the main user interactions with an index. I do anticipate other flags being
added, but I think those will be more along the lines of config knobs that
tweak how indexing works.

I've tried to write these as if they were the docs that will end up in
ripgrep's man page. For now, I've prefixed the flags with `--x-` since I
suspect there will be a lot of them, and this will help separate them from the
rest of ripgrep's flags.


#### `-X/--index`

Enables index search in ripgrep. Without this flag set, ripgrep will never
use an index to search.

ripgrep will find an index to search using the following procedure. Once a
step has found at least one index, subsequent steps are skipped. If an index
could not be found, then ripgrep behaves as if `-X` was not given.

1. If any indexes are provided on the command line as directory paths to the
   index itself, then all of them are searched, in sequence.
2. If `RIPGREP_INDEX_PATH` is set to a valid index, then it is searched.
3. If there exists a `.ripgrep` directory in the current working directory and
   it contains a valid index, then it is searched.
4. Step (3) is repeated, but for each parent directory of the current working
   directory.

If `-X` is given twice, then ripgrep will stop searching if an index is present
and the query makes it unable to use the index.


#### `--x-crud`

Indexes one or more directories or files. If a file has already been indexed,
then it is re-indexed. By default, a file is only re-indexed if it's last
modified time is more recent than the time it was last indexed. To force
re-indexing, use the `--x-force` flag. If a file that was previously indexed is
no longer accessible, then it is removed from the index.

The files that are indexed are determined by ripgrep's normal filtering
options.

The location of the index itself is determined by the following procedure. Once
a step has found an index, subsequent steps are skipped.

1. If `--x-index-path` is set, then the index is written to that path.
2. If `RIPGREP_INDEX_PATH` is set, then its value is used as the path.
3. If the current working directory contains a `.ripgrep` path and is a valid
   existing index, then it is updated.
4. Step (3) is repeated, but for each parent directory of the current working
   directory.
5. If an index could not otherwise be determined, then it is written to the
   current working directory at `.ripgrep`. If `.ripgrep` already exists and is
   not a valid index, then ripgrep returns an error.

If another ripgrep process is writing to the same index identified via the
above process, then this process will wait until the other is finished.

As an example, the following will create an index of all files (recursively)
in the current directory that pass ripgrep's normal smart filtering:

    $ rg --x-crud

And this will search that index:

    $ rg -X foobar

Note that the index created previously can now be updated incrementally. For
example, if you know that `foo/bar/quux` has changed, then you can run:

    $ rg --x-crud foo/bar/quux

To tell ripgrep to re-index just that file (or directory). This works even if
`foo/bar/quux` has been deleted, where ripgrep will remove that entry from its
index.


### Prior work

I believe there are three popularish tools that accomplish something similar to
what I'm trying to achieve here. That is, indexed search where the input is a
regex.

* Russ Cox's [codesearch](https://github.com/google/codesearch), which is
  famounsly described in his
  ["Regular Expression Matching with a Trigram Index"](https://swtch.com/~rsc/regexp/regexp4.html)
  article. Other tools, like
  [Hound](https://github.com/hound-search/hound)
  are based on `codesearch`.
* [qgrep](https://github.com/zeux/qgrep), which is described in more detail
  [here](https://zeux.io/2019/04/20/qgrep-internals/).
* [livegrep](https://github.com/livegrep/livegrep), which also provides a very
  nice front-end for searching. It is described in more detail
  [here](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/).

(There are other similarish tools, like
[Recoll](https://www.lesbonscomptes.com/recoll/)
and
[zoekt](https://github.com/google/zoekt),
but these appear to be information retrieval systems, and thus not exhaustive
searching like what I've proposed above.)

ripgrep's index system will most closely resemble Russ Cox's `codesearch`. That
is, at its core, it is an inverted index with ngrams as its terms. Each ngram
will point to a postings list, which lists all of the files that contain that
trigram. The main downside of `codesearch` is that it's closer to a proof of
concept of an idea rather than a productionized thing that will scale. Its most
compelling downside is it performance with respect to incremental updates.

`qgrep` and `livegrep` both represent completely different ways to tackle this
problem. `qgrep` actually keeps a compressed copy of the data from every file
in its index. This makes its index quite a bit larger than `codesearch`, but
can do well with spinning rust hard drives due to sequential reading. `qgrep`
does support incremental updates, but will eventually require the entire index
to be rebuilt after too many of them. My plan with ripgrep is to make
incremental updates a core part of the design such that a complete re-index is
never _necessary_.

`livegrep` uses suffix arrays. I personally haven't played with this tool yet,
but mostly because it seems like incremental updates are slow or hard here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: add ngram indexing support to ripgrep #1497

Search is exhaustive with no ranking

ripgrep will not synchronize your index for you

Indexes are only indexes, they do not contain file contents

Indexes will probably not be durable

Indexs may not have a stable format at first

New flags

`-X/--index`

`--x-crud`

Prior work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

RFC: add ngram indexing support to ripgrep #1497

Description

Search is exhaustive with no ranking

ripgrep will not synchronize your index for you

Indexes are only indexes, they do not contain file contents

Indexes will probably not be durable

Indexs may not have a stable format at first

New flags

-X/--index

--x-crud

Prior work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`-X/--index`

`--x-crud`