We have a lot of fields in the search index. Not all of them are necessarily useful and not all of them are the best way to use Elasticsearch. We should clean them all up, and then document them on:
A lot of these are documented here, but especially because of Gutenberg I am not certain that we should keep all of them. Also, fields that we decide not to document should probably get removed from the index.
Fields that exist but are not documented:
- the "deprecated" url fields
- location
- langs field (probability of each language)
- all of the content extraction stuff: