Skip to content

Conversation

@tokee
Copy link
Contributor

@tokee tokee commented May 29, 2018

We have two timestamps: crawl_date, which is authoritative from the crawler, and last_modified, which is extracted by Tika from the source data. This pull request adds weekday (Monday, Tuesday, Wednesday...) and time_of_day (16:44:30) to both of these fields. As Solr has no concept of time without a full date, the time_of_day is prefixed with 0001-01-01. Indexing this ways means that date math works as expected. An alternative way would be to have second_of_day or something like that, but that requires translation from the user interface.

The use-cases for last_modified are fairly obvious: e.g. with this, it is possible to find images taken from Friday evening to Saturday morning.

It is more dubious for crawl_date as that timestamp does not say much about when the material was created. It might be useful for debugging crawls? I would appreciate input on whether the crawl_date-additions should be included or not.

This pull request closes #161.

@tokee tokee self-assigned this May 29, 2018
@tokee
Copy link
Contributor Author

tokee commented Aug 21, 2018

I just realized that weekday is highly locale dependent. In this pull request it is fixed at UTC, but that would have to be configurable to be really usable.

Or (the heavy option): Index 24 different terms (one for each 1-hour timezone offset) in the field for each document.

@tokee tokee added the question label Aug 21, 2018
@anjackson
Copy link
Contributor

I think this is a good step to help explore this kind of thing. If the locale is made configurable, that should make it usable by folks outside of GMT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index weekday

2 participants