Currently, Histograph does the following:
- API queries Elasticsearch (e.g.
q=utrecht), ES returns list of PITs
- List of PITs probably contains many forms and spellings of Utrecht, and maybe some results like Abcoude bij Utrecht
- Those PITs are sent to Neo4j Plugin, BFSs are computed for each PIT, and subgraphs/concepts/klonten are returned, ordered by number of PITs per concept
- This may cause Abcoude to show up first in the list of results.
- This is wrong!
Possible solution:
- API queries Elasticsearch (e.g.
q=utrecht), ES returns list of PITs
- List of PITs probably contains many forms and spellings of Utrecht, and maybe some results like Abcoude bij Utrecht
- Those PITs are sent to Neo4j Plugin, together with their respective Elasticsearch score
- BFSs are computed for each PIT, just like before, but now the Neo4j Plugin orders the list of resulting concepts by (ES hits * ES score per concept) / PITs per concept
- This way, the concept of Utrecht will have many ES hits (and high ES scores, too) per concept, while the concept of Abcoude will have at least one ES hit (Abcoude bij Utrecht) in its concept, but probably not many more. The new ordering algorithm will make sure this concept is not returned first.
- This is better!
Currently, Histograph does the following:
q=utrecht), ES returns list of PITsPossible solution:
q=utrecht), ES returns list of PITs