Add autcomplete search for concepts on landing page#98
Conversation
| Entry = tuple[str, str, str] | ||
|
|
||
|
|
||
| def get_concept_nodes(client) -> NodeData: |
There was a problem hiding this comment.
isn't this a bit memory intensive? would it be possible to just use the fact that there's a client available to endpoints in the app and just to write the right Cypher query to do text search over nodes' properties?
MATCH (n:concept)
WHERE lower(n.name) contains lower("substring")
RETURN n
or do something like the full text index in https://neo4j.com/developer/kb/fulltext-search-in-neo4j/
There was a problem hiding this comment.
This is actually a design choice that I recommended to make the search responsive but I agree we should check that this isn't too large for the full database, in which case we should change to an active query solution.
There was a problem hiding this comment.
I tested doing something like the suggested query for text search (instead of the trie-based lookup) on the cell and cell line landscape data, and I couldn't tell the difference in the autocomplete response time when typing in search box. This was also without creating an index on the names.
I do worry that an index might be needed for larger datasets to make it fast enough and that would require more memory usage.
An alternative is to make it a pure search where the button is pushed and then the user waits for the search results rather than getting suggestions while typing.
There was a problem hiding this comment.
See branch https://github.com/kkaris/semra/tree/fulltext-index for implemented fulltext search.
For the raw data, building this index takes ~15 GB out of the total memory usage of ~48 GB, so we'd have to decide if that's worth it.
There was a problem hiding this comment.
The fulltext-index branch is now incorporated into this branch (search-box).
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #98 +/- ##
===========================================
+ Coverage 29.79% 47.58% +17.78%
===========================================
Files 31 46 +15
Lines 2292 3451 +1159
Branches 412 487 +75
===========================================
+ Hits 683 1642 +959
- Misses 1572 1695 +123
- Partials 37 114 +77 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| exist_ok=True, | ||
| ) | ||
| # Create btree index for concept curies and evidence mapping_justification | ||
| self.create_single_property_node_index( |
There was a problem hiding this comment.
are the second two non-fulltext indexes used?
There was a problem hiding this comment.
Yes, to speed up the edge query when getting the connected component:
https://github.com/kkaris/semra/blob/search-box/src/semra/client.py#L600-L614
// There is a mapping between the two concepts
MATCH p=(a:concept)-[r]->(b:concept)
// We look up all mappings connecting them, making sure that we maintain
// source-target semantics to only pull out paths for p that correspond
// to the direction of the mapping
MATCH (a)<-[:`owl:annotatedSource`]-(m:mapping)-[:`owl:annotatedTarget`]->(b)
// Traverse the mapping to get to the evidence supporting it
MATCH q=(m)-[:hasEvidence]-(e:evidence)
WHERE a <> b
AND a.curie in $curies AND b.curie in $curies
// Make sure the evidence is not an inversion of chaining
AND NOT (e.mapping_justification IN ['{CHAIN_MAPPING.curie}', '{INVERSION_MAPPING.curie}'])
RETURN p|
|
||
| if not relation_constraint: | ||
| relation_constraint = self._rel_q | ||
| if ":" in relation_constraint: |
There was a problem hiding this comment.
can you add a test for this, please?
cthoyt
left a comment
There was a problem hiding this comment.
Thanks @kkaris!
I did a bit of refactoring this to simplify it and better reuse pydantic models. I also added the CURIE to the autocomplete dropdown.
Time permitting, could you do the following:
- Add some tests
- Make autocomplete work on CURIEs, too. If you do it this way, please also show the names in the autocomplete box
| OPTIONS {{ | ||
| indexConfig: {{ | ||
| `fulltext.analyzer`: 'unicode_whitespace', | ||
| `fulltext.eventually_consistent`: true |
There was a problem hiding this comment.
can we make this case insensitive?
There was a problem hiding this comment.
Not without doing some significant changes, the problem is that the case-insensitive analyzers also tokenize on ':' so we would basically lose search for curies (it still matches on say bto:0123 if you provide bto, but not when bto: with : is provided). I checked and this trade-off exists for all analyzers available in the community edition of Neo4j. We can hack around this by adding a new property n.name_lc = toLower(n.name), then search on n.name_lc instead of n.name, but still return n.name.
We could also switch back to just using the b-tree index for names or curies instead.
Here is an enumeration of the options:
- full text search with case sensitive match and match on curies (current implementation)
- full text search with case insensitive match but no match on curies
- b-tree based prefix search: case insensitive and match on curies, but only pure prefix available, so no fuzzy matching or full word match (see below).
- (hack) Add
n.name_lc = toLower(n.name)and keep the current fulltext search function but search on the lowecase version with lowercased search term and returnn.name.
The difference in results from b-tree based and fulltext is if I search blood cell I would get these results:
-
Fulltext:
- "blood cell"
- "blood platelet"
- "peripheral blood cell"
-
b-tree based:
- "blood cell"
- "blood cells"
- "blood cellular component"
What should we do?
There was a problem hiding this comment.
I think I would go with the hack so we can have it both ways - I want to be able to do case insensitive search and also CURIE search. It's not great, but this can be added as part of the python code that builds the index after the fact, right? Then we don't have to change the pipeline that creates the database.
If it's not possible without changing the DB generation code, then I'd go for option 1 when we just leave this for future work
There was a problem hiding this comment.
I wasn't able to ad-hoc add the lowecase curies with a MATCH ... SET n.curie_lc = toLower(n.curie) on the raw data as I tried for 10+ minutes and nothing happened (I believe all curies have lowercase prefixes, but I tried it to check how expensive the operation would be).
The way to add this would be to add the lowercase names/curies at build time, so I think case-insensitive matching would have to be a future feature.
There was a problem hiding this comment.
great, then let's address this as a low priority future feature, and not worry about it for the revision of the paper
It should already work on curies. |
|
@kkaris please let me know when you're done, and this can be merged thanks again for the effort |
All done from my end. |
Summary
Per title:
This PR adds a search box on the landing page (at
/), with autocomplete capabilities.Files changed/added:
autocompleteinsemra.webcontaining:__init__.pywith classConceptsTrie, to hold the autocomplete lookup andget_concept_nodesto query concept nodes from the graph dbautocomplete_blueprint.pycontaining the blueprint that handles GET requests to the prefix searchwsgi.pyto add the autocomplete blueprintpyproject.tomlto addpytrieas dependencyhome.htmlwith search box and JS to handle the prefix searchResolves #94