Skip to content

Add autcomplete search for concepts on landing page#98

Merged
cthoyt merged 71 commits into
biopragmatics:mainfrom
kkaris:search-box
Jul 17, 2025
Merged

Add autcomplete search for concepts on landing page#98
cthoyt merged 71 commits into
biopragmatics:mainfrom
kkaris:search-box

Conversation

@kkaris
Copy link
Copy Markdown
Contributor

@kkaris kkaris commented Jul 8, 2025

Summary

Per title:
This PR adds a search box on the landing page (at /), with autocomplete capabilities.

Files changed/added:

  • Add module autocomplete in semra.web containing:
    • __init__.py with class ConceptsTrie, to hold the autocomplete lookup and get_concept_nodes to query concept nodes from the graph db
    • autocomplete_blueprint.py containing the blueprint that handles GET requests to the prefix search
  • Update wsgi.py to add the autocomplete blueprint
  • Update pyproject.toml to add pytrie as dependency
  • Update home.html with search box and JS to handle the prefix search

Resolves #94

Comment thread src/semra/web/autocomplete/autocomplete_blueprint.py Outdated
Comment thread src/semra/web/autocomplete/__init__.py Outdated
Entry = tuple[str, str, str]


def get_concept_nodes(client) -> NodeData:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a bit memory intensive? would it be possible to just use the fact that there's a client available to endpoints in the app and just to write the right Cypher query to do text search over nodes' properties?

MATCH (n:concept)
WHERE lower(n.name) contains lower("substring")
RETURN n

or do something like the full text index in https://neo4j.com/developer/kb/fulltext-search-in-neo4j/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a design choice that I recommended to make the search responsive but I agree we should check that this isn't too large for the full database, in which case we should change to an active query solution.

Copy link
Copy Markdown
Contributor Author

@kkaris kkaris Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested doing something like the suggested query for text search (instead of the trie-based lookup) on the cell and cell line landscape data, and I couldn't tell the difference in the autocomplete response time when typing in search box. This was also without creating an index on the names.

I do worry that an index might be needed for larger datasets to make it fast enough and that would require more memory usage.

An alternative is to make it a pure search where the button is pushed and then the user waits for the search results rather than getting suggestions while typing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See branch https://github.com/kkaris/semra/tree/fulltext-index for implemented fulltext search.

For the raw data, building this index takes ~15 GB out of the total memory usage of ~48 GB, so we'd have to decide if that's worth it.

Copy link
Copy Markdown
Contributor Author

@kkaris kkaris Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fulltext-index branch is now incorporated into this branch (search-box).

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 14, 2025

Codecov Report

Attention: Patch coverage is 52.11268% with 34 lines in your changes missing coverage. Please review.

Project coverage is 47.58%. Comparing base (8888095) to head (ae79db6).
Report is 101 commits behind head on main.

Files with missing lines Patch % Lines
src/semra/client.py 31.11% 30 Missing and 1 partial ⚠️
src/semra/wsgi.py 88.88% 1 Missing and 1 partial ⚠️
src/semra/web/fastapi_components.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #98       +/-   ##
===========================================
+ Coverage   29.79%   47.58%   +17.78%     
===========================================
  Files          31       46       +15     
  Lines        2292     3451     +1159     
  Branches      412      487       +75     
===========================================
+ Hits          683     1642      +959     
- Misses       1572     1695      +123     
- Partials       37      114       +77     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/semra/client.py Outdated
Comment thread src/semra/client.py Outdated
Comment thread src/semra/client.py
exist_ok=True,
)
# Create btree index for concept curies and evidence mapping_justification
self.create_single_property_node_index(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the second two non-fulltext indexes used?

Copy link
Copy Markdown
Contributor Author

@kkaris kkaris Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to speed up the edge query when getting the connected component:
https://github.com/kkaris/semra/blob/search-box/src/semra/client.py#L600-L614

// There is a mapping between the two concepts
MATCH p=(a:concept)-[r]->(b:concept)
// We look up all mappings connecting them, making sure that we maintain
// source-target semantics to only pull out paths for p that correspond
// to the direction of the mapping
MATCH (a)<-[:`owl:annotatedSource`]-(m:mapping)-[:`owl:annotatedTarget`]->(b)
// Traverse the mapping to get to the evidence supporting it
MATCH q=(m)-[:hasEvidence]-(e:evidence)
WHERE a <> b
AND a.curie in $curies AND b.curie in $curies
// Make sure the evidence is not an inversion of chaining
AND NOT (e.mapping_justification IN ['{CHAIN_MAPPING.curie}', '{INVERSION_MAPPING.curie}'])
RETURN p

Comment thread src/semra/client.py

if not relation_constraint:
relation_constraint = self._rel_q
if ":" in relation_constraint:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this, please?

Copy link
Copy Markdown
Contributor Author

@kkaris kkaris Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 07fdd9a ?

Copy link
Copy Markdown
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kkaris!

I did a bit of refactoring this to simplify it and better reuse pydantic models. I also added the CURIE to the autocomplete dropdown.

Time permitting, could you do the following:

  1. Add some tests
  2. Make autocomplete work on CURIEs, too. If you do it this way, please also show the names in the autocomplete box

Comment thread src/semra/client.py
OPTIONS {{
indexConfig: {{
`fulltext.analyzer`: 'unicode_whitespace',
`fulltext.eventually_consistent`: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this case insensitive?

Copy link
Copy Markdown
Contributor Author

@kkaris kkaris Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without doing some significant changes, the problem is that the case-insensitive analyzers also tokenize on ':' so we would basically lose search for curies (it still matches on say bto:0123 if you provide bto, but not when bto: with : is provided). I checked and this trade-off exists for all analyzers available in the community edition of Neo4j. We can hack around this by adding a new property n.name_lc = toLower(n.name), then search on n.name_lc instead of n.name, but still return n.name.

We could also switch back to just using the b-tree index for names or curies instead.

Here is an enumeration of the options:

  1. full text search with case sensitive match and match on curies (current implementation)
  2. full text search with case insensitive match but no match on curies
  3. b-tree based prefix search: case insensitive and match on curies, but only pure prefix available, so no fuzzy matching or full word match (see below).
  4. (hack) Add n.name_lc = toLower(n.name) and keep the current fulltext search function but search on the lowecase version with lowercased search term and return n.name.

The difference in results from b-tree based and fulltext is if I search blood cell I would get these results:

  • Fulltext:

    • "blood cell"
    • "blood platelet"
    • "peripheral blood cell"
  • b-tree based:

    • "blood cell"
    • "blood cells"
    • "blood cellular component"

    What should we do?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would go with the hack so we can have it both ways - I want to be able to do case insensitive search and also CURIE search. It's not great, but this can be added as part of the python code that builds the index after the fact, right? Then we don't have to change the pipeline that creates the database.

If it's not possible without changing the DB generation code, then I'd go for option 1 when we just leave this for future work

Copy link
Copy Markdown
Contributor Author

@kkaris kkaris Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to ad-hoc add the lowecase curies with a MATCH ... SET n.curie_lc = toLower(n.curie) on the raw data as I tried for 10+ minutes and nothing happened (I believe all curies have lowercase prefixes, but I tried it to check how expensive the operation would be).

The way to add this would be to add the lowercase names/curies at build time, so I think case-insensitive matching would have to be a future feature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added issue for this: #100.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, then let's address this as a low priority future feature, and not worry about it for the revision of the paper

@kkaris
Copy link
Copy Markdown
Contributor Author

kkaris commented Jul 16, 2025

@cthoyt

  1. Make autocomplete work on CURIEs, too. If you do it this way, please also show the names in the autocomplete box

It should already work on curies.

@cthoyt
Copy link
Copy Markdown
Member

cthoyt commented Jul 16, 2025

@kkaris please let me know when you're done, and this can be merged

thanks again for the effort

@kkaris
Copy link
Copy Markdown
Contributor Author

kkaris commented Jul 16, 2025

@kkaris please let me know when you're done, and this can be merged

thanks again for the effort

All done from my end.

@cthoyt cthoyt enabled auto-merge (squash) July 17, 2025 11:40
@cthoyt cthoyt merged commit 47e555c into biopragmatics:main Jul 17, 2025
11 checks passed
@kkaris kkaris deleted the search-box branch July 17, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add search functionality to SeMRA web application

3 participants