Add autcomplete search for concepts on landing page by kkaris · Pull Request #98 · biopragmatics/semra

kkaris · 2025-07-08T20:54:07Z

Summary

Per title:
This PR adds a search box on the landing page (at /), with autocomplete capabilities.

Files changed/added:

Add module autocomplete in semra.web containing:
- __init__.py with class ConceptsTrie, to hold the autocomplete lookup and get_concept_nodes to query concept nodes from the graph db
- autocomplete_blueprint.py containing the blueprint that handles GET requests to the prefix search
Update wsgi.py to add the autocomplete blueprint
Update pyproject.toml to add pytrie as dependency
Update home.html with search box and JS to handle the prefix search

Resolves #94

cthoyt · 2025-07-08T21:18:27Z

+Entry = tuple[str, str, str]
+
+
+def get_concept_nodes(client) -> NodeData:


isn't this a bit memory intensive? would it be possible to just use the fact that there's a client available to endpoints in the app and just to write the right Cypher query to do text search over nodes' properties?

MATCH (n:concept) WHERE lower(n.name) contains lower("substring") RETURN n

or do something like the full text index in https://neo4j.com/developer/kb/fulltext-search-in-neo4j/

This is actually a design choice that I recommended to make the search responsive but I agree we should check that this isn't too large for the full database, in which case we should change to an active query solution.

I tested doing something like the suggested query for text search (instead of the trie-based lookup) on the cell and cell line landscape data, and I couldn't tell the difference in the autocomplete response time when typing in search box. This was also without creating an index on the names.

I do worry that an index might be needed for larger datasets to make it fast enough and that would require more memory usage.

An alternative is to make it a pure search where the button is pushed and then the user waits for the search results rather than getting suggestions while typing.

See branch https://github.com/kkaris/semra/tree/fulltext-index for implemented fulltext search.

For the raw data, building this index takes ~15 GB out of the total memory usage of ~48 GB, so we'd have to decide if that's worth it.

The fulltext-index branch is now incorporated into this branch (search-box).

codecov · 2025-07-14T17:31:43Z

Codecov Report

Attention: Patch coverage is 52.11268% with 34 lines in your changes missing coverage. Please review.

Project coverage is 47.58%. Comparing base (8888095) to head (ae79db6).
Report is 101 commits behind head on main.

Files with missing lines	Patch %	Lines
src/semra/client.py	31.11%	30 Missing and 1 partial ⚠️
src/semra/wsgi.py	88.88%	1 Missing and 1 partial ⚠️
src/semra/web/fastapi_components.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #98       +/-   ##
===========================================
+ Coverage   29.79%   47.58%   +17.78%     
===========================================
  Files          31       46       +15     
  Lines        2292     3451     +1159     
  Branches      412      487       +75     
===========================================
+ Hits          683     1642      +959     
- Misses       1572     1695      +123     
- Partials       37      114       +77

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cthoyt · 2025-07-16T14:41:06Z

+            exist_ok=True,
+        )
+        # Create btree index for concept curies and evidence mapping_justification
+        self.create_single_property_node_index(


are the second two non-fulltext indexes used?

Yes, to speed up the edge query when getting the connected component:
https://github.com/kkaris/semra/blob/search-box/src/semra/client.py#L600-L614

// There is a mapping between the two concepts MATCH p=(a:concept)-[r]->(b:concept) // We look up all mappings connecting them, making sure that we maintain // source-target semantics to only pull out paths for p that correspond // to the direction of the mapping MATCH (a)<-[:`owl:annotatedSource`]-(m:mapping)-[:`owl:annotatedTarget`]->(b) // Traverse the mapping to get to the evidence supporting it MATCH q=(m)-[:hasEvidence]-(e:evidence) WHERE a <> b AND a.curie in $curies AND b.curie in $curies // Make sure the evidence is not an inversion of chaining AND NOT (e.mapping_justification IN ['{CHAIN_MAPPING.curie}', '{INVERSION_MAPPING.curie}']) RETURN p

cthoyt · 2025-07-16T14:41:28Z


        if not relation_constraint:
            relation_constraint = self._rel_q
+        if ":" in relation_constraint:


can you add a test for this, please?

Resolved in 07fdd9a ?

cthoyt

Thanks @kkaris!

I did a bit of refactoring this to simplify it and better reuse pydantic models. I also added the CURIE to the autocomplete dropdown.

Time permitting, could you do the following:

Add some tests
Make autocomplete work on CURIEs, too. If you do it this way, please also show the names in the autocomplete box

cthoyt · 2025-07-16T16:30:06Z

+        OPTIONS {{
+            indexConfig: {{
+                `fulltext.analyzer`: 'unicode_whitespace',
+                `fulltext.eventually_consistent`: true


can we make this case insensitive?

Not without doing some significant changes, the problem is that the case-insensitive analyzers also tokenize on ':' so we would basically lose search for curies (it still matches on say bto:0123 if you provide bto, but not when bto: with : is provided). I checked and this trade-off exists for all analyzers available in the community edition of Neo4j. We can hack around this by adding a new property n.name_lc = toLower(n.name), then search on n.name_lc instead of n.name, but still return n.name.

We could also switch back to just using the b-tree index for names or curies instead.

Here is an enumeration of the options:

full text search with case sensitive match and match on curies (current implementation)

full text search with case insensitive match but no match on curies

b-tree based prefix search: case insensitive and match on curies, but only pure prefix available, so no fuzzy matching or full word match (see below).

(hack) Add n.name_lc = toLower(n.name) and keep the current fulltext search function but search on the lowecase version with lowercased search term and return n.name.

The difference in results from b-tree based and fulltext is if I search blood cell I would get these results:

Fulltext:

"blood cell"

"blood platelet"

"peripheral blood cell"

b-tree based:

"blood cell"

"blood cells"

"blood cellular component"

What should we do?

I think I would go with the hack so we can have it both ways - I want to be able to do case insensitive search and also CURIE search. It's not great, but this can be added as part of the python code that builds the index after the fact, right? Then we don't have to change the pipeline that creates the database.

If it's not possible without changing the DB generation code, then I'd go for option 1 when we just leave this for future work

I wasn't able to ad-hoc add the lowecase curies with a MATCH ... SET n.curie_lc = toLower(n.curie) on the raw data as I tried for 10+ minutes and nothing happened (I believe all curies have lowercase prefixes, but I tried it to check how expensive the operation would be).

The way to add this would be to add the lowercase names/curies at build time, so I think case-insensitive matching would have to be a future feature.

Added issue for this: #100.

great, then let's address this as a low priority future feature, and not worry about it for the revision of the paper

kkaris · 2025-07-16T21:04:32Z

@cthoyt

Make autocomplete work on CURIEs, too. If you do it this way, please also show the names in the autocomplete box

It should already work on curies.

cthoyt · 2025-07-16T23:27:28Z

@kkaris please let me know when you're done, and this can be merged

thanks again for the effort

kkaris · 2025-07-16T23:48:59Z

@kkaris please let me know when you're done, and this can be merged

thanks again for the effort

All done from my end.

cthoyt reviewed Jul 8, 2025

View reviewed changes

Comment thread src/semra/web/autocomplete/autocomplete_blueprint.py Outdated

cthoyt reviewed Jul 8, 2025

View reviewed changes

kkaris force-pushed the search-box branch from 32f5199 to e882329 Compare July 10, 2025 05:34

kkaris added 26 commits July 14, 2025 11:36

Start autocomplete module

c803a59

Add autocomplete blueprint

0d82859

Rename class method

2955761

Add logging

d58ff48

Fix code handling duplicates

4915ad5

Optionally load autocomplete blueprint, load by default

39430ff

Add search box and autocomplete JS

7969e03

Cleanup and docstrings

191c691

Change button text

226256a

Open page in new tab

98e97ec

WIP: Add mockdata to show functionality

ba55991

Uncomment indexing data, comment out mock data

5649d0f

Remove example since what concepts are available is dynamic

5c70de2

replace Flask Blueprint with FastAPI router

8d3bbf4

Add print for building index and adding autocomplete

ae3d2cf

Display curie when name is not available

58968ae

Increase heading sizes, center top header

150e3e8

Add some logging when web app starts

e381e1c

Do print instead of logger

d848583

Add examples to search label

8aa1a85

Clean up from rebase conflict

d5cfd78

Reimplement search with full-text index

d008db2

Add method to create full-text index

851675a

Add full-text index on app startup

72d1257

Use regular call instead of CALL db.index

b8b43b3

Also create index for curies

779ae9f

bgyori reviewed Jul 14, 2025

View reviewed changes

Comment thread src/semra/client.py Outdated

bgyori reviewed Jul 14, 2025

View reviewed changes

Comment thread src/semra/client.py Outdated

bgyori and others added 11 commits July 14, 2025 15:34

Remove backticks inside quoted relation type

7301a4a

Remove long comment

eea96bb

Set up logging separately from uvicorn

8d7adc5

Lint

d3cab0f

Refactor autocompletion abstraction

b330ea7

Update tests

8940cb7

Merge branch 'main' into pr/98

f7cb9b3

Switch to reusing pydantic data models

a85f007

Update home.html

9d31693

Add curie to autocomplete box

45e6eb1

Update pyproject.toml

fb53c9a

cthoyt reviewed Jul 16, 2025

View reviewed changes

cthoyt approved these changes Jul 16, 2025

View reviewed changes

cthoyt reviewed Jul 16, 2025

View reviewed changes

Fix typing

7ae16f1

kkaris added 3 commits July 16, 2025 15:11

Add helper to safely escape special characters in labels or types

b8d4904

Test helper

07fdd9a

Lint

b09278f

kkaris mentioned this pull request Jul 16, 2025

Make autocomplete search case insensitive #100

Open

2 tasks

cthoyt added 2 commits July 17, 2025 07:39

Consolidate additional code

6de2391

Update test_web.py

ae79db6

cthoyt enabled auto-merge (squash) July 17, 2025 11:40

cthoyt merged commit 47e555c into biopragmatics:main Jul 17, 2025
11 checks passed

kkaris deleted the search-box branch July 17, 2025 15:16

		Entry = tuple[str, str, str]


		def get_concept_nodes(client) -> NodeData:

Uh oh!

Conversation

kkaris commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed/added:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cthoyt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkaris commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthoyt commented Jul 16, 2025

Uh oh!

kkaris commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kkaris commented Jul 8, 2025 •

edited

Loading

kkaris Jul 9, 2025 •

edited

Loading

kkaris Jul 10, 2025 •

edited

Loading

codecov Bot commented Jul 14, 2025 •

edited

Loading

kkaris Jul 16, 2025 •

edited

Loading

kkaris Jul 16, 2025 •

edited

Loading

kkaris Jul 16, 2025 •

edited

Loading

kkaris Jul 16, 2025 •

edited

Loading

kkaris commented Jul 16, 2025 •

edited

Loading