Skip to content

[ENH] Score Documents - Use SBERT embedding instead of FastText#930

Merged
janezd merged 2 commits into
biolab:masterfrom
PrimozGodec:score-documents-bert
Mar 6, 2023
Merged

[ENH] Score Documents - Use SBERT embedding instead of FastText#930
janezd merged 2 commits into
biolab:masterfrom
PrimozGodec:score-documents-bert

Conversation

@PrimozGodec

Copy link
Copy Markdown
Collaborator
Issue

SBERT embedding is (in our opinion) more suitable for measuring distances between embeddings of documents and words. The reason for it is that it embeds complete text and not words separately, which leads to a better representation of the whole context of the document.

Description of changes

This PR replaces FastText embedding with SBERT in Score Document. It also better addresses some weaknesses of the widget:

  • It implements the option to send the list of documents to the server. It is used when sending words to the server so that each term is not sent in a separate request.
  • When embedding fail widget now shows the warning that similarity cannot be computed since some embeddings were unsuccessful. Before, it failed and didn't show any scores; now, it shows other scores except similarity in case of failure.
Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec force-pushed the score-documents-bert branch 2 times, most recently from c26aaca to d6e2f75 Compare January 18, 2023 16:14
@PrimozGodec

Copy link
Copy Markdown
Collaborator Author

Comparison of fastText (first image) and sBERT (second image) similarity for keyword "bicycle". They look similar, but it seems that sBERT is more focused on actual documents with searched content (on the right side of the image, fewer dots are bright green and yellow).

similarity-fasttext-bike
similarity-bert-bike

@PrimozGodec

Copy link
Copy Markdown
Collaborator Author

/rebase

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.68%. Comparing base (ef1dd29) to head (9e01a2e).
⚠️ Report is 260 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #930   +/-   ##
=======================================
  Coverage   77.68%   77.68%           
=======================================
  Files          86       86           
  Lines       12281    12297   +16     
  Branches     1607     1613    +6     
=======================================
+ Hits         9540     9553   +13     
- Misses       2442     2444    +2     
- Partials      299      300    +1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants