FTSearchEngine

FTSearchEngine is a simple Full-Text Search Engine. It is part of a series of projects that I am working on to learn more about how algorithms we use in the everyday life works.

This work is based on the talk of Michele Riva about Writing a full-text search engine

How it works

Tokenization

The idea behind tokenization is to normalize the data we want to search. It reduces memory usage by removing duplicates, meaningless words and so on. It is typically done in multiple steps.

Break the text down into individual words.
Remove any punctuation, accents, diacritics or special characters.
Lowercase the tokens (optional, but disallows case-sensitive search).
Remove stop-words (words that does not carry meaning)
Stemming - Normalize the tokens' meaning.
Remove duplicates

Fuzzy Matching

One of the core components of this fuzzy matching algorithm is the Levenshtein Distance algorithm. It give the edit distance between two words given 3 operations (Deletion, Insertion, Replacement) Here's an example with the words "kitten" and "sitting":

replace k with s.
replace e with i.
insert g at the end.

The edit distance between these two words is 3.

How to use

Make sure you have C# and .NET Core 8 installed on your machine. You can check if you have it installed by running the following command in your terminal.

dotnet --version

If you don't have it installed, you can download it from the official website .NET Core

Then, clone the repository and run the project

git clone https://github.com/Lxdovic/FTSearchEngine
cd FTSearchEngine
dotnet run

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
FTS.Tests		FTS.Tests
FTSearchEngine		FTSearchEngine
.gitignore		.gitignore
FTSearchEngine.sln		FTSearchEngine.sln
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FTSearchEngine

How it works

Tokenization

Fuzzy Matching

How to use

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Lxdovic/FTSearchEngine

Folders and files

Latest commit

History

Repository files navigation

FTSearchEngine

How it works

Tokenization

Fuzzy Matching

How to use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages