GitHub - Hoksolinvan/Text-Based-Search-Engine-with-Term-Document-Matrix: This project involves developing a backend system for a search tool designed to enhance search engine optimization (SEO). The core of this implementation lies in the creation and utilization of a term-document matrix and cosine similarity to rank web pages based on preprocessed terms.

Term-Document Matrix Construction:

Generation: Create a term-document matrix from a collection of text documents. Each row represents a term, and each column represents a document. The entries in the matrix capture the frequency of terms within each document.
Preprocessing: Apply tokenization, stemming, and removal of stop words to prepare terms. This preprocessing ensures that terms are standardized for accurate analysis.

Query Handling and Vectorization

Query Input: Receive a text query from the user, which is preprocessed in the same manner as the documents to ensure consistency.
Query Vector Construction: Convert the preprocessed query into a query vector. This vector aligns with the dimensions of the term-document matrix, enabling direct comparison with document vectors.

Similarity Measurement

Cosine Similarity: Calculate the cosine similarity between the query vector and each document vector in the term-document matrix. Cosine similarity measures the cosine of the angle between two vectors, providing a metric for their similarity.

Formula: The cosine similarity between vectors A and B is given by:

Implementation:Compute the dot product of the query vector and document vectors, and divide by the product of their magnitudes. This results in a similarity score between 0 and 1.

Ranking and Output

Document Ranking: Rank the documents based on their similarity scores to the query vector. Higher scores indicate greater relevance to the query.
Output Format: Present the ranked documents along with their similarity scores. The output is designed to be processed by a colleague's web interface, ensuring seamless integration and display.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Term-Document_Matrix.py		Term-Document_Matrix.py
Term-frequency-Counter.py		Term-frequency-Counter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Hoksolinvan/Text-Based-Search-Engine-with-Term-Document-Matrix

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages