Naive implementation of Bag of Words Model for NLP

How Bag of Words in NLP works:

Todo: We must preprocess the text

Converting text to lowercase
Removing non-word characters
Removing extra spaces

How:

Todo We break down text into sentences and from sentences into words

Example: Hi my name is Brandon. Nice to meet you. How is your family doing?

We can analyze that we have a total of three sentences. We can split this into three sentences

Hi my name is Brandon
Nice to meet you
How is your family doing

How:

Once we break down the text into sentences, we next have to break the text into words

[hi,my,name,is,brandon]
[nice,to,meet,you]
[how,is,your,family,doing]

How:

Todo: We must remove stopwords from our word lists

Stopwords are common words that don't add much meaning like "is", "the", "a", "to"

After removing stopwords:

[hi,name,brandon]
[nice,meet]
[family,doing]

How:

Todo: We build our vocabulary from all unique words

From our example, our vocabulary would be: [brandon,doing,family,hi,meet,name,nice]

This is sorted alphabetically and contains each unique word only once

We also create a frequency hashmap to track how many times each word appears across all sentences: {brandon:1, doing:1, family:1, hi:1, meet:1, name:1, nice:1}

How:

Todo: We create vectors for each sentence based on our vocabulary

Each sentence is represented as a vector where each position corresponds to a word in our vocabulary. We mark with the frequency if the word is present, 0 if not

Hi my name is Brandon → [1,0,0,1,0,1,0]
Nice to meet you → [0,0,0,0,1,0,1]
How is your family doing → [0,1,1,0,0,0,0]

How:

Todo: These vectors can now be used for machine learning

Each sentence is now represented as a numerical vector that can be:

Compared with other sentences
Used for classification tasks
Used for similarity measurements

Difference between my implementation and tensorflow implementaion:

(.venv) C:\Users\Chinzilla\bag_of_words>python main.py
2025-09-26 15:53:36.594281: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-26 15:53:37.658635: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.

My Bag of words implementation output:

[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 0, 0, 1, 1, 0, 2, 0, 0, 1, 1, 1, 1]

Tensorflow Bag of words output

2025-09-26 15:53:37.895705: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Vocabulary: ['[UNK]', np.str_('the'), np.str_('bus'), np.str_('samantha'), np.str_('mary'), np.str_('for'), np.str_('and'), np.str_('waited'), np.str_('train'), np.str_('station')]
Bag-of-Words Representation:
 [[1 1 0 0 0 1 0 1 1 0]
 [2 1 0 0 0 0 0 0 1 0]
 [1 1 1 1 1 0 1 0 0 0]
 [3 1 1 1 1 1 1 0 0 1]
 [6 2 2 1 1 1 1 1 0 1]]

(.venv) C:\Users\Chinzilla\bag_of_words>

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
generate_vector.py		generate_vector.py
main.py		main.py
real_generate_vector.py		real_generate_vector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Naive implementation of Bag of Words Model for NLP

Difference between my implementation and tensorflow implementaion:

About

Uh oh!

Releases

Packages

Languages

Chinzillla/bag_of_words_model_NLP

Folders and files

Latest commit

History

Repository files navigation

Naive implementation of Bag of Words Model for NLP

Difference between my implementation and tensorflow implementaion:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages