Project-Herme

Herme is a sentiment text classifier for profanity detection and an innapropiate image classifier.

`Introduction`

The idea behind this repository is to see how content filtering on places like forums or social media works.

Also this is a part of a series of projects being delivered by the team in the AI scientific group of students at Havana University

`Description`

`Text Classifier`

The text classifier is a multi-hot-encoding LSTM text classifier with the categories as follows: toxic, severe toxic, obscene, threat, insult and identity hate.

Full code here

First of all the text becomes "cleaned". It is performed by the clean method which removes every stopword, ip adress, non alphabetic symbol, etc, so that only "important" word gets through and to remove unnecesary noise in the network.

It uses a model serving as follows:

inp = Input(shape=(100, ))

# size of the vector space
embed_size = 128
x = Embedding(20000, embed_size)(inp)

output_dimention = 60
x = LSTM(output_dimention, return_sequences=True, name='lstm_layer')(x)
# reduce dimention
x = GlobalMaxPool1D()(x)
# disable 10% precent of the nodes
x = Dropout(0.1)(x)
# pass output through a RELU function
x = Dense(50, activation="relu")(x)
# another 10% dropout
x = Dropout(0.1)(x)
# pass the output through a sigmoid layer, since 
# we are looking for a binary (0,1) classification 
x = Dense(6, activation="sigmoid")(x)

model = Model(inputs=inp, outputs=x)
# we use binary_crossentropy because of binary classification
# optimise loss by Adam optimiser
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Given a sataset of around 160k written Wikipedia forum comments the networks behaved with an overall validation accuracy of 98%

Here we can see how the loss and accuracy behaved on the course of 2 training epochs, both on training and validation data

Full dataset can be found here. Once downloaded in order to test it it must be placed within the Text/Raw Dataset folder or changed in the code.

The testing script can be located here

`Image Classifier`

The image classifier is a binary classifier where values <= 0.5 mean unnappropiate and in any other case it means appropriate.

The full code can be found here

To load the data it creates a tensorflow.data.Dataset pipeline by labeling elements within the two folder in the data folder, batching them and resizing them to a default of 256x256. The only thing done manually is filtering the data so that only items that can be read by the model are present, which is done with the cv2 and imghdr modules.

The model can be summarized as follows:

model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Conv2D(16, (3,3), 1, activation='relu', input_shape=(256,256,3)))
model.add(tf.keras.layers.MaxPooling2D())

model.add(tf.keras.layers.Conv2D(32, (3,3), 1, activation='relu'))
model.add(tf.keras.layers.MaxPooling2D())

model.add(tf.keras.layers.Conv2D(16, (3,3), 1, activation='relu'))
model.add(tf.keras.layers.MaxPooling2D())

model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(256,activation='relu'))
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

model.compile('adam', loss='binary_crossentropy')

Dataset is made entirely by random google images that serve the purpose of the labels. Actually the model is so flexible that the same repository can be used to any binary classification, not only the one it is currently being used for

Here we can see how the loss and accuracy behaved on the course of 20 training epochs, both on training and validation data

`Using`

`Requirements`

Tensorflow
Numpy
OpenCV
NLTK

`Instructions`

Clone the repo and download the necessary databases as indicated in the sections above
Open the TrainingModel.ipynb of the one you wish to test and run all cells, it should create the model.h5 file
In the repo root folder should be the Hermes.py with two functions: predict_text and predict_images. Follow the instructions in there and you should be good to go

`Thank you notes`

This project is based on the one made by @iamhosseindhv on LSTM text classification. He is the main inspiration behind the text classification, and the image classification is simply the image counter part of the same focus.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Boilerplate		Boilerplate
Image		Image
Text		Text
WebPage		WebPage
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project-Herme

`Introduction`

`Description`

`Text Classifier`

`Image Classifier`

`Using`

`Requirements`

`Instructions`

`Thank you notes`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project-Herme

Introduction

Description

Text Classifier

Image Classifier

Using

Requirements

Instructions

Thank you notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Introduction`

`Description`

`Text Classifier`

`Image Classifier`

`Using`

`Requirements`

`Instructions`

`Thank you notes`

Packages