My quick investigation shows that bag of words might be a good algoritm: https://en.wikipedia.org/wiki/Bag-of-words_model