Skip to content

Support "Pandas Series Representation" #43

@jbesomi

Description

@jbesomi

This is one of the most interesting (future) aspects of Texthero: the ability to represent any text-dataset with ease, even very large dataset.

Motivation
One of the big limitations of the current version of the Texthero is that the output of the tfidf function or whatever other "representation" function is not particularly interpretable. The user do not even know which tf-idf weight is associated with which word/token.

The solution is to return a Multiindex Pandas Series where the first level represent the document and the second document represents the word. See this example below:

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["I am GROOT", "Flame on"])
>>> s = hero.tokenize(s)
>>> hero.tfidf(s)
document  word 
0        GROOT    0.577350
          I        0.577350
          am       0.577350
1        Flame    0.707107
          on       0.707107
dtype: Sparse[float64, nan]

The advantage of this approach is that:

  1. The result is much more interpretable
  2. The result is a Sparse Pandas Series! This is very good as often the output of TF-IDF (especially when max_features=None) is a very large and very sparse matrix.

The drawback is that this Pandas Series cannot be appended directly into the Pandas Dataframe.

We refer to this MultiIndex series where the first level is the document and the second level is the term as
"Pandas Series Representation" (a better name is welcomed!)

Texthero 2.0

Starting from Texthero 2.0 all? "representation" functions will return such Pandas Representation Series. The pca/nmf function will accept as input a Pandas Representation and will (probably) return a flat representation as it does not make sense anymore to have a second level called "pca-component-1".

From Pandas Representation Series to Pandas Series

A function to_flat_series or something similar will transform the Pandas Representation Series into a (flatten) Pandas Series (as the actual output of tfidf). This will permit to append the Series into the initial df.

From Pandas Representation Series to a document-term matrix

Just by calling .stack() on the Pandas Representation Series it will be possible to convert it to a Pandas DataFrame where rows are the documents and every column is a term. Nice, right? We will need to explain clearly how to deal with MultiIndex (basics are not particularly hard)

Interested in helping out?
Most of the code has already been written. If you are interested in helping out for this important changes leave a comment. We will be glad to have you onboard!

Your opinion
Your opinion matter; let us know your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions