Support "Pandas Series Representation"

This is one of the most interesting (future) aspects of Texthero: the ability to represent any text-dataset with ease, even very large dataset.

**Motivation**
One of the big limitations of the current version of the Texthero is that the output of the `tfidf` function or whatever other "representation" function is not particularly interpretable. The user do not even know which tf-idf weight is associated with which word/token.

The solution is to return a Multiindex Pandas Series where the first level represent the document and the second document represents the word. See this example below:

```
>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["I am GROOT", "Flame on"])
>>> s = hero.tokenize(s)
>>> hero.tfidf(s)
document  word 
0        GROOT    0.577350
          I        0.577350
          am       0.577350
1        Flame    0.707107
          on       0.707107
dtype: Sparse[float64, nan]
```

The advantage of this approach is that:
1. The result is much more interpretable
2. The result is a **Sparse** Pandas Series! This is very good as often the output of TF-IDF (especially when max_features=None) is a very large and very sparse matrix. 

The drawback is that this Pandas Series cannot be appended directly into the Pandas Dataframe.

We refer to this MultiIndex series where the first level is the document and the second level is the term as 
  "Pandas Series Representation" (a better name is welcomed!)

**Texthero 2.0**

Starting from Texthero 2.0 all? "representation" functions will return such Pandas Representation Series. The `pca`/`nmf` function will accept as input a Pandas Representation and will (probably) return a flat representation as it does not make sense anymore to have a second level called "pca-component-1".

**From Pandas Representation Series to Pandas Series**

A function `to_flat_series` or something similar will transform the Pandas Representation Series into a (flatten) Pandas Series (as the actual output of `tfidf`). This will permit to append the Series into the initial df.

**From Pandas Representation Series to a document-term matrix**

Just by calling `.stack()` on the Pandas Representation Series it will be possible to convert it to a Pandas DataFrame where rows are the documents and every column is a term. Nice, right? We will need to explain clearly how to deal with MultiIndex (basics are not particularly hard)

**Interested in helping out?**
Most of the code has already been written. If you are interested in helping out for this **important** changes leave a comment. We will be glad to have you onboard!

**Your opinion**
Your opinion matter; let us know your thoughts!
 
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "Pandas Series Representation" #43

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support "Pandas Series Representation" #43

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions