Skip to content

When column is of type int, histogram acts different from pandas dataframe #22

@itamar-otonomo

Description

@itamar-otonomo

When I summon a hist from a Pandas column (Series) containing integers I get a proper histogram where the x axis is divided to bins of value ranges.
When I do the same using a handy DataFrame I get a categorical histogram.

I dug into the code and the reason for the way handy acts is that the column of integers is not defined as a member of the self._continuous group of columns.

hist uses the continuous list as an indication of using categorical for non continuous. This is why a hist of integers in handy is not what one would expect from a hist of integers in Pandas.

a workaround is to cast the integer column to floats. I think this is a bug (couldn't find anything in the docs).

Here's a quick repro code..

pdf = pd.DataFrame({'bobo': np.random.randint(0, 100, 5000)})
df = spark.createDataFrame(pdf).withColumn('float_bobo', F.col('bobo').astype('float'))
hdf = df.toHandy()
pdf.bobo.hist()
hdf.cols['bobo'].hist()
hdf.cols['float_bobo'].hist()

I forgot to congratulate you on this great lib, it really is cool!

Itamar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions