-
Notifications
You must be signed in to change notification settings - Fork 27
Description
When I summon a hist from a Pandas column (Series) containing integers I get a proper histogram where the x axis is divided to bins of value ranges.
When I do the same using a handy DataFrame I get a categorical histogram.
I dug into the code and the reason for the way handy acts is that the column of integers is not defined as a member of the self._continuous group of columns.
hist uses the continuous list as an indication of using categorical for non continuous. This is why a hist of integers in handy is not what one would expect from a hist of integers in Pandas.
a workaround is to cast the integer column to floats. I think this is a bug (couldn't find anything in the docs).
Here's a quick repro code..
pdf = pd.DataFrame({'bobo': np.random.randint(0, 100, 5000)})
df = spark.createDataFrame(pdf).withColumn('float_bobo', F.col('bobo').astype('float'))
hdf = df.toHandy()
pdf.bobo.hist()
hdf.cols['bobo'].hist()
hdf.cols['float_bobo'].hist()
I forgot to congratulate you on this great lib, it really is cool!
Itamar