Skip to content

DISCUSSION: Data for pandas examples #150

Open
@datapythonista

Description

@datapythonista

Very often in the pandas documentation, to show examples simple DataFrame objects are created. And many of them just use random data, see for example https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#id1

>>> df = pandas.DataFrame(numpy.random.randn(5, 2), columns=list('AB'))
>>> df
          A         B
0  0.469112 -0.282863
1 -1.509059 -1.135632
2  1.212112 -0.173215
3  0.119209 -1.044236
4 -0.861849 -2.104569

Then, if I want to show an operation, I can get something like:

>>> df @ 2
          A         B
0  2.469112  1.717137
1  0.490941  0.864368
2  3.212112  1.826785
3  2.119209  0.955764
4  1.138151 -0.104569

And in my opinion the example is quite useless (more than for the syntax), because if you don't know what the operation does, the example is not helping you understand.

The best example I could find to overcome that (probably not great, but the best I could find) is:

>>> df = pandas.DataFrame({"num_legs": [4, 4, 2],
...                        "num_arms": [0, 0, 2]},
...                       ["dog", "cat", "monkey"])
>>> df
        num_arms  num_legs
dog            0         4
cat            0         4
monkey         2         2

Then, when performing an operation is easy to guess what it's doing, or double check if you already have a guess:

>>> df @ 2
        num_arms  num_legs
dog            2         6
cat            2         6
monkey         4         4

We are already using some of those in some examples: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html

While this worked well in some places, we found this dataset very insufficient to show all pandas functionality. And while we initially wanted to standardize the data used in the examples, so things are easier for recurring users, we finally forgot about it.

But while it's surely not simple, I think it'd be ideal if we could find a very reduced amount of datasets that can be used in all pandas examples. The ones I think we surely need are:

  • A simple example like the one proposed
  • One with MultIndex (probably in both axis)
  • A timeseries dataset

If we're able to find the ones we need, I think it'd also be great if we could have something like:

>>> import pandas
>>> animals = pandas.sample_data('animals')
>>> animals
        num_arms  num_legs
dog            0         4
cat            0         4
monkey         2         2

That should make the examples much simpler, and directly show the point they are trying to show. See for example the MultiIndex example here, how creating the DataFrame distracts from the operation shown: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

@python-sprints/pandas-mentoring thoughts? Ideas on datasets?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions