Description
Very often in the pandas documentation, to show examples simple DataFrame
objects are created. And many of them just use random data, see for example https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#id1
>>> df = pandas.DataFrame(numpy.random.randn(5, 2), columns=list('AB'))
>>> df
A B
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
3 0.119209 -1.044236
4 -0.861849 -2.104569
Then, if I want to show an operation, I can get something like:
>>> df @ 2
A B
0 2.469112 1.717137
1 0.490941 0.864368
2 3.212112 1.826785
3 2.119209 0.955764
4 1.138151 -0.104569
And in my opinion the example is quite useless (more than for the syntax), because if you don't know what the operation does, the example is not helping you understand.
The best example I could find to overcome that (probably not great, but the best I could find) is:
>>> df = pandas.DataFrame({"num_legs": [4, 4, 2],
... "num_arms": [0, 0, 2]},
... ["dog", "cat", "monkey"])
>>> df
num_arms num_legs
dog 0 4
cat 0 4
monkey 2 2
Then, when performing an operation is easy to guess what it's doing, or double check if you already have a guess:
>>> df @ 2
num_arms num_legs
dog 2 6
cat 2 6
monkey 4 4
We are already using some of those in some examples: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html
While this worked well in some places, we found this dataset very insufficient to show all pandas functionality. And while we initially wanted to standardize the data used in the examples, so things are easier for recurring users, we finally forgot about it.
But while it's surely not simple, I think it'd be ideal if we could find a very reduced amount of datasets that can be used in all pandas examples. The ones I think we surely need are:
- A simple example like the one proposed
- One with MultIndex (probably in both axis)
- A timeseries dataset
If we're able to find the ones we need, I think it'd also be great if we could have something like:
>>> import pandas
>>> animals = pandas.sample_data('animals')
>>> animals
num_arms num_legs
dog 0 4
cat 0 4
monkey 2 2
That should make the examples much simpler, and directly show the point they are trying to show. See for example the MultiIndex example here, how creating the DataFrame distracts from the operation shown: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
@python-sprints/pandas-mentoring thoughts? Ideas on datasets?