Reading data into Pandas in Uproot 5; awkward-pandas versus MultiIndex #803

ast0815 · 2022-12-14T17:18:22Z

ast0815
Dec 14, 2022

In the getting started guide it says that nested structures will automatically be flattened into a DataFrame with MultiIndex when importing using Pandas. This does not seem to be the case any more:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
print(df)

      NMuon  ...                                   Muon_Pz
0         2  ...  [-8.16079330444336, -11.307581901550293]
1         1  ...                      [20.199968338012695]
2         2  ...   [11.168285369873047, 36.96519088745117]
3         2  ...   [403.84844970703125, 335.0942077636719]
4         2  ...  [-89.69573211669922, 20.115053176879883]
...     ...  ...                                       ...
2416      1  ...                      [61.715789794921875]
2417      1  ...                       [160.8179168701172]
2418      1  ...                      [-52.66374969482422]
2419      1  ...                       [162.1763153076172]
2420      1  ...                       [54.71943664550781]

So I guess that should be updated and maybe an example given how to use the awkward accessor on the structured columns. Assuming that that is the intended way of going about things.

Answered by jpivarski

Dec 14, 2022

You're right: this is a new feature and the docs are out of date. Thanks for the heads-up!

What's happening now is that any non-flat data uses the Awkward dtype provided by awkward-pandas.

If you want to explode the data as before, you can get it as an Awkward Array and use ak.to_dataframe:

>>> import uproot
>>> import awkward as ak
>>> ak.to_dataframe(events.arrays(filter_name="/(Jet|Muon)_P[xyz]/", library="ak"))
                   Jet_Px     Jet_Py      Jet_Pz    Muon_Px    Muon_Py     Muon_Pz
entry subentry                                                                    
1     0        -38.874714  19.863453   -0.894942  -0.816459 -24.404259   20.199968
3     0        -71.695213  93…

View full answer

jpivarski · 2022-12-14T19:37:07Z

jpivarski
Dec 14, 2022
Maintainer

You're right: this is a new feature and the docs are out of date. Thanks for the heads-up!

What's happening now is that any non-flat data uses the Awkward dtype provided by awkward-pandas.

If you want to explode the data as before, you can get it as an Awkward Array and use ak.to_dataframe:

>>> import uproot
>>> import awkward as ak
>>> ak.to_dataframe(events.arrays(filter_name="/(Jet|Muon)_P[xyz]/", library="ak"))
                   Jet_Px     Jet_Py      Jet_Pz    Muon_Px    Muon_Py     Muon_Pz
entry subentry                                                                    
1     0        -38.874714  19.863453   -0.894942  -0.816459 -24.404259   20.199968
3     0        -71.695213  93.571579  196.296432  22.088331 -85.835464  403.848450
      1         36.606369  21.838793   91.666283  76.691917 -13.956494  335.094208
4     0          3.880162 -75.234055 -359.601624  45.171322  67.248787  -89.695732
      1          4.979580 -39.231731   68.456718  39.750957  25.403667   20.115053
...                   ...        ...         ...        ...        ...         ...
2414  0         33.961163  58.900467  -17.006561  -9.204197 -42.204014  -64.264900
2416  0         37.071465  20.131996  225.669037 -39.285824 -14.607491   61.715790
2417  0        -33.196457 -59.664749  -29.040150  35.067146 -14.150043  160.817917
2418  0         -3.714818 -37.202377   41.012222 -29.756786 -15.303859  -52.663750
2419  0        -36.361286  10.173571  226.429214   1.141870  63.609570  162.176315

[2038 rows x 6 columns]

1 reply

ast0815 Dec 15, 2022
Author

~~Thanks! This works like a charm.~~ ~~Spoke too soon. :(~~ Your method works after all.

ast0815 · 2022-12-15T11:48:16Z

ast0815
Dec 15, 2022
Author

Oops, I take it back. Something weird is going on:

structured_tree = uproot.open("HZZ.root")["events"]

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library='pd')
with open("structured_df.txt", "w") as f:
    print(df, file=f)

import awkward as ak

arr = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"])
df = ak.to_dataframe(arr)
with open("flattened_df.txt", "w") as f:
    print(df, file=f)

idx = pd.IndexSlice
df = df.loc[idx[:,0], :]
with open("sliced_df.txt", "w") as f:
    print(df, file=f)

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon > 0]
df["Muon_Px"] = df.Muon_Px.ak[:,0]
df["Muon_Py"] = df.Muon_Py.ak[:,0]
df["Muon_Pz"] = df.Muon_Pz.ak[:,0]
with open("sliced_alt_df.txt", "w") as f:
    print(df, file=f)

$ head *df.txt 

==> flattened_df.txt <==
                NMuon    Muon_Px    Muon_Py     Muon_Pz
entry subentry                                         
0     0             2 -52.899456 -11.654672   -8.160793
      1             2  37.737782   0.693474  -11.307582
1     0             1  -0.816459 -24.404259   20.199968
2     0             2  48.987831 -21.723139   11.168285
      1             2   0.827567  29.800508   36.965191
...               ...        ...        ...         ...
2416  0             1 -39.285824 -14.607491   61.715790
2417  0             1  35.067146 -14.150043  160.817917

==> sliced_alt_df.txt <==
      NMuon    Muon_Px    Muon_Py    Muon_Pz
0         2 -52.899456 -11.654672  -8.160793
1         1  -0.816459 -24.404259  20.199968
2         2  48.987831 -21.723139  11.168285
3         2  22.088331 -85.835464  403.84845
4         2  45.171322  67.248787 -89.695732
...     ...        ...        ...        ...
2416      1  23.913206 -35.665077  54.719437
2417      1  23.913206 -35.665077  54.719437
2418      1  23.913206 -35.665077  54.719437

==> sliced_df.txt <==
                NMuon    Muon_Px    Muon_Py     Muon_Pz
entry subentry                                         
0     0             2 -52.899456 -11.654672   -8.160793
1     0             1  -0.816459 -24.404259   20.199968
2     0             2  48.987831 -21.723139   11.168285
3     0             2  22.088331 -85.835464  403.848450
4     0             2  45.171322  67.248787  -89.695732
...               ...        ...        ...         ...
2416  0             1 -39.285824 -14.607491   61.715790
2417  0             1  35.067146 -14.150043  160.817917

==> structured_df.txt <==
      NMuon  ...                                   Muon_Pz
0         2  ...  [-8.16079330444336, -11.307581901550293]
1         1  ...                      [20.199968338012695]
2         2  ...   [11.168285369873047, 36.96519088745117]
3         2  ...   [403.84844970703125, 335.0942077636719]
4         2  ...  [-89.69573211669922, 20.115053176879883]
...     ...  ...                                       ...
2416      1  ...                      [61.715789794921875]
2417      1  ...                       [160.8179168701172]
2418      1  ...                      [-52.66374969482422]

It looks like things work fine at the beginning, but then things go awkward (pun intended). The last Muon_Pz in the flattened df, has the value of the second-to-last in the original.

And it gets even weirder when I use the awkward-pandas accessor to slice things myself directly. The last couple of entries just have identical entries.

I think I have the newest versions of awkward installed too:

$ pip list | grep awkward
awkward                        2.0.0
awkward-cpp                    2
awkward-pandas                 2022.12a1

1 reply

jpivarski Dec 15, 2022
Maintainer

You lost me at this line:

idx = pd.IndexSlice
df = df.loc[idx[:,0], :]

pd.IndexSlice makes a 2-tuple of (slice(None), 0), so this loc is getting ((slice(None), 0), slice(None)). I guess I don't understand Pandas loc semantics well enough to know what's supposed to happen there.

But it seems like the bottom line of what you're trying to do works:

import uproot, skhep_testdata
structured_tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library='pd')
arr = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library='ak')

from_df = list(df[df.NMuon > 0].Muon_Px.ak[:, 0])
from_arr = list(arr[arr.NMuon > 0].Muon_Px[:, 0])

assert from_df == from_arr

What the above doesn't test is the Awkward → Pandas transformation through ak.to_dataframe, but that's an old function and my Bayesian prior is to trust it because I think if there was a glaring bug, it would have been found already.

Ah heck, I'll try it. Starting from the above,

>>> import awkward as ak
>>> df2 = ak.to_dataframe(arr)
>>> (df2.NMuon == 0).any()
False

There cannot be any rows with NMuon == 0 because then there would be no Muon_Px, Muon_Py, Muon_Pz data to show. This is a consequence of the exploding: you can't represent empty lists. (That's part of why I didn't like this representation.)

To do the next step of selecting just the rows with subentry == 0, I learned what the nested slice syntax you used means (from this documentation). Now I understand why you were doing it.

>>> from_df2 = list(df2.loc[(slice(None), 0), :].Muon_Px)
>>> assert from_df2 == from_df

But the result is the same. You said this was an issue for pz, but I tried all three components and the lists are all the same.

I don't see a bug here. My awkward-pandas is the latest version, checked out of GitHub. The version you're using is from last week, not too long ago, and I don't think @douglasdavis was working on awkward-pandas in the past week. (No, no changes other than documentation.)

~~Could you make a reproducer that shows exactly~~

Woah, I see actually different numbers than you do, when I run the same command:

>>> ak.to_dataframe(structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"]))
                NMuon    Muon_Px    Muon_Py     Muon_Pz
entry subentry                                         
0     0             2 -52.899456 -11.654672   -8.160793
      1             2  37.737782   0.693474  -11.307582
1     0             1  -0.816459 -24.404259   20.199968
2     0             2  48.987831 -21.723139   11.168285
      1             2   0.827567  29.800508   36.965191
...               ...        ...        ...         ...
2416  0             1 -39.285824 -14.607491   61.715790
2417  0             1  35.067146 -14.150043  160.817917
2418  0             1 -29.756786 -15.303859  -52.663750
2419  0             1   1.141870  63.609570  162.176315
2420  0             1  23.913206 -35.665077   54.719437

[3825 rows x 4 columns]

Your sliced_alt_df.txt only goes up to entry 2418 and it has my entry 2420 appearing 3 times. It's also odd that your Pandas print-outs are not symmetric around the "...". They usually show as many entries at the end as they do at the beginning, but yours show more entries at the beginning. Could it be that you've found a Pandas bug? That's implausible but—oh, you're using head to print them out, and that's why the ends are being cut off. It doesn't explain why you see the last entry repeated, though.

I'm going to leave it at that. There are other mysteries going on here. Could you try them in an interactive prompt or in Jupyter, or as many ways as possible because maybe the display-method that you've chosen is garbling things?

ast0815 · 2022-12-15T15:12:12Z

ast0815
Dec 15, 2022
Author

Right, I forgot about the head. Sorry about that. So the output of the flattening method that you suggested actually works as it should.

Just my alternative way of using the awkward-pandas accessor produces garbled data.

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon > 0]
df["Muon_Px"] = df.Muon_Px.ak[:,0]
df["Muon_Py"] = df.Muon_Py.ak[:,0]
df["Muon_Pz"] = df.Muon_Pz.ak[:,0]
print(df)

      NMuon    Muon_Px    Muon_Py    Muon_Pz
0         2 -52.899456 -11.654672  -8.160793
1         1  -0.816459 -24.404259  20.199968
2         2  48.987831 -21.723139  11.168285
3         2  22.088331 -85.835464  403.84845
4         2  45.171322  67.248787 -89.695732
...     ...        ...        ...        ...
2416      1  23.913206 -35.665077  54.719437
2417      1  23.913206 -35.665077  54.719437
2418      1  23.913206 -35.665077  54.719437
2419      1  23.913206 -35.665077  54.719437
2420      1  23.913206 -35.665077  54.719437

[2362 rows x 4 columns]

And if I do not filter out the muonless events I would get an indexing error. There can be rows without muons in the structured DataFrame. Those will just have empty "lists" in the momentum column:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon == 0]
print(df)

      NMuon Muon_Px Muon_Py Muon_Pz
43        0      []      []      []
44        0      []      []      []
71        0      []      []      []
85        0      []      []      []
186       0      []      []      []
...     ...     ...     ...     ...
2194      0      []      []      []
2207      0      []      []      []
2231      0      []      []      []
2243      0      []      []      []
2408      0      []      []      []

[59 rows x 4 columns]

So I guess this is a bug in awkward-pandas?

7 replies

agoose77 Dec 16, 2022
Collaborator

I'm not a Pandas expert, but here's my understanding of what's happening.

The __getitem__ method on awkward-pandas returns a pd.Series. This returned series does not refer to the original series index.

Your code, rewritten into separate phases, is

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")

df2 = df[df.NMuon > 0]

df3 = df2.copy()
df3["Muon_Px"] = df2.Muon_Px.ak[:, 0]
df3["Muon_Py"] = df2.Muon_Py.ak[:, 0]
df3["Muon_Pz"] = df2.Muon_Pz.ak[:, 0]

print(df3)

By slicing df, df2 has a non-consecutive index; there are gaps that follow from the False elements of the boolean slice. Therefore, when you slice the .ak accessor, the returned series will have a different index — the default Pandas index is consecutive integers. Pandas then, I presume, aligns the elements at the indices, and fills the remainder.

If you don't need the df3 index to correspond to the df index, you can reset it:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")

df2 = df[df.NMuon > 0]

df3 = df2.reset_index()
df3["Muon_Px"] = df2.Muon_Px.ak[:, 0]
df3["Muon_Py"] = df2.Muon_Py.ak[:, 0]
df3["Muon_Pz"] = df2.Muon_Pz.ak[:, 0]

print(df3)

ast0815 Dec 16, 2022
Author

Ok, with the reset_index in there it seems to work as intended:

      index  NMuon    Muon_Px    Muon_Py     Muon_Pz
0         0      2 -52.899456 -11.654672   -8.160793
1         1      1  -0.816459 -24.404259   20.199968
2         2      2  48.987831 -21.723139   11.168285
3         3      2  22.088331 -85.835464   403.84845
4         4      2  45.171322  67.248787  -89.695732
...     ...    ...        ...        ...         ...
2357   2416      1 -39.285824 -14.607491    61.71579
2358   2417      1  35.067146 -14.150043  160.817917
2359   2418      1 -29.756786 -15.303859   -52.66375
2360   2419      1    1.14187   63.60957  162.176315
2361   2420      1  23.913206 -35.665077   54.719437

[2362 rows x 5 columns]

I would still classify this as a bug though. It certainly is unexpected behaviour. Whether it is in pandas or in awkward-pandas I cannot say. I guess I should just open an issue on the latter and see what happens if someone with expert knowledge takes a look.

In any case, thanks all!

agoose77 Dec 16, 2022
Collaborator

If this were a bug, it's firmly in awkward-pandas; we return a pd.Series without an index.

I'm mulling over whether this is a bug. At the user-level, I might expect slices of the ak accessor to return a series with the appropriate index. I suspect it should be feasible; if we determine the kind of indexing being performed, then we could also apply that to the series index. It would only matter for slices that touch the first dimension.

@douglasdavis pinging you, as you probably want to be involved in this conversation! :) Does awkward-pandas support multi-index too, or are they mutually exclusive?

douglasdavis Dec 16, 2022

awkward-pandas is young enough to say we haven't thought deeply about indexing yet! It looks like an issue has been opened in the awkward-pandas repository, and that is probably the best place to discuss this!

jpivarski Dec 16, 2022
Maintainer

I'm surprised that it didn't automatically cross-link: intake/akimbo#27

Seeing this in GitHub now, rather than email, I realize that @ast0815 opened that issue: it wasn't coincidentally discovered at the same time!

Reading data into Pandas in Uproot 5; awkward-pandas versus MultiIndex #803

Uh oh!

Uh oh!

ast0815 Dec 14, 2022

Replies: 3 comments · 9 replies

Uh oh!

jpivarski Dec 14, 2022 Maintainer

Uh oh!

Uh oh!

ast0815 Dec 15, 2022 Author

Uh oh!

ast0815 Dec 15, 2022 Author

Uh oh!

jpivarski Dec 15, 2022 Maintainer

Uh oh!

ast0815 Dec 15, 2022 Author

Uh oh!

agoose77 Dec 16, 2022 Collaborator

Uh oh!

ast0815 Dec 16, 2022 Author

Uh oh!

Uh oh!

agoose77 Dec 16, 2022 Collaborator

Uh oh!

douglasdavis Dec 16, 2022

Uh oh!

jpivarski Dec 16, 2022 Maintainer

ast0815
Dec 14, 2022

Replies: 3 comments 9 replies

jpivarski
Dec 14, 2022
Maintainer

ast0815 Dec 15, 2022
Author

ast0815
Dec 15, 2022
Author

jpivarski Dec 15, 2022
Maintainer

ast0815
Dec 15, 2022
Author

agoose77 Dec 16, 2022
Collaborator

ast0815 Dec 16, 2022
Author

agoose77 Dec 16, 2022
Collaborator

jpivarski Dec 16, 2022
Maintainer