-
Notifications
You must be signed in to change notification settings - Fork 101
feat: consider unions of ints and floats as valid layouts #3780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3780 |
ariostas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! It makes sense to me that if someone writes a = ak.Array([1, 1.5, 2]) then it should default to making everything a float, but if they explicitly construct it to keep the two different types then it should still be considered valid.
In fact, I would go as far as saying that if they explicitly want to keep e.g. f32 separate from f64 it should still be considered valid, but that's probably less common and more controversial, so I think mergecastable="family" is a good middle ground.
|
Yeah I think But explicit union creation indeed I think it should be considered valid if floats and ints are separate. |
|
@ikrommyd and @ariostas - I still think that we should follow the strict data typing and semantic integrity. IMHO the "union" of integers and floats is a logical error rather than a convenience. NumPy and Pandas typically "upcast" integers to floats when they meet in an array or column. This is done to maintain a single, uniform data type for the hardware to process quickly. Because missing data ( |
|
@ianna I feel like the comparison with NumPy or Pandas is a bit too restrictive. Both of those libraries can only support unions if they use So something like >>> import awkward as ak
>>> ak.Array([1, "two", 3])
<Array [1, 'two', 3] type='3 * union[int64, string]'>is perfectly fine. Whereas in NumPy >>> import numpy as np
>>> np.array([1, "two", 3])
array(['1', 'two', '3'], dtype='<U21')upcasts everything to a string, and the only way to keep the original types (as far as I know) is to use So in the same way, I think it makes sense to allow integers and floats to be a union of separate types. But I totally agree with you that in most cases it makes sense to upcast to floats to have a uniform type and make the computations more efficient. That's why I think the upcasting to floats should stay as it is right now. So this would not change the result of |
|
Yeah I think that's the difference too. Numpy, Pandas n friends upcast because there is no union type. And that happens in awkward two when an array is created from an iterator. We are talking about explicit unions here where one manually creates a union of ints n floats and I think that's the right thing to do since a union type exists. The only thing we're changing here is whether we call unions of ints and floats valid and I think that is actually valid layout because yes we can pretty accurately represent ints as floats but not exactly in IEE 754. There is no other interface change apart from what the validity check returns. I would go as far as to argue that in the case of |
@ariostas - A union of Let me unpack that in a way that aligns with awkward’s design philosophy and the realities of columnar data systems. Why unions of
|
@ikrommyd - you are correct. A union of It is not a heterogeneous domain. |
|
I will not fight this a lot because I don't have super strong opinions about this, I just think telling that an array is "invalid" with a highlevel function sounds scary because the user thinks that they need to do something about it while in reality, you can create such unions in pyarrow to and there is nothing that complains about validity. That's all, so I thought that we just shouldn't tell that "hey this is invalid" for the same reason. Regarding kernels and dispatching, it is understandable but I think we generally simplify before dispatching to kernels.
Not in numpy btw. And it's also in the original numpy guide from 2000 or something. Bool is its own thing there. It is only a subdtype of I pose a another question though here. What about unused contents? Should a union that has unused contents be invalid? I would assume yes because you are using extra memory and your union can drop buffers without any problem. There is currently no check for this in the validity check. |
@ikrommyd — Thanks! If the wording “invalid” is too strong, we can absolutely soften it — e.g., “non‑minimal,” “contains unused contents,” or “not semantically compact.” The important part is that the check reflects the state of the layout, not that we alarm users. Or perhaps prompt them to act on it?
@ikrommyd - You’re completely right about NumPy — What I meant is that in typed columnar systems (Arrow, Parquet, DuckDB, Polars, Spark, etc.), On the very good question of unused contents: I think a union that carries content arrays never referenced by any tag should be treated as non‑minimal. Unused contents:
There’s no semantic reason to keep an unreferenced content array around, so a validity check should surface it. A simple rule would be: every content must appear at least once in the tag buffer. That keeps unions meaningful, predictable, and compact, and avoids users unknowingly carrying dead buffers. That said, there’s always a real trade‑off between memory usage and speed, especially given awkward’s immutability and zero‑copy goals. It’s probably worth revisiting this in more depth next year to decide where the balance should land. |
|
@ianna okay, I don't have such a strong opinion about this, so I'm fine with leaving it as it is. At the end of the day, Awkward will still work with these, even if they are considered "invalid", so that's what matters. And I guess if a user is using lower-level functionality to create these things they'll probably be knowledgeable and be fine with Awkward telling them it's invalid. |
I can change this PR to do that instead so that the validity check reports something like "contents 1, 3, and 5 are unused" for unused contents. |
As suggested by @ariostas in #3773 (review), I also don't think unions of floats and ints should be considered invalid. Usually, there is a very good reason to keep ints and floats separate and it's not just for fun. This PR changes just that.
Before:
With this PR:
@ianna @ariostas, what do you think about this?