Skip to content

Conversation

@ikrommyd
Copy link
Collaborator

As suggested by @ariostas in #3773 (review), I also don't think unions of floats and ints should be considered invalid. Usually, there is a very good reason to keep ints and floats separate and it's not just for fun. This PR changes just that.

Before:

In [3]: array = ak.contents.UnionArray(
   ...:     ak.index.Index8([0, 1, 1, 0]),
   ...:     ak.index.Index64([0, 1, 1, 0]),
   ...:     [
   ...:         ak.contents.NumpyArray(np.array([1.0, 2.0], dtype=np.float64)),
   ...:         ak.contents.NumpyArray(np.array([1, 2], dtype=np.int64)),
   ...:     ],
   ...: )

In [4]: ak.validity_error(array)
Out[4]: 'at highlevel: content(1) is mergeable with content(0)'

With this PR:

In [3]: array = ak.contents.UnionArray(
   ...:     ak.index.Index8([0, 1, 1, 0]),
   ...:     ak.index.Index64([0, 1, 1, 0]),
   ...:     [
   ...:         ak.contents.NumpyArray(np.array([1.0, 2.0], dtype=np.float64)),
   ...:         ak.contents.NumpyArray(np.array([1, 2], dtype=np.int64)),
   ...:     ],
   ...: )

In [4]: ak.validity_error(array)
Out[4]: ''

@ianna @ariostas, what do you think about this?

@ikrommyd ikrommyd requested review from ariostas and ianna December 22, 2025 23:43
@ikrommyd ikrommyd changed the title do not consider unions of ints and floats as invalid feat: do not consider unions of ints and floats as invalid Dec 22, 2025
@ikrommyd ikrommyd changed the title feat: do not consider unions of ints and floats as invalid feat: consider unions of ints and floats as valid layouts Dec 22, 2025
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.75%. Comparing base (f2416b5) to head (f949e8d).

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/contents/unionarray.py 86.56% <100.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3780

Copy link
Member

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! It makes sense to me that if someone writes a = ak.Array([1, 1.5, 2]) then it should default to making everything a float, but if they explicitly construct it to keep the two different types then it should still be considered valid.

In fact, I would go as far as saying that if they explicitly want to keep e.g. f32 separate from f64 it should still be considered valid, but that's probably less common and more controversial, so I think mergecastable="family" is a good middle ground.

@ikrommyd
Copy link
Collaborator Author

Yeah I think a = ak.Array([1, 1.5, 2]) is fine because it's the same as numpy

In [7]: a = ak.Array([1, 1.5, 2])

In [8]: a
Out[8]: <Array [1, 1.5, 2] type='3 * float64'>

In [9]: a.layout
Out[9]: <NumpyArray dtype='float64' len='3'>[1.  1.5 2. ]</NumpyArray>

In [11]: np.array([1, 1.5, 2])
Out[11]: array([1. , 1.5, 2. ])

In [12]: np.array([1, 1.5, 2]).dtype
Out[12]: dtype('float64')

But explicit union creation indeed I think it should be considered valid if floats and ints are separate.

@ianna
Copy link
Member

ianna commented Dec 23, 2025

@ikrommyd and @ariostas - I still think that we should follow the strict data typing and semantic integrity. IMHO the "union" of integers and floats is a logical error rather than a convenience. NumPy and Pandas typically "upcast" integers to floats when they meet in an array or column. This is done to maintain a single, uniform data type for the hardware to process quickly. Because missing data (NaN) is mathematically a float, many libraries have historically forced entire integer datasets into float formats just to accommodate a single empty cell. Libraries follow a "lowest common denominator" logic. Since a float can approximate an integer but an integer cannot represent a decimal, the float always "wins" the conversion.

@ariostas
Copy link
Member

@ianna I feel like the comparison with NumPy or Pandas is a bit too restrictive. Both of those libraries can only support unions if they use dtype=object, but in Awkward there are proper unions with all the options being specified.

So something like

>>> import awkward as ak
>>> ak.Array([1, "two", 3])
<Array [1, 'two', 3] type='3 * union[int64, string]'>

is perfectly fine. Whereas in NumPy

>>> import numpy as np
>>> np.array([1, "two", 3])
array(['1', 'two', '3'], dtype='<U21')

upcasts everything to a string, and the only way to keep the original types (as far as I know) is to use dtype=object.

So in the same way, I think it makes sense to allow integers and floats to be a union of separate types.

But I totally agree with you that in most cases it makes sense to upcast to floats to have a uniform type and make the computations more efficient. That's why I think the upcasting to floats should stay as it is right now. So this would not change the result of ak.Array([1, 1.5, 2]). It's only in cases when someone goes out of their way to keep integers and floats separate, which seems to me like it's reasonable sometimes.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Dec 23, 2025

Yeah I think that's the difference too. Numpy, Pandas n friends upcast because there is no union type. And that happens in awkward two when an array is created from an iterator. We are talking about explicit unions here where one manually creates a union of ints n floats and I think that's the right thing to do since a union type exists.
Numpy automatically upcasts bools too but awkward doesn't because it has a union type. And in the validity check btw, we have mergebool=False which means a union of ints and bools is valid
See below:

In [1]: import numpy as np

In [2]: np.array([1, True, 2])
Out[2]: array([1, 1, 2])

In [3]: import awkward as ak

In [4]: ak.Array([1, True, 2])
Out[4]: <Array [1, True, 2] type='3 * union[int64, bool]'>

In [5]: np.array([1, True, 2.0])
Out[5]: array([1., 1., 2.])

In [6]: ak.Array([1, True, 2.0])
Out[6]: <Array [1, True, 2] type='3 * union[float64, bool]'>

The only thing we're changing here is whether we call unions of ints and floats valid and I think that is actually valid layout because yes we can pretty accurately represent ints as floats but not exactly in IEE 754. There is no other interface change apart from what the validity check returns.

I would go as far as to argue that in the case of [1, 2.0, 3], awkward should probably give a union when parsing the iterable instead of converting ints to floats since it does support that internally unlike numpy but that's a public behavior change so we wouldn't want to do that now.

@ianna
Copy link
Member

ianna commented Dec 23, 2025

It's only in cases when someone goes out of their way to keep integers and floats separate, which seems to me like it's reasonable sometimes.

@ariostas - A union of int and float is almost always accidental in awkward array.

Let me unpack that in a way that aligns with awkward’s design philosophy and the realities of columnar data systems.

Why unions of int + float are accidental in practice

1. There is no semantic reason to mix them

In real data models:

  • integers represent counts, IDs, categories, indices
  • floats represent measurements, probabilities, continuous values

If a column contains both, it’s almost never because the domain is “a number that is sometimes discrete and sometimes continuous.” It’s because something upstream was inconsistent.

This is why Arrow, Parquet, Polars, DuckDB, and every typed system treat this as a schema smell.

2. It usually comes from Python’s dynamic typing leaking into a typed system

Typical accidental sources:

  • Python lists like [1, 2.5, 3]
  • JSON where numbers are not consistently typed
  • user code that appends floats to an int list
  • missing-value sentinels (-1 vs nan)
  • reading data from untyped sources

awkward preserves what it sees — so if the input is inconsistent, you get a union.

But the intent was not ”I want a sum type of int|float.” The intent was ”I want numbers.”

3. It breaks vectorization and GPU kernels

A union of int and float forces awkward to:

  • dispatch per element
  • branch on tags
  • lose SIMD opportunities
  • lose kernel fusion
  • lose type stability
  • complicate CUDA/CCCL kernels

No one intends this. It’s a performance cliff caused by inconsistent input.

4. It is not a meaningful sum type

A real union type is something like:

  • int | string (e.g., ID vs name)
  • recordA | recordB (variant schemas)
  • None | record (optional)

Those have semantic meaning.

But int | float is not a meaningful variant. It’s just two numeric storage types that should be unified.

5. awkward’s philosophy is: preserve meaning, not accidents

awkward preserves structure and type because it matters. But it does not encourage meaningless heterogeneity.

If the user intends a numeric column, the correct representation is:

  • float (if any floats appear)
  • int (if all values are integers)

A union is a sign that the input was inconsistent, not that the domain is heterogeneous.

So what is intentional?

Intentional unions look like:

  • {"muon": {...}} | {"electron": {...}}
  • {"type": "A", ...} | {"type": "B", ...}
  • None | record
  • int | string

These represent real variant types.

int | float does not.

To summarize

If the two types differ only in numeric precision, the union is accidental.

awkward’s job is to preserve the distinction when it’s meaningful — not when it’s noise.

If this isn’t clearly documented yet, we should make sure it is.

Users can create unions of ints and floats — awkward will faithfully represent whatever they give it — but our job is to teach them that this is not a valid or meaningful domain type for performant columnar computation.

This distinction is essential:

  • Allowed (awkward will not forbid it)
  • Representable (UnionArray can encode it)
  • But not valid in the sense of:
    • not semantically meaningful
    • not stable for schemas
    • not good for vectorization
    • not good for GPU kernels
    • not good for Arrow/Parquet interoperability
    • not good for performance

@ianna
Copy link
Member

ianna commented Dec 23, 2025

union of ints and bools is valid

@ikrommyd - you are correct. A union of bool and int is valid because bool is a subtype of integer in every typed columnar system.

It is not a heterogeneous domain.
It is a refinement of the same domain.

@ikrommyd
Copy link
Collaborator Author

I will not fight this a lot because I don't have super strong opinions about this, I just think telling that an array is "invalid" with a highlevel function sounds scary because the user thinks that they need to do something about it while in reality, you can create such unions in pyarrow to and there is nothing that complains about validity. That's all, so I thought that we just shouldn't tell that "hey this is invalid" for the same reason.

Regarding kernels and dispatching, it is understandable but I think we generally simplify before dispatching to kernels.

bool is a subtype of integer in every typed columnar system.

Not in numpy btw. And it's also in the original numpy guide from 2000 or something. Bool is its own thing there. It is only a subdtype of np.generic.

>>> import numpy as np
>>> np.issubdtype(np.bool_, np.integer)
False

I pose a another question though here. What about unused contents? Should a union that has unused contents be invalid? I would assume yes because you are using extra memory and your union can drop buffers without any problem. There is currently no check for this in the validity check.

@ianna
Copy link
Member

ianna commented Dec 24, 2025

I will not fight this a lot because I don't have super strong opinions about this, I just think telling that an array is "invalid" with a highlevel function sounds scary because the user thinks that they need to do something about it while in reality, you can create such unions in pyarrow to and there is nothing that complains about validity. That's all, so I thought that we just shouldn't tell that "hey this is invalid" for the same reason.

@ikrommyd — Thanks! If the wording “invalid” is too strong, we can absolutely soften it — e.g., “non‑minimal,” “contains unused contents,” or “not semantically compact.” The important part is that the check reflects the state of the layout, not that we alarm users. Or perhaps prompt them to act on it?

Regarding kernels and dispatching, it is understandable but I think we generally simplify before dispatching to kernels.

bool is a subtype of integer in every typed columnar system.

Not in numpy btw. And it's also in the original numpy guide from 2000 or something. Bool is its own thing there. It is only a subdtype of np.generic.

>>> import numpy as np
>>> np.issubdtype(np.bool_, np.integer)
False

I pose a another question though here. What about unused contents? Should a union that has unused contents be invalid? I would assume yes because you are using extra memory and your union can drop buffers without any problem. There is currently no check for this in the validity check.

@ikrommyd - You’re completely right about NumPy — bool_ is not a subtype of np.integer, and that’s been consistent since the early NumPy documentation. NumPy treats bool as its own scalar type with only np.generic as the common ancestor, so in that ecosystem it really is separate.

What I meant is that in typed columnar systems (Arrow, Parquet, DuckDB, Polars, Spark, etc.), bool is implemented as a specialized integer storage type rather than a distinct numeric family. That’s why bool | int forms a coherent domain in those systems, whereas int | float does not.

On the very good question of unused contents: I think a union that carries content arrays never referenced by any tag should be treated as non‑minimal. Unused contents:

  • waste memory
  • complicate the schema
  • prevent buffer dropping
  • introduce unnecessary dispatch paths
  • and perhaps indicate something upstream wasn’t intentional (well, slicing is intentional :-)

There’s no semantic reason to keep an unreferenced content array around, so a validity check should surface it. A simple rule would be: every content must appear at least once in the tag buffer.

That keeps unions meaningful, predictable, and compact, and avoids users unknowingly carrying dead buffers.

That said, there’s always a real trade‑off between memory usage and speed, especially given awkward’s immutability and zero‑copy goals. It’s probably worth revisiting this in more depth next year to decide where the balance should land.

@ariostas
Copy link
Member

@ianna okay, I don't have such a strong opinion about this, so I'm fine with leaving it as it is. At the end of the day, Awkward will still work with these, even if they are considered "invalid", so that's what matters. And I guess if a user is using lower-level functionality to create these things they'll probably be knowledgeable and be fine with Awkward telling them it's invalid.

@ikrommyd
Copy link
Collaborator Author

There’s no semantic reason to keep an unreferenced content array around, so a validity check should surface it. A simple rule would be: every content must appear at least once in the tag buffer.

I can change this PR to do that instead so that the validity check reports something like "contents 1, 3, and 5 are unused" for unused contents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants