Skip to content

[SPARK-56561][DOCS] Document order preservation for array_distinct, array_intersect, array_union, array_except#55549

Open
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi:SPARK-56561-doc-array-order
Open

[SPARK-56561][DOCS] Document order preservation for array_distinct, array_intersect, array_union, array_except#55549
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi:SPARK-56561-doc-array-order

Conversation

@shrirangmhalgi
Copy link
Copy Markdown

What changes were proposed in this pull request?

This change documents the order preservation behavior of array_distinct, array_intersect, array_union, and array_except in:

  • SQL function descriptions (@ExpressionDescription)
  • Scala API scaladoc (functions.scala)
  • PySpark docstrings (builtin.py)

Also fixes an incorrect statement in array_except's scaladoc which said "The order of elements in the result is not determined" - the implementation preserves order from the first array.

Why are the changes needed?

With this change users will not have to read implementation code to know whether these functions preserve element order. This is useful for code reviews and helps AI coding agents understand the behavior.

Does this PR introduce any user-facing change?

No - It is just updating the documentation.

How was this patch tested?

  1. Verified Unit Tests using SBT - Tests pass for CollectionExpressionsSuite and DataFrameFunctionsSuite
  • build/sbt 'catalyst/testOnly *CollectionExpressionsSuite -- -z "Array Distinct" -z "Array Union" -z "Array Except" -z "Array Intersect"'
  • build/sbt 'sql/testOnly *DataFrameFunctionsSuite -- -z "array_distinct" -z "array_intersect" -z "array_union" -z "array_except"'
  1. Runtime verification using - spark-shell:
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r1 = df.select(array_distinct(col("a"))).collect()(0).getSeq[Int](0)
println(s"array_distinct([3,1,2,1,3]) = $r1")

Result - array_distinct([3,1,2,1,3]) = ArraySeq(3, 1, 2)

import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r2 = df.select(array_union(col("a"), col("b"))).collect()(0).getSeq[Int](0)
println(s"array_union([3,1,2,1,3], [2,4,3]) = $r2")

Result - array_union([3,1,2,1,3], [2,4,3]) = ArraySeq(3, 1, 2, 4)

import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r3 = df.select(array_intersect(col("a"), col("b"))).collect()(0).getSeq[Int](0)
println(s"array_intersect([3,1,2,1,3], [2,4,3]) = $r3")

Result - array_intersect([3,1,2,1,3], [2,4,3]) = ArraySeq(3, 2)

import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r4 = df.select(array_except(col("a"), col("b"))).collect()(0).getSeq[Int](0)
println(s"array_except([3,1,2,1,3], [2,4,3]) = $r4")

Result - array_except([3,1,2,1,3], [2,4,3]) = ArraySeq(1)

What changes were proposed in this pull request?

Documentation update.

Was this patch authored or co-authored using generative AI tooling?

No.

@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-56561-doc-array-order branch from 8e03f20 to 534cf1b Compare April 25, 2026 06:54
@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-56561-doc-array-order branch from 534cf1b to 0b421b0 Compare April 25, 2026 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant