-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(csharp): improve handling of StructArrays #2587
base: main
Are you sure you want to change the base?
Conversation
davidhcoe
commented
Mar 7, 2025
- improves the handling of structs to return objects or JsonString (defaults to JsonString to not break existing callers)
- additional testing for each return type
- updates to the ADO.NET wrapper to support both struct types
- fixes csharp: ValueAt extension causes error when StringArray length = 0 #2586
|
||
public ReadRowsStream(IAsyncEnumerator<ReadRowsResponse> response) | ||
{ | ||
if (!response.MoveNextAsync().Result) { } | ||
this.currentBuffer = response.Current.ArrowSchema.SerializedSchema.Memory; | ||
|
||
if (response.Current != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A NullReferenceException occurs if there are no results from the query because response.Current is null. So, this uses an indicator of "HasRows" to dictate the behavior of it.
@@ -208,7 +226,7 @@ private IArrowType GetType(TableFieldSchema field, IArrowType type) | |||
return type; | |||
} | |||
|
|||
static IArrowReader ReadChunk(BigQueryReadClient readClient, string streamName) | |||
static IArrowReader? ReadChunk(BigQueryReadClient readClient, string streamName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the internal contract slightly because ReadChunk can now result in a null if the stream doesn't have any rows.
@@ -112,7 +112,13 @@ public override QueryResult ExecuteQuery() | |||
ReadSession rrs = readClient.CreateReadSession("projects/" + results.TableReference.ProjectId, rs, maxStreamCount); | |||
|
|||
long totalRows = results.TotalRows == null ? -1L : (long)results.TotalRows.Value; | |||
IArrowArrayStream stream = new MultiArrowReader(TranslateSchema(results.Schema), rrs.Streams.Select(s => ReadChunk(readClient, s.Name))); | |||
|
|||
var readers = rrs.Streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make up for the internal contract change of allowing null on ReadChunk, only pass valid readers (that aren't null) to the MultiArrowReader. If it is empty, then no errors occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I've left a few comments and questions to consider, but nothing I'd think of as seriously blocking.
In hindsight, I think this is arguably the wrong approach to dealing with "nonstandard" values whether they're structured or decimal. It would have been better to keep all conversions in Arrow "vector" space instead of dealing with them one-by-one in a "get scalar" function. That way, if I'm a consumer who wants to deal with the results as an array but I don't want to have to handle values one at a time I can say "convert this struct array into a string array" and then it's just a regular Arrow string vector and I can keep going in vector space. For full generality, this might require a change to the C# Arrow implementation to support a common interface between C# arrays and Arrow arrays, but that's probably worth doing or at least thinking about.
(And we can obviously move in those directions over time.)
@@ -76,7 +83,9 @@ public static class IArrowArrayExtensions | |||
case ArrowTypeId.Int64: | |||
return ((Int64Array)arrowArray).GetValue(index); | |||
case ArrowTypeId.String: | |||
return ((StringArray)arrowArray).GetString(index); | |||
StringArray sArray = (StringArray)arrowArray; | |||
if (sArray.Length == 0) { return null; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we get here? Why is this not an error, and why does it impact only StringArray and not other arrays?
csharp/src/Apache.Arrow.Adbc/Extensions/IArrowArrayExtensions.cs
Outdated
Show resolved
Hide resolved
Assert.True(ctv.ExpectedValue.Equals(value), Utils.FormatMessage($"Expected value [{ctv.ExpectedValue}] does not match actual value [{value}] for {ctv.Name} for query [{query}]", environmentName)); | ||
bool areEqual = false; | ||
|
||
if (value is ExpandoObject) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing a case for when the result is an array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is a check above if (type.BaseType?.Name.Contains("PrimitiveArray") == false)
that puts us in a different path for comparing arrays.
return (array, index) => ((StringArray)array).GetString(index); | ||
return (array, index) => | ||
{ | ||
StringArray? sArray = array as StringArray; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives up some of the performance benefit of the approach. Instead of having to add a check to each invocation, can we have the caller tell us in advance what the expected source type is and return one of two different delegates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldnt figure out an elegant way to have two separate delegates so I went with one and checking the DataType of the array that's passed in.