Skip to content

tableFromIPC incorrectly types columns in schema when there are duplicate column names #288

@danielstreit

Description

@danielstreit

Describe the bug, including details regarding any error messages, version, and platform.

Version: [email protected] (also tested and noticed the issue in 17.0.0)
I have a use case where I have an arrow file that has two columns with the same name, but different types. In this example, there are two "id" columns, but one is an Int while the other is a string.
When I load this arrow file using tableFromIPC, the table schema marks both columns as strings.
Simple base64 arrow file used in this example:

/////7ABAAAQAAAAAAAKAA4ABgANAAgACgAAAAAABAAQAAAAAAEKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAIAAADAAAAABAAAAFr///8UAAAAiAAAAIwAAAAAAAAFiAAAAAIAAABAAAAABAAAABT///8IAAAAFAAAAAgAAAAic3RyaW5nIgAAAAAXAAAAU3Bhcms6RGF0YVR5cGU6SnNvblR5cGUATP///wgAAAAQAAAABgAAAFNUUklORwAAFgAAAFNwYXJrOkRhdGFUeXBlOlNxbE5hbWUAAAAAAAAEAAQABAAAAAIAAABpZAAAAAASABgAFAAAABMADAAAAAgABAASAAAAFAAAAIwAAACUAAAAAAAAApgAAAACAAAARAAAAAQAAADM////CAAAABAAAAAGAAAAImxvbmciAAAXAAAAU3Bhcms6RGF0YVR5cGU6SnNvblR5cGUACAAMAAgABAAIAAAACAAAABAAAAAGAAAAQklHSU5UAAAWAAAAU3Bhcms6RGF0YVR5cGU6U3FsTmFtZQAAAAAAAAgADAAIAAcACAAAAAAAAAFAAAAAAgAAAGlkAAD/////yAAAABQAAAAAAAAADAAWAAYABQAIAAwADAAAAAADBAAYAAAAKAAAAAAAAAAAAAoAGAAMAAQACAAKAAAAbAAAABAAAAABAAAAAAAAAAAAAAAFAAAAAAAAAAAAAAABAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAQAAAAAAAAAAEAAAAAAAAAGAAAAAAAAAAIAAAAAAAAACAAAAAAAAAAAgAAAAAAAAAAAAAAAgAAAAEAAAAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAIAAAAxMAAAAAAAAP////8AAAAA

Below are the details of the Table object returned by tableFromIPC. Notice how in the schema, both id columns are marked as Utf8, but the column details show the difference in column types.

=== Arrow Table Information ===
Rows: 1
Columns: 2
Schema: Schema<{ 0: id: Utf8, 1: id: Utf8 }>
=== Table Contents ===
[
  {"id": 0, "id": "10"}
]
=== Column Details ===
Column 0: undefined (Int64)
Column 1: undefined (Utf8)
Code to generate the above result
const arrowBuffer = base64ToUint8Array(base64String);
// Parse the Arrow IPC data into a table
const table = tableFromIPC(arrowBuffer);
console.log('\n=== Arrow Table Information ===');
console.log(`Rows: ${table.numRows}`);
console.log(`Columns: ${table.numCols}`);
console.log(`Schema: ${table.schema}`);
console.log('\n=== Table Contents ===');
console.log(table.toString());
console.log('\n=== Column Details ===');
for (let i = 0; i < table.numCols; i++) {
  const column = table.getChildAt(i);
  console.log(`Column ${i}: ${column.name} (${column.type})`);
}

I have tested this same arrow file with pyarrow, and it shows the expected result:

Schema:
 id: int64 not null
  -- field metadata --
  Spark:DataType:SqlName: 'BIGINT'
  Spark:DataType:JsonType: '"long"'
id: string not null
  -- field metadata --
  Spark:DataType:SqlName: 'STRING'
  Spark:DataType:JsonType: '"string"'
Batch 0:
pyarrow.RecordBatch
id: int64 not null
id: string not null
----
id: [0]
id: ["10"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions