Skip to content

Issues reading created files with repeated fields or LIST #151

@johanfunnel

Description

@johanfunnel

We're using parquet-cli (brew install parquet-cli) to read files that created with this lib, but we're running in to issues with either errors or empty values for fields with repeated: true and/or type: 'LIST'. Reading using ParquetReader.openFile from this lib works fine though!

Steps to reproduce

Example 1 - repeated: true

Using the following schema and code, based on this README example

const schema = new ParquetSchema({
  id: { type: 'UTF8' },
  stock: {
    repeated: true,
    fields: {
      price: { type: 'DOUBLE' },
      quantity: { type: 'INT64' },
    },
  },
});

const writer = await ParquetWriter.openFile(
  schema,
  'repeated-example.parquet'
);

await writer.appendRow({
  id: 'Row1',
  stock: [
    { price: 100, quantity: 10 },
    { price: 200, quantity: 20 },
  ],
});

Example 2 - type: 'LIST'

Using the following schema and code, based on the tests for array list

const schema = new ParquetSchema({
  id: { type: 'UTF8' },
  test: {
    type: 'LIST',
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            type: 'UTF8',
          },
        },
      },
    },
  },
});

const writer = await ParquetWriter.openFile(schema, 'list-example.parquet');

await writer.appendRow({
  id: 'Row1',
  test: { list: [{ element: 'abcdef' }, { element: 'fedcba' }] },
});
  1. Generate files using the examples above
  2. Read these files with parquet-cli using parquet cat <path-to-file>.

Expected behaviour

Example 1
Being able to read the file without errors.

Example 2
The result having { list: [ { element: 'abcdef' }, { element: 'fedcba' } ] } in the test field, like when reading the file using ParquetReader.openFile.

Actual behaviour

Example 1
An error is thrown, see under Error logs

Example 2
Getting the result {"id": "Row1", "test": null}

Error logs

From Example 1

Unknown error
java.lang.RuntimeException: Failed on record 0 in <omitted>/output-basic.parquet
	at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:75)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.parquet.cli.Main.main(Main.java:191)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:<omitted>/output-basic.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
	at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:70)
	... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required group stock (LIST) {
  repeated group array {
    required double price;
    required int64 quantity;
  }
} != repeated group stock {
  required double price;
  required int64 quantity;
}
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:104)
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
	at org.apache.parquet.schema.MessageType.accept(MessageType.java:52)
	at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:167)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:155)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
	... 9 more

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions