Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Interface based DataFile reader and writer API #12298

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pvary
Copy link
Contributor

@pvary pvary commented Feb 17, 2025

Instead of base classes, just use interfaces

public Key(FileFormat fileFormat, String dataType, String builderType) {
this.fileFormat = fileFormat;
this.dataType = dataType;
this.builderType = builderType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of defining the default one using a priority (int) based approach and let the one with the highest priority be the default one. WDYT?

Copy link
Contributor Author

@pvary pvary Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a concrete example for this: Comet vectorized parquet reader spark.sql.iceberg.parquet.reader-type

I think it is good if the reader/writer choice is a conscious decision, and not happening based on some behind the scenes algorithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for simplicity. This code should not determine things like whether Comet is used. This should have a single purpose, which is to standardize how object models plug in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the config to properties, and the builder method will create the different readers based on this config

import org.apache.iceberg.io.FileAppender;

/** Builder API for creating {@link FileAppender}s. */
public interface AppenderBuilder extends InternalData.WriteBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder that AppenderBuilder has a base interface but the other builders don't.
Guess it might help to have a common DataFileIoBuilder interface defining the common builder attributes (table, schema, properties, meta). It's a bit of an "adventure in Java generics", but doable.

Copy link
Contributor Author

@pvary pvary Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take a look at the other PRs (#12164, #12069), you can see that first, I took that adventurous route, but the result was too many classes/interfaces and casts.

This PR is aiming for the minimal set of changes, and the InternalData.WriteBuilder is already introduced to Iceberg by #12060. We either need to widen that interface or inherit from it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also confused by this inheritance. We're extending but overriding everything and it's not clear to me what we really gain by going with this approach. It looks like it ends up as a completely different builder that produces the same build result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal with the PR was to show the minimal changes required to make the idea work.
We either create a different builder class for the InternalData.WriteBuilder and the DataFile.WriteBuilder, or we need to have inheritance of the interfaces.

Based on our discussion below we might end up using a different strategy, so revisit this comment later.

Comment on lines 91 to 97
return DataFileServiceRegistry.read(
task.file().format(), Record.class.getName(), input, fileProjection, partition)
.split(task.start(), task.length())
.caseSensitive(caseSensitive)
.reuseContainers(reuseContainers)
.filter(task.residual())
.build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these simplifications!

@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from c528a52 to 9975b4f Compare February 20, 2025 09:45
@pvary pvary changed the title WIP: Interface based FileFormat API WIP: Interface based DataFile reader and writer API Feb 20, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary for this proposal, I left some comments.


/** Enables reusing the containers returned by the reader. Decreases pressure on GC. */
@Override
default ReaderBuilder reuseContainers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it should not be here? These are parquet reader specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also used by Avro.
See:

this.reuseContainers = reuseContainers;

* @param rowType of the native input data
* @return {@link DataWriterBuilder} for building the actual writer
*/
public static <S> DataWriterBuilder dataWriterBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand in what case need this? I think append would be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this. We might be able to remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current approach, the file format api implementation creates the appender, and the PR creates the writers for the different data/delete files

* @return {@link AppenderBuilder} for building the actual writer
*/
public static <S, B extends EqualityDeleteWriterBuilder<B>>
EqualityDeleteWriterBuilder<B> equalityDeleteWriterBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think file format should consider eqaulity deletion/pos deletion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current Avro positional delete writer behaves differently than Parquet/ORC positional delete writers.
In case of the positional delete files the schema provided to the Avro writer should omit the PATH and the POS fields, and only needs the actual table schema. The writer handles the PATH/POS fields by static code:

public void write(PositionDelete<D> delete, Encoder out) throws IOException {
PATH_WRITER.write(delete.path(), out);
POS_WRITER.write(delete.pos(), out);
rowWriter.write(delete.row(), out);
}

The Parquet/ORC positional delete writers behave in the same way. They expect the same input.

If we are ready for a more invasive change we can harmonize the writers.
I have aimed for a minimal changeset to allow easier acceptance for the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The appender doesn't need to know about these, but the file formats and the writer implementations need this

* issues.
*/
private static final class Registry {
private static final Map<Key, ReaderService> READ_BUILDERS = Maps.newConcurrentMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a convention problem, I think maybe we just need to store in FileFormatService in registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we don't have writers (arrow), or we have multiple readers vectorized/non-vectorized readers. Also Parquet has Comet reader. So I kept the writers and the readers separate

/** Key used to identify readers and writers in the {@link DataFileServiceRegistry}. */
public static class Key {
private final FileFormat fileFormat;
private final String dataType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this things like arrow, internal row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,
Currenly we have:

  • Record - generic readers/writers
  • ColumnarBatch (arrow) - arrow
  • RowData - Flink
  • InternalRow - Spark
  • ColumnarBatch (spark) - Spark batch

@pvary
Copy link
Contributor Author

pvary commented Feb 21, 2025

I will start to collect the differences here between the different writer types (appender/dataWriter/equalityDeleteWriter/positionalDeleteWriter) for reference:

  • Writer context is different between delete and data files. This contains TableProperties/Configurations which could be different between delete and data files. For example for parquet: RowGroupSize/PageSize/PageRowLimit/DictSize/Compression etc. For ORC and Avro we have some similar changing configs
  • Specific writer functions for position deletes to write out the PositionDelete records
  • Positional delete PathTransformFunction to convert writer data type for the path to file format data type

import org.apache.iceberg.io.DataWriter;

/** Builder API for creating {@link DataWriter}s. */
public interface DataWriterBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put the builder interface into the DataWriter class and put it in the same package? It seems odd to me that we're introducing this new datafile package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classes which has to be implemented by the file formats are kept in the io package, but moved the others to the data package

@rdblue
Copy link
Contributor

rdblue commented Feb 22, 2025

While I think the goal here is a good one, the implementation looks too complex to be workable in its current form.

The primary issue that we currently have is adapting object models (like Iceber's internal StructLike, Spark's InternalRow, or Flink's RowData) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine. I think a diff from the InternalData PR demonstrates it pretty well:

-    switch (format) {
-      case AVRO:
-        AvroIterable<ManifestEntry<F>> reader =
-            Avro.read(file)
-                .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
-                .createResolvingReader(this::newReader)
-                .reuseContainers()
-                .build();
+    CloseableIterable<ManifestEntry<F>> reader =
+        InternalData.read(format, file)
+            .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
+            .reuseContainers()
+            .build();
 
-        addCloseable(reader);
+    addCloseable(reader);
 
-        return CloseableIterable.transform(reader, inheritableMetadata::apply);
+    return CloseableIterable.transform(reader, inheritableMetadata::apply);
-
-      default:
-        throw new UnsupportedOperationException("Invalid format for manifest file: " + format);
-    }

This shows:

  • Rather than a switch, the format is passed to create the builder
  • There is no longer a callback passed to create readers for the object model (createResolvingReader)

In this PR, there are a lot of other changes as well. I'm looking at one of the simpler Spark cases in the row reader.

The builder is initialized from DataFileServiceRegistry and now requires a format, class name, file, projection, and constant map:

    return DataFileServiceRegistry.readerBuilder(
            format, InternalRow.class.getName(), file, projection, idToConstant)

There are also new static classes in the file. Each creates a new service and each service creates the builder and object model:

  public static class AvroReaderService implements DataFileServiceRegistry.ReaderService {
    @Override
    public DataFileServiceRegistry.Key key() {
      return new DataFileServiceRegistry.Key(FileFormat.AVRO, InternalRow.class.getName());
    }

    @Override
    public ReaderBuilder builder(
        InputFile inputFile,
        Schema readSchema,
        Map<Integer, ?> idToConstant,
        DeleteFilter<?> deleteFilter) {
      return Avro.read(inputFile)
          .project(readSchema)
          .createResolvingReader(schema -> SparkPlannedAvroReader.create(schema, idToConstant));
    }

The createResolvingReader line is still there, just moved into its own service class instead of in branches of a switch statement.

In addition, there are now a lot more abstractions:

  • A builder for creating an appender for a file format
  • A builder for creating a data file writer for a file format
  • A builder for creating an equality delete writer for a file format
  • A builder for creating a position delete writer for a file format
  • A builder for creating a reader for a file format
  • A "service" registry (what is a service?)
  • A "key"
  • A writer service
  • A reader service

I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:

  • Focus on removing boilerplate and hiding the internals. For instance, Key, if needed, should be an internal abstraction and not complexity that is exposed to callers
  • The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder?
  • Is the extra "service" abstraction helpful?
  • Remove ServiceLoader and use a simpler solution. I think that formats could simply register themselves like we do for InternalData. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk.
  • Standardize more across the builders for FileFormat. How idToConstant is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same.

@pvary
Copy link
Contributor Author

pvary commented Feb 24, 2025

While I think the goal here is a good one, the implementation looks too complex to be workable in its current form.

I'm happy that we agree with the goals. I created a PR to start the conversation. If there are willing reviewers we can introduce more invasive changes to archive a better API. I'm all for it!

The primary issue that we currently have is adapting object models (like Iceber's internal StructLike, Spark's InternalRow, or Flink's RowData) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine.

I think we need to keep this direct transformations to prevent the performance loss which would be caused by multiple transformations between object model -> common model -> file format.

We have a matrix of transformation which we need to encode somewhere:

Source Target
Parquet StructLike
Parquet InternalRow
Parquet RowData
Parquet Arrow
Avro ...
ORC ...

[..]

  • Rather than a switch, the format is passed to create the builder
  • There is no longer a callback passed to create readers for the object model (createResolvingReader)

The InternalData reader has one advantage over the data file readers/writers. The internal object model is static for these readers/writers. For the DataFile readers/writers we have multiple object models to handle.

[..]
I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:

  • Focus on removing boilerplate and hiding the internals. For instance, Key, if needed, should be an internal abstraction and not complexity that is exposed to callers

If we allow adding new builders for the file formats we can remove a good chunk of the boilerplate code. Let me see how this would look like

  • The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder?

We need to refactor the Avro positional delete write for this, or add a positionalWriterFunc. Also need to consider that the format specific configurations which are different for the appenders and the delete files (DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES)

  • Is the extra "service" abstraction helpful?

If we are ok with having a new Builder for the readers/writers, then we don't need the service. It was needed to keep the current APIs and the new APIs compatible.

  • Remove ServiceLoader and use a simpler solution. I think that formats could simply register themselves like we do for InternalData. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk.

Will do

  • Standardize more across the builders for FileFormat. How idToConstant is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same.

Will see what could be arcived

@pvary pvary force-pushed the file_Format_api_without_base branch 5 times, most recently from c488d32 to 71ec538 Compare February 25, 2025 16:53
outputFile ->
Avro.write(outputFile)
.writerFunction(
(schema, engineSchema) -> new SparkAvroWriter((StructType) engineSchema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a cast here if the builder is parameterized by this type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to make Avro.WriteBuilder parametrized. Created a deprecation and new classes to make this work. Added serious size to the PR.
Please check, and if you think it is worth it, then we can keep

ORC.write(outputFile)
.writerFunction(
(schema, messageType, engineSchema) -> new SparkOrcWriter(schema, messageType))
.pathTransformFunc(path -> UTF8String.fromString(path.toString())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delete writer needs this to convert the path to strings in the delete records.
All of these are file format specific, so I think this should be fine here as this is not exposed to the users

try {
return DataFileServiceRegistry.writeBuilder(deleteFileFormat, inputType, file)
.set(properties)
.set(writeProperties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. Why are table and write properties both being set?

Copy link
Contributor Author

@pvary pvary Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it is done currently. Just the set calls are in different methods (SparkFileWriterFactory for the writeProperties and for the actual table properties)

* @param properties a map of writer config properties
* @return this for method chaining
*/
public WriteBuilder<A, E> set(Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this set in other builders or could we use setAll?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a hard decision.
We have meta(String, String), and meta(Map), set(String,String) in the InternalData. We have set(String, String) and setAll(Map) in the current writers. I opted for consistency in the new API, and went for set. Since it is a new API, I think better to be consistent


/** Sets the metrics configuration used for collecting column metrics for the created file. */
public WriteBuilder<A, E> metricsConfig(MetricsConfig newMetricsConfig) {
appenderBuilder.metricsConfig(newMetricsConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever customized? Can we set it automatically from table properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current GenericAppenderFactory customizes it for backward compatibility reasons.
I'm not sure requiring a full table object as a parameter is better anyways.. (I just removed based on your suggestion 😄)

}

/** Sets the equality field ids for the equality delete writer. */
public WriteBuilder<A, E> withEqualityFieldIds(List<Integer> fieldIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between this builder and the builder for data or delete files? Do we need to pass this here?

I think the latest changes may have gone too far in combining interfaces. I think the object models should handle appender builders and those builders should be used by a data file builders and delete file builders that are shared. Combining the delete and data file builders creates confusing options, like this for a data file builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split up the Writer API interfaces (kept the single writer implementation for code simplicity reasons)

@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from 132d19b to 050b70a Compare March 18, 2025 14:05
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some comments, nothing that's really blocking

@@ -287,14 +405,17 @@ CodecFactory codec() {
}
}

@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these just deprecated or deprecated-for-removal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely deprecated for removal, but not sure about the timing of the PR. Also they are used pretty extensively in tests, which we need to check later.

Based on @danielcweeks comments on the community sync we can remove after 1 minor release, so when we are near to finalizing the PR, I will add the javadoc comments based on this.

* @param <B> type of the builder
* @param <E> engine specific schema of the input records used for appender initialization
*/
interface WriterBuilderBase<B extends WriterBuilderBase<B, E>, E> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if this interface should be public, or it's functions explicitly overridden in the public interfaces that extend this one - to avoid potential visibility issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep make interfaces public only if they could be directly called.
The interfaces are created only to remove code duplication. I'm actually considering removing them, and directly copying the methods to the public interfaces, but I don't have a strong preference here.

* @param <E> engine specific schema of the input records used for appender initialization
*/
@SuppressWarnings("unchecked")
class WriteBuilder<B extends WriteBuilder<B, A, E>, A extends DataFileAppenderBuilder<A, E>, E>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename to WriteBuilderImpl to distinguish it e.g. from the public ReadBuilder interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again an interesting question. We have 2 different apis here:

  1. APIs implemented by the file formats - currently in the core module and in the org.apache.iceberg.io package
  2. APIs used by the data file readers and writers - currently in the data module in the org.apache.iceberg.data package

The ReadBuilder is an exception, because it is implemented by the file format, but directly exposed to the readers. I was considering creating a wrapper around it to decouple the two. I decided against it, because I don't see any direct current benefit from it and it makes harder to expose the SupportsDeleteFilter api through the 2 interfaces.

@github-actions github-actions bot added the build label Mar 18, 2025
@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from b460f91 to a549bda Compare March 19, 2025 09:57
@pvary
Copy link
Contributor Author

pvary commented Mar 19, 2025

Here are a few important open questions:

  1. We should decide on the expected filtering behavior. Currently the filters are applied as best effort for the file format readers. We might decide on more strict behavior, and enforce the file formats to apply all filters when provided. I would suggest to do it in another PR even if we chose to change current state.
  2. Batch sizes are currently parameters which could be set for non-vectorized readers too. We could put the batch size as a reader property, and tell the readers to parse the reader properties when batch read happens. I would prefer the current solution as the expectation for the readers is self documented.
  3. Parquet/Orc configuration. Currently the Spark batch reader uses different configuration objects for Parquet and ORC as requested by @aokolnychyi. @rdblue suggested to use a common configuration instead. I'm still learning the Spark code, so I don't have a strong opinion here

@pvary pvary force-pushed the file_Format_api_without_base branch from a549bda to 50ed143 Compare March 24, 2025 18:06
@pvary pvary changed the title WIP: Interface based DataFile reader and writer API Core: Interface based DataFile reader and writer API Mar 25, 2025
@pvary pvary force-pushed the file_Format_api_without_base branch from 6ba23be to c8e2e33 Compare March 26, 2025 12:34
@github-actions github-actions bot added the MR label Mar 30, 2025
@pvary pvary force-pushed the file_Format_api_without_base branch from 2ee67b3 to 2e86ee9 Compare March 30, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants