Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Add scan planning apis to REST Catalog #11180

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rahil-c
Copy link
Contributor

@rahil-c rahil-c commented Sep 20, 2024

This pr adds the new scan planning apis based on the recent REST spec changes made in the following #9695.

In addition there is an additional spec change to loadTable that https://github.com/apache/iceberg/pull/11156/files that allows for the service to notify the client when to call these new apis.

cc @amogh-jahagirdar @nastra @rdblue @danielcweeks @jackye1995 @singhpk234 @geruh

@amogh-jahagirdar amogh-jahagirdar changed the title Add scan planning api request and response models API, Core: Add scan planning api request and response models Sep 20, 2024
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rahil-c for kicking off the implementation! I wonder what others think but IMO this can be combined with serializer implementations that get registered in RESTSerializers. Then we can have more robust tests which exercise all the validation logic in the model.

@nastra
Copy link
Contributor

nastra commented Sep 23, 2024

@rahil-c can you also please add respective JSON parser(s) with tests for those parser(s) as those request/response classes won't be serialized/deserialized

@rahil-c
Copy link
Contributor Author

rahil-c commented Sep 23, 2024

Thanks @rahil-c for kicking off the implementation! I wonder what others think but IMO this can be combined with serializer implementations that get registered in RESTSerializers. Then we can have more robust tests which exercise all the validation logic in the model.

can you also please add respective JSON parser(s) with tests for those parser(s) as those request/response classes won't be serialized/deserialized

Thanks @amogh-jahagirdar and @nastra for revieiwing, will follow up on this pr with serializers, parsers, and the tests for them.

planStatus() != null, "invalid, status can not be null: ", planStatus());
Preconditions.checkArgument(
planStatus() != PlanStatus.COMPLETED && (planTasks() != null || fileScanTasks() != null),
"invalid response, tasks can only be returned in a 'completed' status");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this throw IllegalArgumentException? Should this be some other REST exception because the service sent an invalid response? It isn't an argument that was the problem, it was the response itself.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow but I think it can be considered an IllegalArgumentException since planStatus and tasks can be considered as arguments to the request builder, and if the combination of arguments are logically in a state where it's not spec compliant I feel like it makes sense to fail here with that exception? Another compelling part of this implementation is that anyone using this response builder from the reference implementation has immediate validation that their response is spec compliant or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for sharing your thoughts I feel like that was my understanding as well, the request and response require certain arguments to be present that the spec mandates, otherwise it is considered invalid and the IllegalArgumentException seems appropriate.

When checking in the iceberg's core rest module, we already do this in serval places so it seems to be the pattern.

Examples in the request classes
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/requests/CommitTransactionRequest.java#L40
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java#L48

Examples in parser classes
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/requests/CommitTransactionRequestParser.java#L44

https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/requests/RegisterTableRequestParser.java#L43

@rdblue If this is an issue though can try to follow your suggestion as well, let me know what you think.

Preconditions.checkArgument(planStatus() != null, "invalid response, status can not be null");
Preconditions.checkArgument(
planStatus() != PlanStatus.CANCELLED,
"invalid response, 'cancelled' is not a valid status for this endpoint");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also not be an IllegalArgumentException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented here https://github.com/apache/iceberg/pull/11180/files#r1779211148

If this is an issue Im assuming you would like this to also be some specific new REST exception on the actual fields value being invalid?


@Nullable
@Value.Default
default Boolean useSnapShotSchema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem here.

Also, I don't think this is the correct place to add a default. The parser should add the default. When constructing a request the caller should be required to set this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this

Copy link
Contributor Author

@rahil-c rahil-c Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when reading this again I agree that the parser can add the default value but im not undertanding why a caller should be required to set this? In the spec its not a required field https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L4210

Update: Can still update this for this useSnapShotSchema and caseSensitive.

cc @amogh-jahagirdar

@amogh-jahagirdar amogh-jahagirdar added this to the Iceberg 1.7.0 milestone Oct 2, 2024
@rahil-c
Copy link
Contributor Author

rahil-c commented Oct 9, 2024

@rdblue @danielcweeks @amogh-jahagirdar @nastra @jackye1995 @singhpk234 When implementing the parsers, one thing that I noticed was that not having a partition-spec sent back from the service during the time of several response serialization/deserialization causes several issues from client implementation standpoint.

I have marked these areas with TODO comments so its easier for reviewers to see the complexities behind this and would recommend that we have a discussion around the pros/cons of the service sending back a parition-spec in the FileScanTask model.

The spec-id from the DataFile alone is not enough to find the correct the partition-spec because we do not have the TableMetadata at the time of serialization/deserialization to obtain the relevant partition spec.

@rahil-c rahil-c changed the title API, Core: Add scan planning api request and response models API, Core: Add scan planning api request and response models and parsers Oct 9, 2024
@rahil-c rahil-c changed the title API, Core: Add scan planning api request and response models and parsers API, Core: Add scan planning api models and parsers Oct 10, 2024
@@ -26,7 +26,7 @@
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
import org.apache.iceberg.types.Types;

class GenericDataFile extends BaseFile<DataFile> implements DataFile {
public class GenericDataFile extends BaseFile<DataFile> implements DataFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and GenericDeleteFile shouldn't be made public, based on discussion with @rahil-c this is just an artifact of having the parser implementation in the REST package which can't access this class. I think the real solution should be to uplevel the new parser to core + reuse as much of the existing Content file parsers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do my suggestion below, this along with the rev API changes should go away and get the diff to be a lot smaller by reusing ContentFileParser as it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this thanks @amogh-jahagirdar

Comment on lines 118 to 120
// TODO at the time of deserialization we dont have the partition spec we just have spec id.
// we will need to get the spec from table metadata using spec id.
// we wil need to send parition spec i believe, put some placeholder here for now null
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there's a compelling case to update the rest spec so we send back the full serialized partition spec .In case we were worried about payload sizes going over the wire we could follow the same concept we do for delete file references where there's a top level list where it's only listed once, and in the tasks itself it's just an integer index. But this complicates the protocol.

I think payload size is more of a concern for sending schemas, which I'm pretty against. I feel like resolving schemas is probably worth it on the client. Besides clients have full context on what's being projected and can just send the right schema to the file scan task. There's also an argument for the server sending back a schema ID so that the client doesn't have to do the work of resolving a schema to use for passing to the file scan task, but the schema in FileScanTask is ultimately for any client side grouping the client wants to do and we could just use the current schema or the snapshot schema.

Copy link
Contributor

@singhpk234 singhpk234 Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we send back the full serialized partition spec .In case we were worried about payload sizes going over the wire we could follow the same concept we do for delete file references where there's a top level list where it's only listed once, and in the tasks itself it's just an integer index. But this complicates the protocol.

presently this leaves us with no option but to bind the partition spec with the current schema of the snapshot being queried, which is tbh not a bad option as schema / partition evolution cases takes care of not allowing backward incompatible schema by default. interesting would be cases when we allow this flag in schema evolution ?

Having schema_id would be beneficial tbh as if for now we want to project col a, col b and the data file not having any of them would still be returned to the client which eventually the engines / caller will discard but this makes me think more in terms of column level access control, but totally understand we lack this mechanism today.

Comment on lines 75 to 84
public PartitionData() {
// This construtor has niche case of being used at time for rest serial/deserilization
// where we do not have the partition spec and can not populate these values.
// these values will be refreshed before returning to client engine
this.partitionType = null;
this.size = 0;
this.data = new Object[size];
this.schema = null;
this.stringSchema = null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @rahil-c technically we could avoid this constructor by passing in a null value to the GenericDataFile and rebinding that later, but the bigger question is if we want the server to just send back a serialized spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this in next revision.

Comment on lines 48 to 55
public Boolean caseSensitive() {
return caseSensitive;
}

public Boolean useSnapshotSchema() {
return useSnapshotSchema;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should always be boolean. From a model perspective I think you'll want these methods to return some value especially since we have defaults defined. This will also simplify the parser logic where you don't have to if/else these cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this.

Comment on lines 62 to 71
if (request.caseSensitive() != null) {
gen.writeBooleanField(CASE_SENSITIVE, request.caseSensitive());
} else {
gen.writeBooleanField(CASE_SENSITIVE, true);
}
if (request.useSnapshotSchema() != null) {
gen.writeBooleanField(USE_SNAPSHOT_SCHEMA, request.useSnapshotSchema());
} else {
gen.writeBooleanField(USE_SNAPSHOT_SCHEMA, false);
}
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, I think we should write the model in the library in a manner where request.useSnapshotSchema() always returns a concrete boolean value and can't be null.

The spec says it's optional but that doesn't mean the reference implementation for the spec should enable that behavior for the in memory representation for these structures. The spec just gives that flexibility to clients and defines what the server should do in case these boolean fields are missing.

In this case having these well defined in the API make it so callers have to set this when building and servers can have the expectation when handling such a request that the field exists instead of doing some sort of null check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also newlines after if blocks please. I think this is a case where it's really helpful for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this thanks

Comment on lines 1061 to 1065
org.apache.iceberg:iceberg-api:
- code: "java.class.defaultSerializationChanged"
old: "class org.apache.iceberg.expressions.ResidualEvaluator"
new: "class org.apache.iceberg.expressions.ResidualEvaluator"
justification: "local testing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing how we can get rid of this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do what I mentioned below we should be able to get rid of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it necessary to modify ResidualEvaluator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really needed, #11180 (comment) we can just set the residual evaluator once we have that info

Copy link
Contributor Author

@rahil-c rahil-c Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Will use the example of the existing FileScanTaskParser. Currently when deseriaizing FIleScanTasks in the fromJson of this parser (which is similar to the RESTFileScanTaskParser, in order to construct a BaseFileScanTask it takes in a ResidualEvaluator https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/FileScanTaskParser.java#L117

The Residual current constructor currently requires a spec as well as a caseSensitive flag https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java#L84. At the time of deserialization we do not have either, hence an additional constructor was created which only passes in the filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following @amogh-jahagirdar suggestion, so will remove this.

import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
import org.apache.iceberg.util.JsonUtil;

public class RESTContentFileParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big chunk of this PR is this class and the other classes which are quite similar to the existing parsers. Discussed with @rahil-c that We should reuse as much as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this.

Comment on lines 76 to 92
List<Integer> deleteFileReferences = null;
if (jsonNode.has(DELETE_FILE_REFERENCES)) {
ImmutableList.Builder deleteFileReferencesBuilder = ImmutableList.builder();
JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES);
for (JsonNode deleteRef : deletesArray) {
deleteFileReferencesBuilder.add(deleteRef);
}
deleteFileReferences = deleteFileReferencesBuilder.build();
}

if (deleteFileReferences != null) {
ImmutableList.Builder matchedDeleteFilesBuilder = ImmutableList.builder();
for (Integer deleteFileIdx : deleteFileReferences) {
matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx));
}
matchedDeleteFiles = (DeleteFile[]) matchedDeleteFilesBuilder.build().stream().toArray();
}
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic here for resolving the delete files for the task is incorrect and can also just be simplified to something like the following:

    Set<Integer> deleteFileReferences = Sets.newHashSet();
    if (jsonNode.has(DELETE_FILE_REFERENCES)) {
      deleteFileReferences.addAll(JsonUtil.getIntegerList(DELETE_FILE_REFERENCES, jsonNode));
    }

    List<DeleteFile> deleteFilesForTask = Lists.newArrayList();
    for (int i = 0; i < deleteFiles.size(); i++) {
      if (deleteFileReferences.contains(i)) {
        deleteFilesForTask.add(deleteFiles.get(i));
      }
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, should we add tests for these parsers as well or so far we skip it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for pointing this and the concrete suggestion.
@singhpk234 will look into adding tests for the parsers, was first just trying to get most of src implementation out so others can comment.

Comment on lines 53 to 60
// TODO revisit this logic
if (deleteFiles != null) {
generator.writeArrayFieldStart(DELETE_FILE_REFERENCES);
for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) {
generator.writeNumber(delIndex);
}
generator.writeEndArray();
}
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems incorrect, why do we need to pass the full delete files list when serializing the scan task? I think we just need to pass in the references, that's really what's relevant for serializing the scan task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this thanks @amogh-jahagirdar

if (response.fileScanTasks() != null) {
gen.writeArrayFieldStart(FILE_SCAN_TASKS);
for (FileScanTask fileScanTask : response.fileScanTasks()) {
RESTFileScanTaskParser.toJson(fileScanTask, deleteFiles, gen);
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment #11180 (comment), passing the deleteFIles like this is incorrect. We should be passing only the indices of the top level list that are relevant for the task we are serializing.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    Map<String, Integer> deleteFilePathToIndex = Maps.newHashMap();
    if (response.deleteFiles() != null) {
      deleteFiles = response.deleteFiles();
      gen.writeArrayFieldStart(DELETE_FILES);
      for (int i = 0; i < deleteFiles.size(); i++) {
        deleteFilePathToIndex.put(deleteFiles.get(i).location(), i);
        RESTContentFileParser.toJson(deleteFiles.get(i), gen);
      }
      gen.writeEndArray();
    }

    if (response.fileScanTasks() != null) {
      List<Integer> deleteFileReferences = Lists.newArrayList();
      gen.writeArrayFieldStart(FILE_SCAN_TASKS);
      for (FileScanTask fileScanTask : response.fileScanTasks()) {
        if (deleteFiles != null) {
          for (DeleteFile taskDelete : taskDeletes) {
            deleteFileReferences.add(deleteFilePathToIndex.get(taskDelete.location()));
          }
        }

        RESTFileScanTaskParser.toJson(fileScanTask, deleteFileReferences, gen);
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for the clarification here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this again locally, correction in the second half:


    if (response.fileScanTasks() != null) {
      gen.writeArrayFieldStart(FILE_SCAN_TASKS);
      for (FileScanTask fileScanTask : response.fileScanTasks()) {
      // the list should be initialized here, the delete file references are per file scan task
      List<Integer> deleteFileReferences = Lists.newArrayList();

        if (deleteFiles != null) {
          for (DeleteFile taskDelete : taskDeletes) {
            deleteFileReferences.add(deleteFilePathToIndex.get(taskDelete.location()));
          }
        }

        RESTFileScanTaskParser.toJson(fileScanTask, deleteFileReferences, gen);
      }

Anyways, with tests this will all be made clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Needed for Jackson Deserialization.
}

public FetchScanTasksResponse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are following a builder pattern, I think these constructors should be private.

this.planTasks = withPlanTasks;
return this;
}

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the builder API for specifying file scan tasks should also accept another argument to specify a list of partition specs which corresponds to the list of file scan tasks. This would get used in serialization and then we can avoid all the custom stuff in RESTContentFileParser around serializing partition values and just use ContentFileParser as it is today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not like the builder/in memory model can't have extra fields compared to the spec by the way. As long as during serialization we're serializing according to the spec that's good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline with @amogh-jahagirdar on this, will try this idea out thanks for the helpful suggestion!

return this;
}

public Builder withDeleteFiles(List<DeleteFile> withDeleteFiles) {
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check the names of these args, I think you just copied the method name. Should just be deleteFiles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been done due to checkstyle complaining about hidden fields, my bad. It'd be better if we could have different names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this was due to checkstyle shadow/hidden field. Will use different names

Comment on lines 90 to 120
public Builder withPlanStatus(PlanStatus withPlanStatus) {
this.planStatus = withPlanStatus;
return this;
}

public Builder withPlanTasks(List<String> withPlanTasks) {
this.planTasks = withPlanTasks;
return this;
}

public Builder withFileScanTasks(List<FileScanTask> withFileScanTasks) {
this.fileScanTasks = withFileScanTasks;
return this;
}

public Builder withDeleteFiles(List<DeleteFile> withDeleteFiles) {
this.deleteFiles = withDeleteFiles;
return this;
}

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below double check some of the names on these args, they shouldn't have the with

Comment on lines 68 to 70
public PlanTableScanRequest() {
// Needed for Jackson Deserialization.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be needed anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIll remove these

Comment on lines 105 to 106
BaseFileScanTask baseFileScanTask =
new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, residualEvaluator);
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can introduce a UnboundFileScanTask which extends BaseFileScanTask and overrides the spec/schema to throw Unsupported. That class could have a bind(PartitionSpec spec, Schema schema) API which returns a BaseFileScanTask with the updated spec/schema, and the data file partition data updated correctly.

Binding would happen when planning the tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try this idea out thanks again @amogh-jahagirdar !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to bind the residual as well.

Copy link
Contributor

@singhpk234 singhpk234 Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] These fileScanTasks can be belong to diff snapshots (lets say we define start / end snapshot both) and we would have to bind this this UnboundFileScanTask to get FileScanTask via first getting the schema of the snapshot this fileScan task belongs to then get PartitionSpec from binding that schema with the UnboundedPartitionSpec and then we get to it. Do we have a metadata for mapping FileScanTask to a snapshot id ?
ref this interesting case : #11162

jsonNode.isObject(), "Invalid JSON node for file scan task: non-object (%s)", jsonNode);

GenericDataFile dataFile =
(GenericDataFile) ContentFileParser.unboundContentFileFromJson(jsonNode.get(DATA_FILE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use JsonUtil.get(..) where possible, since it already includes error handling and will show if this field is missing

String planningModeValue = response.config().get(REST_SERVER_PLANNING_MODE_KEY);
RESTPlanningMode planningMode = RESTPlanningMode.fromName(planningModeValue);
if (planningMode == RESTPlanningMode.REQUIRED
|| ((planningMode == RESTPlanningMode.SUPPORTED) && restServerPlanningEnabled)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|| ((planningMode == RESTPlanningMode.SUPPORTED) && restServerPlanningEnabled)) {
|| (planningMode == RESTPlanningMode.SUPPORTED && restServerPlanningEnabled)) {

@@ -125,4 +125,38 @@ public String view(TableIdentifier ident) {
public String renameView() {
return SLASH.join("v1", prefix, "views", "rename");
}

public String planTableScan(TableIdentifier ident) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all places that use these new endpoints need to do an Endpoint.check(...) as it's still possible that some of these endpoints don't exist, even if the rest-server-planning-enabled was set by a client

Rahil Chertara and others added 2 commits October 21, 2024 09:53
@rahil-c
Copy link
Contributor Author

rahil-c commented Oct 21, 2024

Please see the following PR: #11369 which just contains the changes only for the model and parsers.

@rahil-c
Copy link
Contributor Author

rahil-c commented Oct 22, 2024

@amogh-jahagirdar @rdblue @danielcweeks @nastra
Pushed the following commit Utilize ParallelIterable, and use table prop instead of rest planning, which addressed some of the planning logic feedback.

Moving on adding to adding tests to check RESTTableScan.planFiles() logic.

cc @singhpk234 @jackye1995

planTasks.addAll(response.planTasks());
// TODO need to confirm recursive call here
if (response.planTasks() != null) {
getScanTasksIterable(response.planTasks());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're not using this result anywhere at the moment to populate the top level list? unless i'm missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, will fix this and adding a test to exercise this code path

@RussellSpitzer
Copy link
Member

It looks like this won't be ready this week so I'll move it out to the next release milestone.

@rahil-c rahil-c force-pushed the scan-plan-request-response-impl branch from 53e475d to 136ca3b Compare November 5, 2024 00:22
@rahil-c
Copy link
Contributor Author

rahil-c commented Dec 10, 2024

@rdblue @danielcweeks Was wondering if you guys can take a look whenever you get a chance?

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 10, 2025
@rahil-c
Copy link
Contributor Author

rahil-c commented Jan 10, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

Commenting to avoid this being marked as stale.
cc @rdblue @danielcweeks

@nastra nastra added not-stale and removed stale labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants