-
Notifications
You must be signed in to change notification settings - Fork 81
Fix query without equality delete fields #485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6f49768 to
cb13c01
Compare
|
@Tishj ,By the way, how do I configure Iceberg to use relative paths for test cases in the data directory? I checked the configuration of Iceberg and it seems that there is no such configuration ,thanks |
I'm not sure I understand what you're asking, Iceberg works with absolute paths, that's a limitation of the format |
in the for example,the |
|
@zhangjun0x01 check out some of the scripts, like I imagine it's because the |
I test it , when we create iceberg hadoop catalog with relative path, the table will use relative path. |
94a498d to
3919d0c
Compare
a5417a7 to
199f9c0
Compare
|
hi,@Tishj ,could you help review again, thanks |
|
When CI is green I'll have a look, thanks 👍 |
05b95dd to
65fcba1
Compare
This is indeed quite strange. the test program works correctly on my computer, and it is fine on linux and windows. |
|
The For reference, this is what was used to generate // src/main/java/com/example/IcebergEqualDeleteExample.java
// ------------------------------------------------------
// Place this file under src/main/java/com/example/IcebergEqualDeleteExample.java
package com.example;
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.*;
import org.apache.iceberg.io.FileAppender;
import org.apache.iceberg.CatalogUtil;
import org.apache.iceberg.catalog.Catalog;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.catalog.TableIdentifier;
import org.apache.iceberg.data.GenericRecord;
import org.apache.iceberg.data.Record;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.data.parquet.GenericParquetWriter;
import com.google.common.collect.ImmutableMap;
import org.apache.iceberg.deletes.EqualityDeleteWriter;
import org.apache.iceberg.io.OutputFile;
import org.apache.iceberg.parquet.Parquet;
import java.util.*;
import java.util.stream.Collectors;
import static org.apache.iceberg.TableProperties.FORMAT_VERSION;
public class IcebergEqualDeleteExample {
public static void main(String[] args) {
Map<String, String> props = new HashMap<>();
props.put(CatalogProperties.WAREHOUSE_LOCATION, "data/persistent/equality_deletes/warehouse");
props.put(CatalogUtil.ICEBERG_CATALOG_TYPE, CatalogUtil.ICEBERG_CATALOG_TYPE_HADOOP);
Catalog catalog = CatalogUtil.buildIcebergCatalog("hadoop_catalog", props, new Configuration());
TableIdentifier tableId = TableIdentifier.of("mydb", "mytable");
Schema schema = new Schema(Types.NestedField.required(1, "id", Types.IntegerType.get()),
Types.NestedField.optional(2, "name", Types.StringType.get()));
if (!catalog.tableExists(tableId)) {
catalog.createTable(tableId, schema, PartitionSpec.unpartitioned(), ImmutableMap.of(FORMAT_VERSION, "2"));
System.out.println("Created table mydb.mytable");
}
Table table = catalog.loadTable(tableId);
// Insert some records
List<Record> records_1 =
Arrays.asList(createRecord(table, 1, "a"), createRecord(table, 2, "b"), createRecord(table, 3, "b"));
insertRecords(table, records_1);
// Delete records where name="b"
Map<String, Object> deleteConditions_1 = new HashMap<>();
deleteConditions_1.put("name", "b");
deleteRecordsByEquality(table, deleteConditions_1);
// Insert some records
List<Record> records_2 =
Arrays.asList(createRecord(table, 1, "a"), createRecord(table, 2, "b"), createRecord(table, 1, "b"));
insertRecords(table, records_2);
// Delete records where id=1 AND name="a"
Map<String, Object> deleteConditions_2 = new HashMap<>();
deleteConditions_2.put("id", 1);
deleteConditions_2.put("name", "a");
deleteRecordsByEquality(table, deleteConditions_2);
}
private static Record createRecord(Table table, int id, String name) {
Record record = GenericRecord.create(table.schema());
record.setField("id", id);
record.setField("name", name);
return record;
}
private static void insertRecords(Table table, List<Record> records) {
Transaction insertTxn = table.newTransaction();
try {
String dataPath = String.format("%s/data/data-%s.parquet", table.location(), UUID.randomUUID());
OutputFile dataOut = table.io().newOutputFile(dataPath);
FileAppender<Record> writer = Parquet.write(dataOut)
.schema(table.schema())
.createWriterFunc(GenericParquetWriter::buildWriter)
.build();
for (Record record : records) {
writer.add(record);
}
writer.close();
DataFile dataFile = DataFiles.builder(table.spec())
.withPath(dataPath)
.withFormat(FileFormat.PARQUET)
.withFileSizeInBytes(dataOut.toInputFile().getLength())
.withMetrics(writer.metrics())
.withSplitOffsets(writer.splitOffsets())
.build();
insertTxn.newAppend().appendFile(dataFile).commit();
insertTxn.commitTransaction();
System.out.println("Data inserted successfully.");
} catch (Exception e) {
throw new RuntimeException("Failed to insert data", e);
}
}
private static void deleteRecordsByEquality(Table table, Map<String, Object> fieldValues) {
Transaction txn = table.newTransaction();
try {
List<String> eqFields = new ArrayList<>(fieldValues.keySet());
Record deleteRec = GenericRecord.create(table.schema().select(eqFields));
// Set all field values in the delete record
fieldValues.forEach(deleteRec::setField);
String deletePath = String.format("%s/data/delete-%s.parquet", table.location(), UUID.randomUUID());
OutputFile out = table.io().newOutputFile(deletePath);
EqualityDeleteWriter<Record> writer = Parquet.writeDeletes(out)
.forTable(table)
.withSpec(table.spec())
.rowSchema(table.schema().select(eqFields))
.createWriterFunc(GenericParquetWriter::buildWriter)
.equalityFieldIds(getFieldIds(table, eqFields))
.buildEqualityWriter();
writer.write(deleteRec);
writer.close();
DeleteFile df = writer.toDeleteFile();
txn.newRowDelta().addDeletes(df).commit();
txn.commitTransaction();
System.out.println("Equality delete file written successfully.");
} catch (Exception e) {
throw new RuntimeException("Failure writing equality deletes", e);
}
}
private static List<Integer> getFieldIds(Table table, List<String> names) {
return names.stream().map(n -> table.schema().findField(n).fieldId()).collect(Collectors.toList());
}
} |
I create the partitioned table and insert data by spark , and generate equality delete files by java api use this case , and I query the table |
300d73d to
adefdbc
Compare
|
I regenerate the table |
…evant equality deletes for a given data file
adefdbc to
22f8dfc
Compare
|
hi,@Tishj, could you help me test if the |
this is the full code to generate the |
when we do a query without the equality delete fields, it will throw an exception
Equality deletes need the relevant columns to be selected, this pr fix this issue.steps:
ApplyEqualityDeletes