Skip to content

Conversation

@pizzaeueu
Copy link
Collaborator

Relates to #237

item
.entrySet()
.stream()
.filter(e => e.getKey != operationTypeColumn)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filtered out

item
.entrySet()
.stream()
.filter(e => e.getKey != operationTypeColumn)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filtered out


val log = LogManager.getLogger("com.scylladb.migrator.writers.DynamoDB")

def deleteRDD(target: TargetSettings.DynamoDB,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there should be function for delete operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the big question

Copy link
Contributor

@tarzanek tarzanek Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was told that current version will remove rows that should be deleted but keeps row key with the "_dynamo_op_type" , all other cells are gone
which is what confuses me and I didn't try myself if it really is like that (since the person who reported it might not know fully)

Copy link
Collaborator Author

@pizzaeueu pizzaeueu Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any stage to test it e2e ?
Within the integration test in this PR I can observe that row is not removed w/o this additional delete operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible , since rec. event type is not used later and dynamowriteable item doesn't seem to have an optype

I am curious if the rdd we have could be reused for both operations to avoid creating new client connection

@pizzaeueu
Copy link
Collaborator Author

Hi @tarzanek
I can try to investigate this issue with unreachable 8000 port

@tarzanek
Copy link
Contributor

tarzanek commented Aug 2, 2025

yes please, my expectation is that the docker with Scylla alternator doesn't start (version upgrade?) or dynamo mock service needs version upgrade
I will look too

@tarzanek tarzanek requested a review from Copilot August 15, 2025 09:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a fix to prevent saving the _dynamo_op_type column during DynamoDB stream replication, addressing issue #237. The change ensures that internal operation metadata is filtered out when writing to the target database.

  • Refactored stream processing logic to separate PUT and DELETE operations
  • Added filtering to exclude _dynamo_op_type from saved data
  • Implemented dedicated DELETE operation handling

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
DynamoStreamReplicationIntegrationTest.scala New integration test verifying operation type column exclusion
DynamoStreamReplication.scala Refactored to filter operation type and handle DELETE operations separately
DynamoDB.scala Added new deleteRDD method for handling DELETE operations
.gitignore files Updated to ignore Docker data directories
tests.yml Added permissions fix for Docker directories
ToRemove.yml Temporary test workflow file

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@tarzanek tarzanek requested a review from Copilot September 16, 2025 16:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

.collect { case Some(item) => item: util.Map[String, AttributeValueV1] }
.repartition(Runtime.getRuntime.availableProcessors() * 2)

val allOperationsSorted = rdd.collect().sortBy(_.get(sequenceNumberColumn).getS)
Copy link

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting all operations into the driver memory could cause OOM errors for large datasets. Consider using distributed sorting with sortBy() on the RDD instead of collecting first.

Suggested change
val allOperationsSorted = rdd.collect().sortBy(_.get(sequenceNumberColumn).getS)
val allOperationsSorted = rdd.sortBy(_.get(sequenceNumberColumn).getS).collect()

Copilot uses AI. Check for mistakes.
Comment on lines 63 to 68
(lastType, lastBatch :+ item) :: rest
} else {
(currentType, List(item)) :: (lastType, lastBatch) :: rest
}
}
.reverse
Copy link

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using :+ for list append is O(n) operation. Consider using a mutable buffer or reversing the accumulation strategy to use :: (prepend) which is O(1).

Suggested change
(lastType, lastBatch :+ item) :: rest
} else {
(currentType, List(item)) :: (lastType, lastBatch) :: rest
}
}
.reverse
(lastType, item :: lastBatch) :: rest
} else {
(currentType, List(item)) :: (lastType, lastBatch) :: rest
}
}
.reverse
.map { case (opType, batch) => (opType, batch.reverse) }

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +66
if (!keyToDelete.isEmpty) {
try {
dynamoDB.deleteItem(
DeleteItemRequest
.builder()
.tableName(target.table)
.key(keyToDelete)
.build()
)
} catch {
case e: Exception =>
log.error(
s"Failed to delete item with key ${keyToDelete} from table ${target.table}",
e)
}
Copy link

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The isEmpty check is redundant since the forEach loop will naturally skip if keySchema is empty, and the delete operation will fail gracefully if keyToDelete is empty.

Suggested change
if (!keyToDelete.isEmpty) {
try {
dynamoDB.deleteItem(
DeleteItemRequest
.builder()
.tableName(target.table)
.key(keyToDelete)
.build()
)
} catch {
case e: Exception =>
log.error(
s"Failed to delete item with key ${keyToDelete} from table ${target.table}",
e)
}
try {
dynamoDB.deleteItem(
DeleteItemRequest
.builder()
.tableName(target.table)
.key(keyToDelete)
.build()
)
} catch {
case e: Exception =>
log.error(
s"Failed to delete item with key ${keyToDelete} from table ${target.table}",
e)

Copilot uses AI. Check for mistakes.
@tarzanek
Copy link
Contributor

@pizzaeueu won't the batching change order if we parallelize by "batch"?
I mean parallelize will create tasks by batch, so parallel batches will run and sequential order might be broken, or I misunderstand the function?

@tarzanek tarzanek requested a review from Copilot September 19, 2025 07:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 63 to 73
val itemWithoutOp = item
.entrySet()
.stream()
.filter(e => e.getKey != operationTypeColumn)
.collect(
Collectors.toMap(
(e: util.Map.Entry[String, AttributeValueV1]) => e.getKey,
(e: util.Map.Entry[String, AttributeValueV1]) =>
AttributeValueUtils.fromV1(e.getValue)
)
)
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This Java-style stream processing is unnecessarily complex for Scala. Consider using Scala collections for better readability: item.asScala.filter(_._1 != operationTypeColumn).map { case (k, v) => k -> AttributeValueUtils.fromV1(v) }.asJava

Suggested change
val itemWithoutOp = item
.entrySet()
.stream()
.filter(e => e.getKey != operationTypeColumn)
.collect(
Collectors.toMap(
(e: util.Map.Entry[String, AttributeValueV1]) => e.getKey,
(e: util.Map.Entry[String, AttributeValueV1]) =>
AttributeValueUtils.fromV1(e.getValue)
)
)
val itemWithoutOp = item.asScala
.filter { case (k, _) => k != operationTypeColumn }
.map { case (k, v) => k -> AttributeValueUtils.fromV1(v) }
.asJava

Copilot uses AI. Check for mistakes.

DynamoDB.writeRDD(target, renamesMap, writableRdd, targetTableDesc)(spark)
run(
msgs.asInstanceOf[RDD[Option[util.Map[String, AttributeValueV1]]]],
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using asInstanceOf for type casting is unsafe and can lead to runtime ClassCastException. The type should already be correct based on the DStream definition - consider removing the cast or adding proper type validation.

Suggested change
msgs.asInstanceOf[RDD[Option[util.Map[String, AttributeValueV1]]]],
msgs,

Copilot uses AI. Check for mistakes.
@pizzaeueu pizzaeueu requested a review from Copilot September 19, 2025 10:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

tableDesc
)

val finalItems = scanAll(sourceDDb(), tableName).sortBy(m => m("id").s)
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is scanning the source DynamoDB table instead of the target table. This should scan targetAlternator() to verify that the operations were correctly applied to the target database.

Suggested change
val finalItems = scanAll(sourceDDb(), tableName).sortBy(m => m("id").s)
val finalItems = scanAll(targetAlternator(), tableName).sortBy(m => m("id").s)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a valid comment, why is source validated ? @pizzaeueu

@tarzanek tarzanek requested a review from Copilot September 24, 2025 09:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +65 to +67
val itemWithoutOp = item.asScala.collect {
case (k, v) if k != operationTypeColumn => k -> AttributeValueUtils.fromV1(v)
}.asJava
Copy link

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a new collection for each item processing. Consider pre-filtering the operation type column or using a more efficient approach to avoid repeated collection transformations.

Suggested change
val itemWithoutOp = item.asScala.collect {
case (k, v) if k != operationTypeColumn => k -> AttributeValueUtils.fromV1(v)
}.asJava
val itemWithoutOp = {
val m = new java.util.HashMap[String, AttributeValueV2]()
val it = item.entrySet().iterator()
while (it.hasNext) {
val entry = it.next()
if (entry.getKey != operationTypeColumn) {
m.put(entry.getKey, AttributeValueUtils.fromV1(entry.getValue))
}
}
m
}

Copilot uses AI. Check for mistakes.
.tableName(tableName)
.item(
Map(
"id" -> AttributeValue.builder.s("toDelete").build,
Copy link

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parentheses after builder. It should be AttributeValue.builder().s(\"toDelete\").build() to properly chain the method calls.

Copilot uses AI. Check for mistakes.
.item(
Map(
"id" -> AttributeValue.builder.s("toDelete").build,
"value" -> AttributeValue.builder.s("value1").build
Copy link

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parentheses after builder. It should be AttributeValue.builder().s(\"value1\").build() to properly chain the method calls.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +70
"id" -> AttributeValue.builder.s("toUpdate").build,
"value" -> AttributeValue.builder.s("value2").build
Copy link

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parentheses after builder. It should be AttributeValue.builder().s(\"toUpdate\").build() and AttributeValue.builder().s(\"value2\").build() to properly chain the method calls.

Copilot uses AI. Check for mistakes.
@tarzanek tarzanek merged commit 3013d6a into scylladb:master Sep 25, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants