do not save _dynamo_op_type during dynamo stream replication #239

pizzaeueu · 2025-07-31T14:58:57Z

Relates to #237

pizzaeueu · 2025-07-31T15:05:01Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

+            item
+              .entrySet()
+              .stream()
+              .filter(e => e.getKey != operationTypeColumn)


filtered out

pizzaeueu · 2025-07-31T15:05:11Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

+                item
+                  .entrySet()
+                  .stream()
+                  .filter(e => e.getKey != operationTypeColumn)


filtered out

pizzaeueu · 2025-07-31T15:05:33Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoDB.scala


  val log = LogManager.getLogger("com.scylladb.migrator.writers.DynamoDB")

+  def deleteRDD(target: TargetSettings.DynamoDB,


Is there should be function for delete operations?

this is exactly the big question

I was told that current version will remove rows that should be deleted but keeps row key with the "_dynamo_op_type" , all other cells are gone
which is what confuses me and I didn't try myself if it really is like that (since the person who reported it might not know fully)

Is there any stage to test it e2e ?
Within the integration test in this PR I can observe that row is not removed w/o this additional delete operation

it is possible , since rec. event type is not used later and dynamowriteable item doesn't seem to have an optype

I am curious if the rdd we have could be reused for both operations to avoid creating new client connection

pizzaeueu · 2025-08-01T12:24:08Z

Hi @tarzanek
I can try to investigate this issue with unreachable 8000 port

tarzanek · 2025-08-02T03:22:59Z

yes please, my expectation is that the docker with Scylla alternator doesn't start (version upgrade?) or dynamo mock service needs version upgrade
I will look too

Create ToRemove.yml

Copilot

Pull Request Overview

This PR implements a fix to prevent saving the _dynamo_op_type column during DynamoDB stream replication, addressing issue #237. The change ensures that internal operation metadata is filtered out when writing to the target database.

Refactored stream processing logic to separate PUT and DELETE operations
Added filtering to exclude _dynamo_op_type from saved data
Implemented dedicated DELETE operation handling

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
DynamoStreamReplicationIntegrationTest.scala	New integration test verifying operation type column exclusion
DynamoStreamReplication.scala	Refactored to filter operation type and handle DELETE operations separately
DynamoDB.scala	Added new deleteRDD method for handling DELETE operations
.gitignore files	Updated to ignore Docker data directories
tests.yml	Added permissions fix for Docker directories
ToRemove.yml	Temporary test workflow file

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

tests/src/test/scala/com/scylladb/migrator/writers/DynamoStreamReplicationIntegrationTest.scala

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoDB.scala

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-16T16:02:59Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

+        .collect { case Some(item) => item: util.Map[String, AttributeValueV1] }
+        .repartition(Runtime.getRuntime.availableProcessors() * 2)
+
+      val allOperationsSorted = rdd.collect().sortBy(_.get(sequenceNumberColumn).getS)


Collecting all operations into the driver memory could cause OOM errors for large datasets. Consider using distributed sorting with sortBy() on the RDD instead of collecting first.

Suggested change

val allOperationsSorted = rdd.collect().sortBy(_.get(sequenceNumberColumn).getS)

val allOperationsSorted = rdd.sortBy(_.get(sequenceNumberColumn).getS).collect()

Copilot · 2025-09-16T16:02:59Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

+            (lastType, lastBatch :+ item) :: rest
+          } else {
+            (currentType, List(item)) :: (lastType, lastBatch) :: rest
+          }
+      }
+      .reverse


Using :+ for list append is O(n) operation. Consider using a mutable buffer or reversing the accumulation strategy to use :: (prepend) which is O(1).

Suggested change

(lastType, lastBatch :+ item) :: rest

} else {

(currentType, List(item)) :: (lastType, lastBatch) :: rest

}

}

.reverse

(lastType, item :: lastBatch) :: rest

} else {

(currentType, List(item)) :: (lastType, lastBatch) :: rest

}

}

.reverse

.map { case (opType, batch) => (opType, batch.reverse) }

Copilot · 2025-09-16T16:02:59Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoDB.scala

+            if (!keyToDelete.isEmpty) {
+              try {
+                dynamoDB.deleteItem(
+                  DeleteItemRequest
+                    .builder()
+                    .tableName(target.table)
+                    .key(keyToDelete)
+                    .build()
+                )
+              } catch {
+                case e: Exception =>
+                  log.error(
+                    s"Failed to delete item with key ${keyToDelete} from table ${target.table}",
+                    e)
+              }


[nitpick] The isEmpty check is redundant since the forEach loop will naturally skip if keySchema is empty, and the delete operation will fail gracefully if keyToDelete is empty.

Suggested change

if (!keyToDelete.isEmpty) {

try {

dynamoDB.deleteItem(

DeleteItemRequest

.builder()

.tableName(target.table)

.key(keyToDelete)

.build()

)

} catch {

case e: Exception =>

log.error(

s"Failed to delete item with key ${keyToDelete} from table ${target.table}",

e)

}

try {

dynamoDB.deleteItem(

DeleteItemRequest

.builder()

.tableName(target.table)

.key(keyToDelete)

.build()

)

} catch {

case e: Exception =>

log.error(

s"Failed to delete item with key ${keyToDelete} from table ${target.table}",

e)

tarzanek · 2025-09-16T16:03:41Z

@pizzaeueu won't the batching change order if we parallelize by "batch"?
I mean parallelize will create tasks by batch, so parallel batches will run and sequential order might be broken, or I misunderstand the function?

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-19T07:40:12Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

+            val itemWithoutOp = item
+              .entrySet()
+              .stream()
+              .filter(e => e.getKey != operationTypeColumn)
+              .collect(
+                Collectors.toMap(
+                  (e: util.Map.Entry[String, AttributeValueV1]) => e.getKey,
+                  (e: util.Map.Entry[String, AttributeValueV1]) =>
+                    AttributeValueUtils.fromV1(e.getValue)
+                )
+              )


[nitpick] This Java-style stream processing is unnecessarily complex for Scala. Consider using Scala collections for better readability: item.asScala.filter(_._1 != operationTypeColumn).map { case (k, v) => k -> AttributeValueUtils.fromV1(v) }.asJava

Suggested change

val itemWithoutOp = item

.entrySet()

.stream()

.filter(e => e.getKey != operationTypeColumn)

.collect(

Collectors.toMap(

(e: util.Map.Entry[String, AttributeValueV1]) => e.getKey,

(e: util.Map.Entry[String, AttributeValueV1]) =>

AttributeValueUtils.fromV1(e.getValue)

)

)

val itemWithoutOp = item.asScala

.filter { case (k, _) => k != operationTypeColumn }

.map { case (k, v) => k -> AttributeValueUtils.fromV1(v) }

.asJava

Copilot · 2025-09-19T07:40:13Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

-
-      DynamoDB.writeRDD(target, renamesMap, writableRdd, targetTableDesc)(spark)
+      run(
+        msgs.asInstanceOf[RDD[Option[util.Map[String, AttributeValueV1]]]],


Using asInstanceOf for type casting is unsafe and can lead to runtime ClassCastException. The type should already be correct based on the DStream definition - consider removing the cast or adding proper type validation.

Suggested change

msgs.asInstanceOf[RDD[Option[util.Map[String, AttributeValueV1]]]],

msgs,

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-19T10:39:14Z

tests/src/test/scala/com/scylladb/migrator/writers/DynamoStreamReplicationIntegrationTest.scala

+      tableDesc
+    )
+
+    val finalItems = scanAll(sourceDDb(), tableName).sortBy(m => m("id").s)


The test is scanning the source DynamoDB table instead of the target table. This should scan targetAlternator() to verify that the operations were correctly applied to the target database.

Suggested change

val finalItems = scanAll(sourceDDb(), tableName).sortBy(m => m("id").s)

val finalItems = scanAll(targetAlternator(), tableName).sortBy(m => m("id").s)

this is actually a valid comment, why is source validated ? @pizzaeueu

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-24T09:08:38Z

migrator/src/main/scala/com/scylladb/migrator/writers/DynamoStreamReplication.scala

+            val itemWithoutOp = item.asScala.collect {
+              case (k, v) if k != operationTypeColumn => k -> AttributeValueUtils.fromV1(v)
+            }.asJava


This creates a new collection for each item processing. Consider pre-filtering the operation type column or using a more efficient approach to avoid repeated collection transformations.

Suggested change

val itemWithoutOp = item.asScala.collect {

case (k, v) if k != operationTypeColumn => k -> AttributeValueUtils.fromV1(v)

}.asJava

val itemWithoutOp = {

val m = new java.util.HashMap[String, AttributeValueV2]()

val it = item.entrySet().iterator()

while (it.hasNext) {

val entry = it.next()

if (entry.getKey != operationTypeColumn) {

m.put(entry.getKey, AttributeValueUtils.fromV1(entry.getValue))

}

}

m

}

Copilot · 2025-09-24T09:08:38Z

tests/src/test/scala/com/scylladb/migrator/writers/DynamoStreamReplicationIntegrationTest.scala

+        .tableName(tableName)
+        .item(
+          Map(
+            "id"    -> AttributeValue.builder.s("toDelete").build,


Missing parentheses after builder. It should be AttributeValue.builder().s(\"toDelete\").build() to properly chain the method calls.

Copilot · 2025-09-24T09:08:39Z

tests/src/test/scala/com/scylladb/migrator/writers/DynamoStreamReplicationIntegrationTest.scala

+        .item(
+          Map(
+            "id"    -> AttributeValue.builder.s("toDelete").build,
+            "value" -> AttributeValue.builder.s("value1").build


Missing parentheses after builder. It should be AttributeValue.builder().s(\"value1\").build() to properly chain the method calls.

Copilot · 2025-09-24T09:08:39Z

tests/src/test/scala/com/scylladb/migrator/writers/DynamoStreamReplicationIntegrationTest.scala

+            "id"    -> AttributeValue.builder.s("toUpdate").build,
+            "value" -> AttributeValue.builder.s("value2").build


Missing parentheses after builder. It should be AttributeValue.builder().s(\"toUpdate\").build() and AttributeValue.builder().s(\"value2\").build() to properly chain the method calls.

do not save _dynamo_op_type during dynamo stream replication

cd72e14

pizzaeueu mentioned this pull request Jul 31, 2025

Is _dynamo_op_type needed to be filled in and written for DynamoDB migration with streaming ? #237

Open

pizzaeueu commented Jul 31, 2025

View reviewed changes

Artem and others added 6 commits August 5, 2025 11:40

chmod scylla test folders

c79ac80

Update gitignore

0e80e3d

Create ToRemove.yml

cc1e391

Merge pull request #2 from pizzaeueu/pizzaeueu-patch-1

586a2af

Create ToRemove.yml

Update source DB port

e57b89b

Update source DB port

e5bde8e

tarzanek requested a review from Copilot August 15, 2025 09:58

Copilot AI reviewed Aug 15, 2025

View reviewed changes

pizzaeueu and others added 7 commits August 16, 2025 12:37

Merge branch 'scylladb:master' into master

97a2ce0

Fix DynamoStreamReplicationIntegrationTest

552eab0

Remove unnecessary test file

84094d2

Merge dynamo stream to single RDD (#10)

dffad6e

Remove unnecessary workflow changes

163b550

Group operations by type

b225726

Merge branch 'master' of github.com:pizzaeueu/scylla-migrator

988a46a

tarzanek requested a review from Copilot September 16, 2025 16:02

Copilot AI reviewed Sep 16, 2025

View reviewed changes

Artem added 4 commits September 18, 2025 06:43

fallback for dynamodb sequence number

8ad00b5

sequence order inside the batch

cab99dc

fix test

282ba17

remove operations batching by type

b75dac3

tarzanek requested a review from Copilot September 19, 2025 07:39

Copilot AI reviewed Sep 19, 2025

View reviewed changes

Artem added 3 commits September 19, 2025 09:51

fix copilot comments

b36d9c2

fix copilot comments

568982c

fix test

dd69003

pizzaeueu requested a review from Copilot September 19, 2025 10:38

Copilot AI reviewed Sep 19, 2025

View reviewed changes

tarzanek requested a review from Copilot September 24, 2025 09:07

Copilot AI reviewed Sep 24, 2025

View reviewed changes

Ensured test validates actual stream replication results

4448917

tarzanek merged commit 3013d6a into scylladb:master Sep 25, 2025
2 checks passed


		val log = LogManager.getLogger("com.scylladb.migrator.writers.DynamoDB")

		def deleteRDD(target: TargetSettings.DynamoDB,

	val allOperationsSorted = rdd.collect().sortBy(_.get(sequenceNumberColumn).getS)
	val allOperationsSorted = rdd.sortBy(_.get(sequenceNumberColumn).getS).collect()

	msgs.asInstanceOf[RDD[Option[util.Map[String, AttributeValueV1]]]],
	msgs,

	val finalItems = scanAll(sourceDDb(), tableName).sortBy(m => m("id").s)
	val finalItems = scanAll(targetAlternator(), tableName).sortBy(m => m("id").s)

-            val itemWithoutOp = item.asScala.collect {
-              case (k, v) if k != operationTypeColumn => k -> AttributeValueUtils.fromV1(v)
-            }.asJava
+            val itemWithoutOp = {
+              val m = new java.util.HashMap[String, AttributeValueV2]()
+              val it = item.entrySet().iterator()
+              while (it.hasNext) {
+                val entry = it.next()
+                if (entry.getKey != operationTypeColumn) {
+                  m.put(entry.getKey, AttributeValueUtils.fromV1(entry.getValue))
+                }
+              }
+              m
+            }

		"id" -> AttributeValue.builder.s("toUpdate").build,
		"value" -> AttributeValue.builder.s("value2").build

do not save _dynamo_op_type during dynamo stream replication #239

do not save _dynamo_op_type during dynamo stream replication #239

Uh oh!

Conversation

pizzaeueu commented Jul 31, 2025

Uh oh!

pizzaeueu Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pizzaeueu Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pizzaeueu Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

tarzanek Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

tarzanek Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pizzaeueu Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarzanek Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

pizzaeueu commented Aug 1, 2025

Uh oh!

tarzanek commented Aug 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

tarzanek commented Sep 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

tarzanek Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

tarzanek Jul 31, 2025 •

edited

Loading

pizzaeueu Jul 31, 2025 •

edited

Loading