Skip to content

Commit b9be9fb

Browse files
authored
Merge pull request #194 from julienrf/add-tutorial
Add tutorial showing how to migrate from DynamoDB
2 parents 0d9d818 + 24eb472 commit b9be9fb

19 files changed

+341
-8
lines changed
+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: "Tests / Tutorials"
2+
on:
3+
push:
4+
branches:
5+
- master
6+
pull_request:
7+
8+
env:
9+
TUTORIAL_DIR: docs/source/tutorials/dynamodb-to-scylladb-alternator
10+
11+
jobs:
12+
test:
13+
name: DynamoDB migration
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v4
17+
- name: Cache Docker images
18+
uses: ScribeMD/[email protected]
19+
with:
20+
key: docker-${{ runner.os }}-${{ hashFiles('docker-compose-tests.yml') }}
21+
- uses: actions/setup-java@v4
22+
with:
23+
distribution: temurin
24+
java-version: 8
25+
cache: sbt
26+
- name: Build migrator
27+
run: |
28+
./build.sh
29+
mv migrator/target/scala-2.13/scylla-migrator-assembly.jar "$TUTORIAL_DIR/spark-data"
30+
- name: Set up services
31+
run: |
32+
cd $TUTORIAL_DIR
33+
docker compose up -d
34+
- name: Wait for the services to be up
35+
run: |
36+
.github/wait-for-port.sh 8000 # DynamoDB
37+
.github/wait-for-port.sh 8001 # ScyllaDB Alternator
38+
.github/wait-for-port.sh 8080 # Spark master
39+
.github/wait-for-port.sh 8081 # Spark worker
40+
- name: Run tutorial
41+
run: |
42+
cd $TUTORIAL_DIR
43+
aws configure set region us-west-1
44+
aws configure set aws_access_key_id dummy
45+
aws configure set aws_secret_access_key dummy
46+
sed -i 's/seq 1 40000/seq 1 40/g' ./create-data.sh
47+
./create-data.sh
48+
. ./run-migrator.sh
49+
- name: Stop services
50+
run: |
51+
cd $TUTORIAL_DIR
52+
docker compose down

docs/source/getting-started/ansible.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ The Ansible playbook expects to be run in an Ubuntu environment where the direct
3535
- Ensure networking is configured to allow you access spark master node via TCP ports 8080 and 4040
3636
- visit ``http://<spark-master-hostname>:8080``
3737

38-
9. `Review and modify config.yaml <../#configure-the-migration>`_ based whether you're performing a migration to CQL or Alternator
38+
9. `Review and modify config.yaml <./#configure-the-migration>`_ based whether you're performing a migration to CQL or Alternator
3939

4040
- If you're migrating to ScyllaDB CQL interface (from Apache Cassandra, ScyllaDB, or other CQL source), make a copy review the comments in ``config.yaml.example``, and edit as directed.
4141
- If you're migrating to Alternator (from DynamoDB or other ScyllaDB Alternator), make a copy, review the comments in ``config.dynamodb.yml``, and edit as directed.

docs/source/getting-started/aws-emr.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
1212
--output-document=config.yaml
1313
1414
15-
2. `Configure the migration <../#configure-the-migration>`_ according to your needs.
15+
2. `Configure the migration <./#configure-the-migration>`_ according to your needs.
1616

1717
3. Download the latest release of the Migrator.
1818

@@ -67,7 +67,7 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
6767
6868
spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
6969
70-
See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
70+
See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
7171

7272
- Add a Bootstrap action to download the Migrator and the migration configuration:
7373

docs/source/getting-started/docker.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ This page describes how to set up a Spark cluster locally on your machine by usi
3333

3434
http://localhost:8080
3535

36-
5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure <../#configure-the-migration>`_ it according to your needs.
36+
5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure <./#configure-the-migration>`_ it according to your needs.
3737

3838
6. Finally, run the migration.
3939

@@ -47,7 +47,7 @@ This page describes how to set up a Spark cluster locally on your machine by usi
4747
4848
The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.
4949

50-
See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
50+
See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
5151

5252
7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.
5353

docs/source/getting-started/index.rst

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ A Spark cluster is made of several *nodes*, which can contain several *workers*
1212

1313
We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 4 CPUs should have at least 8 GB of memory.
1414

15+
.. caution::
16+
17+
Make sure the Spark version, the Scala version, and the Migrator version you use are `compatible together <../#compatibility-matrix>`_.
18+
1519
The following pages describe various alternative ways to set up a Spark cluster:
1620

1721
* :doc:`on your infrastructure, using Ansible </getting-started/ansible>`,

docs/source/getting-started/spark-standalone.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ This page describes how to set up a Spark cluster on your infrastructure and to
2121
wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
2222
--output-document=config.yaml
2323
24-
4. `Configure the migration <../#configure-the-migration>`_ according to your needs.
24+
4. `Configure the migration <./#configure-the-migration>`_ according to your needs.
2525

2626
5. Finally, run the migration as follows from the Spark master node.
2727

@@ -32,6 +32,6 @@ This page describes how to set up a Spark cluster on your infrastructure and to
3232
--conf spark.scylla.config=<path to config.yaml> \
3333
<path to scylla-migrator-assembly.jar>
3434
35-
See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
35+
See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
3636

3737
6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.

docs/source/index.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ The ScyllaDB Migrator is a Spark application that migrates data to ScyllaDB. Its
99
* It can rename columns along the way.
1010
* When migrating from DynamoDB it can transfer a snapshot of the source data, or continuously migrate new data as they come.
1111

12-
Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster for a migration.
12+
Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster and to configure your migration. Alternatively, follow our :doc:`step-by-step tutorial to perform a migration between fake databases using Docker </tutorials/dynamodb-to-scylladb-alternator/index>`.
1313

1414
--------------------
1515
Compatibility Matrix
@@ -33,3 +33,4 @@ Migrator Spark Scala
3333
rename-columns
3434
validate
3535
configuration
36+
tutorials/index
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#!/usr/bin/env sh
2+
3+
generate_25_items() {
4+
local items=""
5+
for i in `seq 1 25`; do
6+
items="${items}"'{
7+
"PutRequest": {
8+
"Item": {
9+
"id": { "S": "'"$(uuidgen)"'" },
10+
"col1": { "S": "'"$(uuidgen)"'" },
11+
"col2": { "S": "'"$(uuidgen)"'" },
12+
"col3": { "S": "'"$(uuidgen)"'" },
13+
"col4": { "S": "'"$(uuidgen)"'" },
14+
"col5": { "S": "'"$(uuidgen)"'" }
15+
}
16+
}
17+
},'
18+
done
19+
echo "${items%,}" # remove trailing comma
20+
}
21+
22+
aws \
23+
--endpoint-url http://localhost:8000 \
24+
dynamodb batch-write-item \
25+
--request-items '{
26+
"Example": ['"$(generate_25_items)"']
27+
}' > /dev/null
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/usr/bin/env sh
2+
3+
# Create table
4+
aws \
5+
--endpoint-url http://localhost:8000 \
6+
dynamodb create-table \
7+
--table-name Example \
8+
--attribute-definitions AttributeName=id,AttributeType=S \
9+
--key-schema AttributeName=id,KeyType=HASH \
10+
--provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100
11+
12+
# Add items in parallel
13+
# Change 40000 into 400 below for a faster demo (10,000 items instead of 1,000,000)
14+
seq 1 40000 | xargs --max-procs=8 --max-args=1 ./create-25-items.sh
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
services:
2+
3+
dynamodb:
4+
command: "-jar DynamoDBLocal.jar -sharedDb -inMemory"
5+
image: "amazon/dynamodb-local:2.5.2"
6+
ports:
7+
- "8000:8000"
8+
working_dir: /home/dynamodblocal
9+
10+
spark-master:
11+
build: dockerfiles/spark
12+
command: master
13+
environment:
14+
SPARK_PUBLIC_DNS: localhost
15+
ports:
16+
- 4040:4040
17+
- 8080:8080
18+
volumes:
19+
- ./spark-data:/app
20+
21+
spark-worker:
22+
build: dockerfiles/spark
23+
command: worker
24+
environment:
25+
SPARK_WORKER_CORES: 2
26+
SPARK_WORKER_MEMORY: 4G
27+
SPARK_WORKER_WEBUI_PORT: 8081
28+
SPARK_PUBLIC_DNS: localhost
29+
ports:
30+
- 8081:8081
31+
depends_on:
32+
- spark-master
33+
34+
scylla:
35+
image: scylladb/scylla:6.0.1
36+
expose:
37+
- 8001
38+
ports:
39+
- "8001:8001"
40+
command: "--smp 1 --memory 2048M --alternator-port 8001 --alternator-write-isolation only_rmw_uses_lwt"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../../dockerfiles
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
=========================================================
2+
Migrate from DynamoDB to ScyllaDB Alternator Using Docker
3+
=========================================================
4+
5+
In this tutorial, you will replicate 1,000,000 items from a DynamoDB table to ScyllaDB Alternator.
6+
7+
All the scripts and configuration files shown on the tutorial can be found in our `GitHub repository <https://github.com/scylladb/scylla-migrator/tree/master/docs/source/tutorials/dynamodb-to-scylladb-alternator>`_.
8+
9+
The whole system is composed of the DynamoDB service, a Spark cluster with a single worker node, and a ScyllaDB cluster with a single node, as illustrated below:
10+
11+
.. image:: architecture.png
12+
:alt: Architecture
13+
:width: 600
14+
15+
To follow this tutorial, you need to install `Docker <https://docker.com>`_ and the `AWS CLI <https://aws.amazon.com/cli/>`_.
16+
17+
----------------------------------------------------
18+
Set Up the Services and Populate the Source Database
19+
----------------------------------------------------
20+
21+
In an empty directory, create the following ``docker-compose.yaml`` file to define all the services:
22+
23+
.. literalinclude:: docker-compose.yaml
24+
:language: YAML
25+
26+
Let’s break down this Docker Compose file.
27+
28+
1. We define the DynamoDB service by reusing the official image ``amazon/dynamodb-local``. We use the TCP port 8000 for communicating with DynamoDB.
29+
2. We define the Spark master and Spark worker services by using a custom image (see below). Indeed, the official Docker images for Spark 3.5.1 only support Scala 2.12 for now, but we need Scala 2.13. We mount the local directory ``./spark-data`` to the Spark master container path ``/app`` so that we can supply the Migrator jar and configuration to the Spark master node. We expose the ports 8080 and 4040 of the master node to access the Spark UIs from our host environment. We allocate 2 cores and 4 GB of memory to the Spark worker node. As a general rule, we recommend allocating 2 GB of memory per core on each worker.
30+
3. We define the ScyllaDB service by reusing the official image ``scylladb/scylla``. We use the TCP port 8001 for communicating with ScyllaDB Alternator.
31+
32+
Create the ``Dockerfile`` required by the Spark services at path ``./dockerfiles/spark/Dockerfile`` and write the following content:
33+
34+
.. literalinclude:: dockerfiles/spark/Dockerfile
35+
:language: Dockerfile
36+
37+
This ``Dockerfile`` installs Java and a Spark distribution. It uses a custom shell script as entry point. Create the file ``./dockerfiles/spark/entrypoint.sh``, and write the following content:
38+
39+
.. literalinclude:: dockerfiles/spark/entrypoint.sh
40+
:language: sh
41+
42+
The entry point takes an argument that can be either ``master`` or ``worker`` to control whether to start a master node or a worker node.
43+
44+
Prepare your system for building the Spark Docker image with the following commands, which create the ``spark-data`` directory and make the entry point executable:
45+
46+
.. code-block:: sh
47+
48+
mkdir spark-data
49+
chmod +x entrypoint.sh
50+
51+
Finally, start all the services with the following command:
52+
53+
.. code-block:: sh
54+
55+
docker compose up
56+
57+
Your system's Docker daemon will download the DynamoDB and ScyllaDB images and build your Spark Docker image.
58+
59+
Check that you can access the Spark cluster UI by opening http://localhost:8080 in your browser. You should see your worker node in the workers list.
60+
61+
.. image:: spark-cluster.png
62+
:alt: Spark UI listing the worker node
63+
:width: 883
64+
65+
Once all the services are up, you can access your local DynamoDB instance and your local ScyllaDB instance by using the standard AWS CLI. Make sure to configure the AWS CLI as follows before running the ``dynamodb`` commands:
66+
67+
.. code-block:: sh
68+
69+
# Set dummy region and credentials
70+
aws configure set region us-west-1
71+
aws configure set aws_access_key_id dummy
72+
aws configure set aws_secret_access_key dummy
73+
# Access DynamoDB
74+
aws --endpoint-url http://localhost:8000 dynamodb list-tables
75+
# Access ScyllaDB Alternator
76+
aws --endpoint-url http://localhost:8001 dynamodb list-tables
77+
78+
The last preparatory step consists of creating a table in DynamoDB and filling it with random data. Create a file named ``create-data.sh``, make it executable, and write the following content into it:
79+
80+
.. literalinclude:: create-data.sh
81+
:language: sh
82+
83+
This script creates a table named ``Example`` and adds 1 million items to it. It does so by invoking another script, ``create-25-items.sh``, which uses the ``batch-write-item`` command to insert 25 items in a single call:
84+
85+
.. literalinclude:: create-25-items.sh
86+
:language: sh
87+
88+
Every added item contains an id and five columns, all filled with random data.
89+
Run the script ``./create-data.sh`` and wait for a couple of hours until all the data is inserted (or change the last line of ``create-data.sh`` to insert fewer items).
90+
91+
---------------------
92+
Perform the Migration
93+
---------------------
94+
95+
Once you have set up the services and populated the source database, you are ready to perform the migration.
96+
97+
Download the latest stable release of the Migrator in the ``spark-data`` directory:
98+
99+
.. code-block::
100+
101+
wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
102+
--directory-prefix=./spark-data
103+
104+
Create a configuration file in ``./spark-data/config.yaml`` and write the following content:
105+
106+
.. literalinclude:: spark-data/config.yaml
107+
:language: YAML
108+
109+
This configuration tells the Migrator to read the items from the table ``Example`` in the ``dynamodb`` service, and to write them to the table of the same name in the ``scylla`` service.
110+
111+
Finally, start the migration with the following command:
112+
113+
.. literalinclude:: run-migrator.sh
114+
:language: sh
115+
116+
This command calls ``spark-submit`` in the ``spark-master`` service with the file ``scylla-migrator-assembly.jar``, which bundles the Migrator and all its dependencies.
117+
118+
In the ``spark-submit`` command invocation, we explicitly tell Spark to use 4 GB of memory; otherwise, it would default to 1 GB only. We also explicitly tell Spark to use 2 cores. This is not really necessary as the default behavior is to use all the available cores, but we set it for the sake of illustration. If the Spark worker node had 20 cores, it would be better to use only 10 cores per executor to optimize the throughput (big executors require more memory management operations, which decrease the overall application performance). We would achieve this by passing ``--executor-cores 10``, and the Spark engine would allocate two executors for our application to fully utilize the resources of the worker node.
119+
120+
The migration process inspects the source table, replicates its schema to the target database if it does not exist, and then migrates the data. The data migration uses the Hadoop framework under the hood to leverage the Spark cluster resources. The migration process breaks down the data to transfer chunks of about 128 MB each, and processes all the partitions in parallel. Since the source is a DynamoDB table in our example, each partition translates into a `scan segment <https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan>`_ to maximize the parallelism level when reading the data. Here is a diagram that illustrates the migration process:
121+
122+
.. image:: process.png
123+
:alt: Migration process
124+
:width: 700
125+
126+
During the execution of the command, a lot of logs are printed, mostly related to Spark scheduling. Still, you should be able to spot the following relevant lines:
127+
128+
.. code-block:: text
129+
130+
24/07/22 15:46:13 INFO migrator: ScyllaDB Migrator 0.9.2
131+
132+
24/07/22 15:46:20 INFO alternator: We need to transfer: 2 partitions in total
133+
24/07/22 15:46:20 INFO alternator: Starting write…
134+
24/07/22 15:46:20 INFO DynamoUtils: Checking for table existence at destination
135+
136+
And when the migration ends, you will see the following line printed:
137+
138+
.. code-block:: text
139+
140+
24/07/22 15:46:24 INFO alternator: Done transferring table snapshot
141+
142+
During the migration, it is possible to monitor the underlying Spark job by opening the Spark UI available at http://localhost:4040.
143+
144+
.. image:: stages.png
145+
:alt: Spark stages
146+
:width: 900
147+
148+
`Example of a migration broken down in 6 tasks. The Spark UI allows us to follow the overall progress, and it can also show specific metrics such as the memory consumption of an executor`.
149+
150+
In our example the size of the source table is ~200 MB. In practice, it is common to migrate tables containing several terabytes of data. If necessary, and as long as your DynamoDB source supports a higher read throughput level, you can increase the migration throughput by adding more Spark worker nodes. The Spark engine will automatically spread the workload between all the worker nodes.
151+
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
docker compose exec spark-master \
2+
/spark/bin/spark-submit \
3+
--executor-memory 4G \
4+
--executor-cores 2 \
5+
--class com.scylladb.migrator.Migrator \
6+
--master spark://spark-master:7077 \
7+
--conf spark.driver.host=spark-master \
8+
--conf spark.scylla.config=/app/config.yaml \
9+
/app/scylla-migrator-assembly.jar
Loading

0 commit comments

Comments
 (0)