dagster-io
diff --git a/‎docs/content/api/modules.json.gz
2.5 KB b/‎docs/content/api/modules.json.gz
2.5 KB
diff --git a/‎docs/content/api/searchindex.json.gz
24 Bytes b/‎docs/content/api/searchindex.json.gz
24 Bytes
diff --git a/‎docs/content/api/sections.json.gz
789 Bytes b/‎docs/content/api/sections.json.gz
789 Bytes
diff --git a/‎docs/content/concepts/dagster-pipes/aws-emr-serverless.mdx
Lines changed: 169 additions & 0 deletions b/‎docs/content/concepts/dagster-pipes/aws-emr-serverless.mdx
Lines changed: 169 additions & 0 deletions
diff --git a/‎docs/next/public/objects.inv
6 Bytes b/‎docs/next/public/objects.inv
6 Bytes
diff --git a/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/Dockerfile
Lines changed: 19 additions & 0 deletions b/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/Dockerfile
Lines changed: 19 additions & 0 deletions
diff --git a/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/dagster_code.py
Lines changed: 42 additions & 0 deletions b/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/dagster_code.py
Lines changed: 42 additions & 0 deletions
diff --git a/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/emr_script.py
Lines changed: 62 additions & 0 deletions b/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/emr_script.py
Lines changed: 62 additions & 0 deletions
diff --git a/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/script.py
Lines changed: 29 additions & 0 deletions b/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/script.py
Lines changed: 29 additions & 0 deletions
diff --git a/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/upload_dev_env.py
Lines changed: 65 additions & 0 deletions b/‎examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-serverless/upload_dev_env.py
Lines changed: 65 additions & 0 deletions
@@ -0,0 +1,169 @@
+---
+title: "Integrating AWS EMR Serverless with Dagster Pipes | Dagster Docs"
+description: "Learn to integrate Dagster Pipes with AWS EMR Serverless to launch external code from Dagster assets."
+---
+
+# AWS EMR Serverless & Dagster Pipes
+
+This tutorial gives a short overview on how to use [Dagster Pipes](/concepts/dagster-pipes) with [AWS EMR Serverless](https://aws.amazon.com/emr-serverless/).
+
+The [dagster-aws](/\_apidocs/libraries/dagster-aws) integration library provides the <PyObject object="PipesEMRServerlessClient" module="dagster_aws.pipes" /> resource, which can be used to launch AWS EMR Serverless jobs from Dagster assets and ops. Dagster can receive regular events such as logs, asset checks, or asset materializations from jobs launched with this client. Using it requires minimal code changes to your EMR jobs.
+
+---
+
+## Prerequisites
+
+- **In the orchestration environment**, you'll need to:
+
+  - Install the following packages:
+
+    ```shell
+    pip install dagster dagster-webserver dagster-aws
+    ```
+
+    Refer to the [Dagster installation guide](/getting-started/install) for more info.
+
+  - **AWS authentication credentials configured.** If you don't have this set up already, refer to the [boto3 quickstart](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html).
+
+- **In AWS**:
+
+  - An existing AWS account
+  - An AWS EMR Serverless job. AWS CloudWatch logging has to be enabled in order to receive logs from the job:
+
+  ```json
+  {
+    "monitoringConfiguration": {
+      "cloudWatchLoggingConfiguration": { "enabled": true }
+    }
+  }
+  ```
+
+---
+
+## Step 1: Install the dagster-pipes module
+
+There are a [few options](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html) available for shipping Python packages to a PySpark job. For example, [install it in your Docker image](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html):
+
+Install the `dagster-pipes` module in the image used for your EMR job. For example, you can install the dependency with `pip` in your image Dockerfile:
+
+```Dockerfile
+# start from EMR image
+FROM public.ecr.aws/emr-serverless/spark/emr-7.2.0:latest
+
+USER root
+
+RUN python -m pip install dagster-pipes
+
+# copy the job script
+COPY . .
+
+USER hadoop
+```
+
+---
+
+## Step 2: Add dagster-pipes to the EMR Serverless job script
+
+Call `open_dagster_pipes` in the EMR Serverless script to create a context that can be used to send messages to Dagster:
+
+```python file=/guides/dagster/dagster_pipes/emr-serverless/script.py
+from dagster_pipes import open_dagster_pipes
+from pyspark.sql import SparkSession
+
+
+def main():
+    with open_dagster_pipes() as pipes:
+        pipes.log.info("Hello from AWS EMR Serverless!")
+
+        spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
+
+        df = spark.createDataFrame(
+            [(1, "Alice", 34), (2, "Bob", 45), (3, "Charlie", 56)],
+            ["id", "name", "age"],
+        )
+
+        # calculate a really important statistic
+        avg_age = float(df.agg({"age": "avg"}).collect()[0][0])
+
+        # attach it to the asset materialization in Dagster
+        pipes.report_asset_materialization(
+            metadata={"average_age": {"raw_value": avg_age, "type": "float"}},
+            data_version="alpha",
+        )
+
+        spark.stop()
+
+
+if __name__ == "__main__":
+    main()
+```
+
+---
+
+## Step 3: Create an asset using the PipesEMRServerlessClient to launch the job
+
+In the Dagster asset/op code, use the `PipesEMRServerlessClient` resource to launch the job:
+
+```python file=/guides/dagster/dagster_pipes/emr-serverless/dagster_code.py startafter=start_asset_marker endbefore=end_asset_marker
+import os
+
+import boto3
+from dagster_aws.pipes import PipesEMRServerlessClient
+
+from dagster import AssetExecutionContext, asset
+
+
+@asset
+def emr_serverless_asset(
+    context: AssetExecutionContext,
+    pipes_emr_serverless_client: PipesEMRServerlessClient,
+):
+    return pipes_emr_serverless_client.run(
+        context=context,
+        start_job_run_params={
+            "applicationId": "<app-id>",
+            "executionRoleArn": "<emr-role>",
+            "clientToken": context.run_id,  # idempotency identifier for the job run
+            "configurationOverrides": {
+                "monitoringConfiguration": {
+                    "cloudWatchLoggingConfiguration": {"enabled": True}
+                }
+            },
+        },
+    ).get_results()
+```
+
+This will launch the AWS EMR Serverless job and wait for it completion. If the job fails, the Dagster process will raise an exception. If the Dagster process is interrupted while the job is still running, the job will be terminated.
+
+---
+
+## Step 4: Create Dagster definitions
+
+Next, add the `PipesEMRServerlessClient` resource to your project's <PyObject object="Definitions" /> object:
+
+```python file=/guides/dagster/dagster_pipes/emr-serverless/dagster_code.py startafter=start_definitions_marker endbefore=end_definitions_marker
+from dagster import Definitions  # noqa
+
+
+defs = Definitions(
+    assets=[emr_serverless_asset],
+    resources={"pipes_emr_serverless_client": PipesEMRServerlessClient()},
+)
+```
+
+Dagster will now be able to launch the AWS EMR Serverless task from the `emr_serverless_asset` asset, and receive logs and events from the job. If using the default `message_reader` `PipesCloudwatchLogReader`, driver logs will be forwarded to the Dagster process.
+
+---
+
+## Related
+
+<ArticleList>
+  <ArticleListItem
+    title="Dagster Pipes"
+    href="/concepts/dagster-pipes"
+  ></ArticleListItem>
+  <ArticleListItem
+    title="AWS EMR Serverless Pipes API reference"
+    href="/_apidocs/libraries/dagster-aws#dagster_aws.pipes.PipesEMRServerlessClient"
+  ></ArticleListItem>
+</ArticleList>
@@ -0,0 +1,19 @@
+FROM public.ecr.aws/emr-serverless/spark/emr-7.2.0
+
+USER root
+
+COPY --from=ghcr.io/astral-sh/uv:0.4.7 /uv /bin/uv
+
+ENV UV_SYSTEM_PYTHON=1 \
+    UV_BREAK_SYSTEM_PACKAGES=true \
+    UV_COMPILE_BYTECODE=1 \
+    UV_PYTHON=/usr/bin/python
+
+WORKDIR /app
+
+COPY python_modules/dagster-pipes ./dagster-pipes
+
+RUN uv pip install ./dagster-pipes
+
+# EMR Serverless will run the image as hadoop
+USER hadoop:hadoop
@@ -0,0 +1,42 @@
+# start_asset_marker
+import os
+
+import boto3
+from dagster_aws.pipes import PipesEMRServerlessClient
+
+from dagster import AssetExecutionContext, asset
+
+
+@asset
+def emr_serverless_asset(
+    context: AssetExecutionContext,
+    pipes_emr_serverless_client: PipesEMRServerlessClient,
+):
+    return pipes_emr_serverless_client.run(
+        context=context,
+        start_job_run_params={
+            "applicationId": "<app-id>",
+            "executionRoleArn": "<emr-role>",
+            "clientToken": context.run_id,  # idempotency identifier for the job run
+            "configurationOverrides": {
+                "monitoringConfiguration": {
+                    "cloudWatchLoggingConfiguration": {"enabled": True}
+                }
+            },
+        },
+    ).get_results()
+
+
+# end_asset_marker
+
+# start_definitions_marker
+
+from dagster import Definitions  # noqa
+
+
+defs = Definitions(
+    assets=[emr_serverless_asset],
+    resources={"pipes_emr_serverless_client": PipesEMRServerlessClient()},
+)
+
+# end_definitions_marker
@@ -0,0 +1,62 @@
+import os
+import sys
+
+import boto3
+import pyspark.sql.functions as F
+from dagster_pipes import (
+    PipesCliArgsParamsLoader,
+    PipesS3ContextLoader,
+    open_dagster_pipes,
+)
+from pyspark.sql import SparkSession
+
+
+def main():
+    spark = SparkSession.builder.appName("WordCount").getOrCreate()
+
+    output_path = None
+
+    if len(sys.argv) > 1:
+        output_path = sys.argv[1]
+    else:
+        print(  # noqa
+            "S3 output location not specified printing top 10 results to output stream"
+        )
+
+    region = os.getenv("AWS_REGION")
+    text_file = spark.sparkContext.textFile(
+        "s3://" + region + ".elasticmapreduce/emr-containers/samples/wordcount/input"
+    )
+    counts = (
+        text_file.flatMap(lambda line: line.split(" "))
+        .map(lambda word: (word, 1))
+        .reduceByKey(lambda a, b: a + b)
+        .sortBy(lambda x: x[1], False)
+    )
+    counts_df = counts.toDF(["word", "count"])
+
+    if output_path:
+        counts_df.write.mode("overwrite").csv(output_path)
+        print(  # noqa
+            "WordCount job completed successfully. Refer output at S3 path: "
+            + output_path
+        )
+    else:
+        counts_df.show(10, False)
+        print("WordCount job completed successfully.")  # noqa
+
+    spark.stop()
+
+
+if __name__ == "__main__":
+    """
+        Usage: wordcount [destination path]
+    """
+
+    with open_dagster_pipes() as pipes:
+        pipes.log.info("Hello from AWS EMR Serverless job!")
+        pipes.report_asset_materialization(
+            metadata={"some_metric": {"raw_value": 0, "type": "int"}},
+            data_version="alpha",
+        )
+        main()
@@ -0,0 +1,29 @@
+from dagster_pipes import open_dagster_pipes
+from pyspark.sql import SparkSession
+
+
+def main():
+    with open_dagster_pipes() as pipes:
+        pipes.log.info("Hello from AWS EMR Serverless!")
+
+        spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
+
+        df = spark.createDataFrame(
+            [(1, "Alice", 34), (2, "Bob", 45), (3, "Charlie", 56)],
+            ["id", "name", "age"],
+        )
+
+        # calculate a really important statistic
+        avg_age = float(df.agg({"age": "avg"}).collect()[0][0])
+
+        # attach it to the asset materialization in Dagster
+        pipes.report_asset_materialization(
+            metadata={"average_age": {"raw_value": avg_age, "type": "float"}},
+            data_version="alpha",
+        )
+
+        spark.stop()
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,65 @@
+# this script can be used to pack and upload a python virtualenv to an s3 bucket
+# requires `uv` and `tar`
+
+import argparse
+import os
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).parent
+REQUIREMENTS_TXT = SCRIPT_DIR / "requirements.txt"
+DAGSTER_DIR = Path(*SCRIPT_DIR.parts[: SCRIPT_DIR.parts.index("examples")])
+
+DAGSTER_PIPES_DIR = DAGSTER_DIR / "python_modules/dagster-pipes"
+
+parser = argparse.ArgumentParser(description="Upload a python virtualenv to an s3 path")
+parser.add_argument(
+    "--python", type=str, help="python version to use", default="3.10.8"
+)
+parser.add_argument(
+    "--requirements",
+    type=str,
+    help="path to the requirements.txt file",
+    default=str(REQUIREMENTS_TXT),
+)
+parser.add_argument("--s3-path", type=str, help="s3 path to copy to", required=True)
+
+
+def main():
+    args = parser.parse_args()
+
+    with tempfile.TemporaryDirectory() as temp_dir:
+        os.chdir(temp_dir)
+        subprocess.run(
+            f"uv python install --python-preference only-managed {args.python}",
+            shell=True,
+            check=True,
+        )
+        subprocess.run(
+            f"uv venv --seed --relocatable --python-preference only-managed --python {args.python}",
+            shell=True,
+            check=True,
+        )
+        os.environ["VIRTUAL_ENV"] = str(Path(temp_dir) / ".venv")
+        subprocess.run("source ./.venv/bin/activate", shell=True, check=True)
+        subprocess.run(
+            f"uv pip install --link-mode clone {DAGSTER_PIPES_DIR} ",
+            shell=True,
+            check=True,
+        )
+        subprocess.run(
+            "tar -czf pyspark_venv.tar.gz -C .venv .",
+            shell=True,
+            check=True,
+        )
+        subprocess.run(
+            f"aws s3 cp {temp_dir}/pyspark_venv.tar.gz {args.s3_path}",
+            shell=True,
+            check=True,
+        )
+
+
+if __name__ == "__main__":
+    main()