Skip to content

Commit 72d469a

Browse files
danielgafnialangenfeld
authored andcommitted
[docs] add docs for PipesEMRContainersClient (#27159)
## Summary & Motivation This PR adds docs for the new PipesEMRContainersCient. It sounds confusing but AWS EMR Containers and AWS EMR on EKS is actually the same thing. The former is the name of the AWS API, and the latter is the service name (sounds like the same thing huh?) which is more used in human language rather than automation. For example, the `boto3` client is called `emr-containers`. I tried to preserve this differentiation in these docs. ## How I Tested These Changes The snippets were tested with a real EMR on EKS cluster.
1 parent 7449e78 commit 72d469a

File tree

14 files changed

+240
-55
lines changed

14 files changed

+240
-55
lines changed

docs/content/_navigation.json

+4
Original file line numberDiff line numberDiff line change
@@ -398,6 +398,10 @@
398398
"title": "Dagster Pipes + AWS EMR",
399399
"path": "/concepts/dagster-pipes/aws-emr"
400400
},
401+
{
402+
"title": "Dagster Pipes + AWS EMR on EKS",
403+
"path": "/concepts/dagster-pipes/aws-emr-containers"
404+
},
401405
{
402406
"title": "Dagster Pipes + AWS EMR Serverless",
403407
"path": "/concepts/dagster-pipes/aws-emr-serverless"

docs/content/api/modules.json.gz

2.55 KB
Binary file not shown.

docs/content/api/searchindex.json.gz

45 Bytes
Binary file not shown.

docs/content/api/sections.json.gz

120 Bytes
Binary file not shown.

docs/content/concepts.mdx

+4
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,10 @@ Dagster Pipes is a toolkit for building integrations between Dagster and externa
236236
title="Dagster Pipes + AWS EMR"
237237
href="/concepts/dagster-pipes/aws-emr"
238238
></ArticleListItem>
239+
<ArticleListItem
240+
title="Dagster Pipes + AWS EMR on EKS"
241+
href="/concepts/dagster-pipes/aws-emr-containers"
242+
></ArticleListItem>
239243
<ArticleListItem
240244
title="Dagster Pipes + AWS EMR Serverless"
241245
href="/concepts/dagster-pipes/aws-emr-serverless"

docs/content/concepts/dagster-pipes.mdx

+2-1
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,8 @@ Ready to get started with Dagster Pipes? Depending on what your goal is, how you
7979
- [AWS ECS](/concepts/dagster-pipes/aws-ecs)
8080
- [AWS Lambda](/concepts/dagster-pipes/aws-lambda)
8181
- [AWS Glue](/concepts/dagster-pipes/aws-glue)
82-
- [AWS EMR Serverless](/concepts/dagster-pipes/aws-emr-serverless)
8382
- [AWS EMR](/concepts/dagster-pipes/aws-emr)
83+
- [AWS EMR on EKS](/concepts/dagster-pipes/aws-emr-containers)
84+
- [AWS EMR Serverless](/concepts/dagster-pipes/aws-emr-serverless)
8485

8586
- **If you don’t see your integration or you want to fully customize your Pipes experience**, check out the [Dagster Pipes details and customization guide](/concepts/dagster-pipes/dagster-pipes-details-and-customization) to learn how to create a custom experience.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
---
2+
title: "Integrating AWS EMR on EKS with Dagster Pipes | Dagster Docs"
3+
description: "Learn to integrate Dagster Pipes with AWS EMR Containers to launch external code from Dagster assets."
4+
---
5+
6+
# AWS EMR on EKS & Dagster Pipes
7+
8+
This tutorial gives a short overview on how to use [Dagster Pipes](/concepts/dagster-pipes) with [AWS EMR on EKS](https://aws.amazon.com/emr/features/eks/) (the corresponding AWS API is called `emr-containers`).
9+
10+
The [dagster-aws](/\_apidocs/libraries/dagster-aws) integration library provides the <PyObject object="PipesEMRContainersClient" module="dagster_aws.pipes" /> resource, which can be used to launch EMR jobs from Dagster assets and ops. Dagster can receive regular events such as logs, asset checks, or asset materializations from jobs launched with this client. Using it requires minimal code changes to your EMR jobs.
11+
12+
---
13+
14+
## Prerequisites
15+
16+
- **In the Dagster environment**, you'll need to:
17+
18+
- Install the following packages:
19+
20+
```shell
21+
pip install dagster dagster-webserver dagster-aws
22+
```
23+
24+
Refer to the [Dagster installation guide](/getting-started/install) for more info.
25+
26+
- **AWS authentication credentials configured.** If you don't have this set up already, refer to the [boto3 quickstart](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html).
27+
28+
- **In AWS**:
29+
30+
- An existing AWS account
31+
- A [EMR Virtual Cluster](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/virtual-cluster.html) set up
32+
33+
---
34+
35+
## Step 1: Install the dagster-pipes module in your EMR environment
36+
37+
There are [a few options](https://aws.github.io/aws-emr-containers-best-practices/submit-applications/docs/spark/pyspark/#python-code-with-python-dependencies) for deploying Python code & dependencies for PySpark jobs. In this tutorial, we are going to build a custom Docker image for this purpose.
38+
39+
Install `dagster-pipes`, `dagster-aws` and `boto3` Python packages in your image:
40+
41+
```Dockerfile,file=/guides/dagster/dagster_pipes/emr-containers/Dockerfile
42+
# start from EMR image
43+
FROM public.ecr.aws/emr-containers/spark/emr-7.2.0:latest
44+
45+
USER root
46+
47+
RUN python -m pip install dagster-pipes
48+
49+
# copy the job script
50+
COPY . .
51+
52+
USER hadoop
53+
```
54+
55+
<Note>
56+
It's also recommended to upgrade the default Python version included in the
57+
base EMR image (as it has been done in the `Dockerfile` above)
58+
</Note>
59+
60+
---
61+
62+
We copy the EMR job script (`script.py`) to the image in the last step.
63+
64+
## Step 2: Invoke dagster-pipes in the EMR job script
65+
66+
Call `open_dagster_pipes` in the EMR script to create a context that can be used to send messages to Dagster:
67+
68+
```python file=/guides/dagster/dagster_pipes/emr-containers/script.py
69+
import sys
70+
71+
import boto3
72+
from dagster_pipes import PipesS3MessageWriter, open_dagster_pipes
73+
from pyspark.sql import SparkSession
74+
75+
76+
def main():
77+
s3_client = boto3.client("s3")
78+
with open_dagster_pipes(
79+
message_writer=PipesS3MessageWriter(client=s3_client),
80+
) as pipes:
81+
pipes.log.info("Hello from AWS EMR Containers!")
82+
83+
spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
84+
85+
df = spark.createDataFrame(
86+
[(1, "Alice", 34), (2, "Bob", 45), (3, "Charlie", 56)],
87+
["id", "name", "age"],
88+
)
89+
90+
# calculate a really important statistic
91+
avg_age = float(df.agg({"age": "avg"}).collect()[0][0])
92+
93+
# attach it to the asset materialization in Dagster
94+
pipes.report_asset_materialization(
95+
metadata={"average_age": {"raw_value": avg_age, "type": "float"}},
96+
data_version="alpha",
97+
)
98+
99+
print("Hello from stdout!")
100+
print("Hello from stderr!", file=sys.stderr)
101+
102+
103+
if __name__ == "__main__":
104+
main()
105+
```
106+
107+
<Note>
108+
It's best to use the `PipesS3MessageWriter` with EMR on EKS, because this
109+
message writer has the ability to capture the Spark driver logs and send them
110+
to Dagster.
111+
</Note>
112+
113+
---
114+
115+
## Step 3: Create an asset using the PipesEMRcontainersClient to launch the job
116+
117+
In the Dagster asset/op code, use the `PipesEMRcontainersClient` resource to launch the job:
118+
119+
```python file=/guides/dagster/dagster_pipes/emr-containers/dagster_code.py startafter=start_asset_marker endbefore=end_asset_marker
120+
from dagster_aws.pipes import PipesEMRContainersClient
121+
122+
import dagster as dg
123+
124+
125+
@dg.asset
126+
def emr_containers_asset(
127+
context: dg.AssetExecutionContext,
128+
pipes_emr_containers_client: PipesEMRContainersClient,
129+
):
130+
image = (
131+
...
132+
) # it's likely the image can be taken from context.run_tags["dagster/image"]
133+
134+
return pipes_emr_containers_client.run(
135+
context=context,
136+
start_job_run_params={
137+
"releaseLabel": "emr-7.5.0-latest",
138+
"virtualClusterId": ...,
139+
"clientToken": context.run_id, # idempotency identifier for the job run
140+
"executionRoleArn": ...,
141+
"jobDriver": {
142+
"sparkSubmitJobDriver": {
143+
"entryPoint": "local:///app/script.py",
144+
"sparkSubmitParameters": f"--conf spark.kubernetes.container.image={image}",
145+
}
146+
},
147+
},
148+
).get_materialize_result()
149+
```
150+
151+
<Note>
152+
Setting `include_stdio_in_messages` to `True` in the `PipesS3MessageReader`
153+
will allow the driver logs to be forwarded to the Dagster process.
154+
</Note>
155+
156+
Materializing this asset will launch the AWS on EKS job and wait for it to complete. If the job fails, the Dagster process will raise an exception. If the Dagster process is interrupted while the job is still running, the job will be terminated.
157+
158+
---
159+
160+
## Step 4: Create Dagster definitions
161+
162+
Next, add the `PipesEMRContainersClient` resource to your project's <PyObject object="Definitions" /> object:
163+
164+
```python file=/guides/dagster/dagster_pipes/emr-containers/dagster_code.py startafter=start_definitions_marker endbefore=end_definitions_marker
165+
import boto3
166+
from dagster_aws.pipes import PipesS3ContextInjector, PipesS3MessageReader
167+
168+
from dagster import Definitions
169+
170+
defs = Definitions(
171+
assets=[emr_containers_asset],
172+
resources={
173+
"pipes_emr_containers_client": PipesEMRContainersClient(
174+
message_reader=PipesS3MessageReader(
175+
client=boto3.client("s3"),
176+
bucket=...,
177+
include_stdio_in_messages=True,
178+
),
179+
)
180+
},
181+
)
182+
```
183+
184+
Dagster will now be able to launch the AWS EMR Containers job from the `emr_containers_asset` asset, and receive logs and events from the job. If `include_stdio_in_messages` is set to `True`, the logs will be forwarded to the Dagster process.
185+
186+
---
187+
188+
## Related
189+
190+
<ArticleList>
191+
<ArticleListItem
192+
title="Dagster Pipes"
193+
href="/concepts/dagster-pipes"
194+
></ArticleListItem>
195+
<ArticleListItem
196+
title="AWS EMR Containers API reference"
197+
href="/_apidocs/libraries/dagster-aws#dagster_aws.pipes.PipesEMRContainersClient"
198+
></ArticleListItem>
199+
</ArticleList>

docs/content/integrations/spark.mdx

+2-1
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ You can either use one of the available Pipes Clients or make your own. The avai
1919

2020
- [Databricks](/concepts/dagster-pipes/databricks)
2121
- [AWS Glue](/concepts/dagster-pipes/aws-glue)
22-
- [AWS EMR Serverless](/concepts/dagster-pipes/aws-emr-serverless)
2322
- [AWS EMR](/concepts/dagster-pipes/aws-emr)
23+
- [AWS EMR on EKS](/concepts/dagster-pipes/aws-emr-containers)
24+
- [AWS EMR Serverless](/concepts/dagster-pipes/aws-emr-serverless)
2425

2526
Existing Spark jobs can be used with Pipes without any modifications. In this case, Dagster will be receiving logs from the job, but not events like asset checks or attached metadata.
2627

docs/next/public/objects.inv

27 Bytes
Binary file not shown.

docs/sphinx/sections/api/apidocs/libraries/dagster-aws.rst

+2
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,8 @@ Clients
127127

128128
.. autoclass:: dagster_aws.pipes.PipesEMRClient
129129

130+
.. autoclass:: dagster_aws.pipes.PipesEMRContainersClient
131+
130132
.. autoclass:: dagster_aws.pipes.PipesEMRServerlessClient
131133

132134
Legacy

examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-containers/Dockerfile

+7-6
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,16 @@ USER root
66
RUN mkdir /python && chown hadoop:hadoop /python
77

88
USER hadoop
9-
ENV UV_PYTHON_INSTALL_DIR=/python
9+
ENV UV_PYTHON_INSTALL_DIR=/python \
10+
UV_BREAK_SYSTEM_PACKAGES=1
1011

1112
RUN uv python install --python-preference only-managed 3.9.16
12-
ENV PATH="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin:${PATH}"
13-
ENV UV_PYTHON="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin/python" \
14-
UV_BREAK_SYSTEM_PACKAGES=1 \
13+
14+
ENV PATH="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin:${PATH}" \
15+
PYTHONPATH="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/lib/python3.9/site-packages" \
16+
UV_PYTHON="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin/python" \
1517
PYSPARK_PYTHON="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin/python" \
16-
PYSPARK_DRIVER_PYTHON="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin/python" \
17-
PYTHONPATH="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/lib/python3.9/site-packages"
18+
PYSPARK_DRIVER_PYTHON="${UV_PYTHON_INSTALL_DIR}/cpython-3.9.16-linux-x86_64-gnu/bin/python"
1819

1920
RUN uv pip install --system dagster-pipes boto3 pyspark
2021

examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-containers/dagster_code.py

+15-32
Original file line numberDiff line numberDiff line change
@@ -2,50 +2,31 @@
22

33
from dagster_aws.pipes import PipesEMRContainersClient
44

5-
from dagster import AssetExecutionContext, asset
5+
import dagster as dg
66

77

8-
@asset
8+
@dg.asset
99
def emr_containers_asset(
10-
context: AssetExecutionContext,
10+
context: dg.AssetExecutionContext,
1111
pipes_emr_containers_client: PipesEMRContainersClient,
1212
):
13+
image = (
14+
...
15+
) # it's likely the image can be taken from context.run_tags["dagster/image"]
16+
1317
return pipes_emr_containers_client.run(
1418
context=context,
1519
start_job_run_params={
1620
"releaseLabel": "emr-7.5.0-latest",
17-
"virtualClusterId": "uqcja50dzo7v1meie1wa47wa3",
21+
"virtualClusterId": ...,
1822
"clientToken": context.run_id, # idempotency identifier for the job run
19-
"executionRoleArn": "arn:aws:iam::467123434025:role/emr-dagster-pipes-20250109135314655100000001",
23+
"executionRoleArn": ...,
2024
"jobDriver": {
2125
"sparkSubmitJobDriver": {
2226
"entryPoint": "local:///app/script.py",
23-
# --conf spark.kubernetes.container.image=
24-
"sparkSubmitParameters": "--conf spark.kubernetes.file.upload.path=/tmp/spark --conf spark.kubernetes.container.image=467123434025.dkr.ecr.eu-north-1.amazonaws.com/dagster/emr-containers:12", # --conf spark.pyspark.python=/home/hadoop/.local/share/uv/python/cpython-3.9.16-linux-x86_64-gnu/bin/python --conf spark.pyspark.driver.python=/home/hadoop/.local/share/uv/python/cpython-3.9.16-linux-x86_64-gnu/bin/python",
27+
"sparkSubmitParameters": f"--conf spark.kubernetes.container.image={image}",
2528
}
2629
},
27-
"configurationOverrides": {
28-
"monitoringConfiguration": {
29-
"cloudWatchMonitoringConfiguration": {
30-
"logGroupName": "/aws/emr/containers/pipes",
31-
"logStreamNamePrefix": str(context.run_id),
32-
}
33-
},
34-
# "applicationConfiguration": [
35-
# {
36-
# "Classification": "spark-env",
37-
# "Configurations": [
38-
# {
39-
# "Classification": "export",
40-
# "Properties": {
41-
# "PYSPARK_PYTHON": "/home/hadoop/.local/share/uv/python/cpython-3.9.16-linux-x86_64-gnu/bin/python",
42-
# "PYSPARK_DRIVER_PYTHON": "/home/hadoop/.local/share/uv/python/cpython-3.9.16-linux-x86_64-gnu/bin/python",
43-
# }
44-
# }
45-
# ]
46-
# }
47-
# ]
48-
},
4930
},
5031
).get_materialize_result()
5132

@@ -54,16 +35,18 @@ def emr_containers_asset(
5435

5536
# start_definitions_marker
5637
import boto3
57-
from dagster_aws.pipes import PipesCloudWatchMessageReader, PipesS3ContextInjector
38+
from dagster_aws.pipes import PipesS3ContextInjector, PipesS3MessageReader
5839

5940
from dagster import Definitions
6041

6142
defs = Definitions(
6243
assets=[emr_containers_asset],
6344
resources={
6445
"pipes_emr_containers_client": PipesEMRContainersClient(
65-
message_reader=PipesCloudWatchMessageReader(
66-
client=boto3.client("logs"),
46+
message_reader=PipesS3MessageReader(
47+
client=boto3.client("s3"),
48+
bucket=...,
49+
include_stdio_in_messages=True,
6750
),
6851
)
6952
},

examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr-containers/script.py

+1-11
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
11
import sys
22

33
import boto3
4-
from dagster_pipes import PipesS3ContextLoader, PipesS3MessageWriter, open_dagster_pipes
4+
from dagster_pipes import PipesS3MessageWriter, open_dagster_pipes
55
from pyspark.sql import SparkSession
66

77

88
def main():
99
s3_client = boto3.client("s3")
10-
1110
with open_dagster_pipes(
1211
message_writer=PipesS3MessageWriter(client=s3_client),
13-
context_loader=PipesS3ContextLoader(client=s3_client),
1412
) as pipes:
1513
pipes.log.info("Hello from AWS EMR Containers!")
1614

@@ -36,11 +34,3 @@ def main():
3634

3735
if __name__ == "__main__":
3836
main()
39-
40-
# import os
41-
# import sys
42-
43-
# print(os.getcwd())
44-
# print(os.environ)
45-
# print(sys.path)
46-
# print(sys.executable)

0 commit comments

Comments
 (0)