Skip to content

Commit d163d93

Browse files
Add option to download input files using a local MinIO server
Why these changes are being introduced: * Downloading extract files improves the performance of the app by reducing requests sent to AWS S3 and avoiding repeated downloads of extract files used across multiple container runs. Having extract files available on local disk also minimizes the occurence of network issues or AWS credentials timing out during a transform. These changes introduces a locally hosted MinIO server to act as a "local S3 bucket" as part of the A/B diff workflow. How this addresses that need: * Add a Docker Compose YAML file to run local MinIO server * Add Makefile commands for starting and stopping local MinIO server * Add option '--download-files' to run-diff CLI command * Implement download_input_files core function * Update run_ab_transforms to suport use of local MinIO server Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-353
1 parent fd3a0c8 commit d163d93

File tree

10 files changed

+602
-331
lines changed

10 files changed

+602
-331
lines changed

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
SHELL=/bin/bash
22
DATETIME:=$(shell date -u +%Y%m%dT%H%M%SZ)
3+
MINIO_COMPOSE_FILE=abdiff/helpers/minio/docker-compose.yaml
34

45
help: # Preview Makefile commands
56
@awk 'BEGIN { FS = ":.*#"; print "Usage: make <target>\n\nTargets:" } \
@@ -54,3 +55,10 @@ black-apply: # Apply changes with 'black'
5455

5556
ruff-apply: # Resolve 'fixable errors' with 'ruff'
5657
pipenv run ruff check --fix .
58+
59+
# Development commands
60+
start-minio-server:
61+
docker compose --env-file .env -f $(MINIO_COMPOSE_FILE) up -d
62+
63+
stop-minio-server:
64+
docker compose -f $(MINIO_COMPOSE_FILE) stop

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ boto3 = "*"
1919

2020
[dev-packages]
2121
black = "*"
22+
boto3-stubs = {version = "*", extras = ["s3"]}
2223
coveralls = "*"
2324
freezegun = "*"
2425
ipython = "*"

Pipfile.lock

Lines changed: 381 additions & 323 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,38 @@ Compare transformed TIMDEX records from two versions (A,B) of Transmogrifier.
1515
- To lint the repo: `make lint`
1616
- To run the app: `pipenv run abdiff --help`
1717

18+
### Storing Files in a Local MinIO Server
19+
20+
TIMDEX extract files from S3 (i.e., input files to use in transformations) can be downloaded to a local MinIO server hosted via Docker container. [MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features](https://min.io/docs/minio/kubernetes/upstream/). Downloading extract files improves the runtime of a diff by reducing the number of requests sent to S3 and avoids repeated downloads of extract files.
21+
22+
1. Configure your `.env` file. In addition to the [required environment variables](#required), the following environment variables must also be set:
23+
24+
```text
25+
MINIO_S3_LOCAL_STORAGE="/Users/jcuerdo/Documents/repos/transmogrifier-ab-diff/output/input_files"
26+
TIMDEX_BUCKET="timdex-extract-dev-222053980223"
27+
```
28+
29+
Note: There are additional variables required by the Local MinIO server (see vars prefixed with "MINIO" in [optional environment variables](#optional)). For these variables, defaults are provided in [abdiff.config](abdiff/config.py).
30+
31+
2. Create an AWS profile `minio`. When prompted for an "AWS Access Key ID" and "AWS Secret Access Key", pass the values set for the `MINIO_ROOT_USER` and `MINIO_ROOT_PASSWORD` environment variables, respectively.
32+
```shell
33+
aws configure --profile minio
34+
```
35+
36+
3. Launch a local MinIO server via Docker container by running the Makefile command:
37+
```shell
38+
make start-minio-server
39+
```
40+
41+
The API is accessible at: http://127.0.0.1:9000.
42+
The WebUI is accessible at: http://127.0.0.1:9001.
43+
44+
4. On your browser, navigate to the WebUI and sign into the local MinIO server. Create a bucket in the local MinIO server named after the S3 bucket containing the TIMDEX extract files that will be used in the A/B Diff.
45+
46+
5. Proceed with A/B Diff CLI commands as needed!
47+
48+
Once a diff run is complete, you can stop the local MinIO server using the Makefile command: `make stop-minio-server`. If you're planning to run another diff using the same files -- good news! All you have to do is restart the local MinIO server. Your data will persist as long as the files exist in the directory you specified for `MINIO_S3_LOCAL_STORAGE`.
49+
1850
## Concepts
1951

2052
A **Job** in `abdiff` represents the A/B test for comparing the results from two versions of Transmogrifier. When a job is first created, a working directory and a JSON file `job.json` with an initial set of configurations is created.
@@ -90,6 +122,11 @@ AWS_SESSION_TOKEN=# passed to Transmogrifier containers for use
90122
### Optional
91123

92124
```text
125+
MINIO_S3_LOCAL_STORAGE=# full file system path to the directory where MinIO stores its object data on the local disk
126+
MINIO_S3_URL=# endpoint for MinIO server API; default is "http://localhost:9000/"
127+
MINIO_S3_CONTAINER_URL=# endpoint for the MinIO server when acccessed from inside a Docker container; default is "http://host.docker.internal:9000/"
128+
MINIO_ROOT_USER=# username for root user account for MinIO server
129+
MINIO_ROOT_PASSWORD=# password for root user account MinIO server
93130
WEBAPP_HOST=# host for flask webapp
94131
WEBAPP_PORT=# port for flask webapp
95132
TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6

abdiff/cli.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
calc_ab_diffs,
1515
calc_ab_metrics,
1616
collate_ab_transforms,
17+
download_input_files,
1718
init_run,
1819
run_ab_transforms,
1920
)
@@ -148,7 +149,12 @@ def init_job(
148149
help="Message to describe Run.",
149150
default="Not provided.",
150151
)
151-
def run_diff(job_directory: str, input_files: str, message: str) -> None:
152+
@click.option(
153+
"--download-files", is_flag=True, help="Pass to skip download of extract files"
154+
)
155+
def run_diff(
156+
job_directory: str, input_files: str, message: str, *, download_files: bool
157+
) -> None:
152158

153159
job_data = read_job_json(job_directory)
154160
run_directory = init_run(job_directory, message=message)
@@ -160,11 +166,15 @@ def run_diff(job_directory: str, input_files: str, message: str) -> None:
160166
else:
161167
input_files_list = [filepath.strip() for filepath in input_files.split(",")]
162168

169+
if download_files:
170+
download_input_files(input_files_list)
171+
163172
ab_transformed_file_lists = run_ab_transforms(
164173
run_directory=run_directory,
165174
image_tag_a=job_data["image_tag_a"],
166175
image_tag_b=job_data["image_tag_b"],
167176
input_files=input_files_list,
177+
use_local_s3=download_files,
168178
)
169179
collated_dataset_path = collate_ab_transforms(
170180
run_directory=run_directory,

abdiff/config.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ class Config:
1111
"WORKSPACE",
1212
)
1313
OPTIONAL_ENV_VARS = (
14+
"MINIO_S3_LOCAL_STORAGE",
15+
"MINIO_S3_URL",
16+
"MINIO_S3_CONTAINER_URL",
17+
"MINIO_ROOT_USER",
18+
"MINIO_ROOT_PASSWORD",
1419
"WEBAPP_HOST",
1520
"WEBAPP_PORT",
1621
"TRANSMOGRIFIER_MAX_WORKERS",
@@ -25,6 +30,22 @@ def __getattr__(self, name: str) -> Any: # noqa: ANN401
2530
message = f"'{name}' not a valid configuration variable"
2631
raise AttributeError(message)
2732

33+
@property
34+
def minio_s3_url(self) -> str:
35+
return self.MINIO_S3_URL or "http://localhost:9000/"
36+
37+
@property
38+
def minio_s3_container_url(self) -> str:
39+
return self.MINIO_S3_CONTAINER_URL or "http://host.docker.internal:9000/"
40+
41+
@property
42+
def minio_root_user(self) -> str:
43+
return self.MINIO_ROOT_USER or "minioadmin"
44+
45+
@property
46+
def minio_root_password(self) -> str:
47+
return self.MINIO_ROOT_PASSWORD or "minioadmin"
48+
2849
@property
2950
def webapp_host(self) -> str:
3051
return self.WEBAPP_HOST or "localhost"

abdiff/core/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from abdiff.core.calc_ab_diffs import calc_ab_diffs
88
from abdiff.core.calc_ab_metrics import calc_ab_metrics
99
from abdiff.core.collate_ab_transforms import collate_ab_transforms
10+
from abdiff.core.download_input_files import download_input_files
1011
from abdiff.core.init_job import init_job
1112
from abdiff.core.init_run import init_run
1213
from abdiff.core.run_ab_transforms import run_ab_transforms
@@ -15,6 +16,7 @@
1516
"init_job",
1617
"init_run",
1718
"build_ab_images",
19+
"download_input_files",
1820
"run_ab_transforms",
1921
"collate_ab_transforms",
2022
"calc_ab_diffs",
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
import logging
2+
import subprocess
3+
4+
import boto3
5+
from botocore.exceptions import ClientError
6+
from mypy_boto3_s3.client import S3Client
7+
8+
from abdiff.config import Config
9+
10+
logger = logging.getLogger(__name__)
11+
12+
CONFIG = Config()
13+
14+
15+
def download_input_files(input_files: list[str]) -> None:
16+
"""Download extract files from S3 to a local MinIO server.
17+
18+
For each file download, two AWS CLI commands are run by subprocess.
19+
The output from the first command is piped to the second command.
20+
These commands are further explained below:
21+
22+
1. Copy the contents from the input file and direct to stdout.
23+
```
24+
aws s3 cp <input_file> -
25+
```
26+
27+
2. Given the stdout from the previous command as input, copy the contents
28+
to a similarly named file on the local MinIO server.
29+
```
30+
aws s3 cp --endpoint-url <minio_s3_url> --profile minio - <input_file>
31+
```
32+
33+
Note: An S3 client connected to the local MinIO server will check whether the file exists
34+
prior to any download.
35+
"""
36+
s3_client = boto3.client(
37+
"s3",
38+
endpoint_url=CONFIG.minio_s3_url,
39+
aws_access_key_id=CONFIG.minio_root_user,
40+
aws_secret_access_key=CONFIG.minio_root_password,
41+
)
42+
43+
for input_file in input_files:
44+
if check_object_exists(CONFIG.TIMDEX_BUCKET, input_file, s3_client):
45+
logger.info(f"File found for input: {input_file}. Skipping download.")
46+
continue
47+
48+
logger.info(f"Downloading input file from {CONFIG.TIMDEX_BUCKET}: {input_file}")
49+
copy_command = ["aws", "s3", "cp", input_file, "-"]
50+
upload_command = [
51+
"aws",
52+
"s3",
53+
"cp",
54+
"--endpoint-url",
55+
CONFIG.minio_s3_url,
56+
"--profile",
57+
"minio",
58+
"-",
59+
input_file,
60+
]
61+
62+
try:
63+
copy_process = subprocess.run(
64+
args=copy_command, check=True, capture_output=True
65+
)
66+
subprocess.run(
67+
args=upload_command,
68+
check=True,
69+
input=copy_process.stdout,
70+
)
71+
except subprocess.CalledProcessError:
72+
logger.exception(f"Failed to download input file: {input_file}")
73+
74+
75+
def check_object_exists(bucket: str, input_file: str, s3_client: S3Client) -> bool:
76+
key = input_file.replace(f"s3://{bucket}/", "")
77+
try:
78+
s3_client.head_object(Bucket=bucket, Key=key)
79+
except ClientError as exception:
80+
if exception.response["Error"]["Code"] == "404":
81+
return False
82+
logger.exception(f"Cannot determine if object exists for key {key}.")
83+
return False
84+
else:
85+
return True

abdiff/core/run_ab_transforms.py

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ def run_ab_transforms(
3535
image_tag_b: str,
3636
input_files: list[str],
3737
docker_client: docker.client.DockerClient | None = None,
38+
*,
39+
use_local_s3: bool = False,
3840
) -> tuple[list[str], ...]:
3941
"""Run Docker containers with versioned images of Transmogrifier.
4042
@@ -59,6 +61,10 @@ def run_ab_transforms(
5961
URIs for input files on S3 are accepted.
6062
docker_client (docker.client.DockerClient | None, optional): Docker client.
6163
Defaults to None.
64+
use_local_s3 (bool): Boolean indicating whether the container should
65+
access input files from a local MinIO server (i.e., "local S3 bucket")
66+
or from AWS S3. This flag determines the appropriate environment variables
67+
to set for the Docker containers. Default is False.
6268
6369
Returns:
6470
tuple[list[str], ...]: A tuple containing two lists, where each list contains
@@ -95,7 +101,9 @@ def run_ab_transforms(
95101
]
96102

97103
# run containers and collect results
98-
futures = run_all_docker_containers(docker_client, input_files, run_configs)
104+
futures = run_all_docker_containers(
105+
docker_client, input_files, run_configs, use_local_s3=use_local_s3
106+
)
99107
containers, exceptions = collect_container_results(futures)
100108
logger.info(
101109
f"Successful containers: {len(containers)}, failed containers: {len(exceptions)}"
@@ -129,6 +137,8 @@ def run_all_docker_containers(
129137
docker_client: docker.client.DockerClient,
130138
input_files: list[str],
131139
run_configs: list[tuple],
140+
*,
141+
use_local_s3: bool = False,
132142
) -> list[Future]:
133143
"""Invoke Docker containers to run in parallel via threads.
134144
@@ -152,7 +162,11 @@ def run_all_docker_containers(
152162
get_transformed_filename(filename_details),
153163
docker_client,
154164
)
155-
tasks.append(executor.submit(run_docker_container, *args))
165+
tasks.append(
166+
executor.submit(
167+
run_docker_container, *args, use_local_s3=use_local_s3
168+
)
169+
)
156170

157171
logger.info(f"All {len(tasks)} containers have exited.")
158172
return tasks
@@ -166,12 +180,27 @@ def run_docker_container(
166180
output_file: str,
167181
docker_client: docker.client.DockerClient,
168182
timeout: int = CONFIG.transmogrifier_timeout,
183+
*,
184+
use_local_s3: bool = False,
169185
) -> tuple[Container, Exception | None]:
170186
"""Run Transmogrifier via Docker container to transform input file.
171187
172188
The container is run in a detached state to capture a container handle for later use
173189
but this function waits for the container to exit before returning.
174190
"""
191+
if use_local_s3:
192+
environment_variables = {
193+
"AWS_ENDPOINT_URL": CONFIG.minio_s3_container_url,
194+
"AWS_ACCESS_KEY_ID": CONFIG.minio_root_user,
195+
"AWS_SECRET_ACCESS_KEY": CONFIG.minio_root_password,
196+
}
197+
else:
198+
environment_variables = {
199+
"AWS_ACCESS_KEY_ID": CONFIG.AWS_ACCESS_KEY_ID,
200+
"AWS_SECRET_ACCESS_KEY": CONFIG.AWS_SECRET_ACCESS_KEY,
201+
"AWS_SESSION_TOKEN": CONFIG.AWS_SESSION_TOKEN,
202+
}
203+
175204
container = docker_client.containers.run(
176205
docker_image,
177206
command=[
@@ -180,11 +209,7 @@ def run_docker_container(
180209
f"--source={source}",
181210
],
182211
detach=True,
183-
environment={
184-
"AWS_ACCESS_KEY_ID": CONFIG.AWS_ACCESS_KEY_ID,
185-
"AWS_SECRET_ACCESS_KEY": CONFIG.AWS_SECRET_ACCESS_KEY,
186-
"AWS_SESSION_TOKEN": CONFIG.AWS_SESSION_TOKEN,
187-
},
212+
environment=environment_variables,
188213
labels={
189214
"docker_image": docker_image,
190215
"source": source,
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Settings and configurations that are common for all containers
2+
x-minio-common: &minio-common
3+
image: quay.io/minio/minio:RELEASE.2024-10-29T16-01-48Z
4+
command: server --console-address ":9001" /mnt/data
5+
ports:
6+
- "9000:9000" # API port
7+
- "9001:9001" # Console port
8+
environment:
9+
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
10+
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
11+
healthcheck:
12+
test: ["CMD", "mc", "ready", "local"]
13+
interval: 5s
14+
timeout: 5s
15+
retries: 5
16+
17+
services:
18+
minio:
19+
<<: *minio-common
20+
volumes:
21+
- ${MINIO_S3_LOCAL_STORAGE}:/mnt/data
22+
23+
24+

0 commit comments

Comments
 (0)