Skip to content

Commit aa34e49

Browse files
[wip]
1 parent fd3a0c8 commit aa34e49

File tree

10 files changed

+557
-330
lines changed

10 files changed

+557
-330
lines changed

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
SHELL=/bin/bash
22
DATETIME:=$(shell date -u +%Y%m%dT%H%M%SZ)
3+
MINIO_COMPOSE_FILE=abdiff/helpers/minio/docker-compose.yaml
34

45
help: # Preview Makefile commands
56
@awk 'BEGIN { FS = ":.*#"; print "Usage: make <target>\n\nTargets:" } \
@@ -54,3 +55,10 @@ black-apply: # Apply changes with 'black'
5455

5556
ruff-apply: # Resolve 'fixable errors' with 'ruff'
5657
pipenv run ruff check --fix .
58+
59+
# Development commands
60+
start-minio-server:
61+
docker compose --env-file .env -f $(MINIO_COMPOSE_FILE) up -d
62+
63+
stop-minio-server:
64+
docker compose -f $(MINIO_COMPOSE_FILE) stop

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ boto3 = "*"
1919

2020
[dev-packages]
2121
black = "*"
22+
boto3-stubs = {version = "*", extras = ["s3"]}
2223
coveralls = "*"
2324
freezegun = "*"
2425
ipython = "*"

Pipfile.lock

Lines changed: 381 additions & 323 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,23 @@ Compare transformed TIMDEX records from two versions (A,B) of Transmogrifier.
1515
- To lint the repo: `make lint`
1616
- To run the app: `pipenv run abdiff --help`
1717

18+
### Storing Files in a Local Minio Server
19+
20+
TIMDEX extract files from S3 (i.e., input files to use in transformations) can be downloaded to a local MinIO server hosted via Docker container. [MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features](https://min.io/docs/minio/kubernetes/upstream/). Downloading extract files improves the runtime of a diff by reducing the number of requests sent to S3 and avoids repeated downloads of extract files.
21+
22+
1. Create an AWS profile `minio`. When prompted for an "AWS Access Key ID" and "AWS Secret Access Key", pass the values set for the `MINIO_ROOT_USER` and `MINIO_ROOT_PASSWORD` environment variables in the Docker Compose YAML file.
23+
```shell
24+
aws configure --profile minio
25+
```
26+
27+
2. Launch a local Minio server via Docker container: `make start-minio-server`.
28+
The API is accessible at: http://127.0.0.1:9000.
29+
The WebUI is accessible at: http://127.0.0.1:9001.
30+
31+
3. On your browser, navigate to the WebUI and sign into the local Minio server using the credentials set in the Docker Compose YAML file.
32+
33+
4. Through the UI, create a bucket in the local Minio server named after the S3 bucket containing the TIMDEX extract files that will be used in the A/B Diff.
34+
1835
## Concepts
1936

2037
A **Job** in `abdiff` represents the A/B test for comparing the results from two versions of Transmogrifier. When a job is first created, a working directory and a JSON file `job.json` with an initial set of configurations is created.
@@ -90,6 +107,11 @@ AWS_SESSION_TOKEN=# passed to Transmogrifier containers for use
90107
### Optional
91108

92109
```text
110+
MINIO_S3_LOCAL_STORAGE=# full file system path to the directory where MinIO stores its object data on the local disk
111+
MINIO_S3_URL=# endpoint for MinIO server API; default is "http://localhost:9000/"
112+
MINIO_S3_CONTAINER_URL=# endpoint for the MinIO server when acccessed from inside a Docker container; default is "http://host.docker.internal:9000/"
113+
MINIO_ROOT_USER=# username for root user account for MinIO server
114+
MINIO_ROOT_PASSWORD=# password for root user account MinIO server
93115
WEBAPP_HOST=# host for flask webapp
94116
WEBAPP_PORT=# port for flask webapp
95117
TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6

abdiff/cli.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from abdiff.config import Config, configure_logger
1212
from abdiff.core import (
1313
build_ab_images,
14+
download_input_files,
1415
calc_ab_diffs,
1516
calc_ab_metrics,
1617
collate_ab_transforms,
@@ -148,7 +149,12 @@ def init_job(
148149
help="Message to describe Run.",
149150
default="Not provided.",
150151
)
151-
def run_diff(job_directory: str, input_files: str, message: str) -> None:
152+
@click.option(
153+
"--download-files", is_flag=True, help="Pass to skip download of extract files"
154+
)
155+
def run_diff(
156+
job_directory: str, input_files: str, message: str, download_files: bool
157+
) -> None:
152158

153159
job_data = read_job_json(job_directory)
154160
run_directory = init_run(job_directory, message=message)
@@ -160,11 +166,15 @@ def run_diff(job_directory: str, input_files: str, message: str) -> None:
160166
else:
161167
input_files_list = [filepath.strip() for filepath in input_files.split(",")]
162168

169+
if download_files:
170+
download_input_files(input_files_list)
171+
163172
ab_transformed_file_lists = run_ab_transforms(
164173
run_directory=run_directory,
165174
image_tag_a=job_data["image_tag_a"],
166175
image_tag_b=job_data["image_tag_b"],
167176
input_files=input_files_list,
177+
download_files=download_files,
168178
)
169179
collated_dataset_path = collate_ab_transforms(
170180
run_directory=run_directory,

abdiff/config.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ class Config:
1111
"WORKSPACE",
1212
)
1313
OPTIONAL_ENV_VARS = (
14+
"MINIO_S3_LOCAL_STORAGE",
15+
"MINIO_S3_URL",
16+
"MINIO_S3_CONTAINER_URL",
17+
"MINIO_ROOT_USER",
18+
"MINIO_ROOT_PASSWORD",
1419
"WEBAPP_HOST",
1520
"WEBAPP_PORT",
1621
"TRANSMOGRIFIER_MAX_WORKERS",
@@ -25,6 +30,22 @@ def __getattr__(self, name: str) -> Any: # noqa: ANN401
2530
message = f"'{name}' not a valid configuration variable"
2631
raise AttributeError(message)
2732

33+
@property
34+
def minio_s3_url(self) -> str:
35+
return self.MINIO_S3_URL or "http://localhost:9000/"
36+
37+
@property
38+
def minio_s3_container_url(self) -> str:
39+
return self.MINIO_S3_CONTAINER_URL or "http://host.docker.internal:9000/"
40+
41+
@property
42+
def minio_root_user(self) -> str:
43+
return self.MINIO_ROOT_USER or "minioadmin"
44+
45+
@property
46+
def minio_root_password(self) -> str:
47+
return self.MINIO_ROOT_PASSWORD or "minioadmin"
48+
2849
@property
2950
def webapp_host(self) -> str:
3051
return self.WEBAPP_HOST or "localhost"

abdiff/core/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from abdiff.core.calc_ab_diffs import calc_ab_diffs
88
from abdiff.core.calc_ab_metrics import calc_ab_metrics
99
from abdiff.core.collate_ab_transforms import collate_ab_transforms
10+
from abdiff.core.download_input_files import download_input_files
1011
from abdiff.core.init_job import init_job
1112
from abdiff.core.init_run import init_run
1213
from abdiff.core.run_ab_transforms import run_ab_transforms
@@ -15,6 +16,7 @@
1516
"init_job",
1617
"init_run",
1718
"build_ab_images",
19+
"download_input_files",
1820
"run_ab_transforms",
1921
"collate_ab_transforms",
2022
"calc_ab_diffs",
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
import logging
2+
import subprocess
3+
4+
import boto3
5+
from botocore.exceptions import ClientError
6+
from mypy_boto3_s3.client import S3Client
7+
8+
from abdiff.config import Config
9+
10+
11+
logger = logging.getLogger(__name__)
12+
13+
CONFIG = Config()
14+
15+
16+
def download_input_files(input_files: str):
17+
s3_client = boto3.client(
18+
"s3",
19+
endpoint_url=CONFIG.minio_s3_url,
20+
aws_access_key_id=CONFIG.MINIO_ROOT_USER,
21+
aws_secret_access_key=CONFIG.MINIO_ROOT_PASSWORD,
22+
)
23+
24+
for input_file in input_files:
25+
if check_object_exists(CONFIG.TIMDEX_BUCKET, input_file, s3_client):
26+
logger.info(f"File found for input: {input_file}. Skipping download.")
27+
continue
28+
29+
logger.info(f"Downloading input file from {CONFIG.TIMDEX_BUCKET}: {input_file}")
30+
copy_command = ["aws", "s3", "cp", input_file, "-"]
31+
upload_command = [
32+
"aws",
33+
"s3",
34+
"cp",
35+
"--endpoint-url",
36+
CONFIG.minio_s3_url,
37+
"--profile",
38+
"minio",
39+
"-",
40+
input_file,
41+
]
42+
43+
try:
44+
copy_process = subprocess.run(
45+
args=copy_command, check=True, capture_output=True
46+
)
47+
subprocess.run(
48+
args=upload_command,
49+
check=True,
50+
input=copy_process.stdout,
51+
)
52+
except subprocess.CalledProcessError:
53+
logger.exception(f"Failed to download input file: {input_file}")
54+
55+
56+
def check_object_exists(bucket: str, input_file: str, s3_client: S3Client) -> bool:
57+
key = input_file.replace(f"s3://{bucket}/", "")
58+
try:
59+
s3_client.head_object(Bucket=bucket, Key=key)
60+
return True
61+
except ClientError as exception:
62+
if exception.response["Error"]["Code"] == "404":
63+
return False
64+
logger.exception(f"Cannot determine if object exists for key {key}.")
65+
return False

abdiff/core/run_ab_transforms.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ def run_ab_transforms(
3434
image_tag_a: str,
3535
image_tag_b: str,
3636
input_files: list[str],
37+
download_files: bool,
3738
docker_client: docker.client.DockerClient | None = None,
3839
) -> tuple[list[str], ...]:
3940
"""Run Docker containers with versioned images of Transmogrifier.
@@ -95,7 +96,9 @@ def run_ab_transforms(
9596
]
9697

9798
# run containers and collect results
98-
futures = run_all_docker_containers(docker_client, input_files, run_configs)
99+
futures = run_all_docker_containers(
100+
docker_client, input_files, run_configs, download_files
101+
)
99102
containers, exceptions = collect_container_results(futures)
100103
logger.info(
101104
f"Successful containers: {len(containers)}, failed containers: {len(exceptions)}"
@@ -129,6 +132,7 @@ def run_all_docker_containers(
129132
docker_client: docker.client.DockerClient,
130133
input_files: list[str],
131134
run_configs: list[tuple],
135+
download_files: bool,
132136
) -> list[Future]:
133137
"""Invoke Docker containers to run in parallel via threads.
134138
@@ -151,6 +155,7 @@ def run_all_docker_containers(
151155
input_file,
152156
get_transformed_filename(filename_details),
153157
docker_client,
158+
download_files,
154159
)
155160
tasks.append(executor.submit(run_docker_container, *args))
156161

@@ -165,13 +170,28 @@ def run_docker_container(
165170
input_file: str,
166171
output_file: str,
167172
docker_client: docker.client.DockerClient,
173+
download_files: bool,
168174
timeout: int = CONFIG.transmogrifier_timeout,
169175
) -> tuple[Container, Exception | None]:
170176
"""Run Transmogrifier via Docker container to transform input file.
171177
172178
The container is run in a detached state to capture a container handle for later use
173179
but this function waits for the container to exit before returning.
174180
"""
181+
182+
if download_files:
183+
environment_variables = {
184+
"AWS_ENDPOINT_URL": CONFIG.minio_s3_container_url,
185+
"AWS_ACCESS_KEY_ID": CONFIG.MINIO_ROOT_USER,
186+
"AWS_SECRET_ACCESS_KEY": CONFIG.MINIO_ROOT_PASSWORD,
187+
}
188+
else:
189+
environment_variables = {
190+
"AWS_ACCESS_KEY_ID": CONFIG.AWS_ACCESS_KEY_ID,
191+
"AWS_SECRET_ACCESS_KEY": CONFIG.AWS_SECRET_ACCESS_KEY,
192+
"AWS_SESSION_TOKEN": CONFIG.AWS_SESSION_TOKEN,
193+
}
194+
175195
container = docker_client.containers.run(
176196
docker_image,
177197
command=[
@@ -180,11 +200,7 @@ def run_docker_container(
180200
f"--source={source}",
181201
],
182202
detach=True,
183-
environment={
184-
"AWS_ACCESS_KEY_ID": CONFIG.AWS_ACCESS_KEY_ID,
185-
"AWS_SECRET_ACCESS_KEY": CONFIG.AWS_SECRET_ACCESS_KEY,
186-
"AWS_SESSION_TOKEN": CONFIG.AWS_SESSION_TOKEN,
187-
},
203+
environment=environment_variables,
188204
labels={
189205
"docker_image": docker_image,
190206
"source": source,
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Settings and configurations that are common for all containers
2+
x-minio-common: &minio-common
3+
image: quay.io/minio/minio:RELEASE.2024-10-29T16-01-48Z
4+
command: server --console-address ":9001" /mnt/data
5+
ports:
6+
- "9000:9000" # API port
7+
- "9001:9001" # Console port
8+
environment:
9+
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
10+
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
11+
healthcheck:
12+
test: ["CMD", "mc", "ready", "local"]
13+
interval: 5s
14+
timeout: 5s
15+
retries: 5
16+
17+
services:
18+
minio:
19+
<<: *minio-common
20+
volumes:
21+
- ${MINIO_S3_LOCAL_STORAGE}:/mnt/data
22+
23+
24+

0 commit comments

Comments
 (0)