Skip to content

Commit d85185a

Browse files
[wip]
1 parent fd3a0c8 commit d85185a

File tree

9 files changed

+514
-323
lines changed

9 files changed

+514
-323
lines changed

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
SHELL=/bin/bash
22
DATETIME:=$(shell date -u +%Y%m%dT%H%M%SZ)
3+
MINIO_COMPOSE_FILE=abdiff/helpers/minio/docker-compose.yaml
34

45
help: # Preview Makefile commands
56
@awk 'BEGIN { FS = ":.*#"; print "Usage: make <target>\n\nTargets:" } \
@@ -54,3 +55,10 @@ black-apply: # Apply changes with 'black'
5455

5556
ruff-apply: # Resolve 'fixable errors' with 'ruff'
5657
pipenv run ruff check --fix .
58+
59+
# Development commands
60+
start-minio-server:
61+
docker compose -f $(MINIO_COMPOSE_FILE) up -d
62+
63+
stop-minio-server:
64+
docker compose -f $(MINIO_COMPOSE_FILE) stop

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ boto3 = "*"
1919

2020
[dev-packages]
2121
black = "*"
22+
boto3-stubs = {version = "*", extras = ["s3"]}
2223
coveralls = "*"
2324
freezegun = "*"
2425
ipython = "*"

Pipfile.lock

Lines changed: 381 additions & 323 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,23 @@ Compare transformed TIMDEX records from two versions (A,B) of Transmogrifier.
1515
- To lint the repo: `make lint`
1616
- To run the app: `pipenv run abdiff --help`
1717

18+
### Storing Files in a Local Minio Server
19+
20+
TIMDEX extract files from S3 (i.e., input files to use in transformations) can be downloaded to a local MinIO server hosted via Docker container. [MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features](https://min.io/docs/minio/kubernetes/upstream/). Downloading extract files improves the runtime of a diff by reducing the number of requests sent to S3 and avoids repeated downloads of extract files.
21+
22+
1. Create an AWS profile `minio`. When prompted for an "AWS Access Key ID" and "AWS Secret Access Key", pass the values set for the `MINIO_ROOT_USER` and `MINIO_ROOT_PASSWORD` environment variables in the Docker Compose YAML file.
23+
```shell
24+
aws configure --profile minio
25+
```
26+
27+
2. Launch a local Minio server via Docker container: `make start-minio-server`.
28+
The API is accessible at: http://127.0.0.1:9000.
29+
The WebUI is accessible at: http://127.0.0.1:9001.
30+
31+
3. On your browser, navigate to the WebUI and sign into the local Minio server using the credentials set in the Docker Compose YAML file.
32+
33+
4. Through the UI, create a bucket in the local Minio server named after the S3 bucket containing the TIMDEX extract files that will be used in the A/B Diff.
34+
1835
## Concepts
1936

2037
A **Job** in `abdiff` represents the A/B test for comparing the results from two versions of Transmogrifier. When a job is first created, a working directory and a JSON file `job.json` with an initial set of configurations is created.
@@ -90,6 +107,7 @@ AWS_SESSION_TOKEN=# passed to Transmogrifier containers for use
90107
### Optional
91108

92109
```text
110+
AWS_ENDPOINT_URL=# endpoint for MinIO server API; default is "http://localhost:9000/"
93111
WEBAPP_HOST=# host for flask webapp
94112
WEBAPP_PORT=# port for flask webapp
95113
TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6

abdiff/cli.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from abdiff.config import Config, configure_logger
1212
from abdiff.core import (
1313
build_ab_images,
14+
download_input_files,
1415
calc_ab_diffs,
1516
calc_ab_metrics,
1617
collate_ab_transforms,
@@ -180,6 +181,19 @@ def run_diff(job_directory: str, input_files: str, message: str) -> None:
180181
)
181182

182183

184+
@main.command()
185+
@click.option(
186+
"-i",
187+
"--input-files",
188+
type=str,
189+
required=True,
190+
help="Input files to transform.",
191+
)
192+
def download_files(input_files: str):
193+
input_files_list = [filepath.strip() for filepath in input_files.split(",")]
194+
download_input_files(input_files_list)
195+
196+
183197
@main.command()
184198
@click.option(
185199
"-d",

abdiff/config.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ class Config:
1111
"WORKSPACE",
1212
)
1313
OPTIONAL_ENV_VARS = (
14+
"AWS_ENDPOINT_URL",
1415
"WEBAPP_HOST",
1516
"WEBAPP_PORT",
1617
"TRANSMOGRIFIER_MAX_WORKERS",
@@ -25,6 +26,10 @@ def __getattr__(self, name: str) -> Any: # noqa: ANN401
2526
message = f"'{name}' not a valid configuration variable"
2627
raise AttributeError(message)
2728

29+
@property
30+
def aws_endpoint_url(self) -> str:
31+
return self.AWS_ENDPOINT_URL or "http://localhost:9000/"
32+
2833
@property
2934
def webapp_host(self) -> str:
3035
return self.WEBAPP_HOST or "localhost"

abdiff/core/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from abdiff.core.calc_ab_diffs import calc_ab_diffs
88
from abdiff.core.calc_ab_metrics import calc_ab_metrics
99
from abdiff.core.collate_ab_transforms import collate_ab_transforms
10+
from abdiff.core.download_input_files import download_input_files
1011
from abdiff.core.init_job import init_job
1112
from abdiff.core.init_run import init_run
1213
from abdiff.core.run_ab_transforms import run_ab_transforms
@@ -15,6 +16,7 @@
1516
"init_job",
1617
"init_run",
1718
"build_ab_images",
19+
"download_input_files",
1820
"run_ab_transforms",
1921
"collate_ab_transforms",
2022
"calc_ab_diffs",
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import logging
2+
import subprocess
3+
from typing import TYPE_CHECKING
4+
5+
import boto3
6+
from botocore.exceptions import ClientError
7+
8+
if TYPE_CHECKING:
9+
from mypy_boto3_s3.client import S3Client
10+
11+
from abdiff.config import Config
12+
13+
14+
logger = logging.getLogger(__name__)
15+
16+
CONFIG = Config()
17+
18+
19+
def download_input_files(input_files: str):
20+
s3_client = boto3.client("s3")
21+
22+
for input_file in input_files:
23+
if check_object_exists(CONFIG.TIMDEX_BUCKET, input_file, s3_client):
24+
continue
25+
26+
logger.info(f"Downloading input file from {CONFIG.TIMDEX_BUCKET}: {input_file}")
27+
copy_command = ["aws", "s3", "cp", input_file, "-"]
28+
upload_command = [
29+
"aws",
30+
"s3",
31+
"cp",
32+
"--endpoint-url",
33+
CONFIG.aws_endpoint_url,
34+
"--profile",
35+
"minio",
36+
"-",
37+
input_file,
38+
]
39+
40+
try:
41+
copy_process = subprocess.run(
42+
args=copy_command, check=True, capture_output=True
43+
)
44+
subprocess.run(
45+
args=upload_command,
46+
check=True,
47+
input=copy_process.stdout,
48+
)
49+
except subprocess.CalledProcessError:
50+
logger.exception(f"Failed to download input file: {input_file}")
51+
52+
53+
def check_object_exists(bucket: str, input_file: str, s3_client: S3Client) -> bool:
54+
key = input_file.replace(f"s3://{bucket}/", "")
55+
try:
56+
s3_client.head_object(Bucket=bucket, Key=key)
57+
return True
58+
except ClientError as exception:
59+
if exception.response["Error"]["Code"] == "NoSuchKey":
60+
return False
61+
logger.exception(f"Cannot determine if object exists for key {key}.")
62+
return False

abdiff/helpers/docker-compose.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Settings and configurations that are common for all containers
2+
x-minio-common: &minio-common
3+
image: quay.io/minio/minio:RELEASE.2024-10-29T16-01-48Z
4+
command: server --console-address ":9001" /mnt/data
5+
ports:
6+
- "9000:9000" # API port
7+
- "9001:9001" # Console port
8+
environment:
9+
MINIO_ROOT_USER: minioadmin
10+
MINIO_ROOT_PASSWORD: minioadmin
11+
healthcheck:
12+
test: ["CMD", "mc", "ready", "local"]
13+
interval: 5s
14+
timeout: 5s
15+
retries: 5
16+
17+
services:
18+
minio:
19+
<<: *minio-common
20+
volumes:
21+
# TODO: env var for absolute path
22+
- ./data:/mnt/data
23+

0 commit comments

Comments
 (0)