Skip to content

Commit 16ecf73

Browse files
Add option to download input files using a local MinIO server (#49)
Why these changes are being introduced: * Downloading extract files improves the performance of the app by reducing requests sent to AWS S3 and avoiding repeated downloads of extract files used across multiple container runs. Having extract files available on local disk also minimizes the occurence of network issues or AWS credentials timing out during a transform. These changes introduces a locally hosted MinIO server to act as a "local S3 bucket" as part of the A/B diff workflow. How this addresses that need: * Add a Docker Compose YAML file to run local MinIO server * Add Makefile commands for starting and stopping local MinIO server * Add option '--download-files' to run-diff CLI command * Add 'download_input_files' function to extras * Update 'run_ab_transforms' to support use of local MinIO server Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-353
1 parent cea0a91 commit 16ecf73

File tree

11 files changed

+469
-186
lines changed

11 files changed

+469
-186
lines changed

Makefile

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
SHELL=/bin/bash
22
DATETIME:=$(shell date -u +%Y%m%dT%H%M%SZ)
3+
MINIO_COMPOSE_FILE=abdiff/extras/minio/docker-compose.yaml
34

45
help: # Preview Makefile commands
56
@awk 'BEGIN { FS = ":.*#"; print "Usage: make <target>\n\nTargets:" } \
@@ -54,3 +55,13 @@ black-apply: # Apply changes with 'black'
5455

5556
ruff-apply: # Resolve 'fixable errors' with 'ruff'
5657
pipenv run ruff check --fix .
58+
59+
####################################
60+
# MinIO local S3 commands
61+
####################################
62+
63+
start-minio-server:
64+
docker compose --env-file .env -f $(MINIO_COMPOSE_FILE) up -d
65+
66+
stop-minio-server:
67+
docker compose --env-file .env -f $(MINIO_COMPOSE_FILE) stop

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ deepdiff = "*"
1919

2020
[dev-packages]
2121
black = "*"
22+
boto3-stubs = {version = "*", extras = ["s3"]}
2223
coveralls = "*"
2324
freezegun = "*"
2425
ipython = "*"

Pipfile.lock

Lines changed: 224 additions & 178 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,43 @@ Compare transformed TIMDEX records from two versions (A,B) of Transmogrifier.
1515
- To lint the repo: `make lint`
1616
- To run the app: `pipenv run abdiff --help`
1717

18+
### Running a Local MinIO Server
19+
20+
TIMDEX extract files from S3 (i.e., input files to use in transformations) can be downloaded to a local MinIO server hosted via a Docker container. [MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features](https://min.io/docs/minio/kubernetes/upstream/). The MinIO server acts as a "local S3 file system", allowing the app to access data on disk through an S3 interface. Since the MinIO server runs in a Docker container, it can be easily started when needed and stopped when not in use. Any data stored in the MinIO server will persist as long as the files exist in the directory specified for `MINIO_S3_LOCAL_STORAGE`.
21+
22+
Downloading extract files improves the runtime of a diff by reducing the number of requests sent to S3 and avoids AWS credentials timing out. Once an extract file is stored in the local MinIO server, the app can access the data from MinIO for all future runs that include the extract file, avoiding repeated downloads of data used across multiple runs.
23+
24+
25+
1. Configure your `.env` file. In addition to the [required environment variables](#required), the following environment variables must also be set:
26+
27+
```text
28+
MINIO_S3_LOCAL_STORAGE=# full file system path to the directory where MinIO stores its object data on the local disk
29+
MINIO_ROOT_USER=# username for root user account for MinIO server
30+
MINIO_ROOT_PASSWORD=# password for root user account MinIO server
31+
TIMDEX_BUCKET=# when using CLI command 'timdex-sources-csv', this is required to know what TIMDEX bucket to use
32+
```
33+
34+
Note: There are additional variables required by the Local MinIO server (see vars prefixed with "MINIO" in [optional environment variables](#optional)). For these variables, defaults are provided in [abdiff.config](abdiff/config.py).
35+
36+
2. Create an AWS profile `minio`. When prompted for an "AWS Access Key ID" and "AWS Secret Access Key", pass the values set for the `MINIO_ROOT_USER` and `MINIO_ROOT_PASSWORD` environment variables, respectively.
37+
```shell
38+
aws configure --profile minio
39+
```
40+
41+
3. Launch a local MinIO server via Docker container by running the Makefile command:
42+
```shell
43+
make start-minio-server
44+
```
45+
46+
The API is accessible at: http://127.0.0.1:9000.
47+
The WebUI is accessible at: http://127.0.0.1:9001.
48+
49+
4. On your browser, navigate to the WebUI and sign into the local MinIO server. Create a bucket in the local MinIO server named after the S3 bucket containing the TIMDEX extract files that will be used in the A/B Diff.
50+
51+
5. Proceed with A/B Diff CLI commands as needed!
52+
53+
Once a diff run is complete, you can stop the local MinIO server using the Makefile command: `make stop-minio-server`. If you're planning to run another diff using the same files, all you have to do is restart the local MinIO server. Your data will persist as long as the files exist in the directory you specified for `MINIO_S3_LOCAL_STORAGE`.
54+
1855
## Concepts
1956

2057
A **Job** in `abdiff` represents the A/B test for comparing the results from two versions of Transmogrifier. When a job is first created, a working directory and a JSON file `job.json` with an initial set of configurations is created.
@@ -90,6 +127,11 @@ AWS_SESSION_TOKEN=# passed to Transmogrifier containers for use
90127
### Optional
91128

92129
```text
130+
MINIO_S3_LOCAL_STORAGE=# full file system path to the directory where MinIO stores its object data on the local disk
131+
MINIO_S3_URL=# endpoint for MinIO server API; default is "http://localhost:9000/"
132+
MINIO_S3_CONTAINER_URL=# endpoint for the MinIO server when acccessed from inside a Docker container; default is "http://host.docker.internal:9000/"
133+
MINIO_ROOT_USER=# username for root user account for MinIO server
134+
MINIO_ROOT_PASSWORD=# password for root user account MinIO server
93135
WEBAPP_HOST=# host for flask webapp
94136
WEBAPP_PORT=# port for flask webapp
95137
TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6

abdiff/cli.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
calc_ab_diffs,
1515
calc_ab_metrics,
1616
collate_ab_transforms,
17+
download_input_files,
1718
init_run,
1819
run_ab_transforms,
1920
)
@@ -148,7 +149,17 @@ def init_job(
148149
help="Message to describe Run.",
149150
default="Not provided.",
150151
)
151-
def run_diff(job_directory: str, input_files: str, message: str) -> None:
152+
@click.option(
153+
"--download-files",
154+
is_flag=True,
155+
help=(
156+
"Pass to download input files from AWS S3 to a local Minio S3 server "
157+
"for Transmogrifier to use."
158+
),
159+
)
160+
def run_diff(
161+
job_directory: str, input_files: str, message: str, *, download_files: bool
162+
) -> None:
152163

153164
job_data = read_job_json(job_directory)
154165
run_directory = init_run(job_directory, message=message)
@@ -160,11 +171,15 @@ def run_diff(job_directory: str, input_files: str, message: str) -> None:
160171
else:
161172
input_files_list = [filepath.strip() for filepath in input_files.split(",")]
162173

174+
if download_files:
175+
download_input_files(input_files_list)
176+
163177
ab_transformed_file_lists = run_ab_transforms(
164178
run_directory=run_directory,
165179
image_tag_a=job_data["image_tag_a"],
166180
image_tag_b=job_data["image_tag_b"],
167181
input_files=input_files_list,
182+
use_local_s3=download_files,
168183
)
169184
collated_dataset_path = collate_ab_transforms(
170185
run_directory=run_directory,

abdiff/config.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ class Config:
1111
"WORKSPACE",
1212
)
1313
OPTIONAL_ENV_VARS = (
14+
"MINIO_S3_LOCAL_STORAGE",
15+
"MINIO_S3_URL",
16+
"MINIO_S3_CONTAINER_URL",
17+
"MINIO_ROOT_USER",
18+
"MINIO_ROOT_PASSWORD",
1419
"WEBAPP_HOST",
1520
"WEBAPP_PORT",
1621
"TRANSMOGRIFIER_MAX_WORKERS",
@@ -25,6 +30,24 @@ def __getattr__(self, name: str) -> Any: # noqa: ANN401
2530
message = f"'{name}' not a valid configuration variable"
2631
raise AttributeError(message)
2732

33+
@property
34+
def minio_s3_url(self) -> str:
35+
"""Host for ABDiff context (host machine) to connect to MinIO."""
36+
return self.MINIO_S3_URL or "http://localhost:9000/"
37+
38+
@property
39+
def minio_s3_container_url(self) -> str:
40+
"""Host for Transmogrifier Docker containers to connect to MinIO."""
41+
return self.MINIO_S3_CONTAINER_URL or "http://host.docker.internal:9000/"
42+
43+
@property
44+
def minio_root_user(self) -> str:
45+
return self.MINIO_ROOT_USER or "minioadmin"
46+
47+
@property
48+
def minio_root_password(self) -> str:
49+
return self.MINIO_ROOT_PASSWORD or "minioadmin"
50+
2851
@property
2952
def webapp_host(self) -> str:
3053
return self.WEBAPP_HOST or "localhost"

abdiff/core/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,13 @@
1010
from abdiff.core.init_job import init_job
1111
from abdiff.core.init_run import init_run
1212
from abdiff.core.run_ab_transforms import run_ab_transforms
13+
from abdiff.extras.minio.download_input_files import download_input_files
1314

1415
__all__ = [
1516
"init_job",
1617
"init_run",
1718
"build_ab_images",
19+
"download_input_files",
1820
"run_ab_transforms",
1921
"collate_ab_transforms",
2022
"calc_ab_diffs",

abdiff/core/run_ab_transforms.py

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ def run_ab_transforms(
3535
image_tag_b: str,
3636
input_files: list[str],
3737
docker_client: docker.client.DockerClient | None = None,
38+
*,
39+
use_local_s3: bool = False,
3840
) -> tuple[list[str], ...]:
3941
"""Run Docker containers with versioned images of Transmogrifier.
4042
@@ -59,6 +61,10 @@ def run_ab_transforms(
5961
URIs for input files on S3 are accepted.
6062
docker_client (docker.client.DockerClient | None, optional): Docker client.
6163
Defaults to None.
64+
use_local_s3 (bool): Boolean indicating whether the container should
65+
access input files from a local MinIO server (i.e., "local S3 bucket")
66+
or from AWS S3. This flag determines the appropriate environment variables
67+
to set for the Docker containers. Default is False.
6268
6369
Returns:
6470
tuple[list[str], ...]: A tuple containing two lists, where each list contains
@@ -95,7 +101,9 @@ def run_ab_transforms(
95101
]
96102

97103
# run containers and collect results
98-
futures = run_all_docker_containers(docker_client, input_files, run_configs)
104+
futures = run_all_docker_containers(
105+
docker_client, input_files, run_configs, use_local_s3=use_local_s3
106+
)
99107
containers, exceptions = collect_container_results(futures)
100108
logger.info(
101109
f"Successful containers: {len(containers)}, failed containers: {len(exceptions)}"
@@ -129,6 +137,8 @@ def run_all_docker_containers(
129137
docker_client: docker.client.DockerClient,
130138
input_files: list[str],
131139
run_configs: list[tuple],
140+
*,
141+
use_local_s3: bool = False,
132142
) -> list[Future]:
133143
"""Invoke Docker containers to run in parallel via threads.
134144
@@ -152,7 +162,11 @@ def run_all_docker_containers(
152162
get_transformed_filename(filename_details),
153163
docker_client,
154164
)
155-
tasks.append(executor.submit(run_docker_container, *args))
165+
tasks.append(
166+
executor.submit(
167+
run_docker_container, *args, use_local_s3=use_local_s3
168+
)
169+
)
156170

157171
logger.info(f"All {len(tasks)} containers have exited.")
158172
return tasks
@@ -166,12 +180,27 @@ def run_docker_container(
166180
output_file: str,
167181
docker_client: docker.client.DockerClient,
168182
timeout: int = CONFIG.transmogrifier_timeout,
183+
*,
184+
use_local_s3: bool = False,
169185
) -> tuple[Container, Exception | None]:
170186
"""Run Transmogrifier via Docker container to transform input file.
171187
172188
The container is run in a detached state to capture a container handle for later use
173189
but this function waits for the container to exit before returning.
174190
"""
191+
if use_local_s3:
192+
environment_variables = {
193+
"AWS_ENDPOINT_URL": CONFIG.minio_s3_container_url,
194+
"AWS_ACCESS_KEY_ID": CONFIG.minio_root_user,
195+
"AWS_SECRET_ACCESS_KEY": CONFIG.minio_root_password,
196+
}
197+
else:
198+
environment_variables = {
199+
"AWS_ACCESS_KEY_ID": CONFIG.AWS_ACCESS_KEY_ID,
200+
"AWS_SECRET_ACCESS_KEY": CONFIG.AWS_SECRET_ACCESS_KEY,
201+
"AWS_SESSION_TOKEN": CONFIG.AWS_SESSION_TOKEN,
202+
}
203+
175204
container = docker_client.containers.run(
176205
docker_image,
177206
command=[
@@ -180,11 +209,7 @@ def run_docker_container(
180209
f"--source={source}",
181210
],
182211
detach=True,
183-
environment={
184-
"AWS_ACCESS_KEY_ID": CONFIG.AWS_ACCESS_KEY_ID,
185-
"AWS_SECRET_ACCESS_KEY": CONFIG.AWS_SECRET_ACCESS_KEY,
186-
"AWS_SESSION_TOKEN": CONFIG.AWS_SESSION_TOKEN,
187-
},
212+
environment=environment_variables,
188213
labels={
189214
"docker_image": docker_image,
190215
"source": source,

abdiff/extras/minio/__init__.py

Whitespace-only changes.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
services:
2+
minio:
3+
image: quay.io/minio/minio:latest
4+
command: server --console-address ":9001" /mnt/data
5+
ports:
6+
- "9000:9000" # API port
7+
- "9001:9001" # Console port
8+
environment:
9+
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
10+
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
11+
healthcheck:
12+
test: ["CMD", "mc", "ready", "local"]
13+
interval: 5s
14+
timeout: 5s
15+
retries: 5
16+
volumes:
17+
- ${MINIO_S3_LOCAL_STORAGE}:/mnt/data

0 commit comments

Comments
 (0)