Skip to content

Commit 2c5a691

Browse files
Address comments
* Move download functionality to extras module * Update Makefile command for stopping MinIO server to use .env * Clarify instructions for running local MinIO server in README * Fix flag description for '--download-files' CLI option * Simplify Docker Compose YAML file * Simplify logging in 'download_input_files' * Log clear messages indicating download counts * Log clear messages * Raise RuntimeError if any files fail to download
1 parent baa1cef commit 2c5a691

File tree

9 files changed

+85
-61
lines changed

9 files changed

+85
-61
lines changed

Makefile

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
SHELL=/bin/bash
22
DATETIME:=$(shell date -u +%Y%m%dT%H%M%SZ)
3-
MINIO_COMPOSE_FILE=abdiff/helpers/minio/docker-compose.yaml
3+
MINIO_COMPOSE_FILE=abdiff/extras/minio/docker-compose.yaml
44

55
help: # Preview Makefile commands
66
@awk 'BEGIN { FS = ":.*#"; print "Usage: make <target>\n\nTargets:" } \
@@ -56,9 +56,12 @@ black-apply: # Apply changes with 'black'
5656
ruff-apply: # Resolve 'fixable errors' with 'ruff'
5757
pipenv run ruff check --fix .
5858

59-
# Development commands
59+
####################################
60+
# MinIO local S3 commands
61+
####################################
62+
6063
start-minio-server:
6164
docker compose --env-file .env -f $(MINIO_COMPOSE_FILE) up -d
6265

6366
stop-minio-server:
64-
docker compose -f $(MINIO_COMPOSE_FILE) stop
67+
docker compose --env-file .env -f $(MINIO_COMPOSE_FILE) stop

README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,18 @@ Compare transformed TIMDEX records from two versions (A,B) of Transmogrifier.
1717

1818
### Running a Local MinIO Server
1919

20-
TIMDEX extract files from S3 (i.e., input files to use in transformations) can be downloaded to a local MinIO server hosted with a Docker container. [MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features](https://min.io/docs/minio/kubernetes/upstream/). Downloading extract files improves the runtime of a diff by reducing the number of requests sent to S3 and avoids repeated downloads of extract files.
20+
TIMDEX extract files from S3 (i.e., input files to use in transformations) can be downloaded to a local MinIO server hosted via a Docker container. [MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features](https://min.io/docs/minio/kubernetes/upstream/). The MinIO server acts as a "local S3 file system", allowing the app to access data on disk through an S3 interface. Since the MinIO server runs in a Docker container, it can be easily started when needed and stopped when not in use. Any data stored in the MinIO server will persist as long as the files exist in the directory specified for `MINIO_S3_LOCAL_STORAGE`.
21+
22+
Downloading extract files improves the runtime of a diff by reducing the number of requests sent to S3 and avoids AWS credentials timing out. Once an extract file is stored in the local MinIO server, the app can access the data from MinIO for all future runs that include the extract file, avoiding repeated downloads of data used across multiple runs.
23+
2124

2225
1. Configure your `.env` file. In addition to the [required environment variables](#required), the following environment variables must also be set:
2326

2427
```text
25-
MINIO_S3_LOCAL_STORAGE=<absolute-path-on-local-disk>
26-
TIMDEX_BUCKET=<timdex-extract-bucket>
28+
MINIO_S3_LOCAL_STORAGE=# full file system path to the directory where MinIO stores its object data on the local disk
29+
MINIO_ROOT_USER=# username for root user account for MinIO server
30+
MINIO_ROOT_PASSWORD=# password for root user account MinIO server
31+
TIMDEX_BUCKET=# when using CLI command 'timdex-sources-csv', this is required to know what TIMDEX bucket to use
2732
```
2833

2934
Note: There are additional variables required by the Local MinIO server (see vars prefixed with "MINIO" in [optional environment variables](#optional)). For these variables, defaults are provided in [abdiff.config](abdiff/config.py).
@@ -45,7 +50,7 @@ TIMDEX extract files from S3 (i.e., input files to use in transformations) can b
4550

4651
5. Proceed with A/B Diff CLI commands as needed!
4752

48-
Once a diff run is complete, you can stop the local MinIO server using the Makefile command: `make stop-minio-server`. If you're planning to run another diff using the same files -- good news! All you have to do is restart the local MinIO server. Your data will persist as long as the files exist in the directory you specified for `MINIO_S3_LOCAL_STORAGE`.
53+
Once a diff run is complete, you can stop the local MinIO server using the Makefile command: `make stop-minio-server`. If you're planning to run another diff using the same files, all you have to do is restart the local MinIO server. Your data will persist as long as the files exist in the directory you specified for `MINIO_S3_LOCAL_STORAGE`.
4954

5055
## Concepts
5156

abdiff/cli.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,12 @@ def init_job(
150150
default="Not provided.",
151151
)
152152
@click.option(
153-
"--download-files", is_flag=True, help="Pass to skip download of extract files"
153+
"--download-files",
154+
is_flag=True,
155+
help=(
156+
"Pass to download input files from AWS S3 to a local Minio S3 server "
157+
"for Transmogrifier to use."
158+
),
154159
)
155160
def run_diff(
156161
job_directory: str, input_files: str, message: str, *, download_files: bool

abdiff/config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,12 @@ def __getattr__(self, name: str) -> Any: # noqa: ANN401
3232

3333
@property
3434
def minio_s3_url(self) -> str:
35+
"""Host for ABDiff context (host machine) to connect to MinIO."""
3536
return self.MINIO_S3_URL or "http://localhost:9000/"
3637

3738
@property
3839
def minio_s3_container_url(self) -> str:
40+
"""Host for Transmogrifier Docker containers to connect to MinIO."""
3941
return self.MINIO_S3_CONTAINER_URL or "http://host.docker.internal:9000/"
4042

4143
@property

abdiff/core/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@
77
from abdiff.core.calc_ab_diffs import calc_ab_diffs
88
from abdiff.core.calc_ab_metrics import calc_ab_metrics
99
from abdiff.core.collate_ab_transforms import collate_ab_transforms
10-
from abdiff.core.download_input_files import download_input_files
1110
from abdiff.core.init_job import init_job
1211
from abdiff.core.init_run import init_run
1312
from abdiff.core.run_ab_transforms import run_ab_transforms
13+
from abdiff.extras.minio.download_input_files import download_input_files
1414

1515
__all__ = [
1616
"init_job",

abdiff/extras/minio/__init__.py

Whitespace-only changes.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
services:
2+
minio:
3+
image: quay.io/minio/minio:latest
4+
command: server --console-address ":9001" /mnt/data
5+
ports:
6+
- "9000:9000" # API port
7+
- "9001:9001" # Console port
8+
environment:
9+
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
10+
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
11+
healthcheck:
12+
test: ["CMD", "mc", "ready", "local"]
13+
interval: 5s
14+
timeout: 5s
15+
retries: 5
16+
volumes:
17+
- ${MINIO_S3_LOCAL_STORAGE}:/mnt/data

abdiff/core/download_input_files.py renamed to abdiff/extras/minio/download_input_files.py

Lines changed: 44 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -40,36 +40,53 @@ def download_input_files(input_files: list[str]) -> None:
4040
aws_secret_access_key=CONFIG.minio_root_password,
4141
)
4242

43-
for input_file in input_files:
44-
if check_object_exists(CONFIG.TIMDEX_BUCKET, input_file, s3_client):
45-
logger.info(f"File found for input: {input_file}. Skipping download.")
46-
continue
47-
48-
logger.info(f"Downloading input file from {CONFIG.TIMDEX_BUCKET}: {input_file}")
49-
copy_command = ["aws", "s3", "cp", input_file, "-"]
50-
upload_command = [
51-
"aws",
52-
"s3",
53-
"cp",
54-
"--endpoint-url",
55-
CONFIG.minio_s3_url,
56-
"--profile",
57-
"minio",
58-
"-",
59-
input_file,
60-
]
61-
43+
success_count = 0
44+
fail_count = 0
45+
for i, input_file in enumerate(input_files):
6246
try:
63-
copy_process = subprocess.run(
64-
args=copy_command, check=True, capture_output=True
65-
)
66-
subprocess.run(
67-
args=upload_command,
68-
check=True,
69-
input=copy_process.stdout,
47+
download_input_file(input_file, s3_client)
48+
success_count += 1
49+
logger.info(
50+
f"Input file: {i + 1} / {len(input_files)}: '{input_file}' "
51+
"available locally for transformation."
7052
)
7153
except subprocess.CalledProcessError:
72-
logger.exception(f"Failed to download input file: {input_file}")
54+
fail_count += 1
55+
logger.info(
56+
f"Input file: {i + 1} / {len(input_files)}: '{input_file}' "
57+
"failed to download."
58+
)
59+
logger.info(
60+
f"Available input files: {success_count}, missing input files: {fail_count}."
61+
)
62+
63+
if fail_count > 0:
64+
raise RuntimeError( # noqa: TRY003
65+
f"{fail_count} input file(s) failed to download."
66+
)
67+
68+
69+
def download_input_file(input_file: str, s3_client: S3Client) -> None:
70+
if check_object_exists(CONFIG.TIMDEX_BUCKET, input_file, s3_client):
71+
return
72+
copy_command = ["aws", "s3", "cp", input_file, "-"]
73+
upload_command = [
74+
"aws",
75+
"s3",
76+
"cp",
77+
"--endpoint-url",
78+
CONFIG.minio_s3_url,
79+
"--profile",
80+
"minio",
81+
"-",
82+
input_file,
83+
]
84+
copy_process = subprocess.run(args=copy_command, check=True, capture_output=True)
85+
subprocess.run(
86+
args=upload_command,
87+
check=True,
88+
input=copy_process.stdout,
89+
)
7390

7491

7592
def check_object_exists(bucket: str, input_file: str, s3_client: S3Client) -> bool:
@@ -79,7 +96,6 @@ def check_object_exists(bucket: str, input_file: str, s3_client: S3Client) -> bo
7996
except ClientError as exception:
8097
if exception.response["Error"]["Code"] == "404":
8198
return False
82-
logger.exception(f"Cannot determine if object exists for key {key}.")
8399
return False
84100
else:
85101
return True

abdiff/helpers/minio/docker-compose.yaml

Lines changed: 0 additions & 24 deletions
This file was deleted.

0 commit comments

Comments
 (0)