Skip to content

Core Functions and CLI signatures

Graham Hukill edited this page Sep 25, 2024 · 6 revisions

Core Functions

init_job

  • arguments
    • job_directory: [required]
    • message: [optional] message to include in job.json
  • actions
    • creates job directory; throw exception if exists
    • creates job.json with job initial details: job working directory and message
  • returns
    • job directory: str

build_ab_images

  • arguments
    • job_directory: [required]
    • commit_sha_a: [required]
    • commit_sha_b: [required]
  • actions
    • builds A/B images of Transmogrifier based on A/B git commits SHAs provided
    • updates job.json with these newly created Docker image names
  • returns
    • Docker A/B image names: tuple[str, str]

init_run (NEW)

  • arguments
    • job_directory: [required]
    • message: [optional] message to include in run.json
  • actions
    • creates run directory as sub-directory of /runs
    • name is YYYY-MM-DD_HH-MM-SS timestamp as the name
    • throw exception if Job directory doesn’t exist
    • clones job.json and creates run.json with run details: run working directory, message, timestamp, etc.
  • returns
    • run directory: str

run_ab_transforms

  • arguments
    • run_directory: [required]
e.g. output/super-job/runs/2024-08-13_13-01-44
e.g. s3://my_bucket/transmog-ab/super-job/runs/2024-08-13_13-01-44
e.g. /my/weird/local/path/abstuff/...
  • image_name_a: [required] docker image of Transmogrifier version A
  • image_name_b: [required] docker image of Transmogrifier version B
  • input_files: [required], list[str] list of S3 (or local?) files to be transformed in A/B runs
  • actions
    • reads input files, runs A/B transforms for all files
    • creates transformed/a and transformed/b sub-directories under <JOB>/runs/<RUN> directory
    • updates run.json with the input files used
  • returns
    • tuple of a/b transformed files: tuple[list, list]
      • paths are relative to run directory, e.g.
(
  [“transformed/a/alma-01.json”, “transformed/a/aspace.json”,…],
  [“transformed/b/alma-01.json”, “transformed/b/aspace.json”,…]
)

collate_ab_transforms

  • arguments
    • run_directory: [required]
    • transformed_files: [required] tuple of A/B transformed filepaths
  • actions
    • utilizes transformed A/B files from <JOB>/runs/<RUN>/transformed/a|b directories
    • creates dataset of parquet files under <JOB>/runs/<RUN>/collated/*.parquet
  • returns
    • directorey of collated parquet dataset: str

calc_ab_diffs

  • arguments
    • run_directory: [required]
    • collated_directory: [required] directory filepath with collated parquet files
  • actions
    • creates diff for each record
    • creates dataset of parquet files under <JOB>/runs/<RUN>/diffs/*.parquet
  • returns
    • directory holding parquet files with diffs: str

calc_ab_metrics

  • arguments
    • run_directory: [required]
    • diff_file: [required] parquet filepath with collated records and diff
  • actions
    • generates metrics
    • writes metrics.json to run directory
      • consider they get added to run.json?
  • returns
    • metrics dictionary: dict

CLI commands

init-job

  • arguments
    • job-directory / d: [required] location where Job is to be created, can be anywhere
    • message / m: [optional] message added to job.json
      • both passed to init_job
    • transmogrifier-a-sha / a: [required] git commit SHA of transmogrifier to build
    • transmogrifier-b-sha / b: [required] git commit SHA of transmogrifier to build
      • both passed to build_ab_images
  • actions
    • runs init_jobbuild_ab_images

run-diff

  • arguments
    • job-directory / d: [required] Job directory
    • message / m: [optional] message added to run.json
      • both passed to init_run
    • input-files / i: [required] list of S3 (or local?) files to be transformed in A/B runs
      • passed to run_ab_transforms
  • actions
    • reads job.json
      • here we get transmog image names from to pass to run_ab_transforms
    • runs init_runrun_ab_transformscollate_ab_transformscalc_ab_record_diffscalc_ab_metrics

view-diff

  • arguments
    • job-directory / d: [required] provide working directory of Job
  • actions
    • runs Flask app with an env var set to Job directory (this is then used throughout Flask app)
    • reads Job directory and crawls Runs directories, provides table of runs (with messages if included)
    • rest of functionality as outlined…

Example CLI commands

pipenv run abdiff init-job \
--job-directory output/super-job \
--transmogrifier-a-sha abc123 \
--transmogrifier-b-sha def456
# Run 1
pipenv run abdiff run-diff \
--job-directory output/super-job \
--input-files s3://alma.xml,s3://aspace.xml,...

# Run 2
pipenv run abdiff run-diff \
--job-directory output/super-job \
--input-files s3://libguides.xml