Skip to content

All/task/devopsdsc 917 dvc windows#816

Draft
matthijs-deltares wants to merge 12 commits intomainfrom
all/task/DEVOPSDSC-917-dvc-windows
Draft

All/task/devopsdsc 917 dvc windows#816
matthijs-deltares wants to merge 12 commits intomainfrom
all/task/DEVOPSDSC-917-dvc-windows

Conversation

@matthijs-deltares
Copy link
Copy Markdown
Contributor

What was done

This pull request introduces significant improvements to the handling of DVC-based test cases in the Deltares testbench system. The main enhancement is the implementation of a batch download process for DVC test case data, which improves efficiency and reliability. Additionally, the code now better manages dependencies and credentials for DVC operations, and includes several refactorings for clarity and maintainability.

  • Added a new method to batch-download all .dvc files for test cases using DVC, including dependencies
  • Introduced a context manager to temporarily inject AWS S3/MinIO credentials as environment variables during DVC operations
  • Modified DVC repository initialization to use NoSCM() for cases where source control management is not required
  • Updated .dvc/config to increase the S3 read timeout, reducing the risk of timeouts during large downloads. (.dvc/config)
  • Adjusted TeamCity and .gitignore settings to support new DVC credentials and ensure .dvc files are not ignored.

Evidence of the work done

  • Video/figures
    <add video/figures if applicable>
  • Clear from the issue description
  • Not applicable

Tests

  • Tests updated
    <add testcase numbers if applicable, Issue number>
  • Not applicable

Documentation

  • Documentation updated
    <add description of changes if applicable, Issue number>
  • Not applicable

Issue link

Implement DVC batch download in test_set_runner and dvc_handler.
Add tests for DvcHandler and ComparisonRunner.
Add dependency support with version-suffix disambiguation for shared
dependencies referenced with different timestamps across XMLs.
Add dry-run mode with MinIO presence checks for cases, references,
dependencies, and doc folders.
Add testcase verification and error handling with log rotation.
Replace tuple types with frozen dataclasses (DependencyKey, AwsCredentials).
Add DVC access key and secret parameters to TeamCity settings.
Remove duplicate nearfield test case entry from testbench table.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Deltares testbench’s DVC integration and the MinIO→DVC migration tooling, focusing on more efficient DVC downloads (batch fetch/checkout), improved dependency handling, and better operational robustness (logging, timeouts, and CI parameterization).

Changes:

  • Added batch DVC download support (including temporary AWS credential injection via env vars) and updated the test runner to prefetch all DVC targets up front.
  • Enhanced the MinIO→DVC migration tool to detect missing MinIO data, migrate dependency references, and improve logging/CLI ergonomics (e.g., --dry-run, log rotation).
  • Updated CI/configuration and tests to support the new DVC behavior and expanded path pattern handling.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test/deltares_testbench/tools/minio_dvc_migration/xml_file_with_testcase_data.py Migrates XMLs to DVC conventions; adds dependency rewrite + MinIO presence checks + dependency version mapping helpers.
test/deltares_testbench/tools/minio_dvc_migration/testcase_data.py Adds dependency fields/logic, MinIO presence checks, and DVC tracking helpers for dependencies.
test/deltares_testbench/tools/minio_dvc_migration/s3_client.py Refactors AWS credentials loading to a dataclass.
test/deltares_testbench/tools/minio_dvc_migration/migrate_xmls.py Adds --dry-run, log rotation, and refactors migration workflow into run_migration().
test/deltares_testbench/test/utils/handlers/test_dvc_handler.py New unit tests for DvcHandler batch download + env restoration + root detection.
test/deltares_testbench/test/tools/minio_dvc_migration/test_xml_file_with_testcase_data.py Adds tests for dependency migration, dependency deduping, and version suffix mapping.
test/deltares_testbench/test/tools/minio_dvc_migration/test_testcase_data.py Adds tests for dependency handling and MinIO presence helpers; refactors helper return type.
test/deltares_testbench/test/tools/minio_dvc_migration/test_s3_url_info.py Extends S3→local path mapping test coverage for deeper case paths.
test/deltares_testbench/test/tools/minio_dvc_migration/test_migrate_xmls.py Adds tests for TeeStream and _rotate_log_file().
test/deltares_testbench/test/suite/test_ComparisonRunner.py Updates expectations to validate batch DVC preparation instead of per-case DVC download logging.
test/deltares_testbench/test/helpers/xml_config_helper.py Fixes dependency XML helper to write cases_path (not version) into <dependency> element text.
test/deltares_testbench/src/utils/logging/file_logger.py Makes stderr writes robust for non-str message inputs.
test/deltares_testbench/src/utils/handlers/handler_factory.py Avoids creating empty destination directories for DVC downloads.
test/deltares_testbench/src/utils/handlers/dvc_handler.py Introduces AWS env credential context manager, NoSCM repo init, and batch fetch/checkout API.
test/deltares_testbench/src/suite/test_set_runner.py Adds DVC batch preparation, includes dependencies, and adjusts dependency download logic.
test/deltares_testbench/.gitignore Attempts to allow tracking .dvc files under data/.
ci/teamcity/Delft3D/vars/dimr_testbench_table.csv Removes a TeamCity test row entry.
ci/teamcity/Delft3D/settings.kts Adds TeamCity parameters for DVC testbench credentials.
.dvc/config Increases S3 read timeout for the DVC remote.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +528 to +545
def __collect_dependency_dvc_files(self, dvc_configs: list[TestCaseConfig]) -> list[str]:
"""Collect .dvc file paths for DVC dependencies so they are included in the batch download."""
seen: set[str] = set()
dvc_files: list[str] = []
for config in dvc_configs:
if not config.dependency or config.dependency.version != "DVC":
continue
location = next((loc for loc in config.locations if loc.type == PathType.INPUT), None)
if location is None:
continue
dep_dvc_path = os.path.abspath(
Paths().rebuildToLocalPath(
Paths().mergeFullPath(location.root, location.from_path, config.dependency.cases_path, "input.dvc")
)
)
if dep_dvc_path not in seen and os.path.isfile(dep_dvc_path):
seen.add(dep_dvc_path)
dvc_files.append(dep_dvc_path)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DVC dependency batch-include path assumes every dependency is tracked as <cases_path>/input.dvc. However the migration tool tracks dependencies by running repo.add() on the dependency directory itself, which produces <cases_path>.dvc (next to the directory), not input.dvc inside it. This mismatch will cause DVC dependencies to be skipped from the batch download and later copy steps to fail. Build the dependency .dvc path consistent with how dependencies are added (likely ${location.root}/${config.dependency.cases_path}.dvc, without appending input.dvc).

Copilot uses AI. Check for mistakes.
Comment on lines +978 to +1002
if dependency_version == "DVC":
# DVC dependencies are already checked out at {location.root}/{cases_path}/input/.
# Place the dependency as a sibling of the test case's input directory,
# so that relative paths like ../local_dir resolve correctly from input_work.
local_path = Paths().rebuildToLocalPath(
Paths().mergeFullPath(location.root, config.path.prefix, config.dependency.local_dir)
)
if os.path.isdir(local_path) and any(os.scandir(local_path)):
logger.info("Dependency directory already exists: Skipping download")
return

source_input_path = Path(
Paths().rebuildToLocalPath(
Paths().mergeFullPath(location.root, location.from_path, config.dependency.cases_path, "input")
)
)
if source_input_path.is_dir():
logger.info(f"Copying DVC dependency from {source_input_path} to {local_path}")
shutil.copytree(str(source_input_path), local_path, dirs_exist_ok=True)
else:
logger.error(
f"DVC dependency source not found at {source_input_path}. "
"Ensure the dependency's .dvc files have been checked out."
)
return
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the dependency_version == "DVC" branch, the code copies from <cases_path>/input, but dependencies are treated elsewhere as plain directories (downloaded/tracked at <cases_path>). Copying from .../input will fail if the dependency folder doesn't have an input/ subfolder (and it also depends on location.from_path, which isn't used for DVC case paths). Align the source path with how dependencies are added to DVC (copy from the checked-out dependency directory itself, and compute it without appending /input).

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +28
objects = list(rewinder.list_objects(prefix, timestamp))
return "found" if objects else "missing"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_s3_prefix() materializes the entire object listing with list(...) just to test existence. For large prefixes this can be very slow and memory-heavy. Prefer a short-circuit approach (e.g., next(rewinder.list_objects(...), None) or an iterator any(...)) so you only fetch the first key needed to decide found/missing.

Suggested change
objects = list(rewinder.list_objects(prefix, timestamp))
return "found" if objects else "missing"
first_object = next(rewinder.list_objects(prefix, timestamp), None)
return "found" if first_object is not None else "missing"

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +153
"""Update dependency elements: set version to 'DVC' and rename path to DVC convention.

The dependency folder name is renamed by prepending 'c00_' if it doesn't start with 'c'.
A version suffix (e.g. '_v1') is appended when the dependency_version_map contains a mapping
for the (s3_path, version) pair, which happens when a dependency path has multiple versions.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring claims the dependency folder is renamed by prepending 'c00_' when it doesn’t start with 'c', but the implementation calls rename_dependency_path_for_dvc() which currently only appends a version suffix and does not do any 'c00_' prefixing. Please update the docstring to match the actual behavior, or implement the documented rename so downstream code/tests aren’t misled about the resulting folder structure.

Suggested change
"""Update dependency elements: set version to 'DVC' and rename path to DVC convention.
The dependency folder name is renamed by prepending 'c00_' if it doesn't start with 'c'.
A version suffix (e.g. '_v1') is appended when the dependency_version_map contains a mapping
for the (s3_path, version) pair, which happens when a dependency path has multiple versions.
"""Update dependency elements: set version to 'DVC' and rewrite the dependency path.
The dependency path is normalized by `rename_dependency_path_for_dvc`.
A version suffix (for example, '_v1') is appended when `dependency_version_map`
contains a mapping for the `(s3_path, version)` pair, which happens when a
dependency path has multiple versions.

Copilot uses AI. Check for mistakes.
Comment on lines +306 to +326
def build_dependency_version_map(
xmls: list[XmlFileWithTestCaseData],
) -> dict[DependencyKey, str]:
"""Build a mapping from DependencyKey to a version suffix.

When a dependency S3 path is referenced with multiple distinct versions across
the provided XMLs, each version gets a suffix like '_v1', '_v2', etc. sorted
chronologically. Paths that have only one version across all XMLs are omitted
from the map (they don't need a suffix).
"""
path_versions: dict[str, set[str]] = defaultdict(set)
for xml in xmls:
for tc in xml.testcases:
if tc.has_dependency and tc.dependency_version:
path_versions[tc.dependency_s3_path].add(tc.dependency_version)

version_map: dict[DependencyKey, str] = {}
for s3_path, versions in path_versions.items():
if len(versions) > 1:
for i, version in enumerate(sorted(versions), start=1):
version_map[DependencyKey(s3_path, version)] = f"_v{i}"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_dependency_version_map() treats any non-empty dependency_version as a sortable version string, and uses sorted(versions) while the docstring says versions are sorted chronologically. If versions can include non-timestamps (e.g. "NO VERSION") or multiple timestamp formats, lexicographic sorting and including non-parseable values can produce unstable or incorrect suffix assignments. Consider filtering to versions where rewind_timestep_2_datetime() succeeds and sorting by the parsed datetime rather than raw strings.

Copilot uses AI. Check for mistakes.
@@ -1,5 +1,6 @@
/logs/**
/data/**
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negation !/data/**/*.dvc likely won’t take effect because /data/** also ignores the intermediate directories under /data, and Git won’t descend into ignored directories to see re-included files. To reliably track .dvc files under data/, you typically also need to unignore the directories (e.g. !/data/**/) or narrow the original ignore pattern so the directories remain visible.

Suggested change
/data/**
/data/**
!/data/**/

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +152
def run_migration(xmls_to_migrate: list[XmlFileWithTestCaseData], dry_run: bool = False) -> None:
"""Execute the migration workflow.

When dry_run is True, only checks MinIO for file existence without
downloading, adding to DVC, pushing, or rewriting XMLs.
"""
rewinder = setup_minio_rewinder(BASE_URL)

# Always validate DVC repo and remote configuration
repo_root = find_dvc_root_in_parent_directories(Path(__file__).resolve())
print(f"Found DVC repo at: {repo_root}")

config_path = repo_root / ".dvc" / "config"
if not config_path.exists():
raise FileNotFoundError(f"DVC config not found: {config_path}")
config_text = config_path.read_text()
if 'remote "storage"' not in config_text:
raise ValueError("DVC remote 'storage' is not configured in .dvc/config")
print("DVC remote 'storage' is configured")

repo = Repo(str(repo_root))
print("DVC repository initialized successfully")

total_found = 0
total_missing = 0
total_errors = 0
total_doc_moves = 0
total_skipped = 0
failed_xmls: list[str] = []

for i, xml_file in enumerate(xmls_to_migrate, start=1):
print(f"\n[{i}/{len(xmls_to_migrate)}] {xml_file.xml_file.name} ({len(xml_file.testcases)} testcases)")

if dry_run:
counts = xml_file.check_minio_presence(rewinder)
total_found += counts["found"]
total_missing += counts["missing"]
total_errors += counts["errors"]
total_doc_moves += counts["doc_moves"]
else:
try:
skipped = xml_file.verify_and_filter_testcases(rewinder)
total_skipped += len(skipped)
if not xml_file.testcases:
print(f" WARNING: All testcases missing on MinIO, skipping XML entirely")
continue

xml_file.download_from_minio_in_new_folder_structure(rewinder=rewinder)
xml_file.move_testcases_doc_folder_to_parent()
dvc_files = []
dvc_files.extend(xml_file.add_to_dvc(repo=repo))
push_dvc_files_to_remote(repo, dvc_files)
xml_file.migrate_xml_to_dvc()
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_migration() never builds/applies the dependency version suffix mapping, even though XmlFileWithTestCaseData supports dependency_version_map and helpers exist. Without applying this map before downloading/adding to DVC/migrating XML, two XMLs referencing the same dependency path with different versions can be downloaded into the same local directory and then tracked under the same .dvc, causing data corruption/overwrites. Build the map once for xmls_to_migrate (e.g., via build_dependency_version_map + apply_dependency_version_map) before the per-XML loop so both filesystem paths and rewritten XML dependency paths get the correct suffixes.

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +121
repo = Repo(str(repo_root))
print("DVC repository initialized successfully")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo(str(repo_root)) will use the default SCM integration (typically Git). This script is intended to run in environments where SCM may be absent/unneeded (and the PR description mentions NoSCM()), so initialization can fail unnecessarily. Consider constructing the repo with scm=NoSCM() (and importing it) to match DvcHandler and avoid requiring a Git checkout.

Copilot uses AI. Check for mistakes.
@matthijs-deltares matthijs-deltares marked this pull request as draft April 20, 2026 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants