Conversation
492ecc5 to
e59134e
Compare
3af7c34 to
f187874
Compare
| env["OCL_ICD_VENDORS"] = f"{OCL_ICD_VENDORS}/" | ||
|
|
||
|
|
||
| def execute_tests(env): |
There was a problem hiding this comment.
do these tests have a smoke test filter? something like this: https://github.com/ROCm/TheRock/blob/main/docs/development/test_filtering.md
There was a problem hiding this comment.
as of now, there is no smoke tests for this test.
this is already a shortened test. Maybe we can look at adding smoke tests in the future, needs more work identifying and trimming the tests
There was a problem hiding this comment.
yes let's open tix for filtering, as we probably want to do expanded testing later
| "ocltst": { | ||
| "job_name": "ocltst", | ||
| "fetch_artifact_args": "--tests", | ||
| "timeout_minutes": 120, |
There was a problem hiding this comment.
from this, it looks like tests takes a total of 8 hours.... that's a pretty insane amount of time, especially for a developer waiting on PR checks.
are there any filters we can use? we are getting teams to start using this format: https://github.com/ROCm/TheRock/blob/main/docs/development/test_filtering.md so we get fast turn around times with the machines we have
There was a problem hiding this comment.
reduced it to 20 mins. that should be enough to verify
There was a problem hiding this comment.
actually needed 30 mins of timeout
cleanup
image contains the ICD loader package
log for LD_LIBRARY_PATH and OCL_ICD_VENDORS
ScottTodd
left a comment
There was a problem hiding this comment.
Detailed review notes: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_TheRock_3590.md (this was generated via some back and forth review of the PR, looking at logs, etc.). I expect this level of analysis from PR authors too, so a PR description of just "Technical Details: enable ocltst execution on linux and windows" is insufficient.
I've added one inline comment for the major architectural issue. I'll wait to comment on the smaller details until that is resolved.
| # copies the dlls to local ocltst path. | ||
| # to overwrite the registry entries | ||
| def copy_dlls_exe_path(ocltst_path): | ||
| if platform.system() == "Windows": | ||
| # hip and comgr dlls need to be copied to the same folder as exectuable | ||
| dlls_pattern = ["amdocl*.dll", "amd_comgr*.dll", "OpenCL.dll"] | ||
| dlls_to_copy = [] | ||
| for pattern in dlls_pattern: | ||
| dlls_to_copy.extend(THEROCK_BIN_DIR.glob(pattern)) | ||
| for dll in dlls_to_copy: | ||
| try: | ||
| shutil.copy(dll, ocltst_path) | ||
| logging.info(f"++ Copied: {dll} to {ocltst_path}") | ||
| except Exception as e: | ||
| logging.info(f"++ Error copying {dll}: {e}") | ||
|
|
||
|
|
||
| # returns ocltst path | ||
| def setup_env(env): | ||
| ROCM_PATH = Path(THEROCK_DIR) | ||
| env["ROCM_PATH"] = str(ROCM_PATH) | ||
| if not is_windows: | ||
| ROCK_LIB_PATH = Path(THEROCK_DIR) / "lib" | ||
| OCL_LIB = Path(ROCK_LIB_PATH) / "opencl" | ||
| LLVM_LIB = Path(ROCK_LIB_PATH) / "llvm" / "lib" | ||
| ROCM_SYSDEPS_LIB = Path(ROCK_LIB_PATH) / "rocm_sysdeps" / "lib" | ||
| OCL_ICD_VENDORS = Path(THEROCK_DIR) / "etc" / "OpenCL" / "vendors" | ||
| OCLTST_PATH = Path(THEROCK_DIR) / "share" / "opencl" / "ocltst" | ||
| LD_LIBRARY_PATH = os.getenv("LD_LIBRARY_PATH") | ||
| if LD_LIBRARY_PATH is not None: | ||
| LD_LIBRARY_PATH = Path(LD_LIBRARY_PATH) | ||
| env["LD_LIBRARY_PATH"] = ( | ||
| f"{ROCK_LIB_PATH}:{OCL_LIB}:{LLVM_LIB}:{ROCM_SYSDEPS_LIB}:{LD_LIBRARY_PATH}:{OCLTST_PATH}" | ||
| ) | ||
| env["OCL_ICD_VENDORS"] = f"{OCL_ICD_VENDORS}/" | ||
| else: | ||
|
|
||
| OCLTST_PATH = Path(THEROCK_DIR) / "tests" / "ocltst" | ||
| copy_dlls_exe_path(OCLTST_PATH) | ||
| OCL_DLL_FILE = Path(OCLTST_PATH) / "amdocl64.dll" | ||
| OCL_ICD_DLL = Path(THEROCK_BIN_DIR) / "OpenCL.dll" | ||
| env["OCL_ICD_FILENAMES"] = str(OCL_DLL_FILE) | ||
| return OCLTST_PATH |
There was a problem hiding this comment.
I believe we can (and should) remove nearly all of this setup code from this test script if we fix https://github.com/ROCm/rocm-systems/blob/be8076461de7f44e21a0f273d94f5872c89c1e17/projects/clr/opencl/tests/ocltst/CMakeLists.txt#L7-L11 to install ocltst into a standard install location like bin/ instead of a new location like tests/.
See for example what hipblaslt does:
- https://github.com/ROCm/rocm-libraries/blob/670f8557bbf5062bedaea552827733f245fdbccc/projects/hipblaslt/CMakeLists.txt#L504-L518
- https://github.com/ROCm/rocm-cmake/blob/develop/share/rocmcmakebuildtools/cmake/ROCMInstallTargets.cmake
Or a more recent / simpler case like dnn-providers/hipblaslt-provider: https://github.com/ROCm/rocm-libraries/blob/94453c6cb85db167e96e3f463a12d808734674b5/dnn-providers/hipblaslt-provider/cmake/Tests.cmake#L180-L192
The guiding principles are this:
- Tests should be runnable from a build directory (e.g. via
ctest) - Component development should use the same testing workflow as CI systems
- Scripts for CI like these in
build_tools/github_actions/test_executable_scripts/should be thin wrappers around test runners / executables that possibly add filtering and sharding options applicable to CI. They should NOT be structurally modifying the artifact directory in any way. We granted a narrow exception for that to hip-tests in Enable hip-tests on TheRock #2001 but if this is becoming a pattern we need to fix it and not dig the whole deeper
See more detailed analysis of the issue and recommendations to fixing it here: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_TheRock_3590.md#1-install-to-bin-to-eliminate-dll-copy-and-ld_library_path-workarounds
I ran local experiments with the artifacts produced by this PR here: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_TheRock_3590.md#local-experiment-install-to-bin-validation
- Moving files to
bin/, tests passed if I ran without settingOCL_ICD_FILENAMES(so using my system OpenCL ICD version 3652.0) - Moving files to
bin/, tests failed withOCL_ICD_FILENAMES(using OpenCL ICD version 3628.0...?) I'm not sure why, given that tests passed on CI
There was a problem hiding this comment.
Thanks for your comment. I agree we should not have different installation paths for Linux vs windows.
However, changing the installation paths in clr could impact compute build.
So changes will be required in rocm-systems, compute could take a couple of weeks to get this merged again.
this coverage is important piece to validate Opencl and move away from legacy build systems.
Also earlier the linux paths were a requirement from packaging team to install the artifacts in a certain folder path.
We need similar requirements now with Rock which covers both linux and windows
I want to come back and clean up the paths/artifacts through a ticket/issue instead of blocking this PR.
Delaying this further risks test coverage gap in 7.13 too.
Looking at your experiments-
exp 1 - invalid, uses system32 opencl and comgr dlls for loading. Registry key takes precedence. Which is incorrect, need the dlls from the rock build to be loaded
exp 2 - partially valid, setting OCL_ICD_FILENAMES=$(pwd)/amdocl64.dll picks one dll from local folder but does not find all the dlls. So it fails
exp 3 - invalid again. same as exp1. use dlls from system32 instead of local build
To pick the right dlls to be loaded at runtime . Use Listdlls app to see the dlls loaded at runtime.
opt1 - We need to update registry key to point the amdocl64
or
opt2 - copy the required dlls to local folder of exe.
opt1 did not work, because we do not have any facility in Rock to update the registry key. Updating registry key requires admin permissions so could not set it through the test_ocltst.py.
choosing opt2 - With the interest of time, chose 2. this way local folder always gets precedence over system32 dlls
There was a problem hiding this comment.
However, changing the installation paths in clr could impact compute build.
Impacts on internal builds are not generally a blocking concern for open source code, builds, or releases. The costs for living downstream are paid by downstream projects, not upstream.
I see the points about timelines, but so far I'm seeing more tech debt being accumulated than paid down (I tried to run hip-tests the other day and was completely unable to do so, facing similar issues as on here... these "scripts that only work on CI" need to be fixed)
Also earlier the linux paths were a requirement from packaging team to install the artifacts in a certain folder path.
We need similar requirements now with Rock which covers both linux and window
As far as I can tell, only rocgdb (also added recently) writes to a top level tests/ folder. All other ROCm subprojects use other paths. So this is already in "needs a requirement from packaging" territory. Using the already established layout seems like it would be the path of least resistance here to me, especially given that the tests/ path does not work (we're not going to ask users who download our packages to copy DLLs to run tests... they should be able to run the tests as easily as developers working on the projects and CI/CD systems can).
exp 2 - partially valid, setting OCL_ICD_FILENAMES=$(pwd)/amdocl64.dll picks one dll from local folder but does not find all the dlls. So it fails
Which am I missing? I have everything that this PR tries to copy and I don't see any missing DLLs when I run the various exes and dlls through https://github.com/lucasg/dependencies
λ ls D:\scratch\claude\artifacts\3590-win\bin
AMD.ROCM.Comgr.MANIFEST clinfo.exe* hipcc.exe* oclperf.dll* oclruntime.exclude
amd_comgr0702.dll* cltrace.dll* hipconfig.bat oclperf.exclude ocltst.exe*
amdocl64.dll*
There was a problem hiding this comment.
actually looking at your exp setup
exp 1 - since you copy the test items to bin. runtime dlls picked are valid but system32 Opencl.dll is loaded
exp 2 - assuming items are copied from bin to test . setting OCL_ICD_FILENAMES forces to pick local OpenCL.dll. not sure what is causing this failure. Cant repro it locally.
exp 3 - running from tests folder without copying dlls to local folder picks the system32 dlls. So this is invalid.
There was a problem hiding this comment.
Impacts on internal builds are not generally a blocking concern for open source code, builds, or releases. The costs for living downstream are paid by downstream projects, not upstream.
I agree with this if its a simple bug fix, upstream gets priority. But currently we are still holding the legacy/downstream systems because of the test coverage gap in Rock. Once we have enabled the tests on Rock, we can look at fixing the issues, atleast then we have a gate check that verifies our change. Right now the opencl changes are not tested in Rock, this causes issues to be caught later in the QA cycle.
I see the points about timelines, but so far I'm seeing more tech debt being accumulated than paid down (I tried to run hip-tests the other day and was completely unable to do so, facing similar issues as on here... these "scripts that only work on CI" need to be fixed)
Please file an issue assign to our team, we will work on it soon. It should work outside CI as well.
As far as I can tell, only rocgdb (also added recently) writes to a top level
tests/folder. All other ROCm subprojects use other paths. So this is already in "needs a requirement from packaging" territory. Using the already established layout seems like it would be the path of least resistance here to me, especially given that thetests/path does not work (we're not going to ask users who download our packages to copy DLLs to run tests... they should be able to run the tests as easily as developers working on the projects and CI/CD systems can).
ocltst is a test binary. I am not sure if its right thing to copy this to bin folder. sorry need a proper requirement and design.
also exp2 should work, I am not able to repro it locally.
There was a problem hiding this comment.
We can provide another LIMITED, TEMPORARY exception for these sorts of DLL copying workarounds in tests as part of adding ocltst coverage to TheRock but this needs to be fixed urgently as it impacts core developer workflows and shipping package structures.
We're still recovering from issues where framework (JAX) automated tests, QA processes, and packages were completely misaligned to the point where we were silently falling back to testing on CPU on CI, QA had their own undocumented test procedures, and the end result were user-facing packages that were fundamentally broken.
This has all the same smells to it. We ship these test files as part of releases. Users/developers/etc. can and will try to run the test files in those release archives. They will not work as-is. Users/developers/etc. will not think to find a script like this or copy DLLs / set environment variables themselves.
Tracking follow-ups in #4192.
Added more information to the technical details. |
| "-m", | ||
| "oclruntime.dll", | ||
| "-A", | ||
| "oclruntime.exclude", |
There was a problem hiding this comment.
any options for parallelization? can we add a comment on what ./ocltst does? it seems to be a custom script so it's unclear to us what is under the hood. is it ctest/gtest/pytest?
There was a problem hiding this comment.
its a custom test with c/c++ calls no test framework used. Added 16 threads , think they run parallely. Need to check
There was a problem hiding this comment.
seems to have reduced the execution time to around 17 mins.
There was a problem hiding this comment.
30 mins is long. any options for parallelization? having a GPU machine run 30 mins for quick/sanity checks is really long and costly
There was a problem hiding this comment.
i have added 16 threads for execution. Lets see if the execution time comes down
There was a problem hiding this comment.
okay, 16 mins is better, but let's open an ocl test ticket to introduce filtering
There was a problem hiding this comment.
16 mins is okay now, but let's introduce an internal ticket for ocl test filtering
There was a problem hiding this comment.
#4192 - can create a internal ticket for this as well
There was a problem hiding this comment.
can we add that GH issue to the code here, so we don't forget
amd-shiraz
left a comment
There was a problem hiding this comment.
I believe we are adding tests for new components as part of this PR? Before we do that, do we have a report on the below:
- are are introducing any flaky tests / timeouts on tests as part of this that will impact our ci production pipelines/nightlies ?
- how much resources these new tests need to successfully execute all steps ?
- are we adding any extra time as part of this ? if yes, how much ?
This is a coverage gap in the Rock. We are porting it from existing build framework. this test has been from years, mostly stable. From the tests we observed around 30 mins to complete the tests. I have added parallel threads for execution. Hopefully it should take less time |
- 16 threads for ocltst - common cmd for readablity - flag for ocltst artifact
There was a problem hiding this comment.
16 mins is okay now, but let's introduce an internal ticket for ocl test filtering
| env["OCL_ICD_VENDORS"] = f"{OCL_ICD_VENDORS}/" | ||
|
|
||
|
|
||
| def execute_tests(env): |
There was a problem hiding this comment.
yes let's open tix for filtering, as we probably want to do expanded testing later
| # copies the dlls to local ocltst path. | ||
| # to overwrite the registry entries | ||
| def copy_dlls_exe_path(ocltst_path): | ||
| if platform.system() == "Windows": |
There was a problem hiding this comment.
would prefer to fix it in the clean up. We need this PR merged ASAP for ocl coverage.
There was a problem hiding this comment.
It's a 1 line change, in the code this PR is directly introducing, immediately prior in the code. Fix it please.
@geomin12 @amd-shiraz please don't approve PRs that so obviously are ignoring reviewer feedback.
geomin12
left a comment
There was a problem hiding this comment.
lgtm, let's use #4192 and get this cleaned up after this lands.
16 mins is long for "quick" and causes a waste of resources / developer time - noted in test_filtering.md.
soon, we'll start having global timeouts and if test times are not within the doc above, these tests will fail (or we may just disable the test if takes too long) - so please priotizing cleaning up!
There was a problem hiding this comment.
can we add that GH issue to the code here, so we don't forget
@amd-shiraz - if you are satisfied with this PR and the explanation. Can you please review/approve this so we can merge and close the gap please. |
| # copies the dlls to local ocltst path. | ||
| # to overwrite the registry entries | ||
| def copy_dlls_exe_path(ocltst_path): | ||
| if platform.system() == "Windows": |
There was a problem hiding this comment.
It's a 1 line change, in the code this PR is directly introducing, immediately prior in the code. Fix it please.
@geomin12 @amd-shiraz please don't approve PRs that so obviously are ignoring reviewer feedback.
| ROCM_PATH = Path(THEROCK_DIR) | ||
| env["ROCM_PATH"] = str(ROCM_PATH) |
There was a problem hiding this comment.
This intermediate ROCM_PATH variable is doing nothing. THEROCK_DIR is already of type Path
| ROCK_LIB_PATH = Path(THEROCK_DIR) / "lib" | ||
| OCL_LIB = Path(ROCK_LIB_PATH) / "opencl" | ||
| LLVM_LIB = Path(ROCK_LIB_PATH) / "llvm" / "lib" | ||
| ROCM_SYSDEPS_LIB = Path(ROCK_LIB_PATH) / "rocm_sysdeps" / "lib" | ||
| OCL_ICD_VENDORS = Path(THEROCK_DIR) / "etc" / "OpenCL" / "vendors" | ||
| OCLTST_PATH = Path(THEROCK_DIR) / "share" / "opencl" / "ocltst" |
There was a problem hiding this comment.
These variables are all already of type Path, no need to wrap in Path()
| if args.ocltst: | ||
| # ocltst binaries ship in the core-ocl test artifact (base already includes _lib/_dev). | ||
| if args.tests: | ||
| argv.append("core-ocl_test") |
There was a problem hiding this comment.
This doesn't quite follow existing patterns here.
There is already a code block below that handles _test artifacts:
# Fetch _lib (always) and _test (when --tests) for each artifact.
# Some projects have self-contained _test archives (just test
# binaries), while others may also need executables or data from
# _run. Add those explicitly above via argv.append("<name>_run").
extra_artifact_patterns = [f"{a}_lib" for a in extra_artifacts]
if args.tests:
extra_artifact_patterns.extend([f"{a}_test" for a in extra_artifacts])This can probably just do:
| if args.ocltst: | |
| # ocltst binaries ship in the core-ocl test artifact (base already includes _lib/_dev). | |
| if args.tests: | |
| argv.append("core-ocl_test") | |
| if args.ocltst: | |
| # ocltst binaries ship in the core-ocl test artifact (base already includes _lib/_dev). | |
| extra_artifacts.append("core-ocl") |
| # copies the dlls to local ocltst path. | ||
| # to overwrite the registry entries | ||
| def copy_dlls_exe_path(ocltst_path): | ||
| if platform.system() == "Windows": | ||
| # hip and comgr dlls need to be copied to the same folder as exectuable | ||
| dlls_pattern = ["amdocl*.dll", "amd_comgr*.dll", "OpenCL.dll"] | ||
| dlls_to_copy = [] | ||
| for pattern in dlls_pattern: | ||
| dlls_to_copy.extend(THEROCK_BIN_DIR.glob(pattern)) | ||
| for dll in dlls_to_copy: | ||
| try: | ||
| shutil.copy(dll, ocltst_path) | ||
| logging.info(f"++ Copied: {dll} to {ocltst_path}") | ||
| except Exception as e: | ||
| logging.info(f"++ Error copying {dll}: {e}") | ||
|
|
||
|
|
||
| # returns ocltst path | ||
| def setup_env(env): | ||
| ROCM_PATH = Path(THEROCK_DIR) | ||
| env["ROCM_PATH"] = str(ROCM_PATH) | ||
| if not is_windows: | ||
| ROCK_LIB_PATH = Path(THEROCK_DIR) / "lib" | ||
| OCL_LIB = Path(ROCK_LIB_PATH) / "opencl" | ||
| LLVM_LIB = Path(ROCK_LIB_PATH) / "llvm" / "lib" | ||
| ROCM_SYSDEPS_LIB = Path(ROCK_LIB_PATH) / "rocm_sysdeps" / "lib" | ||
| OCL_ICD_VENDORS = Path(THEROCK_DIR) / "etc" / "OpenCL" / "vendors" | ||
| OCLTST_PATH = Path(THEROCK_DIR) / "share" / "opencl" / "ocltst" | ||
| LD_LIBRARY_PATH = os.getenv("LD_LIBRARY_PATH") | ||
| if LD_LIBRARY_PATH is not None: | ||
| LD_LIBRARY_PATH = Path(LD_LIBRARY_PATH) | ||
| env["LD_LIBRARY_PATH"] = ( | ||
| f"{ROCK_LIB_PATH}:{OCL_LIB}:{LLVM_LIB}:{ROCM_SYSDEPS_LIB}:{LD_LIBRARY_PATH}:{OCLTST_PATH}" | ||
| ) | ||
| env["OCL_ICD_VENDORS"] = f"{OCL_ICD_VENDORS}/" | ||
| else: | ||
|
|
||
| OCLTST_PATH = Path(THEROCK_DIR) / "tests" / "ocltst" | ||
| copy_dlls_exe_path(OCLTST_PATH) | ||
| OCL_DLL_FILE = Path(OCLTST_PATH) / "amdocl64.dll" | ||
| OCL_ICD_DLL = Path(THEROCK_BIN_DIR) / "OpenCL.dll" | ||
| env["OCL_ICD_FILENAMES"] = str(OCL_DLL_FILE) | ||
| return OCLTST_PATH |
There was a problem hiding this comment.
We can provide another LIMITED, TEMPORARY exception for these sorts of DLL copying workarounds in tests as part of adding ocltst coverage to TheRock but this needs to be fixed urgently as it impacts core developer workflows and shipping package structures.
We're still recovering from issues where framework (JAX) automated tests, QA processes, and packages were completely misaligned to the point where we were silently falling back to testing on CPU on CI, QA had their own undocumented test procedures, and the end result were user-facing packages that were fundamentally broken.
This has all the same smells to it. We ship these test files as part of releases. Users/developers/etc. can and will try to run the test files in those release archives. They will not work as-is. Users/developers/etc. will not think to find a script like this or copy DLLs / set environment variables themselves.
Tracking follow-ups in #4192.
Motivation
Enable ocltst execution on Rock builds on both linux and windows.
Technical Details
enable ocltst execution on linux and windows.
Setup to pick the correct runtime libs
export OCL_ICD_VENDORS=/opt/rocm/etc/OpenCL/vendors/
Linux -
setting LD_LIBRARY_PATH to find the libs
Windows -
choosing to copy the runtime/icd dlls to local folder instead of regsitry key update.
Registry key update requires admin permissions so using copy mechanism to pick the right dlls
else the system32 dlls will be loaded, leading to incorrect dlls loaded at runtime
Test Plan
All opencl tests executed and working
Test Result
All opencl tests executed and working
JIRA
AIRUNTIME-267
Submission Checklist