Isolate per-test scratch for parallel xdist runs [CI 5/9]#1599
Conversation
auphelia
left a comment
There was a problem hiding this comment.
Hi @merkelmarrow ,
Thank you for the contribution!
One small thing: could you replace the use of tmp_path with the make_build_dir helper?
The main reason is consistency: we try to ensure that all build artifacts go through the standard FINN build directory setup (i.e., FINN_BUILD_DIR / FINN_HOST_BUILD_DIR). Since tests are often used as templates by users, using make_build_dir here helps avoid unintentionally introducing alternative directory patterns.
Thanks!
Many unit tests wrote a fixed scratch filename into the working directory or a single shared build dir, which is unsafe under pytest-xdist. The two batchnorm_to_affine tests shared one ONNX path, and the convert_to_hw, lookup and softmax tests each reused one fixed export path across their parametrisations or sibling tests, so concurrent workers raced on that path. The infer and sign_to_thres tests are single tests that wrote a fixed-name export into the working directory, which is left behind there on failure. Give each of these tests its own make_build_dir scratch directory so concurrent workers never share a path while still using FINN's standard FINN_BUILD_DIR setup. Remove those scratch directories with robust_rmtree teardown because these cases do not intentionally keep artefacts. The loop-rolling tests exported under FINN_BUILD_DIR and separately the LoopExtraction transform saves and reloads a fixed-name loop-body-template.onnx in the working directory. Give each case its own make_build_dir and chdir into it so both the export and the cwd are per test, and seed torch and numpy so quantised configurations stay within tolerance from run to run. Dir is removed with robust_rmtree on success and kept on failure for inspection. The npy2apintstream and npy2vectorstream util tests already used a unique make_build_dir, but these specific tests were vulnerable to NFS ENOTEMPTY teardown races, so remove the dir with robust_rmtree on success. Swap subprocess.Popen plus communicate (which never checked the return code) for subprocess.check_call, so a failed compile/run raises at the point of failure. Signed-off-by: Marco Blackwell <mblackwe@amd.com>
2622a71 to
da4b2e1
Compare
Hi @auphelia, No problem! That makes sense. I've updated all the tmp_path instances with a make_build_dir + robust_rmtree try/finally pattern. Some of the tests are pretty long so I put the body in a helper to avoid indenting everything. Let me know if you would like any further changes. |
auphelia
left a comment
There was a problem hiding this comment.
Thanks a lot for addressing the issues you observed and also changing tmp_path with make_build_dir!
Also really nice to see the numpy/pytorch seeding added in the rolling test, I think that’s a good call for CI stability. The test is mainly about validating the rolling logic itself, so avoiding occasional numerical noise makes sense.
Looks good to me 👍
This is PR 5 of 9 in a series to make CI faster and more robust.
This patch makes concurrency hygiene improvements to tests, making them safe to run under
pytest-xdistby giving each test and each parametrisation its own scratch path. Several tests wrote a fixed scratch filename and so collided on the same path when two workers ran them at once. A few others did not race but left a stray file in the repository root on failure.Changes
Grouped by type:
tmp_pathfor tests that reused one fixed export name.test_batchnorm_to_affine_bnn_pynq(two functions shared one ONNX path),test_convert_to_hw_layers_{cnv,fc,synthetic},test_fpgadataflow_lookupandtest_fpgadataflow_softmaxeach reused one fixed export path across parametrisations or sibling tests. Each now writes under its owntmp_path, so no two workers share a path and pytest owns the cleanup.tmp_pathfor cleanup hygiene.test_infer_datatypes_lfc,test_infer_data_layouts_cnvandtest_sign_to_thresare single tests with unique filenames, so they did not race, but they wrote into the working directory and left the file behind on failure (thexfailingtest_infer_data_layouts_cnvleft it on every run). Moving them totmp_pathkeeps the repo root clean and hands cleanup to pytest.test_loop_rolling. The export went to a sharedFINN_BUILD_DIRand, separately,LoopExtractionsaves and reloads a fixed-nameloop-body-template.onnxin the current working directory. Each case now gets its ownmake_build_dirandchdirs into it, so both the export and that cwd template are per test. The directory is removed withrobust_rmtreeon success and kept on failure.test_npy2apintstream/test_npy2vectorstream. These already used a uniquemake_build_dir, but these specific tests sometimes hitENOTEMPTYon NFS (see Harden build dir deletion against transient NFS failures [CI 1/9] #1595), so teardown now usesrobust_rmtree. They also swapsubprocess.Popenpluscommunicate(which discards the return code) forsubprocess.check_call, so a failed compile or run fails at the point of failure.Review notes / judgement calls
Intended pattern
Going forward, the intended pattern is as follows:
make_build_dirwithrobust_rmtreeif you intend to keep artifacts in FINN_BUILD_DIR on failure.Seeding change
This PR now seeds numpy/pytorch in the loop rolling test (while we're already changing it). A random seed would occasionally fail numerical tolerance in a small percentage of cases which is unhelpful for CI. Opinion on this is welcome.
Completeness
This PR addresses observed issues, but the problematic patterns/races addressed may silently exist elsewhere. Follow-on work needed, out of scope for this PR.