Applying computer vision to image checking #727

forsyth2 · 2025-07-08T22:51:42Z

Hackathon project: Applying computer vision to image checking

This is a prototype and is not meant to be merged. The pull request is for easy visibility/ability to comment.

Goal

The image checker can produce a great number of diffs. These are currenty organized by plot type.

This poses a problem if many of the diffs are similar -- it's not a great use of a developer's time to manually look through say 1,000 diffs only to realize every single one is just "the pixels shifted slightly left" or "the colorbar changed so of course the plots look different".

What would be ideal is to use computer vision/machine learning to group those diffs how we want them. E.g., "every diff in cluster_0/ is because the pixels shifted, every diff in cluster_1/ is because its colorbar changed, and so on".

That would mean the developer only needs to look at a diff or two in each cluster to know what the issues are. There would be no need to manually review each diff, saving a lot of time.

Setting up

setup

Set up the directories to test on:

# Set up actual_images_dir
cd /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2
mkdir -p cv_prototype/actual_from_20250613/e3sm_diags/
cp -r zppy_weekly_comprehensive_v3_www/test_weekly_20250613/v3.LR.historical_0051/e3sm_diags/atm_monthly_180x360_aave cv_prototype/actual_from_20250613/e3sm_diags/
ls cv_prototype/actual_from_20250613/e3sm_diags/atm_monthly_180x360_aave/model_vs_obs_1987-1988/
# Contains the different sets, good

# Set up expected_images_dir
cd /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2
mkdir -p cv_prototype/expected_from_unified/e3sm_diags/
cp -r /lcrc/group/e3sm/public_html/zppy_test_resources_previous/expected_results_for_unified_1.11.1/expected_comprehensive_v3/e3sm_diags/atm_monthly_180x360_aave cv_prototype/expected_from_unified/e3sm_diags/
ls cv_prototype/expected_from_unified/e3sm_diags/atm_monthly_180x360_aave/model_vs_obs_1987-1988/
# Contains the different sets, good

# Get the list of files we expect, expected_images_list
cd cv_prototype/expected_from_unified/
find . -type f -name '*.png' > ../image_list_expected.txt
cd ..
ls
# actual_from_20250613  expected_from_unified  image_list_expected.txt

To run:

# Initial setup
cd /home/ac.forsyth2/ez/zppy
lcrc_conda # alias to set up conda
pre-commit run --all-files
git add -A
conda clean --all --y
conda env create -f conda/dev.yml -n zppy-hackathon-20250707-with-cv
conda activate zppy-hackathon-20250707-with-cv
pip install .

# Each run
# Update `try_num` below (the argument to `cv_prototype`)
pre-commit run --all-files
git add -A
python tests/integration/image_checker.py
# Compare missing/mismatched images with the original image checker's results
# Change the number below to the `try_num`
${num} = 0
diff /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/cv_prototype/diff_try${num}/e3sm_diags/missing_images.txt /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/cv_prototype/diff_try1/e3sm_diags/missing_images.txt
diff /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/cv_prototype/diff_try${num}/e3sm_diags/mismatched_images.txt /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/cv_prototype/diff_try1/e3sm_diags/mismatched_images.txt

Explaining the code

The cv_prototype function effectively replaces set_up_and_run_image_checker as the caller of check_images. If this was actually at the point of merging, we'd change

test_results = check_images(parameters, task)

in set_up_and_run_image_checker to:

test_results = check_images(parameters, task, cv_dict=cv_dict)

where cv_dict is defined as something like what we have in the main block now. With that change, the integration test suite could make use of this code. This cv_dict allows us to configure the pipeline with modularity. We can easily swap out different feature extraction, preprocessing, and clustering algorithms.

The check_images function is more-or-less unchanged, except for passing along cv_dict.

The _check_mismatched_images function has 2 important additions, if we're in CV mode (i.e., cv_dict is non-empty)

For each file, call _cv_compare_actual_and_expected, which is in fact substantially different than its counterpart _compare_actual_and_expected.
After going through all the files, call group_diffs.

Let's dive into _cv_compare_actual_and_expected:

get_images: we now use cv2.imread instead of Image.open
compute_diff_image: before, we filtered based on if fraction >= 0.0002. That is, the fraction of mismatched pixels needed to be less than 0.02%. This is a little different. Here, we're seeing how high the grayscale value is for each pixel in the diff. If it's > gray_diff_threshold, we add it to the mask.
write_images: more-or-less the same as what we were doing in compare_actual_and_expected, but using cv2.imwrite and no longer drawing bounding boxes on the diffs. (I assume this is possible, but the cv2 images are ndarrays not Images, so I didn't look into translating over the bounding box drawing.)
update_features: this is totally new. This is finding the important features in the diffs. I.e., it's finding what to cluster the diffs (or, optionally, the actuals + the diffs) on. This currently has 3 options we can swap out: "simple_hist" is just a normalized color histogram, "combined_features" adds in spatial features, "sector_slice" is an attempt to slice up the image into relevant pieces.

And dive into group_diffs:

preprocess_diffs: preprocess the features, as needed. We have options to scale features, reduce dimensions, and plot the features for a visual analysis.
run_cluster_algorithm: run a clustering algorithm. There are options for DBSCAN, KMeans, and AgglomerativeClustering.
Go through the clusters and copy over actual/expected/diff images to a subdirectory for each cluster.

Results

Using compute_diff_image with a gray_diff_threshold of 30 results in 12 more mismatches (57 => 69).
For feature extraction algorithm, "combined_features" seems to give the best results.
For clustering algorithm, AgglomerativeClustering seems to give the best results.

I ran quite a few iterations of the code, tracked with the try_num. The results can be seen in the diff_try# subdirectories of this page

Interesting iterations

Try 2: original image checker

Try 9: using `cv2` for diff computation. Adds 12 mismatches (57 => 59):
lat_lon/Cloud SSM/I/SSMI-TGCLDLWP_OCN-ANN-global.png
lat_lon/Cloud SSM/I/SSMI-TGCLDLWP_OCN-DJF-global.png
lat_lon/Cloud SSM/I/SSMI-TGCLDLWP_OCN-JJA-global.png
lat_lon/OMI-MLS/OMI-MLS-TCO-JJA-60S60N.png
lat_lon/SST_CL_HadISST/HadISST_CL-SST-ANN-global.png
lat_lon/SST_CL_HadISST/HadISST_CL-SST-DJF-global.png
lat_lon/SST_CL_HadISST/HadISST_CL-SST-JJA-global.png
lat_lon/SST_CL_HadISST/HadISST_CL-SST-MAM-global.png
lat_lon/SST_PD_HadISST/HadISST_PD-SST-ANN-global.png
lat_lon/SST_PD_HadISST/HadISST_PD-SST-DJF-global.png
lat_lon/SST_PD_HadISST/HadISST_PD-SST-MAM-global.png
lat_lon/SST_PI_HadISST/HadISST_PI-SST-ANN-global.png

Tries 29-32: combinations of detecting features on <diff only, actual + diff> and clustering using <DBSCAN, KMeans>

###############################################################################
Try 29: feature detection -- actual + diffs, clustering algorithm -- DBSCAN
Total of 69 mismatched images.

These 27 diffs all involve the bottom plot AND the metrics
0: 5 polar MERRA2 > MERRA2-U-850-{season}-polar_S, diffs in bottom plot/metrics
1: 7 polar MERRA2 > MERRA2-T-850-{season}-polar_{hemisphere}, diffs in bottom plot/metrics
2: 5 polar MERRA2 > MERRA2-U-850-{season}_polar_{hemisphere}, diffs in bottom plot/metrics
6: 5 lat_lon Cloud_Calpiso > CALIPSOCOSP-CLDLOW_CAL-{season}-global, diffs in bottom plot/metrics
11: 5 lat_lon MERRA2 > MERRA2-OMEGA-850-{season}-global, diffs in bottom plot/metrics

These 7 diffs all involve all 3 plots
3: 3 tropical_subseasonal wavernumber-frequency > PRECT_{}_15N-15S, diffs in all 3 plots
4: 4 tropical_subseasonal wavernumber-frequency > PRECT_norm_{}_15N-15S, diffs in all 3 plots

These 5 diffs all involve the bottom plot only
9: 5 lat_lon MERRA2 > MERRA2-U-850-{season}-global, all barely noticeable diffs in bottom plot

These 25 diffs all involve the bottom plots and/or metrics, but with less similarity
5: 15 lat_lon SST_{}_HadISST > HadISST_{}-SST-{season}-global diff always in bottom plot, sometimes on bottom metrics; some diffs are barely visible, but some are noticeable
7: 3 lat_lon OMI-MLS > OMI-MLS-TCO-{season}-60S60N, 1 barely noticeable diff in bottom plot, 1 diff in bottom plot/metrics, 1 diff in bottom plot only
8: 3 lat_lon Cloud_SSM.I > SSMI-TGCLDLWP_OCN-{season}-global, 2 nearly invisible diffs in bottom plot, 1 nearly invisible diff in bottom metrics
10: 4 lat_lon MERRA2 > MERRA2-T-850-{season}-global, 2 diffs in bottom plot/metrics and 2 just in bottom plot

noise: 5 remaining diffs. 1 lat_lon, 3 polar, 1 qbo

CONCLUSIONS
- Diffs of plots in the same family almost always have the same things wrong with them.
- Our combination of feature detection + clustering algorithm (DBSCAN) is actually able to tell which plot types are which, so that is interesting/good. HOWEVER, we're really more interested in grouping together similar diffs. E.g., clusters 0,1,2,6,11 above all involve diffs in the bottom plot AND metrics. Is there *anything* we can do to get the feature detection/clustering algorithm to merge clusters 0,1,2,6,11 into one cluster?
- The ultimate goal here is to be able to look at just a few representative diffs rather than needing to sort through many many diffs manually (69 diffs is already a lot, but there can be even more).

###########################################################################
Try 30: feature detection -- diffs only, clustering algorithm -- DBSCAN

cluster_0 has all 69 diffs in it.

CONCLUSIONS
- Looking at only the diffs, it doesn't seem "smart" enough to distinguish that the 3-plot square diffs of tropical_subseasonal clearly belong in a different cluster than the small world map diffs of the other plots.

###########################################################################
Try 31: feature detection -- diffs only, clustering algorithm -- KMeans

Trying with n_clusters = 3. Can we get it to make the following 3 clusters: the world map plots, the tropical_subseasonal plots, and the qbo plots?

Clusters:
0: 2 tropical_subseasonal diffs
2: 2 tropical_subseasonal diffs
1: the remaining 65 diffs, including 3 tropical_subseasonal diffs

CONCLUSIONS
- Not good at all. The tropical subeasonal plots are spread out into 3 clusters and everything else is in one of those.

###########################################################################
Try 32: feature detection -- actual + diffs, clustering algorithm -- KMeans

Clusters:
0: 39 diffs in lat_lon, polar, tropical_subseasonal
1: 21 diffs in lat_lon, polar
2: 9 diffs in polar

###########################################################################
Now, we turn to AgglomerativeClustering

###########################################################################
Try 38: feature detection -- diffs, clustering algorithm -- AC
4 clusters -- 2 for tropical_subseasonal, 1 for qbo, and 1 for lat_lon/polar
Pretty decent!

###########################################################################
Try 40: feature detection -- sector slice on diffs, clustering algorithm -- AC
3 clusters -- 2 for tropical_subseasonal/qbo, 1 for lat_lon, 1 for lat_lon/polar/tropical_subseasonal
So, not that great

Full iteration summary:
Tries 1-2: with original image checker functions
Tries 3-9: first attempt using cv2, compute_diff_image
Tries 10-13: detect_features
Tries 14-24: reworking code to group multiple diff images together
Tries 25-28: working on clusters
Tries 29-32: 4 combinations of DBSCAN/KMeans & extract features on diff/diff+actual
Try 33: cleaned up code, DBSCAN & diff-only
Try 34: cleaned up code, DBSCAN & diff+actual
Tries 35-38: new feature detection, clustering algorithms
Tries 38-40: trying sector slice for feature detection
Tries 41-42: code cleanup

Directories were moved as follows:

cd /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/cv_prototype
mkdir important_tries
mv diff_try2 important_tries/diff_try2
mv diff_try9 important_tries/diff_try9
mv diff_try29 important_tries/diff_try29
mv diff_try30 important_tries/diff_try30
mv diff_try31 important_tries/diff_try31
mv diff_try32 important_tries/diff_try32
mv diff_try38 important_tries/diff_try38
mv diff_try40 important_tries/diff_try40

mkdir less_important_tries
mv diff_try1 less_important_tries/diff_try1
mv diff_try41 important_tries/diff_try41

mkdir other_tries
mv diff_try* other_tries/
du -sh other_tries/
# 2.4G	other_tries/
# These can probably be deleted

Best Result

The combination of parameters in this PR are what have given the best result so far:

cv_dict: Dict[str, Any] = {
    # Image diff
    "gray_diff_threshold": 30,  # Out of 255
    # Feature extraction
    "extract_diff_features_only": True,
    "feature_extraction_algorithm": "combined_features",
    "simple_hist_bins": 32,
    "combined_features_bins": 8,
    # Preprocessing
    "scale_features": True,
    "reduce_dimensions": True,
    "plot_features": True,
    # Clustering
    "cluster_algorithm": "AgglomerativeClustering",
    # Clustering > DBSCAN
    "eps": 0.5,
    "min_samples": 2,
    # Clustering > KMeans, AgglomerativeClustering
    "n_clusters": 4,
}

Notably:

"feature_extraction_algorithm": "combined_features"
"cluster_algorithm": "AgglomerativeClustering"
"n_clusters": 4

That produces these 4 clusters:

0: tropical_subseasonal diffs that look almost like terrain maps
1: tropical subseasonal diffs that look like waves
2: the qbo diff
3: all the lat_lon/polar diffs

These results can be seen here.

Leveraging LivChat

LivChat was very useful in suggesting code snippets. It helped substantially in narrowing down areas to debug. For example, asked "is this issue because of the feature extraction or the clustering algorithm?" it explained how it was almost certainly the feature extraction that was the problem. It also suggested alternative clustering algorithms.

What would be needed to merge?

Find some way of getting the diffs to group based on what is different. It seems they're more grouped by the plot type at the moment.
- We'd need a bigger test set. E.g., get some diffs where some lat_lon plots have different titles but others have an axis label change, where some have the diffs in the top plot but others have the diffs in the bottom plot.
As mentioned above, in set_up_and_run_image_checker set:

test_results = check_images(parameters, task, cv_dict=cv_dict)

forsyth2 · 2025-08-09T00:04:57Z

Closing, as this was a hackathon project.

forsyth2 added 4 commits July 7, 2025 15:08

Initial setup for CV image checker

fc81ae6

Compute diffs

b15aa4d

Feature detection and clustering

9651258

Code cleanup

49b803a

forsyth2 self-assigned this Jul 8, 2025

forsyth2 added semver: new feature New feature (will increment minor version) Testing Files in `tests` modified labels Jul 8, 2025

forsyth2 changed the title ~~Hackathon 20250707~~ Applying computer vision to image checking Jul 8, 2025

forsyth2 closed this Aug 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Applying computer vision to image checking #727

Applying computer vision to image checking #727

Uh oh!

forsyth2 commented Jul 8, 2025

Uh oh!

forsyth2 commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Applying computer vision to image checking #727

Applying computer vision to image checking #727

Uh oh!

Conversation

forsyth2 commented Jul 8, 2025

Hackathon project: Applying computer vision to image checking

Goal

Setting up

Explaining the code

Results

Best Result

Leveraging LivChat

What would be needed to merge?

Uh oh!

forsyth2 commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants