Skip to content

Data race condition when using exrmetrics #2207

@vlazar-ilm

Description

@vlazar-ilm

Hello,

I have been using exrmetrics as a poor man's conversion tool to try different codecs / parameters as supported by OpenEXR.
After converting a few hundred gigs of data, I realized that certain files were actually corrupted but I was unable to repro reliably.

To repro I created a small python script that calls exrmetric twice:

  • Once to convert it to htj2k32 or zip
  • Once to read to make sure it's correct

the command line is:
exrmetric --convert infile -o output_path -z htj2k32 -t 10

I attached the script if it saves you some time to repro
run.py
This file managed to get to trip fairly easily: https://openexr.com/en/latest/test_images/DisplayWindow/t05.html

I repeat this loop up to 5000 (I manage to get it to corrupt from 1-3000 iterations ). Running with -t 1 seems to complete a few thousand iterations succesfully.

I have recompiled openEXR with thread sanitizer enabled and got the following error when converting to zip:

-------------------- STDERR --------------------
==================
WARNING: ThreadSanitizer: data race (pid=86644)
  Write of size 8 at 0x00016dbde5b0 by main thread:
    #0 std::__1::vector<char, std::__1::allocator<char>>::__base_destruct_at_end[abi:ne200100](char*) vector.h:750 (exrmetrics:arm64+0x1000557b8)
    #1 std::__1::vector<char, std::__1::allocator<char>>::clear[abi:ne200100]() vector.h:531 (exrmetrics:arm64+0x100055488)
    #2 std::__1::vector<char, std::__1::allocator<char>>::__destroy_vector::operator()[abi:ne200100]() vector.h:248 (exrmetrics:arm64+0x1000552b0)
    #3 std::__1::vector<char, std::__1::allocator<char>>::~vector[abi:ne200100]() vector.h:259 (exrmetrics:arm64+0x100056ea8)
    #4 std::__1::vector<char, std::__1::allocator<char>>::~vector[abi:ne200100]() vector.h:259 (exrmetrics:arm64+0x100056ddc)
    #5 MemOStream::~MemOStream() exrmetrics.cpp:876 (exrmetrics:arm64+0x100056d4c)
    #6 MemOStream::~MemOStream() exrmetrics.cpp:876 (exrmetrics:arm64+0x100050128)
    #7 exrmetrics(char const*, char const*, int, Imf_3_4::Compression, float, int, bool, bool, PixelMode, bool) exrmetrics.cpp:1140 (exrmetrics:arm64+0x10004d378)
    #8 exrmetrics(char const*, char const*, int, Imf_3_4::Compression, float, int, bool, bool, PixelMode, bool) exrmetrics.cpp:972 (exrmetrics:arm64+0x10004b564)
    #9 <null> <null> (0x00018825ab98)

  Previous read of size 8 at 0x00016dbde5b0 by thread T6 (mutexes: write M0):
    #0 std::__1::vector<char, std::__1::allocator<char>>::size[abi:ne200100]() const vector.h:385 (exrmetrics:arm64+0x100055618)
    #1 MemIStream::readMemoryMapped(int) exrmetrics.cpp:912 (exrmetrics:arm64+0x100056320)
    #2 Imf_3_4::istream_nonparallel_read(_priv_exr_context_t const*, void*, void*, unsigned long long, unsigned long long, int (*)(_priv_exr_context_t const*, int, char const*, ...)) <null> (libOpenEXR-3_4.33.3.4.3.dylib:arm64+0x14478)

  Location is stack of main thread.

  Mutex M0 (0x000106d010e0) created at:
    #0 pthread_mutex_lock <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x31494)
    #1 std::__1::mutex::lock() <null> (libc++.1.dylib:arm64e+0x1f3d8)
    #2 exrmetrics(char const*, char const*, int, Imf_3_4::Compression, float, int, bool, bool, PixelMode, bool) exrmetrics.cpp:972 (exrmetrics:arm64+0x10004b564)
    #3 <null> <null> (0x00018825ab98)

  Thread T6 (tid=147741413, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 IlmThread_3_4::(anonymous namespace)::DefaultThreadPoolProvider::setNumThreads(int) <null> (libIlmThread-3_4.33.3.4.3.dylib:arm64+0x27ec)
    #2 <null> <null> (0x00018825ab98)

SUMMARY: ThreadSanitizer: data race vector.h:750 in std::__1::vector<char, std::__1::allocator<char>>::__base_destruct_at_end[abi:ne200100](char*)
==================
ThreadSanitizer: reported 1 warnings

Strangely enough, I don't get the same error when converting toi HTJ2K, but got a more generic one:

ojph error 0x000300A1 at ojph_codeblock.cpp:219: Error decoding a codeblock.
ojph error 0x000300A1 at ojph_codeblock.cpp:219: Error decoding a codeblock.
ojph error 0x000300A1 at ojph_codeblock.cpp:219: Error decoding a codeblock.
ojph error 0x000300A1 at ojph_codeblock.cpp:219: Error decoding a codeblock.
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Unable to decompress w 11 image data 37355 -> 76800, got 0
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Decode pipeline unable to decompress data
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Unable to decompress w 11 image data 37361 -> 76800, got 0
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Decode pipeline unable to decompress data
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Unable to decompress w 11 image data 37362 -> 76800, got 0
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Decode pipeline unable to decompress data
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Unable to decompress w 11 image data 37548 -> 76800, got 0
/tmp/htj2k_t05.exr: (EXR_ERR_CORRUPT_CHUNK) Decode pipeline unable to decompress data
error from exrmetrics: Unable to run decoder

=============
OS: Alma9.6 and OSX
Experience corruption on both, thread sanitizer output is from OSX
OpenEXR: 741ecb8 (latest RB-3.4) (have also had the corruption with 3.4.1)
OpenJPH: 0.25.3

Compiler: Clang 17.0.0 on OSX GCC 11.5 on Alma
Breaks with both Debug and Release configs

================

Connected with #2157 ?
I got it unreliably to happen with all kinds of builds and on Linux though.
Also the issue did not go away after compiling openJPH with SIMD off: OJPH_DISABLE_SIMD

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions