Skip to content

Julia crashes for multithreaded Stack for some non-Julia models #783

Open
@ablaom

Description

@ablaom

Context: #767 adds support for an option acceleration=CPUThreads() in composite model types defined by "exporting" learning networks, and implements this option for Stack. I have been carrying out MLJ ecosystem integration tests of the new Stack with a large number of models as base models in the stack. If the base model is one from the non-Julia packages ScikitLearn.jl, XGBoost.jl, or LIBSVM.jl, and I am including CPUThreads() in the testing, then I am experiencing Julia crashes. I not been able to reliably reproduce the crashes with a "minimal example" but the follow seems to do the job on my machine:

using Pkg
Pkg.activate(temp=true)
Pkg.add(
    url="https://github.com/JuliaAI/MLJBase.jl",
    rev="stack_cache_and_acceleration",
)
Pkg.add(
    url = "https://github.com/JuliaAI/MLJTestIntegration.jl",
    rev= "multi-threading",
)
Pkg.add("NearestNeighborModels")
Pkg.add("MLJLIBSVMInterface")
Pkg.add("XGBoost")
Pkg.instantiate()

julia> Pkg.status()
      Status `/private/var/folders/4n/gvbmlhdc8xj973001s6vdyw00000gq/T/jl_wRKoZO/Project.toml`          
  [a7f614a8] MLJBase v0.20.2 `https://github.com/JuliaAI/MLJBase.jl#stack_cache_and_acceleration`        
  [61c7150f] MLJLIBSVMInterface v0.2.0
  [697918b4] MLJTestIntegration v0.1.0 `https://github.com/JuliaAI/MLJTestIntegration.jl#multi-threading`
  [636a865e] NearestNeighborModels v0.2.0
  [009559a3] XGBoost v1.5.2

using MLJBase
using NearestNeighborModels
using MLJLIBSVMInterface
using MLJTestIntegration
using XGBoost

model = EpsilonSVR()

models = (knn1=KNNRegressor(K=4),
          knn2=KNNRegressor(K=6),
          model=model)

metalearner = KNNRegressor()
measure = LPLoss(2)

# mini Boston:
y, X = unpack(MLJBase.load_boston(), ==(:MedV), col->col in [:LStat, :Rm])
data = (X, y)

mystack = Stack(
    ; metalearner,
    resampling=CV(;nfolds=3),
    acceleration=CPUThreads(),
    models...)

julia> MLJTestIntegration.test_single_target_regressors(
    [(name="EpsilonSVR", package_name="LIBSVM"),],
    level=4,
    verbosity=2
)
┌ Info: 
└ Testing EpsilonSVR from LIBSVM
[ Info: [:model_type] Loading model type ✓
[ Info: [:model_instance] Instantiating default model ✓
[ Info: [:fitted_machine] Fitting machine ✓
[ Info: [:operations] Calling `predict`, `transform` and/or `inverse_transform` ✓
[ Info: [evaluation] Evaluating model performance using with 1 resources. ✓
Internal repeatability tests, 50 of 50 trials complete ✓ Repeatable.
[ Info: Testing with 5 threads. 
[ Info: [:accelerated_evaluation] Evaluating model performance using with 2 resources. ✓
[ Info: [:tuned_pipe_evaluation] Evaluating perfomance in a tuned pipeline ✓
[ Info: [:ensemble_prediction] Ensembling ✓
[ Info: [stack_evaluation] Evaluating a stack containing model with 1 resources. ✓

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43
unknown function (ip: 0x10b82aca3)
Allocations: 279946573 (Pool: 279865905; Big: 80668); GC: 248

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43
unknown function (ip: 0x10b80f59c)
Allocations: 279946573 (Pool: 279865905; Big: 80668); GC: 248
...

Interestingly, if I remove MLJXGBoostInterface from the env, and the using XGBoost, then there are no issues and the tests pass.

I do not seem to have problems with any pure Julia models.

In attempts to isolate, I have encountered various errors, such as:

OMP: Error #13: Assertion failure at kmp_csupport.cpp(540).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.

signal (6): Abort trap: 6
in expression starting at REPL[2]:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 105303846 (Pool: 105260636; Big: 43210); GC: 106

julia(70986,0x70000783d000) malloc: *** error for object 0x7ff0725333e0: pointer being freed was not allocated
julia(70986,0x70000783d000) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at /Users/anthony/sandbox/crash.jl:46

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:46
Allocations: 279191441 (Pool: 279111122; Big: 80319); GC: 222

julia(90542,0x7000079c6000) malloc: Incorrect checksum for freed object 0x7f8da2b121a8: probably modified after being freed.
Corrupt value: 0x7f8da2b1b4c0
julia(90542,0x7000079c6000) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at /Users/anthony/MLJ/MLJTestIntegration/examples/bigtest/notebook.jl:35

signal (4): Illegal instruction: 4
in expression starting at /Users/anthony/MLJ/MLJTestIntegration/examples/bigtest/notebook.jl:35

I am running with 5 threads.

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 5
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions