-
Notifications
You must be signed in to change notification settings - Fork 139
Description
Describe the bug
Hello, I'd like to ask for help resolving a multi-threading issue. It can be duplicated using https://github.com/mlemanczyk/even-perfect-numbers-scanner/tree/codex/analyze-gc-instances-in-checkdivisors-method source code.
You can use CSV test file as input to EvenPerfectBitScanner. It's inside sorted-primes.zip file in the root of the repository.
Just run it with the parameters:
.\EvenPerfectBitScanner.exe --increment=add --filter-p=./sorted-primes.csv --use-orders=false --prime=31 --max-prime=140000000 --mersenne=bydivisor --divisor-cycles-device=cpu --mersenne-device=cpu --order-device=gpu --primes-device=gpu --divisor-cycles-batch=131072 --gpu-prime-batch=1024 --threads=10240 --gpu-prime-threads=20480 --write-batch-size=1 --bydivisor-deltas-device=gpu --bydivisor-montgomery-device=cpu --block-size=6 --test
Depending on your hardware, you may need to adjust the number of rolling accelerators in PerfectNumberConstants.cs
I'm running it against thousands of threads, e.g. 16_384+. I'd like to share the accelerators between threads, with separate input / output device buffers and separate streams. But every time I try using more streams with an accelerator, sooner or later I'm running into memory access violation, copy to device or kernel launch CL exceptions. I've tried adding locks all around, especially for device memory allocations to prevent those, but nothing seems to help. I'm unsure if that is related to ILGPU, AMD drivers and/or my code issue.
`Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:
at ILGPU.Runtime.OpenCL.CLAPI_0.clEnqueueFillBuffer_Import(IntPtr, IntPtr, Void*, IntPtr, IntPtr, IntPtr, Int32, IntPtr*, IntPtr*)
at ILGPU.Runtime.OpenCL.CLAPI_0.clEnqueueFillBuffer(IntPtr, IntPtr, Void*, IntPtr, IntPtr, IntPtr, Int32, IntPtr*, IntPtr*)
at ILGPU.Runtime.OpenCL.CLAPI.FillBuffer[[System.Byte, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](ILGPU.Runtime.AcceleratorStream, IntPtr, Byte, IntPtr, IntPtr)
at ILGPU.Runtime.OpenCL.CLMemoryBuffer.CLMemSet[[System.Byte, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](ILGPU.Runtime.OpenCL.CLStream, Byte, ILGPU.ArrayView1<Byte> ByRef) at ILGPU.Runtime.OpenCL.CLMemoryBuffer.MemSet(ILGPU.Runtime.AcceleratorStream, Byte, ILGPU.ArrayView1 ByRef)
at ILGPU.Runtime.MemoryBuffer.MemSet(ILGPU.Runtime.AcceleratorStream, Byte, Int64, Int64)
at ILGPU.Runtime.ArrayViewExtensions.MemSet[[ILGPU.ArrayView1[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], ILGPU, Version=1.5.3.0, Culture=neutral, PublicKeyToken=null]](ILGPU.ArrayView1, ILGPU.Runtime.AcceleratorStream, Byte)
at ILGPU.Runtime.ArrayViewExtensions.MemSetToZero[[ILGPU.ArrayView1[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], ILGPU, Version=1.5.3.0, Culture=neutral, PublicKeyToken=null]](ILGPU.ArrayView1, ILGPU.Runtime.AcceleratorStream)
at ILGPU.Runtime.ArrayViewExtensions.MemSetToZero[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](ILGPU.Runtime.ArrayView1D2<Int32,Dense>, ILGPU.Runtime.AcceleratorStream) at PerfectNumbers.Core.HeuristicCombinedPrimeTester.HeuristicTrialDivisionGpuDetectsDivisor(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, UInt64, Byte) at PerfectNumbers.Core.HeuristicCombinedPrimeTester.IsPrimeGpu(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, UInt64, Byte) Finished processing 12611 Processing 12613 at PerfectNumbers.Core.HeuristicCombinedPrimeTester.IsPrimeGpu(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64) at PerfectNumbers.Core.PrimeOrderCalculator.PartialFactor(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, PrimeOrderSearchConfig ByRef) at PerfectNumbers.Core.PrimeOrderCalculator.CalculateInternal(UInt64, System.Nullable1, PerfectNumbers.Core.MontgomeryDivisorData ByRef, PrimeOrderSearchConfig ByRef)
at PerfectNumbers.Core.PrimeOrderCalculator.Calculate(UInt64, System.Nullable1<UInt64>, PerfectNumbers.Core.MontgomeryDivisorData ByRef, PrimeOrderSearchConfig ByRef, PrimeOrderHeuristicDevice) at PerfectNumbers.Core.MersenneDivisorCycles.TryCalculateCycleLengthForExponentCpu(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, UInt64, PerfectNumbers.Core.MontgomeryDivisorData ByRef, UInt64 ByRef, Boolean ByRef) at PerfectNumbers.Core.Cpu.MersenneNumberDivisorByDivisorCpuTester.CheckDivisors64(UInt64, UInt64, UInt64, UInt64, UInt16, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Boolean ByRef) at PerfectNumbers.Core.Cpu.MersenneNumberDivisorByDivisorCpuTester.CheckDivisors(UInt64, UInt64, Boolean ByRef) at PerfectNumbers.Core.Cpu.MersenneNumberDivisorByDivisorCpuTester.IsPrime(UInt64, Boolean ByRef) at PerfectNumbers.Core.MersenneNumberDivisorByDivisorTester+<>c__DisplayClass0_0.<Run>g__ProcessPrime|0(UInt64) at PerfectNumbers.Core.MersenneNumberDivisorByDivisorTester+<>c__DisplayClass0_2.<Run>b__1() at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) at System.Threading.Tasks.Task.ExecuteEntry() at PerfectNumbers.Core.UnboundedTaskScheduler.ExecuteTask(System.Threading.Tasks.Task) at PerfectNumbers.Core.TaskThreadPool.WorkerLoop()
May I ask for your help and input on this?
Environment
- ILGPU version: 1.5.3
- .NET version: .NET 8
- Operating system: Windows 11
- Hardware (if GPU-related): AMD Ryzen 7 integrated notebook card, 20 GB RAM.
Steps to reproduce
- Compile the solution
- Unzip sorted-primes.zip to bin folder of EvenPerfectBitScanner application
- Run EvenPerfectBitScanner with the following parameters.
.\EvenPerfectBitScanner.exe --increment=add --filter-p=./sorted-primes.csv --use-orders=false --prime=31 --max-prime=140000000 --mersenne=bydivisor --divisor-cycles-device=cpu --mersenne-device=cpu --order-device=gpu --primes-device=gpu --divisor-cycles-batch=131072 --gpu-prime-batch=1024 --threads=10240 --gpu-prime-threads=20480 --write-batch-size=1 --bydivisor-deltas-device=gpu --bydivisor-montgomery-device=cpu --block-size=6 --test
Expected behavior
Multiple streams accessed by multiple threads (1 stream / 1 thread, xxx threads / 1 accelerator) on shared accelerators without exceptions.
Additional context
No response