Description
Summary
During our pytorch development, we found Windows system memory alloctor is worse performance, and slow down the whole pytorch performance. After add third party memory alloctor, pytorch improved its tensor alloction performance. Detailed please take reference: pytorch/pytorch#102534
As pytorch submodule, I found oneDNN still using system memory alloctor to malloc some buffer for reorder/resharp options.
Related code as here: https://github.com/oneapi-src/oneDNN/blob/11f55587a6ef7ac07bac5e81fdac72a8233bb469/src/common/utils.cpp#L146-L170
I add some debug log to confirmed also.
(build_pytorch) D:\xuhan\build_pytorch\pytorch\third_party\ideep\mkl-dnn>git diff
diff --git a/src/common/utils.cpp b/src/common/utils.cpp
index 37659a5d3e..1d1db40337 100644
--- a/src/common/utils.cpp
+++ b/src/common/utils.cpp
@@ -46,6 +46,38 @@
#include "cpu/platform.hpp"
#endif
+#ifdef _WIN32
+#include <debugapi.h>
+#define MAX_MESSAGE_SIZE 4096
+void D4D(LPCSTR szFormat, ...)
+{
+ const CHAR * p_ModuleName = "[pytorch] ";
+ char szMsg[MAX_MESSAGE_SIZE] = { 0 };
+ LPSTR lpsz = szMsg;
+ size_t nLen = 0;
+ int nReturnValue = 0;
+ va_list va;
+ va_start(va, szFormat);
+
+ lstrcatA(lpsz, p_ModuleName);
+
+ nLen = lstrlenA(szMsg);
+ lpsz = szMsg;
+ lpsz += nLen;
+
+ nReturnValue = _vsnprintf_s(lpsz, MAX_MESSAGE_SIZE - nLen, MAX_MESSAGE_SIZE, szFormat, va);
+
+ lstrcatA(szMsg, "\n");
+
+ OutputDebugStringA(szMsg);
+}
+#else
+void D4D(LPCSTR szFormat, ...)
+{
+
+}
+#endif
+
namespace dnnl {
namespace impl {
@@ -151,6 +183,7 @@ void *malloc(size_t size, int alignment) {
#ifdef _WIN32
ptr = _aligned_malloc(size, alignment);
int rc = ptr ? 0 : -1;
+ D4D("dnnl malloc: %p - %x", ptr, size);
#else
int rc = ::posix_memalign(&ptr, alignment, size);
#endif
@@ -164,6 +197,7 @@ void free(void *p) {
#ifdef _WIN32
_aligned_free(p);
+ D4D("dnnl free: %p", p);
#else
::free(p);
#endif
(build_pytorch) D:\xuhan\build_pytorch\pytorch\third_party\ideep\mkl-dnn>
On Windows, I tested resnet18 it has more than 360k times malloc/free via system malloc/free.
Shows as below:
Problem statement
For slow memory alloction on Windows OS, I also write a malloc benchmark: https://github.com/xuhancn/bench_malloc
The other third party memory malloc libraries can improve the performance.
It is also works well on pytorch: pytorch/pytorch#102534 (comment)
So, we need an idea to let oneDNN use some third party memory alloctor for performance improvement.
Option 1: Add some memory alloction library as a submodule.
Acturally, It is not a good option:
- Additional library will bring in more lisence, security issues.
- It is hard to selected a memory alloction library for all usage cases.
Option 2: Add cpu alloc/free callback to support customlize memory alloctor APIs.
It is a light method to change the memory alloction implemention.
- Add a optional cpu alloc/free callback registeration API.
- If we registered callback functions, It will use the customlize memory alloctor.
- If we not registered callback functions, oneDNN will use the default system memory alloctor.
Preferred solution
For above option 2:
First, we can define the callback funtions:
// void* alloc_cpu(size_t size, int alignment);
typedef void* (*t_dnnl_cpu_aligned_malloc)(size_t, int);
// void free_cpu(void* data);
typedef void (*t_dnnl_cpu_free)(void*);
The registeration API as below:
static t_dnnl_cpu_aligned_malloc g_dnnl_cpu_malloc;
static t_dnnl_cpu_free g_dnnl_cpu_free;
bool register_dnnl_cpu_memory_alloction_apis(t_dnnl_cpu_aligned_malloc p_malloc, t_dnnl_cpu_free p_free)
{
if(!p_malloc || !p_free)
return false;
g_dnnl_cpu_malloc = p_malloc;
g_dnnl_cpu_free = p_free;
return true;
}
Reference implemention:
void *malloc(size_t size, int alignment) {
void *ptr;
if (memory_debug::is_mem_debug())
return memory_debug::malloc(size, alignment);
// malloc callback
if(g_dnnl_cpu_malloc)
return g_dnnl_cpu_malloc(size, alignment);
#ifdef _WIN32
ptr = _aligned_malloc(size, alignment);
int rc = ptr ? 0 : -1;
#else
int rc = ::posix_memalign(&ptr, alignment, size);
#endif
return (rc == 0) ? ptr : nullptr;
}
void free(void *p) {
if (memory_debug::is_mem_debug()) return memory_debug::free(p);
// free callback
if(g_dnnl_cpu_free)
return g_dnnl_cpu_free(p);
#ifdef _WIN32
_aligned_free(p);
#else
::free(p);
#endif
}
Additional question:
oneDNN has two piece of malloc/free implemention:
- Common: https://github.com/oneapi-src/oneDNN/blob/11f55587a6ef7ac07bac5e81fdac72a8233bb469/src/common/utils.cpp#L146-L170
- Graph: https://github.com/oneapi-src/oneDNN/blob/11f55587a6ef7ac07bac5e81fdac72a8233bb469/src/graph/utils/alloc.cpp#L62-L80
Whether we need to add callback for both them?
CC: @jgong5, @chunyuan-w, @Guobing-Chen