[GPU] model loading latency optimization#34057
[GPU] model loading latency optimization#34057riverlijunjie wants to merge 6 commits intoopenvinotoolkit:masterfrom
Conversation
| #endif | ||
|
|
||
| bool load_direct(std::istream& stream, void* buffer, size_t size) { | ||
| #ifdef __linux__ |
There was a problem hiding this comment.
Instead ifdef use separate files for windows/linux etc
this utils could be common utils to improved read
7ef90f1 to
bc5e0dc
Compare
0931e3c to
cae70ee
Compare
1. Use Unified Shared Memory (usm_host) to eliminates hidden implicit memory copies done by the GPU driver and allows the GPU to DMA directly from the system host memory. 2. L3 Cache-Friendly Block Size: Use small, finely-tuned chunk sizes (e.g., 4MB) instead of massive blocks. 3. Low-Level System I/O: Bypass the overhead and locking of C++ standard streams (std::istream) in favor of direct binary file reading interfaces to reduce user-space buffer copies and kernel-space copying overhead.
|
|
||
| // Pass the cached blob file path to plugins that support it (e.g. GPU plugin) | ||
| // so they can use optimized parallel I/O to read weights directly from the blob file | ||
| if (!cacheContent.m_blob_id.empty() && util::contains(plugin.get_property(ov::supported_properties), |
There was a problem hiding this comment.
The core logic should be avoided, especially for device specific properties.
The cache entry is manage by cache manger and here should not be any logic to add such property or bypass what is opened cache manger. Also use any hardcoded path is not correct.
The proper solution is open the stream (fast version) or (mmap) version which allow to read the parallel better read.
There was a problem hiding this comment.
At early time, I also hope to do so, but ifstream cannot meet requirement of parallel read due to each thread need seek to different offset to read data. Could you give some test sample of such parallel read with stream?
| OV_CONFIG_RELEASE_OPTION(ov::internal, value_cache_quant_mode, ov::internal::CacheQuantMode::BY_TOKEN, "AUTO or BY_CHANNEL or BY_TOKEN") | ||
| OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, mem_pool_util_threshold, 0.5, "Minimum utilization threshold (0.0~1.0) for reusable memory in the pool") | ||
| OV_CONFIG_RELEASE_OPTION(ov, enable_weightless, false, "Enable/Disable weightless blob") | ||
| OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, cached_blob_path, "", "Path to the cached blob file used during cache loading for optimized parallel I/O") |
There was a problem hiding this comment.
The property should not be introduce
mange cache entries is core responsibility and it should not be bypassed
There was a problem hiding this comment.
Could you give better idea to pass cache blob path to gpu plugin for parallel read?
There was a problem hiding this comment.
@riverlijunjie
As discussed offline.
The core cache manger opens cache in two ways depends on mmap flag.
With mmap disabled is opened as stream and forward to plugin. In this case it if there would be custom stream which hides parallel read the plugin could use it with beneficial of faster read and then it will work for all plugins.
in case mmap enabled the blob is opened as ov::Tensor view on mmap file. In this case plugin (GPU) should have more native support to use tensor and read data as from buffer (with parallel option) instead wrap it in stream. Then reading should be faster and mmap flag will be not bypassed by custom GPU property.
| } | ||
|
|
||
| #ifdef _WIN32 | ||
| bool ov::util::read_binary_file_parallel(const std::filesystem::path& path, void* buffer, size_t size, size_t offset) { |
There was a problem hiding this comment.
Move this implementation to dedicated file for windows under os folder
| const std::wstring& wpath = path.native(); | ||
|
|
||
| HANDLE hFile = CreateFileW(wpath.c_str(), |
There was a problem hiding this comment.
| const std::wstring& wpath = path.native(); | |
| HANDLE hFile = CreateFileW(wpath.c_str(), | |
| HANDLE hFile = CreateFileW(path.c_str(), |
| return false; | ||
|
|
||
| // Safety check: File size | ||
| LARGE_INTEGER fileSize; |
There was a problem hiding this comment.
| LARGE_INTEGER fileSize; | |
| LARGE_INTEGER file_size; |
Use snake_case for variables
sungeunk
left a comment
There was a problem hiding this comment.
LGTM. This change reduces the model loading time on PTLH.
Run 4 executions after remove cache files.
- Master: R1 4.657s -> R4 1.133s
- PR: R1 4.681s -> R4 0.441s
|
|
||
| allocation_type _allocation_type = allocation_type::unknown; | ||
| ib >> make_data(&_allocation_type, sizeof(_allocation_type)); | ||
| // std::cout << "load weights: allocation_type = " << static_cast<int>(_allocation_type) << ", weights_path = " << weights_path << std::endl; |
Details:
Solutions:
Results:
Todo list:
Test result:
Tickets: