[GPU] model loading latency optimization by riverlijunjie · Pull Request #34057 · openvinotoolkit/openvino

riverlijunjie · 2026-02-11T03:14:03Z

Details:

Goal: Maximize IO throughput when loading large OpenVINO GPU model caches (Blobs) on NVMe SSDs.
Bottleneck: The original implementation used single-threaded std::istream (read/sgetn). Due to standard library double-buffering and CPU memory copy overhead, the throughput was capped at 1GB/s, failing to saturate modern NVMe hardware (3.5GB/s+).

Solutions:
- Linux Optimization (Zero-Copy): O_DIRECT (Direct IO), which extracted the underlying file descriptor (FD) from std::filebuf, bypassed the partial stream, and used pread to read directly from disk into the user-space buffer.(Discarded due to performance is not good as Parallel IO)
- Linux/Windows Optimization (Parallel IO): Implemented a custom parallel file loader. It splits the load task into 4KB-aligned chunks processed by concurrent threads.
- Resolving Data Corruption ("Garbled Data"): Parallel loading resulted in incorrect/corrupted weight values, implemented an Automatic Header Detection mechanism.
Results:
- Correctness: Data verification passed; physical and logical offsets are now perfectly synchronized.
- Performance: up to 2x increase, effectively utilizing the parallel capabilities of modern NVMe drives.
Todo list:
- Cache model support
- To verify on Windows
- To verify on Linux
- To verify on dGPU
- Weightless support (Will do in another PR)
- Normal loading support (Will do in another PR)
Test result:

Tickets:

CVS-179677

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

praasz · 2026-02-11T06:56:14Z

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

+#endif
+
+bool load_direct(std::istream& stream, void* buffer, size_t size) {
+#ifdef __linux__


Instead ifdef use separate files for windows/linux etc
this utils could be common utils to improved read

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

src/common/util/src/file_util.cpp

1. Use Unified Shared Memory (usm_host) to eliminates hidden implicit memory copies done by the GPU driver and allows the GPU to DMA directly from the system host memory. 2. L3 Cache-Friendly Block Size: Use small, finely-tuned chunk sizes (e.g., 4MB) instead of massive blocks. 3. Low-Level System I/O: Bypass the overhead and locking of C++ standard streams (std::istream) in favor of direct binary file reading interfaces to reduce user-space buffer copies and kernel-space copying overhead.

praasz

The improvement in read speed is good direction, but integration with core must corrected as it introduce some kind of bypass to main logic

praasz · 2026-03-03T12:57:48Z

src/inference/src/dev/core_impl.cpp


+                // Pass the cached blob file path to plugins that support it (e.g. GPU plugin)
+                // so they can use optimized parallel I/O to read weights directly from the blob file
+                if (!cacheContent.m_blob_id.empty() && util::contains(plugin.get_property(ov::supported_properties),


The core logic should be avoided, especially for device specific properties.
The cache entry is manage by cache manger and here should not be any logic to add such property or bypass what is opened cache manger. Also use any hardcoded path is not correct.

The proper solution is open the stream (fast version) or (mmap) version which allow to read the parallel better read.

At early time, I also hope to do so, but ifstream cannot meet requirement of parallel read due to each thread need seek to different offset to read data. Could you give some test sample of such parallel read with stream?

praasz · 2026-03-04T08:33:06Z

src/plugins/intel_gpu/include/intel_gpu/runtime/options.inl

 OV_CONFIG_RELEASE_OPTION(ov::internal, value_cache_quant_mode, ov::internal::CacheQuantMode::BY_TOKEN, "AUTO or BY_CHANNEL or BY_TOKEN")
 OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, mem_pool_util_threshold, 0.5, "Minimum utilization threshold (0.0~1.0) for reusable memory in the pool")
 OV_CONFIG_RELEASE_OPTION(ov, enable_weightless, false, "Enable/Disable weightless blob")
+OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, cached_blob_path, "", "Path to the cached blob file used during cache loading for optimized parallel I/O")


The property should not be introduce
mange cache entries is core responsibility and it should not be bypassed

Could you give better idea to pass cache blob path to gpu plugin for parallel read?

@riverlijunjie
As discussed offline.
The core cache manger opens cache in two ways depends on mmap flag.
With mmap disabled is opened as stream and forward to plugin. In this case it if there would be custom stream which hides parallel read the plugin could use it with beneficial of faster read and then it will work for all plugins.
in case mmap enabled the blob is opened as ov::Tensor view on mmap file. In this case plugin (GPU) should have more native support to use tensor and read data as from buffer (with parallel option) instead wrap it in stream. Then reading should be faster and mmap flag will be not bypassed by custom GPU property.

praasz · 2026-03-04T08:42:19Z

src/common/util/src/file_util.cpp

 }

+#ifdef _WIN32
+bool ov::util::read_binary_file_parallel(const std::filesystem::path& path, void* buffer, size_t size, size_t offset) {


Move this implementation to dedicated file for windows under os folder

praasz · 2026-03-04T08:42:33Z

src/common/util/src/file_util.cpp

+    const std::wstring& wpath = path.native();
+
+    HANDLE hFile = CreateFileW(wpath.c_str(),


Suggested change

const std::wstring& wpath = path.native();

HANDLE hFile = CreateFileW(wpath.c_str(),

HANDLE hFile = CreateFileW(path.c_str(),

praasz · 2026-03-04T08:43:08Z

src/common/util/src/file_util.cpp

+        return false;
+
+    // Safety check: File size
+    LARGE_INTEGER fileSize;


Suggested change

LARGE_INTEGER fileSize;

LARGE_INTEGER file_size;

Use snake_case for variables

sungeunk

LGTM. This change reduces the model loading time on PTLH.
Run 4 executions after remove cache files.

Master: R1 4.657s -> R4 1.133s
PR: R1 4.681s -> R4 0.441s

maxnick · 2026-03-10T13:07:29Z

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp


        allocation_type _allocation_type = allocation_type::unknown;
        ib >> make_data(&_allocation_type, sizeof(_allocation_type));
+        // std::cout << "load weights: allocation_type = " << static_cast<int>(_allocation_type) << ", weights_path = " << weights_path << std::endl;


Commented out code.

[GPU] model cache loading latency optimization

9514661

github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 11, 2026

p-durandin assigned praasz and maxnick Feb 11, 2026

p-durandin added this to the 2026.1 milestone Feb 11, 2026

praasz reviewed Feb 11, 2026

View reviewed changes

Move parallel loading into common utils

6c533e9

github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Feb 16, 2026

praasz reviewed Feb 17, 2026

View reviewed changes

src/common/util/src/file_util.cpp Outdated Show resolved Hide resolved

mlukasze mentioned this pull request Feb 25, 2026

Add auto-review activation, WIP/POC suppression, and CI monitoring rules #34281

Merged

github-actions bot added the category: inference OpenVINO Runtime library - Inference label Feb 28, 2026

Add internal property to pass cache blob file path

dba859e

riverlijunjie marked this pull request as ready for review February 28, 2026 00:38

riverlijunjie requested review from a team as code owners February 28, 2026 00:38

github-actions bot added the category: build OpenVINO cmake script / infra label Feb 28, 2026

riverlijunjie force-pushed the river/cache_loading_opt branch from 7ef90f1 to bc5e0dc Compare February 28, 2026 14:46

riverlijunjie added 2 commits February 28, 2026 22:47

Fix Clang-format and build errors issue

bc5e0dc

Merge branch 'master' into river/cache_loading_opt

f59493a

riverlijunjie requested a review from praasz March 3, 2026 01:08

riverlijunjie force-pushed the river/cache_loading_opt branch from 0931e3c to cae70ee Compare March 3, 2026 06:35

praasz requested changes Mar 4, 2026

View reviewed changes

praasz assigned olpipi Mar 5, 2026

sungeunk approved these changes Mar 10, 2026

View reviewed changes

maxnick reviewed Mar 10, 2026

View reviewed changes

		const std::wstring& wpath = path.native();

		HANDLE hFile = CreateFileW(wpath.c_str(),

Conversation

riverlijunjie commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

praasz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sungeunk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

riverlijunjie commented Feb 11, 2026 •

edited

Loading

praasz left a comment •

edited

Loading