Skip to content

[GPU] model loading latency optimization#34057

Open
riverlijunjie wants to merge 6 commits intoopenvinotoolkit:masterfrom
riverlijunjie:river/cache_loading_opt
Open

[GPU] model loading latency optimization#34057
riverlijunjie wants to merge 6 commits intoopenvinotoolkit:masterfrom
riverlijunjie:river/cache_loading_opt

Conversation

@riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented Feb 11, 2026

Details:

  • Goal: Maximize IO throughput when loading large OpenVINO GPU model caches (Blobs) on NVMe SSDs.
  • Bottleneck: The original implementation used single-threaded std::istream (read/sgetn). Due to standard library double-buffering and CPU memory copy overhead, the throughput was capped at 1GB/s, failing to saturate modern NVMe hardware (3.5GB/s+).
image
  • Solutions:

    • Linux Optimization (Zero-Copy): O_DIRECT (Direct IO), which extracted the underlying file descriptor (FD) from std::filebuf, bypassed the partial stream, and used pread to read directly from disk into the user-space buffer.(Discarded due to performance is not good as Parallel IO)
    • Linux/Windows Optimization (Parallel IO): Implemented a custom parallel file loader. It splits the load task into 4KB-aligned chunks processed by concurrent threads.
    • Resolving Data Corruption ("Garbled Data"): Parallel loading resulted in incorrect/corrupted weight values, implemented an Automatic Header Detection mechanism.
  • Results:

    • Correctness: Data verification passed; physical and logical offsets are now perfectly synchronized.
    • Performance: up to 2x increase, effectively utilizing the parallel capabilities of modern NVMe drives.
  • Todo list:

    • Cache model support
    • To verify on Windows
    • To verify on Linux
    • To verify on dGPU
    • Weightless support (Will do in another PR)
    • Normal loading support (Will do in another PR)
  • Test result:

image

Tickets:

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 11, 2026
@p-durandin p-durandin added this to the 2026.1 milestone Feb 11, 2026
#endif

bool load_direct(std::istream& stream, void* buffer, size_t size) {
#ifdef __linux__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead ifdef use separate files for windows/linux etc
this utils could be common utils to improved read

@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Feb 16, 2026
@riverlijunjie riverlijunjie marked this pull request as ready for review February 28, 2026 00:38
@riverlijunjie riverlijunjie requested review from a team as code owners February 28, 2026 00:38
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Feb 28, 2026
@riverlijunjie riverlijunjie force-pushed the river/cache_loading_opt branch from 7ef90f1 to bc5e0dc Compare February 28, 2026 14:46
@riverlijunjie riverlijunjie requested a review from praasz March 3, 2026 01:08
@riverlijunjie riverlijunjie force-pushed the river/cache_loading_opt branch from 0931e3c to cae70ee Compare March 3, 2026 06:35
   1. Use Unified Shared Memory (usm_host) to eliminates hidden implicit memory copies done by the GPU driver and allows the GPU to DMA directly from the system host memory.
   2. L3 Cache-Friendly Block Size: Use small, finely-tuned chunk sizes (e.g., 4MB) instead of massive blocks.
   3. Low-Level System I/O: Bypass the overhead and locking of C++ standard streams (std::istream) in favor of direct binary file reading interfaces to reduce user-space buffer copies and kernel-space copying overhead.
Copy link
Contributor

@praasz praasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement in read speed is good direction, but integration with core must corrected as it introduce some kind of bypass to main logic


// Pass the cached blob file path to plugins that support it (e.g. GPU plugin)
// so they can use optimized parallel I/O to read weights directly from the blob file
if (!cacheContent.m_blob_id.empty() && util::contains(plugin.get_property(ov::supported_properties),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core logic should be avoided, especially for device specific properties.
The cache entry is manage by cache manger and here should not be any logic to add such property or bypass what is opened cache manger. Also use any hardcoded path is not correct.

The proper solution is open the stream (fast version) or (mmap) version which allow to read the parallel better read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At early time, I also hope to do so, but ifstream cannot meet requirement of parallel read due to each thread need seek to different offset to read data. Could you give some test sample of such parallel read with stream?

OV_CONFIG_RELEASE_OPTION(ov::internal, value_cache_quant_mode, ov::internal::CacheQuantMode::BY_TOKEN, "AUTO or BY_CHANNEL or BY_TOKEN")
OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, mem_pool_util_threshold, 0.5, "Minimum utilization threshold (0.0~1.0) for reusable memory in the pool")
OV_CONFIG_RELEASE_OPTION(ov, enable_weightless, false, "Enable/Disable weightless blob")
OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, cached_blob_path, "", "Path to the cached blob file used during cache loading for optimized parallel I/O")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property should not be introduce
mange cache entries is core responsibility and it should not be bypassed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give better idea to pass cache blob path to gpu plugin for parallel read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riverlijunjie
As discussed offline.
The core cache manger opens cache in two ways depends on mmap flag.
With mmap disabled is opened as stream and forward to plugin. In this case it if there would be custom stream which hides parallel read the plugin could use it with beneficial of faster read and then it will work for all plugins.
in case mmap enabled the blob is opened as ov::Tensor view on mmap file. In this case plugin (GPU) should have more native support to use tensor and read data as from buffer (with parallel option) instead wrap it in stream. Then reading should be faster and mmap flag will be not bypassed by custom GPU property.

}

#ifdef _WIN32
bool ov::util::read_binary_file_parallel(const std::filesystem::path& path, void* buffer, size_t size, size_t offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this implementation to dedicated file for windows under os folder

Comment on lines +276 to +278
const std::wstring& wpath = path.native();

HANDLE hFile = CreateFileW(wpath.c_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::wstring& wpath = path.native();
HANDLE hFile = CreateFileW(wpath.c_str(),
HANDLE hFile = CreateFileW(path.c_str(),

return false;

// Safety check: File size
LARGE_INTEGER fileSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LARGE_INTEGER fileSize;
LARGE_INTEGER file_size;

Use snake_case for variables

Copy link
Contributor

@sungeunk sungeunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This change reduces the model loading time on PTLH.
Run 4 executions after remove cache files.

  • Master: R1 4.657s -> R4 1.133s
  • PR: R1 4.681s -> R4 0.441s


allocation_type _allocation_type = allocation_type::unknown;
ib >> make_data(&_allocation_type, sizeof(_allocation_type));
// std::cout << "load weights: allocation_type = " << static_cast<int>(_allocation_type) << ", weights_path = " << weights_path << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: inference OpenVINO Runtime library - Inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants