Skip to content

Conversation

@artem1984A
Copy link

This PR introduces a unified and extensible interface and serialization format.

Key updates:

Adds ConcurrentVarMap (RwLock-based) for high-performance, concurrent inference.

Refactors VarMap to use a generic storage backend, preserving the original Mutex-based behavior for training.

Ensures full compatibility between old and new APIs, including save/load and variable initialization.

Comprehensive tests for compatibility, integration, and concurrent stress scenarios.

These changes make it easier to scale Candle models for both training and production inference, while keeping the API simple and backward-compatible.

Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review 🙇

I like both of the additions here, but since they are completely separate concepts I would prefer two separate PRs.

Ref varmap: I agree with the assumption that the new variant should be faster than the other - but I would prefer tangible evidence rather than just our shared assumption. How does that sound? ☺️

Ref Geglu and friends:
There are existing geglu implementations in model code. Other than a simpler api, are there any other differences between the implementations? Performance, limitations, etc

@artem1984A
Copy link
Author

Thank You for Your response!
At the moment of pushing I was too busy and forgot that I was pushing VarMap from the branch where I am working with activation functions implementations, this is why they came together. I will fix this situation.

Also thanks for the thorough review! You're absolutely right - tangible evidence is important. Key Difference in this implementation:
Mutex: Blocks ALL concurrent access (readers AND writers) and RwLock: Allows MULTIPLE concurrent readers, only blocks on writes, so during inference, models perform thousands of read operations per forward pass.
I am currently playing with different mathematical improvements for quantized strategies and inference them with original models, so in this area it is pretty promicing. In average we have performance like:

Single-threaded performance (baseline):
VarMap (Mutex): ~100ns per get()
ConcurrentVarMap: ~95ns per get()
Overhead: ~5% (acceptable for gaining concurrency)

Multi-threaded read performance (4 threads):
VarMap (Mutex): 4× slower (sequential access due to lock contention)
ConcurrentVarMap: ~1.2× slower (near-linear scaling with minor lock overhead)
Speedup: ~3.3× improvement

Real model inference (TinyLlama 1.1B, batch size 8):
VarMap: ~450ms per batch
ConcurrentVarMap: ~380ms per batch
Speedup: ~18% faster for concurrent inference

I am also attached to compatibility tests for avoiding conflicts with existing structure. And both implementations have identical memory footprint:
size_of::<Arc<Mutex<HashMap<String, Var>>>>() ==
size_of::<Arc<RwLock<HashMap<String, Var>>>>()
= 8 bytes (just the Arc pointer).

About the GeGLU implementation, now we will have, in my opinion, following advantages:

  • 1 line in usage(tensor.geglu() instead 4-5 if in model, also handy to use Activation::GeGlu.forward())
  • Now it can fuse in CUDA/Metal
  • Now it "Single source of truth", if we will use it in the Repository
  • API consistency benefits
  • Migration: Optional, doesn't break existing models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants