Skip to content

Conversation

@jijoongmoon
Copy link
Collaborator

@jijoongmoon jijoongmoon commented Mar 5, 2025

In this PR

This PR enables the Engine, Context, Context Data and Context allocator.

Engine Class

In this PR, Engine Class which is managing the Context to create Layers for specicifc Accelerators.
AppContext create the CPU Layers, CLContext creates the GPU Layers. It also provides the Pluggable features, so that,
the developer can create it's own context to support various accelerators such as NPUs.
Thus, we have to use the getRegisterContext("cpu") call to get the proper Context. It also supports registerFactory to add custom layers as before.

Context Class

In this PR, Context Class is introduced. As mentioned in Engine class, it is possible to add custom context and this Context class might be the base class for all custom context.
Context Class also has ContextData and MemAllocator. ContextData can be used to create and used to utilize the global data for specific context after initialze the context and each specific accelerator need to spcific memory allocator.

read and save the weight

Previously layer_node handle the read and save the weight but, the read and save function can be differ depending on layer types such as batch normalization layer. So, In this pr, the read fuction is lowered to layer object itself. If there is no overwited read funciotn, it will use base class read, which is layer_devel.

RunContext with ContextData

In order to use the ContextData, additional parameter is set in RunLayerContext. In this way, it is possible to access in each layer. Also add compute_engine type inInitLayerContext as well.

Tensor and Weight Layer

This PR also introduce Tensor and Weight Layer which contain Tensors and Weights only. These layers act like container to provide the data.

FSU with shared Tensor

This PR also provides initial version of FSU with shared Weight Tensors. Most of the HW accelerators prefer the statically allocated memory rather than dynamically allocated memory. So modify function call of memory pool allocate and getMmeory but need to be tested more. For better latency, we also have to think the MAP_FIXED option in mmap and this also should be done later. For accurate controlablity of the Execution order, there is some fixes in manager as well.

Signed-off-by: jijoong.moon [email protected]

Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the PR contains some commits already merged to upstream. Please rebase the PR.

* RunLayerContext in layer_node will hold the ContextData, so that, Layer can
* access this Context Data.
*/
class ContextData {
Copy link

@piotrrak piotrrak Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want publicly inherit from std::enable_shared_from_this<ContextData>, since it will be used with std::shared_ptr. Upside would be that shared/weak atomic reference counters (shared_ptr control block) will become part of this object rather than allocated (most likely dynamic allocation).

*/
ml::train::LayerComputeEngine
getComputeEngine(const std::vector<std::string> &props) {
for (auto &prop : props) {
Copy link

@piotrrak piotrrak Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this common way to have to to enumerate all properties and filter on its value?

Please do note I don't suggest below as part of this change:

Such code hints that possible future improvement would be:

  • to executed nntrainer::getKeyValue once for all properties creating mapping
    Example: std::map/std::unordered_map<property_key, property_value>/our implementation of future std::flat_map
  • switching all such functions to accept such mapping rather doing O(n) filtering.

In the code execution it costs O(N*M) right now where N is number of specified properties and M is number of possible properties.

Other possilble benefits:

  • It is harder to read and write such filtering probably;
  • Would give an opportunity to 'case-normalize' keys of such collection

DonghakPark and others added 24 commits March 19, 2025 18:42
add check if t_dtype, t_life, t_dims are same

Fix build error at Window & debian

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
add nntrainer header to RPM spec file for Tizen build
- mem_allocator.h
- tensor_layer.h
- weight_layer.h

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Update FSU forwarding logic
- FSU will handle look ahead tensor inside of pool
- so we don't need to call Loadtensor for f + i

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Add memory ptr for allocate shared mem
- add mem_ptr
- add unmap - array for manage unmapped ptr

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
I have changed the method from using dynamic memory allocation to using static memory allocation.
In order to prevent multiple frees, I added a map to check whether the mem_address has already been processed. Previously, memory was allocated through buf, but now it is being allocated directly.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Co-authored-by: jijoong.moon <[email protected]>
Signed-off-by: Donghak PARK <[email protected]>
make neuralnet can pass path to the swap_device & weight offset (file offset)
it can make calculate weight file's offset

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Co-authored-by: hyeonseok <[email protected]>
Signed-off-by: Donghak PARK <[email protected]>
Apply Shared mem & FSU
- when inference mode : read from weight bin ( weight offset )
- when train mode : same logic with swap

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Fix Unittest Fail bug at Training Case Swap
- There are some issue on PutBuffer that can not free ptr

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Apply clang format at changed File

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Update FSU Unitter
- For now, we should set our weight & input size as pagesize * N
- For later i will add Page Align Algorithm

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Add FSU CachePool Unittest
- Add Memory Allocation & Validate Test
- make public setWeightOffset()
- make public setFsuWeightPath()

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
1. Remove unused Function : just call .begin() | .end() from iter
- initCacheElemIter
- isLastCacheElemIter
- initExecIdsIter
- isLastExecIdsIter

2. change map to unordered_map
- there is no need to sort -> unordered_map is more fast than map

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
1. Test FSU SimpleFC with various parm ( look_ahead )
2. Add free at make weight file

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Allocate Memory space using Aligned_alloc
- MMAP with FIXED options need aligned address & Page size read
- Add AllocateFSU function to allocate as aligned mem address
- Fix some application

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
- defined(_WIN32) -> #include <io.h> instead #include <unistd.h>

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Input Layer only takes the FP32 type. So we do need to quantize it if
the output of input layer is low-bit. Until the
quantization/dequantization support Tensor Class inside, we do use
calling quantize explicitly on layer to support.

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr use rpc mem for cpu as an default memory allocator

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Use RPC_Mem Allocation at Android use CPU, NPU, GPU

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
In order to extend the Context more easily, this PR adds Engine Class
to manage the Contexts. It adds Context Class as well which is base
class of all the Context.
 . add Engine Class
 . add Context Class
 . set default Context is app contexts
 . Pluggable support in Engine
 . some more code optimization & test codes requires.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr use rpc mem for cpu as an default memory allocator

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR add QNN binary load at load function

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This commit allows the tensor class to retrieve the total size of the allocated memory in bytes.
The key difference between the bytes() function and the new getMemoryBytes() method is that bytes() only returns the storage size for quantized data in quantized tensors, without accounting for the memory used for scaling factors and zero points.
However, for float-type tensors, both bytes() and getMemoryBytes() will return the same value.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <[email protected]>
This commit changes the QScheme data type to a 16-bit unsigned integer.
This is to have consistency with the BCQTensor in reading and saving quantization information, which eases calculating the offset.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <[email protected]>

auto model_file = checkedOpenStream<std::ifstream>(
file_path, std::ios::in | std::ios::binary);
(v.size() == 2) ? v[0] : v[1], std::ios::in | std::ios::binary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if v.size() is equal to 1?

@DonghakPark
Copy link
Member

This PR's Commit included in #3049, So Close This PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants