Project structure

This document is written on the basis of discussions taken as part of the task of preparing the compiler for development in open-source. This means that (debatable) decisions on the structure of the project structure were made based on the following conditions:

Enable/disable a specific platform by a CMake option and a corresponding define;
Make a clear, convenient process of adding a new device;
Ensure code reuse for different device generations.

Compiler overview

Dialects

Regardless of the device version, the compilation flow has the same appearance at the dialect level. These dialects represent different levels of de tail. The IR is lowered from high level abstractions to more detailed representation step-by-step during compilation. The compilation pipeline consis ts of the "atomic“ passes. Each pass in compilation pipeline must represent one single transformation to reach one specific goal (either IR adaptatio n or IR optimization).

It is also necessary to describe the dependence of dialects from an architectural point of view:

Only a low-level dialect might depend on a high-level one, but not vice versa. Ideally, based on DIP, they should all depend on abstraction. A simple example is shown in the diagram above: AdjustLayoutsPass pass is written in a general manner and depends on the LayoutInfoOpInterface interface, which is implemented in the VPU dialect. Thus AdjustLayoutsPass is protected from changes in HW details and can easily be reused. This example is somewhat simplified relative to the actual implementation. More information can be found in MLIR Overview and below.

Libraries

This high-level diagram covers the main dependencies between libraries inside the compiler. It makes sense to divide libraries into two "types": common and HW-specific. Common part consists of:

frontend: to import NGraph to IE dialect
core: data structures required by compiler
utils: helpers to work with core data structures
act_kernels: shave utilities
conversion: passes for lowering dialects
[dialect]_IR: dialect operations, attributes and types
[dialect]_transforms: passes to perform transformations over IR
[dialect]_interfaces: interfaces and base classes on which passes may depend
other utility libraries

HW-specific part consists of implementation of interfaces, passes, operations and other device-specific details/utilities. There is one library for each device version. For convenience, the diagram shows it in the form of separate libraries, so npu_compiler_[dialectN]37xx means the dialect folder in the NPU37XX directory.

Passes

Common passes

These are fully HW-agnostic passes. This means you will get the same result for any input IR regardless of the platform version. Such passes have to be placed in a common part. Please refer to primer_mlir to get more information.

HW-specific passes

Hardware specific passes are designed to work on a particular platform. And from development perspective, the only difference is that necessary to use appropriate HW folder. For example, 37XX-specific passes for IE dialect:

Declaration path in TableGen;
Declaration path for constructor;
Implementation folder.

You are allowed to reuse passes from an older HW version for a newer one if the required feature is a strict superset:

// 40XX
void vpux::buildDefaultHWModePipeline(mlir::OpPassManager& pm, const DefaultHWOptions40XX& options, Logger log) {
    // ...
    // Use pass from previous version here
    pm.addPass(IE::arch37xx::createHwSpecific1Pass(log));
    IE::buildName1Pipeline(pm, log);
    // ...
}

HW-specific passes must also be registered in vpux-opt for validation purposes:

// ...
vpux::IE::arch37xx::registerPasses();
// ...

"Mixed" passes

Mixed passes share a common core algorithm but utilise hardware specific information to make decisions.

Interface-based approach

Lets say we have StrategyManager pass in VPU dialect that can be applied for all HW generations. At the same time, the general algorithm from this pass needs information about possible strategies that are different for different devices. So we have to store strategies separately for HW components, because, for example, for the newest device it is private information.

Following this approach, the development of a "mixed" pass is similar to a common pass. The difference here is that we have to create a concrete instance of the corresponding type in the common part, using, for example, the factory method:

std::unique_ptr<IStrategyGetter> vpux::VPU::createMCStrategyGetter(ArchKind arch, int64_t numClusters) {
    switch (arch) {
    case VPU::ArchKind::NPU37XX: {
        return std::make_unique<arch37xx::StrategyGetter>();
    }
    case VPU::ArchKind::NPU40XX: {
        return std::make_unique<arch40xx::StrategyGetter>(numClusters);
    }
    case ArchKind::UNKNOWN:
    default: {
        VPUX_THROW("Unsupported arch kind value: {0}", arch);
    }
    }
}

This approach does not have the disadvantages of rejected option. But it has its downsides:

A large number of factory methods need to be created. However, this problem can be mitigated by creating some sort of global register, like a DI container in C# or Java.
Removing the module requires more effort, as the common part link the hardware library(CMake changes), and also need to remove code from factories(see previous point). The problem related to dependencies between libraries can be solved by switching to a plugin system, then we could load the necessary libraries in runtime depending on the arch value.

Please note that despite the dependence of the common part(npu_compiler_dialect_passes_vpu) on the HW-specific one(npu_compiler_vpu37xx), by design, classes do not depend on it. Here StrategyManagerPass depends on interface IStrategyGetter and arch37xx::StrategyGetter implements this — so both components depend on abstraction and we still follow DIP.

This approach is adopted as the main one, as it reduces duplication and decreases the probability of errors in comparison with the rejected option.

Rewriter-based approach

In this example, the different behavior of the pass for different HWs is achieved by adding special rewriters. To do this, we use the interface again: UnrollClusterTilingPass depends on IGreedilyPassStrategy. And here is possible implementation of UnrollClusterTilingPass::safeRunOnFunc method:

void UnrollClusterTilingPass::safeRunOnFunc() {
    auto& ctx = getContext();
    auto func = getOperation();

    auto strategy = createUnrollClusterTilingStrategy(func, _log);

    mlir::RewritePatternSet patterns(&ctx);
    // add necessary rewriters here
    strategy.addPatterns(patterns);

    if (mlir::failed(
                mlir::applyPatternsAndFoldGreedily(func, std::move(patterns), vpux::getDefaultGreedyRewriteConfig()))) {
        signalPassFailure();
    }
}

where strategy is IGreedilyPassStrategy and it can be implemented in different ways, depending on the version of the device:

// 37XX
void UnrollClusterTilingStrategy::addPatterns(mlir::RewritePatternSet& patterns) {
    auto module = _func->getParentOfType<mlir::ModuleOp>();
    auto dmaOp = IE::getAvailableExecutor(module, VPU::ExecutorKind::DMA_NN);
    auto dmaPortCount = dmaOp.getCount();

    patterns.add<VPUIP::ClusterDMARewriter>(&_ctx, dmaPortCount, _log);
    patterns.add<VPUIP::arch37xx::ClusterSWRewriter>(&_ctx, module, _log);
    patterns.add<VPUIP::arch37xx::ClusterNCERewriter>(&ctx, _log);
}

// 40XX
void UnrollClusterTilingStrategy::addPatterns(mlir::RewritePatternSet& patterns) {
    auto module = _func->getParentOfType<mlir::ModuleOp>();
    auto dmaOp = IE::getAvailableExecutor(module, VPU::ExecutorKind::DMA_NN);
    auto dmaPortCount = dmaOp.getCount();

    patterns.add<VPUIP::ClusterDMARewriter>(&_ctx, dmaPortCount, _log);
    patterns.add<VPUIP::arch37xx::ClusterSWRewriter>(&_ctx, module, _log);
    // Compared to the 37xx, we have specific ClusterNCERewriter here
    patterns.add<VPUIP::arch40xx::ClusterNCERewriter>(&_ctx, _log);
    // Compared to the 37xx, we have also ClusterConvertDMARewriter here
    patterns.add<VPUIP::arch40xx::ClusterConvertDMARewriter>(&ctx, dmaPortCount, _log);
}

IConversionPassStrategy also provides a markOpLegality method, useful for setting up operation legality in passes which rely on the dialect conversion driver.

Rewriters can also depend on interfaces to write them in the most general form — kind of combination with Interface-based approach. In this case, the necessary objects can be created directly in addPatterns method. This approach also helps reducing code duplication since it doesn't require passes to be registered for each device. Then we can use the same name in vpux-opt and manage behavior of pass using only vpu-arch:

// RUN: vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX allow-custom-values=true" --unroll-cluster-tiling  %s | FileCheck %s

instead of, for example, duplicating the device version in the pass name:

// RUN: vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX allow-custom-values=true" --unroll-cluster-tiling-VPUX40XX  %s | FileCheck %s

More detailed information about vpux-opt can be found in the how-to-test document.

Canonicalization

TODO: #-86282

Pipelines

Compiler has different pipeline for different HW generation. These pipelines are stored in appropriate HW folders: NPU37XX, etc. To build a pipeline, it is also necessary to implement IPipelineStrategy interface for each device:

Then it is used in this way:

auto pipelineFactory = createPipelineStrategy(arch);
// pm is PassManager
pipelineFactory->buildPipeline(pm, config, rootTiming, log);

The main advantage of this approach is that we can easily hide the pipeline for a new device containing HW-specific passes. The consequence of this separation is that there is no need to add passes to the pipeline that do not work with this device. Therefore, the size of the pipeline becomes smaller, only the necessary passes are involved. And it is possible to get rid of such code:

void MyPass::safeRunOnFunc() {
    // ...
    if (arch != VPU::ArchKind::NPU37XX) {
        return mlir::failure();
    }
    // ...
}

This approach also has a downside. It is not clear why this or that pass participates in one pipeline, but not in another. Are there HW restrictions or did developer forget to add it? A possible solution is to introduce as many sub-pipelines as possible to bring the main pipeline to a similar form:

// Only sub-pipelines and HW-specific passages should remain in the main pipeline

// 37XX
void vpux::buildDefaultHWModePipeline(mlir::OpPassManager& pm, const DefaultHWOptions37XX& options, Logger log) {
    // ...
    IE::buildName1Pipeline(pm, log);
    IE::buildName2Pipeline(pm, log);
    pm.addPass(IE::arch37xx::createHwSpecific1Pass(log));
    IE::buildName3Pipeline(pm, log);
    // ...
}

// 40XX
void vpux::buildDefaultHWModePipeline(mlir::OpPassManager& pm, const DefaultHWOptions40XX& options, Logger log) {
    // ...
    IE::buildName1Pipeline(pm, log);
    pm.addPass(IE::arch40xx::createHwSpecific2Pass(log));
    IE::buildName2Pipeline(pm, log);
    IE::buildName3Pipeline(pm, log);
    // ...
}

Some recommendations are already written in code style.

Operation interfaces

Interfaces and External models are powerful tools that allow us to add the necessary behavior for operations in runtime. A typical example is the AdjustLayoutsPass pass. It works with the IE::LayoutInfoOpInterface interface. For the same operation from IE dialect we want to have different results depending on the device version. For this purpose, different models can be implemented and then are attached for the same operation depending on device version:

// 37XX:
IE::SigmoidOp::attachInterface<vpux::VPU::SameAnyDimsOrderOpModelForSW>(*ctx);

Interfaces registration follows the same schema as the pipelines registration:

auto interfacesRegistry = createInterfacesRegistry(arch);
interfacesRegistry->registerInterfaces(registry);

Properties

TODO: #-66795. Store properties in module; Handle properties in passes.

Operations

TODO: #-86281

There is no complex solution here yet. As a first step, operations are devided between several ops.td files depending on the HW version. And the logic of transformations again is based on op-interfaces.

In future we could proceed with HW-specific dialects if necessary:

VPUIP37XX_SwKernelOp
VPUIP40XX_ConvertDMAOp
..

For example, we already have HW-specific dialects like VPUMI37XX.

Attributes

TODO: #-88494

Rationale

"Mixed" passes

Interface-based approach (rejected)

Here in common part we have StrategyManagerImplAlgo class (it can also be a method, but it doesn't really matter), which contains the basic general logic. This class depends on the interface to be specified by HW details, in our case, a specific set of strategies. This scheme requires the developer to register a pass for each platform:

// src/vpux_compiler/tblgen/vpux/compiler/NPU37XX/dialect/VPU/passes.td
// The same for 40XX
def StrategyManagerPass : PassBase<"strategy-manager", "mlir::OperationPass<mlir::func::FuncOp>"> {
    // ...
    let constructor = "vpux::IE::arch37xx::createStrategyManagerPass()";
    // ...
}

The implementation of HW passes is also duplicated for each platform. Possible way:

void StrategyManagerPass::safeRunOnFunc() {
    auto func = getOperation();
    auto module = func->getParentOfType<mlir::ModuleOp>();

    // in case of 40XX we have to create arch40xx::StrategyGetter
    StrategyManagerImplAlgo algo {func, std::make_unique<arch37xx::StrategyGetter>();}
    algo.foo();
}

Then we will have the difference in compilation pipelines:

void vpux::buildDefaultHWModePipeline(mlir::OpPassManager& pm, const DefaultHWOptions37XX& options, Logger log) {
    // ....
    // Accordingly, it will be arch40xx::createStrategyManagerPass for 40XX, etc.
    pm.addPass(VPU::arch37xx::createStrategyManagerPass(log));
    // ...
}

The advantage of this approach is that the HW library itself creates the necessary dependencies for the generic algorithms, and therefore minimal changes are required to remove such a library from the repository: platform libraries depend on the generic part, and not vice versa.

At the same time there are several cons:

Code duplication for declaration and implementation of the pass;
Impossible to reuse sub-pipelines: we can't have common sub-pipeline for 37xx and 40xx with this pass;
It is easy to make a mistake when registering passes for vpux-opt. You get an error when trying to register passes for from 37XX and 40XX at the same time, because two passes are registered with the same name.

Dispatched Inlining

Motivation

MLIR's inliner will be called as part of the inliner pass. It does a lot behind the scenes, but when it comes to deciding if an operation can be inlined and how it shall be inlined, it is quite simple.

As an example, we take a look at isLegalToInline(): If it encounters some operation, it will lookup the dialect. Internally, the inliner saves a mapping from mlir::Dialect* to mlir::DialectInlinerInterface. If the map contains no such interface for the requested dialect, false will be returned. Otherwise the inliner dispatches to that particular inliner interface and return interface->isLegalToInline(op). There are a handful of other functions that work in the same fashion.

We see two important things here: The inliner can only decide on a per-operation basis which interface to choose and the inliner only supports at most one interface per dialect. This means that the inliner cannot support multiple different inlining semantics for a particular operation out of the box! This is the motivation for our dispatched inliner interface system.

Our solution is to implement a func inliner interface that can then dynamically dispatch to other interfaces. The idea is that the user can add a special attribute, namely {inliner_dispatch = #MyDialect.MyInlinerDispatchAttr}, to func operations. UnifiedFuncInlinerInterface then dispatches to an inliner interface that is associatated with #MyDialect.MyInlinerDispatchAttr.

Tutorial

Supporting operations of a custom dialect in the inliner

The common approach here is extending mlir::DialectInlinerInterface and implementing isLegalToInline(). The most trivial implementation looks like this:

struct MyDialectInlinerInterface : public mlir::DialectInlinerInterface {
    bool isLegalToInline(mlir::Operation*, mlir::Operation*, bool) const final {
        return true;   
    }

    bool isLegalToInline(mlir::Region*, mlir::Region*, bool, mlir::IRMapping&) const final {
        return true;
    }

    bool isLegalToInline(mlir::Operation*, mlir::Region*, bool, mlir::IRMapping&) const final {
        return true;
    }
};

This then has to be registered during MyDialect::initialize():

void MyDialect::initialize() {
    // ...
    addInterface<MyDialectInlinerInterface>();
    // ...
}

This is enough to enable inlining in MyDialect if the default inlining behaviour (see mlir/lib/Dialect/Func/Extensions/InlinerExtension.cpp) is enough.

Supporting custom call (and func, return) operations

Assume we want to implement a custom MyDialect.Call operation. It extends CallOpInterface and will therefore be handled by UnifiedFuncInlinerInterface. If we don't want to have the default behaviour (see mlir/lib/Dialect/Func/Extensions/InlinerExtension.cpp) for that kind of operation, we can extend mlir::DialectInlinerInterface.

struct MyDialectDispatchedInlinerInterface : public mlir::DialectInlinerInterface {
    bool isLegalToInline(mlir::Operation*, mlir::Operation*, bool) const final {
        return true;   
    }

    bool isLegalToInline(mlir::Region*, mlir::Region*, bool, mlir::IRMapping&) const final {
        return true;
    }

    bool isLegalToInline(mlir::Operation*, mlir::Region*, bool, mlir::IRMapping&) const final {
        return true;
    }

    void handleTerminator(Operation *op, ValueRange valuesToRepl) const final {
        // custom logic
    }

    void processInlinedCallBlocks(mlir::Operation* call,
                                  mlir::iterator_range<mlir::Region::iterator> inlinedBlocks) const final {
        // custom logic
    }

    std::tuple<mlir::Block*, mlir::Block::iterator> getInlineBlockAndPoint(mlir::Operation* call) const final {
        // custom logic
    }

    void eraseCall(mlir::Operation* call) const final {
        // custom logic
    }
};

Additionally, we have to add an attribute to MyDialect. This attribute will be added to the func-like operations in MyDialect to tell UnifiedFuncInlinerInterface which interface to dispatch to.

def MyDialectInlinerDispatchAttr : InlinerDispatchAttr<MyDialect, "MyDialectInlinerDispatch">;

MyDialect.Call {inliner_dispatch = #MyDialect.MyInlinerDispatchAttr} @someFunction() -> ()
// or if we just want to have different semantics for func ops
func.call {inliner_dispatch = #MyDialect.MyInlinerDispatchAttr} @someFunction() -> ()

We then have to register this interface in the UnifiedFuncInlinerInterface:

void MyDialect::initialize() {
    // ...

    // support inlining for "normal" ops in MyDialect
    addInterfaces<MyDialectInlinerInterface>();

    // support for func-like ops in MyDialect
    auto funcDialect = getContext()->getLoadedDialect<mlir::func::FuncDialect>();
    assert(funcDialect != nullptr);

    auto interface = funcDialect->getRegisteredInterface<Core::UnifiedFuncInlinerInterface>();
    assert(interface != nullptr);

    interface->registerDispatchedInlinerInterface<MyDialect::MyDialectInlinerDispatchAttr, MyDialect::FuncInlinerInterface>();
}

Note: If no dispatched inliner interface is provided via registerDispatchedInlinerInterface, a fallback implementation which mirrors mlir/lib/Dialect/Func/Extensions/InlinerExtension.cpp is used! For a lot of use-cases this is enough as the default inlining behaviour is the desired one.

Weights Separation

Monolithic Mode

The main motivation of the Monolithic mode is to align as much as possible with "real" weights separation but keeping @init() and @main() in the same blob and thus being able to use the current CI infrastructure. This eases the debugging of compilation, accuracy and inference issues (IMD).

A rough sketch of the Monolithic WS pipeline looks like this:

Up until IntroduceInitFunctionPass, we have the normal IR structure with a single @main(...) -> (...) function. This pass then creates the @init(...) -> (...) function. The pass strips away the transformations from const.Declare operations and converts them into IE-dialect operations in @init. The WSInit pipeline is then executed only on the @init function. After that, the UnpackNestedModulesPass, together with the InlinerPass, converts the multiple nested functions back into a single network function. Then, the default hardware VPUIP pipeline is executed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!