micronDLA
diff --git a/‎README.md
Lines changed: 93 additions & 45 deletions b/‎README.md
Lines changed: 93 additions & 45 deletions
diff --git a/‎api.h
Lines changed: 4 additions & 4 deletions b/‎api.h
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/C_API.md
Lines changed: 9 additions & 17 deletions b/‎docs/C_API.md
Lines changed: 9 additions & 17 deletions
diff --git a/‎docs/Codes.md
Lines changed: 1 addition & 0 deletions b/‎docs/Codes.md
Lines changed: 1 addition & 0 deletions
@@ -35,7 +35,7 @@ This SDK folder contains:
   * [System requirements](#system-requirements)
   * [Pico computing](#pico-computing)
   * [Docker Image](#docker-image)
-  * [Manual Installation](#manual-installation)
+  * [Python package Install](#python-package-install)
 - [2. Getting started with Deep Learning](#2-getting-started-with-deep-learning) : general information about deep learning
   * [Introduction](#introduction)
   * [PyTorch: Deep Learning framework](#pytorch-deep-learning-framework)
@@ -49,6 +49,10 @@ This SDK folder contains:
   * [Multiple FPGAs with different models <a name="two"></a>](#multiple-fpgas-with-different-models)
   * [Multiple Clusters with input batching <a name="three"></a>](#multiple-clusters-with-input-batching)
   * [Multiple Clusters without input batching <a name="four"></a>](#multiple-clusters-without-input-batching)
+  * [Multiple Clusters with different models <a name="five"></a>](#multiple-clusters-with-different-models)
+  * [All Clusters with different models in sequence <a name="six"></a>](#all-clusters-with-different-models-in-sequence)
+  * [Multiple Clusters with even bigger batches <a name="seven"></a>](#multiple-clusters-with-even-bigger-batches)
+  * [Batching using MVs <a name="four"></a>](#batching-using-mvs)
 - [6. Tutorial - PutInput and GetResult](#6-tutorial---putinput-and-getresult) : tutorial for using PutInput and GetOutput
 - [7. Tutorial - Writing tests](#7-tutorial---writing-tests) : Tutorial on running tests
 - [8. Tutorial - Debugging](#8-tutorial---debugging) : Tutorial on debugging and printing
@@ -114,8 +118,12 @@ lspci | grep -i pico
 lsmod | grep -i pico
     pico                 3493888  12
 ```
+After installing pico-computing, run install.sh to install the MDLA SDK
 
-## Docker Image (optional)
+
+## Docker Image 
+
+This step is optinal if you want to run as a docker image
 
 If you want to use MDLA with docker, then you need to install [pico-computing](#pico-computing) and [docker](https://docs.docker.com/get-docker/).
 
@@ -176,6 +184,17 @@ root@d80174ce2995:/home/mdla#
 
 Run the example code provided. Check sections [3](#3-getting-started-inference-on-micron-dla-hardware) and [4](#4-getting-started-inference-on-micron-dla-hardware-with-c).
 
+## Python package Install (optional)
+
+You can also install as a python package
+
+`git clone https://github.com/FWDNXT/SDK`
+
+Then inside SDK folder do
+
+`python3 setup.py install --user`
+
+
 # 2. Getting started with Deep Learning
 
 ## Introduction
@@ -355,20 +374,16 @@ numclus = 1
 # Create Micron DLA API
 sf = microndla.MDLA()
 # Generate instructions
-sf.SetFlag('nfpgas', str(numfpga))
-sf.SetFlag('nclusters', str(numclus))
-sf.Compile('resnet18.onnx', 'microndla.bin')
-# Init the FPGA cards
-sf.Init('microndla.bin')
+sf.SetFlag({'nfpgas': str(numfpga), 'nclusters': str(numclus)})
+sf.Compile('resnet18.onnx')
 in1 = np.random.rand(2, 3, 224, 224).astype(np.float32)
 input_img = np.ascontiguousarray(in1)
 # Create a location for the output
 output = sf.Run(input_img)
 ```
 
-`sf.Compile` will parse the model from model.onnx and save the generated Micron DLA instructions in microndla.bin. Here numfpga=2, so instructions for two FPGAs are created.
+`sf.Compile` will parse the model from model.onnx and save the generated Micron DLA instructions. Here numfpga=2, so instructions for two FPGAs are created.
 `nresults` is the output size of the model.onnx for one input image (no batching).
-`sf.Init` will initialize the FPGAs. It will send the instructions and model parameters to each FPGA's main memory.
 The expected output size of `sf.Run` is twice `nresults`, because numfpga=2 and two input images are processed. `input_img` is two images concatenated.
 The diagram below shows this type of execution:
 
@@ -387,13 +402,9 @@ sf1 = microndla.MDLA()
 # Create second Micron DLA API
 sf2 = microndla.MDLA()
 # Generate instructions for model1
-sf1.Compile('resnet50.onnx', 'microndla1.bin')
+sf1.Compile('resnet50.onnx')
 # Generate instructions for model2
-sf2.Compile('resnet18.onnx', 'microndla2.bin')
-# Init the FPGA 1 with model 1
-sf1.Init('microndla1.bin')
-# Init the FPGA 2 with model 2
-sf2.Init('microndla2.bin')
+sf2.Compile('resnet18.onnx')
 in1 = np.random.rand(3, 224, 224).astype(np.float32)
 in2 = np.random.rand(3, 224, 224).astype(np.float32)
 input_img1 = np.ascontiguousarray(in1)
@@ -423,9 +434,7 @@ numclus = 2
 sf = microndla.MDLA()
 # Generate instructions
 sf.SetFlag('nclusters', str(numclus))
-sf.Compile('resnet18.onnx', 'microndla.bin')
-# Init the FPGA cards
-sf.Init('microndla.bin')
+sf.Compile('resnet18.onnx')
 in1 = np.random.rand(2, 3, 224, 224).astype(np.float32)
 input_img = np.ascontiguousarray(in1)
 output = sf.Run(input_img)
@@ -447,12 +456,9 @@ numfpga = 1
 numclus = 2
 # Create Micron DLA API
 sf = microndla.MDLA()
-sf.SetFlag('nclusters', str(numclus))
-self.dla.SetFlag('clustersbatchmode', '1')
+sf.SetFlag({'nclusters': str(numclus), 'clustersbatchmode': '1'})
 # Generate instructions
-sf.Compile('resnet18.onnx', 'microndla.bin')
-# Init the FPGA cards
-sf.Init('microndla.bin')
+sf.Compile('resnet18.onnx')
 in1 = np.random.rand(3, 224, 224).astype(np.float32)
 input_img = np.ascontiguousarray(in1)
 output = sf.Run(input_img)
@@ -465,6 +471,60 @@ The diagram below shows this type of execution:
 
 <img src="docs/pics/2clus1img.png" width="600" height="550"/>
 
+## Multiple Clusters with different models
+The following example shows how to run different models using different clusters in parallel. 
+Currently, a cluster for each model is allowed. But different number of cluster per model is not allowed. For example, 3 clusters for a model and then 1 cluster for another.
+The example code is in [here](./examples/python_api/twonetdemo.py)
+
+```python
+import microndla
+import numpy as np
+nclus = 2
+img0 = np.random.rand(3, 224, 224).astype(np.float32)
+img1 = np.random.rand(3, 224, 224).astype(np.float32)
+ie = microndla.MDLA()
+ie2 = microndla.MDLA()
+ie.SetFlag({'nclusters': nclus, 'clustersbatchmode': 1})
+ie2.SetFlag({'nclusters': nclus, 'firstcluster': nclus, 'clustersbatchmode': 1})
+ie.Compile('resnet18.onnx')
+ie2.Compile('alexnet.onnx', MDLA=ie)
+ie.PutInput(img0, None)
+ie2.PutInput(img1, None)
+result0, _ = ie.GetResult()
+result1, _ = ie2.GetResult()
+```
+In the code, you create one MDLA object for each model and compile them. For the first model, use 2 clusters together. 
+For the second model, assign the remaining 2 clusters to it. Use `firstcluster` flag to tell `Compile` which cluster is the first cluster it is going to use.
+In this example, first model uses clusters 0 and 1 and second model uses clusters 2 and 3. 
+In `Compile`, pass the previous MDLA object to link them together so that they get loaded into memory in one go. 
+In this case, you must use `PutInput` and `GetResult` paradigm (this [section](#6-tutorial---putinput-and-getresult)), you cannot use `Run`.
+
+<img src="docs/pics/2clus2model.png" width="600" height="550"/>
+
+## All Clusters with different models in sequence
+
+This example shows how to load multiple models and run them in a sequence using all clusters. This is similar to previous example, the only different 
+all clusters is used for each model. It uses same principle of creating different MDLA objects for each model and link different MDLAs in `Compile`.
+
+```python
+import microndla
+import numpy as np
+nclus = 2
+img0 = np.random.rand(3, 224, 224).astype(np.float32)
+img1 = np.random.rand(3, 224, 224).astype(np.float32)
+ie = microndla.MDLA()
+ie2 = microndla.MDLA()
+ie.SetFlag({'nclusters': nclus, 'clustersbatchmode': 1})
+ie2.SetFlag({'nclusters': nclus, 'clustersbatchmode': 1})
+ie.Compile('resnet18.onnx')
+ie2.Compile('alexnet.onnx', MDLA=ie)
+result0 = ie.Run(img0)
+result1 = ie2.Run(img1)
+```
+
+<img src="docs/pics/2clus2seqmodel.png" width="600" height="550"/>
+
+
 ## Multiple Clusters with even bigger batches
 
 It's possible to run batches of more than than the number of clusters or FPGAs. Each cluster will process multiple images.
@@ -477,12 +537,9 @@ numfpga = 1
 numclus = 2
 # Create Micron DLA API
 sf = microndla.MDLA()
-sf.SetFlag('nclusters', str(numclus))
-sf.SetFlag('imgs_per_cluster', '16')
+sf.SetFlag({'nclusters': str(numclus), 'imgs_per_cluster': '16'})
 # Generate instructions
-sf.Compile('resnet18.onnx', 'microndla.bin')
-# Init the FPGA cards
-sf.Init('microndla.bin')
+sf.Compile('resnet18.onnx')
 in1 = np.random.rand(32, 3, 224, 224).astype(np.float32)
 input_img = np.ascontiguousarray(in1)
 output = sf.Run(input_img) # Run
@@ -502,13 +559,9 @@ numfpga = 1
 numclus = 2
 # Create Micron DLA API
 sf = microndla.MDLA()
-sf.SetFlag('nclusters', str(numclus))
-sf.SetFlag('imgs_per_cluster', '16')
-sf.SetFlag('mvbatch', '1')
+sf.SetFlag({'nclusters': str(numclus), 'imgs_per_cluster': '16', 'mvbatch': '1'})
 # Generate instructions
-sf.Compile('resnet18.onnx', 'microndla.bin')
-# Init the FPGA cards
-sf.Init('microndla.bin')
+sf.Compile('resnet18.onnx')
 in1 = np.random.rand(32, 3, 224, 224).astype(np.float32)
 input_img = np.ascontiguousarray(in1)
 output = sf.Run(input_img)
@@ -594,8 +647,7 @@ result_pyt = result_pyt.detach().numpy()
 
 Now we need to run this model using the accelerator with the SDK.
 ```python
-sf.Compile('net_conv.onnx', 'net_conv.bin')
-sf.Init("./net_conv.bin")
+sf.Compile('net_conv.onnx')
 in_1 = np.ascontiguousarray(inV)
 result = sf.Run(in_1)
 ```
@@ -630,9 +682,7 @@ A debug option won't affect the compiler, it will only print more information. T
 
 You can use `SetFlag('debug', 'b')` to print the basic prints. The debug code `'b'` stands for basic. Debug codes and option codes are letters (case-sensetive). For a complete list of letters refer to [here](docs/Codes.md).
 
-Always put the `SetFlag()` after creating the Micron DLA object. If will print the information about the run. First, it will list all the layers that it is going to compile from the `net_conv.onnx` and produce a `net_conv.bin`.
-
-Then `Init` will find an FPGA system, AC511 in our case. It will also show how much time it took to send the weights and instructions to the external memory in the `Init` function.
+Always put the `SetFlag()` after creating the Micron DLA object. If will print the information about the run. First, it will list all the layers that it is going to compile from the `net_conv.onnx`.
 
 Then `Run` will rearrange in the input tensor and load it into the external memory. It will print the time it took and other properties of the run, such as number of FPGAs and clusters used.
 
@@ -706,7 +756,7 @@ In this case, you can set 'V' in the options using `SetFlag` function before `Co
 ie = microndla.MDLA()
 ie.SetFlag('varfp', '1')
 #Compile to a file
-swnresults = ie.Compile('resnet18.onnx', 'save.bin')
+swnresults = ie.Compile('resnet18.onnx')
 ```
 
 **Option 2**: Variable fix-point can be determined for input and output of each layer if one or more sample inputs are provided.
@@ -726,7 +776,7 @@ for fn in os.listdir(args.imagesdir):
 #Create and initialize the Inference Engine object
 ie = microndla.MDLA()
 #Compile to a file
-swnresults = ie.Compile('resnet18.onnx', 'save.bin', samples=imgs)
+swnresults = ie.Compile('resnet18.onnx', samples=imgs)
 ```
 
 After that, `Init` and `Run` runs as usual using the saved variable fix-point configuration.
@@ -784,11 +834,9 @@ mnist = tf.keras.datasets.mnist
 x_train, x_test = x_train / 255.0, x_test / 255.0
 
 ie = microndla.MDLA()
-swnresults = ie.Compile('28x28x1', 'mnist', 'save.bin')
-ie.Init('save.bin', '')
-result = np.ndarray(swnresults, dtype=np.float32)
+ie.Compile('mnist.onnx')
 for i in range(0, 10):
-    ie.Run(x_test[i].astype(np.float32), result)
+    result = ie.Run(x_test[i].astype(np.float32))
     print(y_test[i], np.argmax(result))
 
 ```
 
@@ -11,7 +11,7 @@
 #ifndef _IE_API_H_INCLUDED_
 #define _IE_API_H_INCLUDED_
 
-static const char *microndla_version = "2021.1.0";
+static const char *microndla_version = "2021.2.0";
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -46,7 +46,7 @@ int IECOMPILER_API set_external_wait(void *cmemo, bool (*wait_ext) (int));
 /*!
 Allow to pass externally created thnets net into node list
 */
-void IECOMPILER_API ext_thnets2lst(void *cmemo,  void* nett, char* image, int limit, int batch);
+void IECOMPILER_API ext_thnets2lst(void *cmemo,  void* nett, char* image, int batch);
 
 /*!
 Create an Inference Engine object
@@ -82,7 +82,7 @@ Run static quantization of inputs, weight and outputs over a calibration dataset
 */
 void IECOMPILER_API *ie_compile_vfp(void *cmemo, const char *modelpath, const char* outbin, const char *inshapes,
                     unsigned *noutputs, unsigned **noutdims, uint64_t ***outshapes,
-                    const float * const *inputs, const uint64_t *input_elements, unsigned ninputs);
+                    const float * const *inputs, const uint64_t *input_elements, unsigned ninputs, void *cmemp);
 
 /*!
 Compile a network and produce a .bin file with everything that is needed to execute in hardware.
@@ -97,7 +97,7 @@ In this case, ie_compile is necessary, ie_init with a previously generated bin f
     @param outshapes    returns a pointer to noutputs pointers to the shapes of each output
     @return context object
 */
-void IECOMPILER_API *ie_compile(void *cmemo, const char *modelpath, const char *outbin, const char *inshapes, unsigned *noutputs, unsigned **noutdims, uint64_t ***outshapes);
+void IECOMPILER_API *ie_compile(void *cmemo, const char *modelpath, const char *outbin, const char *inshapes, unsigned *noutputs, unsigned **noutdims, uint64_t ***outshapes, void *cmemp);
 /*!
 Load a .bin file into the hardware and initialize it
     @param cmemo        pointer to an Inference Engine object, may be null
 
@@ -37,15 +37,18 @@ Frees the network
 ******
 ## void *ie_compile
 
-Parse an ONNX model and generate Inference Engine instructions
+Parse an ONNX/NNEF model and generate Inference Engine instructions
 
 ***Parameters:***
 
+void IECOMP char *modelpath, const char *outbin, const char *inshapes, unsigned *noutputs, unsigned **noutdims, uint64_t ***outshapes, void *cmemp);
+
+
 `void *cmemo`: pointer to an Inference Engine object, may be 0
 
 `const char *modelpath`: path to a model file in ONNX format
 
-`const char* outbin`: path to a file where a model in the Inference Engine ready format will be saved
+`const char* outbin`: path to a file where a model in the Inference Engine ready format will be saved. If this param is used then Init call is needed afterwards
 
 `const char *inshapes`: shape of the inputs in the form size0xsize1xsize2...; more inputs are separated by semi-colon; this parameter is optional as the shapes of the inputs can be obtained from the model file
 
@@ -55,6 +58,8 @@ Parse an ONNX model and generate Inference Engine instructions
 
 `uint64_t ***outshapes`: returns a pointer to noutputs pointers to the shapes of each output
 
+`void *cmemp`: MDLA object to link together so that models can be load into memory together
+
 ***Return value:*** pointer to the Inference Engine object or 0 in case of error
 
 ******
@@ -85,22 +90,9 @@ choosing the proper quantization for variable-fixed point, available with the VF
 
 `unsigned ninputs`: number of inputs, must be a multiple of the inputs expected by the network
 
-***Return value:*** pointer to the Inference Engine object or 0 in case of error
-
-******
-## void *ie_loadmulti
-
-Loads multiple bitfiles without initializing hardware
+`void *cmemp`: MDLA object to link together so that models can be load into memory together
 
-***Parameters:***
-
-`void *cmemo`: pointer to an Inference Engine object
-
-`const char* const *inbins`: array of pathnames to the bitfiles to load
-
-`unsigned count`: number of bitfiles to load
-
-***Return value:*** pointer to an Inference Engine object to pass to ie_init
+***Return value:*** pointer to the Inference Engine object or 0 in case of error
 
 ******
 ## void *ie_init
 
@@ -147,6 +147,7 @@ following characters:
 
 **no_rearrange**: Skip output rearrangement
 
+**heterogeneous**: Run DLA-unsupported layers on CPU also in the middle of the network
 
 *****
 ## GetInfo