MAIX performance and limit for AI models.


#1

Sipeed MAIX board is based on main chip K210, this thread introduce K210’s performance and limit for AI models.

Performance

The declared power of KPU is 0.23TOPS for multiplication, 1TOPS for total.

We did a performance test:

CPU@500MHz, KPU@300MHz (CPU can be turbo to 700M, KPU can be turbo to 750M)

Weights =16bit, input channel = 128, output channel = 128, conv kernel = 3x3

test different pictures calculate latency:

Width Height W*H MAC (k) Latency (ms) G MAC/s
4 4 16 2359. 0.339 6.9
16 4 64 9437. 0.339 27.8
16 16 256 37748. 0.417 90.5
32 16 512 75497. 0.692 109.1
32 32 1024 150994. 1.138 132.6
64 32 2048 301989. 2.125 142.1
64 64 4096 603979. 3.891 155.2
128 64 8192 1207959. 7.709 156.7
128 128 16384 2415919. 14.778 163.5
256 256 65536 9663676. 57.756 167.3

We can see perfermance is good when W,H > 64. As KPU deal pictures in 64 pixels every time.

The GMAC incease to about 170GMAC when pictures size is big enough.

It is about 3/4 * 230G, as declared 230G is meansure in standard 400M KPU freq.

The declared data is true.

Some performance for classics network:

MobileNet v1 1.0 @ QVGA is about 20~30fps.

yolov2-tiny for face detection @QVGA is about 50~70fps

yolov3-tiny is about 20fps @QVGA

Limit

K210 is simple and integration chip, it have some limit in hardware and software(at this moment).

Hardware Limit

Normal CPU freq 400M, turbo to 500M without increase core voltage, turbo to 700M increase core voltage to 1.1V

Normal KPU freq 400M, turbo to 500M without increase core voltage, turbo to 750M increase core voltage to 1.1V

8MB SRAM inside chip, 6MB for CPU, 2MB for KPU.

So max size of model is closes to 6MB, in fact, strip memory for image buffer, there is about 5.5MB idle for model.

In our MaixDuino(Arduino) environment, the max space for model is ~5MB.

In our MaixPy(MicroPython) environment, the max space for model is ~3.5MB.

If your model exceed the size limit, you have to reload rest layer of model from flash.

K210 use 4-line SPI Nor Flash, max read speed is about 50MB/s, it will slow down model calculation significantly.

And the middle layer result can’t exceed 2MB, as KPU memory is 2MB.

Software Limit

It is KPU diagram:

you can see it is good in calculate classics CNN network, but poor to other network (need CPU calculate in node concat point).

K210 have a tool called “nncase” for model convert.

nncase support tflite and caffe model format, other model format need you convert to tflite or caffe first.

(convert toolbox for scratch https://github.com/sipeed/Maix_Toolbox)

nncase deal the op nodes in this list (in kpu.h)


typedef enum

{

KL_INVALID = 0,

KL_ADD,

KL_QUANTIZED_ADD,

KL_GLOBAL_MAX_POOL2D,

KL_QUANTIZED_GLOBAL_MAX_POOL2D,

KL_GLOBAL_AVERAGE_POOL2D,

KL_QUANTIZED_GLOBAL_AVERAGE_POOL2D,

KL_MAX_POOL2D,

KL_QUANTIZED_MAX_POOL2D,

KL_AVERAGE_POOL2D,

KL_QUANTIZED_AVERAGE_POOL2D,

KL_QUANTIZE,

KL_DEQUANTIZE,

KL_REQUANTIZE,

KL_L2_NORMALIZATION,

KL_SOFTMAX,

KL_CONCAT,

KL_QUANTIZED_CONCAT,

KL_FULLY_CONNECTED,

KL_QUANTIZED_FULLY_CONNECTED,

KL_TENSORFLOW_FLATTEN,

KL_QUANTIZED_TENSORFLOW_FLATTEN,

KL_K210_CONV = 10240,

KL_K210_ADD_PADDING,

KL_K210_REMOVE_PADDING,

KL_K210_UPLOAD

} kpu_model_layer_type_t;

To deal with the operations inside the list, nncase will deal it automatically, and drivers is done too, you just put data in, and get result out.

To deal with the operations outside the list, you need split the network, get the middle result from KPU, and write CPU program to do the unsupport operations , and put result back to KPU manually, It will be take time to debug.

Or you can post issue to https://github.com/kendryte/nncase

If you have any question, welcome leave msg below~