With the following changes to llama2.c, I am able to achieve 19.13 tok/s:
- Utilizing both cores of the ESP32 during math heavy operations.
- Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
- Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.
This requires the ESP-IDF toolchain to be installed
idf.py build
idf.py -p /dev/{DEVICE_PORT} flash