Description
See also #395.
TLDR
The 8-byte buffer in lfs_bd_cmp() is a bottleneck on slower systems.
In a ram-backed configuration, performance increased by ~40% when this buffer was increased to 256 bytes. (Note: When using real flash devices, you will not see this level of performance improvement, as performance will be limited more by the flash device, however testing against RAM gives a good indication of any bottlenecks that may exist in the file system).
Please consider:
- Make the size of the internal dat[] array overridable/tunable, or a member of lfs_config
- If technically possible, leverage the existing pcache and rcache arrays to perform the compare, rather than creating a third array on the stack.
Full background
I have been investigating lfs performance on a Cortex-M4 processor running at 100MHz and, to rule out flash performance, implemented lfs_config->read(), lfs_config->prog() and lfs_config->erase() to read/write to a chunk of RAM instead (64kB, emulating 4x 256 pages per block).
I was surprised to see that even with this configuration, lfs_config->prog() performance was only ~1MB/s, indicating that there may be an inherent bottleneck in littlefs.
Part of this was due to the fact that I was using a vendor-provided precompiled version of newlib-nano with either PREFER_SIZE_OVER_SPEED or ___OPTIMIZE_SIZE___ defined that resulted in the simplest, and slowest implementation of memcpy() possible - and it was only able to achieve 10MB/s raw memcpy() throughput. (This is not directly relevant to the issue found, but I've included this in case others find this ticket - it may be worth checking your implementations of standard libc functions)
With an optimized version of raw memcpy() performance increased to 79MB/s, but lfs_config->prog() performance stubbornly remained around ~1.6MB/s.
When I increased lfs_bd_cmp()'s dat[8] buffer to 256 bytes (and therefore reduce the number of function calls to lfs_bd_read() and memcmp() by a factor of 32), lfs_config->prog performance increased to ~2.6MB/s. Still not great, but a worthy performance increase.
Numbers
Write benchmark: repeat(open/create file, write 8kB in 256bytes chunks, close file, delete file)
Read bechmark: create 8kB file, repeat(open file, read 8kB in 256bytes chunks, close file), delete file
Read | Benchmark timings (1MB) | littlefs throughput |
---|---|---|
newlib-nano (optimized for size) | 353ms | 2.8MB/s |
optimized memcpy | 100ms | 10MB/s |
Write | Benchmark timings (1MB) | littlefs throughput |
---|---|---|
newlib-nano (optimized for size) | 1041ms | 960kB/s |
optimzed memcpy | 634ms | 1.58MB/s |
lfs_bd_cmp(), dat[8->256] | 417ms | 2.39MB/s |
no validation (best case performance) | 241ms | 4.15MB/s |
It is worth noting that the optimized version of the Write benchmark (417ms) is slower than a separate 'no-validation' Write benchmark (241ms) followed by a Read Benchmark (100ms). This may indicate additional inefficiencies in the lfs_config->prog() implementation. Given that the lfs_config->read() performs a checksum validation, would it be faster/sufficient to validate the data written in lfs_config->prog() indirectly by using checksum validation, rather than a full-on memcmp()?