[nrf toup][nrfconnect] Optimize OTA time by Adam-Maciuga · Pull Request #707 · nrfconnect/sdk-connectedhomeip

Adam-Maciuga · 2026-03-02T15:42:34Z

Summary

The following changes Schedule the next download before the previous chunk is saved to flash
This way, the OTA time drops by around 15-20%

Testing

Perform Matter OTA and observe the speed up.

CLAassistant · 2026-03-02T15:42:51Z

All committers have signed the CLA.

Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>

Damian-Nordic · 2026-03-03T07:45:44Z

+    if (err != 0)
+    {
+        ChipLogError(SoftwareUpdate, "OTA block write failed %d", err);
+        DeviceLayer::SystemLayer().ScheduleLambda([this] {


If WriteToFlash was already always asynchronous, this ScheduleLambda wouldn't really be needed anymore.

Damian-Nordic · 2026-03-03T07:47:46Z

+        });
+    }
+
+    if (isLastBlock)


Is this optimization or necessary for some reason? Perhaps, it would be good to always process ProcessBlock and Finalize asynchronously as in OTAImageProcessor.h these are documented as "must not block". This is something we missed originally, but maybe we could clean it as well now that we anyway want to write to flash asynchronously.

So, the main reason is that after the last block we do not need to request the next data since all the data has been taken in already. In the previous implementation, we always called the FetchNextData, thought it was unnecessary, but I suppose if we can just call it always for the sake of code simplicity, I can just remove the different handling. Will look into that.

When it comes to the asynchronous/synchronous I left the writing to flash synchronous as it was. I decided not to experiment with moving it all to asynchronous way because of this:

case TransferSession::OutputEventType::kBlockReceived: { chip::ByteSpan blockData(outEvent.blockdata.Data, outEvent.blockdata.Length); ReturnErrorOnFailure(mImageProcessor->ProcessBlock(blockData)); mStateDelegate>OnUpdateProgressChanged(mImageProcessor>GetPercentComplete()); // TODO: this will cause problems if Finalize() is not guaranteed to do its work after ProcessBlock(). if (outEvent.blockdata.IsEof) { mBdxTransfer.PrepareBlockAck(); ReturnErrorOnFailure(mImageProcessor->Finalize()); } break; }

Part of BDXDownloaded that calls the ProcessBlock and Finalize

Because of the comment I decided it's safer to leave it synchronous.

Should I rewrite it to do all its work asynchronously?

I think you did good by not calling FetchNextData for the last block. Otherwise, it might generate another unnecessary Block request if I understand the code properly. Previously, this wasn't an issue because FetchNextData was done asynchronously and the caller of ProcessBlock would generate BlockAckEof message, which would change the state of the transfer as complete (after which any further FetchNextData would just fail).

If you wanted to do things asynchronously you would need to also call FetchNextData and do Finalize() asynchronously to make sure things work properly.

By doing this synchronously we don't fulfill the API contract not to block, but it has been like that for years so the contract is probably never relied on in the upper layers ;). So I would be OK with leaving this as is. But then my question is why do we even need the staging buffer?

I changed all the functions that were supposed to be non-blocking to non-blocking. Tested on l15, worked as expected, I am going to run CI to test a wider range of platforms. This way we should be safe if a change on the upper layers actually relies on this contract.

When it comes to the buffer, you are right, I had a misconception about it, would not be needed in the blocking code. But since now the code is async, it is needed. During my testing the flash always finished before the download, but just to be safe I added a semaphore to avoid data corruption. On l15 with default settings the flash writes usually finished within 20 ms but it sometimes took up to 150ms, so to I chose 500 ms as a maximum time to wait for the flash write. I am not really sure If it should stay like that or some other value should be chosen

Either my handling was wrong, or we shouldn't call the FetchNextData function for the last block.
Nevertheless, it now should work correctly.

Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>

Damian-Nordic · 2026-03-05T10:18:50Z

+    return DeviceLayer::SystemLayer().ScheduleLambda([this] {
+        PostOTAStateChangeEvent(DeviceLayer::kOtaDownloadComplete);
+        DFUSync::GetInstance().Free(mDfuSyncMutexId);
+        System::MapErrorZephyr(dfu_multi_image_done(true));


Mapping the error is pointless if you're not returning anything. Perhaps, add a log on error?

Added a log

Damian-Nordic · 2026-03-05T10:22:31Z

-
-    return error;
+    return DeviceLayer::SystemLayer().ScheduleLambda([this] {
+        System::MapErrorZephyr(dfu_multi_image_done(false));


Damian-Nordic · 2026-03-05T10:27:02Z

+    /* For typical operation this should never block the thread
+     * This will wait only if the flash write took longer than block download
+     */
+    if (k_sem_take(&mStagingSem, K_MSEC(500)))


I think this is unnecessary and incorrect :). The reason is that both the flash write and ProcessBlock run on the same thread so if it was ever possible to wait here, the code would just deadlock. Normally, ProcessBlock will just wait scheduled until WriteToFlash exits.

Ok if this is the case then you are right the semaphore is not needed, deleted it. I thought the work scheduled with the lambdas runs on a different thread

Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>

Damian-Nordic

Thanks!

Damian-Nordic · 2026-03-13T08:26:39Z

+
+    if (error != CHIP_NO_ERROR)
+    {
+        return DeviceLayer::SystemLayer().ScheduleLambda([this, error] {


nit: Perhaps, it's better to return the original error instead of CHIP_NO_ERROR in such a case, but not sure what this changes in practice :)

It makes more sense, changed it.

Damian-Nordic · 2026-03-13T08:29:12Z

+    return DeviceLayer::SystemLayer().ScheduleLambda([this, blockOffset, blockSize] {
+        CHIP_ERROR err         = CHIP_NO_ERROR;
+        const bool isLastBlock = blockOffset + blockSize >= mParams.totalFileBytes;
+        if (!isLastBlock)


Don't have time to check this, but it's surprising that this is needed as TransferSession::PrepareBlockQuery() should exit early due to VerifyOrReturnError(mState == TransferState::kTransferInProgress, CHIP_ERROR_INCORRECT_STATE); after the last block has been sent. But anyway, there's probably no harm to skip it either, so it's OK :).

Did some investigation to understand why this was not needed but is now:
Flow of program before (last block):

Process block flashes the block

Lambda requesting the next block is scheduled

Finalize is called (synchronous)

The lambda is executed; there was no error handling for mDownloader->FetchNextData();

Now:

Process block schedules the lambda

Finalize schedules its lambda

Process Block lambda runs, mDownloader->FetchNextData() returns an error (expected since the download is finished), The error is handled and the mDownloader->EndDownload(); is called

the download process doesn't finish properly

Therefore, the last block has to be processed a bit differently

ok, so it worked only because FetchNextData error wasn't checked. Thanks for checking :)

Reduced OTA time by 15-20% Signed-off-by: Adam Maciuga <adam.maciuga@nordicsemi.no>

Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>

Adam-Maciuga requested a review from a team as a code owner March 2, 2026 15:42

github-actions Bot added platform nrf connect labels Mar 2, 2026

NordicBuilder mentioned this pull request Mar 2, 2026

manifest: sdk-connectedhomeip: Update revision nrfconnect/sdk-nrf#27332

Merged

Damian-Nordic reviewed Mar 3, 2026

View reviewed changes

Adam-Maciuga force-pushed the OTA_optimization branch from 513e278 to 65c4322 Compare March 5, 2026 09:43

Adam-Maciuga force-pushed the OTA_optimization branch from 65c4322 to f369f54 Compare March 5, 2026 09:59

Damian-Nordic reviewed Mar 5, 2026

View reviewed changes

Adam-Maciuga force-pushed the OTA_optimization branch from f369f54 to f4a62a7 Compare March 5, 2026 11:10

Adam-Maciuga force-pushed the OTA_optimization branch from f4a62a7 to 4fa88cd Compare March 6, 2026 11:58

Adam-Maciuga requested review from ArekBalysNordic and Damian-Nordic March 13, 2026 07:01

Damian-Nordic approved these changes Mar 13, 2026

View reviewed changes

kkasperczyk-no approved these changes Mar 13, 2026

View reviewed changes

[nrf toup][nrfconnect] Optimize OTA time

00c268f

Reduced OTA time by 15-20% Signed-off-by: Adam Maciuga <adam.maciuga@nordicsemi.no>

Adam-Maciuga force-pushed the OTA_optimization branch from 4fa88cd to 00c268f Compare March 16, 2026 12:16

kkasperczyk-no merged commit 8276cee into nrfconnect:master Mar 18, 2026
12 checks passed

Conversation

Adam-Maciuga commented Mar 2, 2026

Summary

Testing

Uh oh!

CLAassistant commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Adam-Maciuga Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Adam-Maciuga Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Damian-Nordic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Mar 2, 2026 •

edited

Loading

Adam-Maciuga Mar 3, 2026 •

edited

Loading

Adam-Maciuga Mar 5, 2026 •

edited

Loading