Skip to content

[nrf toup][nrfconnect] Optimize OTA time#707

Merged
kkasperczyk-no merged 1 commit into
nrfconnect:masterfrom
Adam-Maciuga:OTA_optimization
Mar 18, 2026
Merged

[nrf toup][nrfconnect] Optimize OTA time#707
kkasperczyk-no merged 1 commit into
nrfconnect:masterfrom
Adam-Maciuga:OTA_optimization

Conversation

@Adam-Maciuga
Copy link
Copy Markdown
Contributor

Summary

The following changes Schedule the next download before the previous chunk is saved to flash
This way, the OTA time drops by around 15-20%

Testing

Perform Matter OTA and observe the speed up.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 2, 2026

CLA assistant check
All committers have signed the CLA.

NordicBuilder added a commit to NordicBuilder/sdk-nrf that referenced this pull request Mar 2, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
if (err != 0)
{
ChipLogError(SoftwareUpdate, "OTA block write failed %d", err);
DeviceLayer::SystemLayer().ScheduleLambda([this] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WriteToFlash was already always asynchronous, this ScheduleLambda wouldn't really be needed anymore.

});
}

if (isLastBlock)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this optimization or necessary for some reason? Perhaps, it would be good to always process ProcessBlock and Finalize asynchronously as in OTAImageProcessor.h these are documented as "must not block". This is something we missed originally, but maybe we could clean it as well now that we anyway want to write to flash asynchronously.

Copy link
Copy Markdown
Contributor Author

@Adam-Maciuga Adam-Maciuga Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the main reason is that after the last block we do not need to request the next data since all the data has been taken in already. In the previous implementation, we always called the FetchNextData, thought it was unnecessary, but I suppose if we can just call it always for the sake of code simplicity, I can just remove the different handling. Will look into that.

When it comes to the asynchronous/synchronous I left the writing to flash synchronous as it was. I decided not to experiment with moving it all to asynchronous way because of this:

case TransferSession::OutputEventType::kBlockReceived: {
        chip::ByteSpan blockData(outEvent.blockdata.Data, outEvent.blockdata.Length);
        ReturnErrorOnFailure(mImageProcessor->ProcessBlock(blockData));   mStateDelegate>OnUpdateProgressChanged(mImageProcessor>GetPercentComplete());
        // TODO: this will cause problems if Finalize() is not guaranteed to do its work after ProcessBlock().
        if (outEvent.blockdata.IsEof)
        {
            mBdxTransfer.PrepareBlockAck();
            ReturnErrorOnFailure(mImageProcessor->Finalize());
        }
        break;
    }

Part of BDXDownloaded that calls the ProcessBlock and Finalize

Because of the comment I decided it's safer to leave it synchronous.

Should I rewrite it to do all its work asynchronously?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you did good by not calling FetchNextData for the last block. Otherwise, it might generate another unnecessary Block request if I understand the code properly. Previously, this wasn't an issue because FetchNextData was done asynchronously and the caller of ProcessBlock would generate BlockAckEof message, which would change the state of the transfer as complete (after which any further FetchNextData would just fail).

If you wanted to do things asynchronously you would need to also call FetchNextData and do Finalize() asynchronously to make sure things work properly.

By doing this synchronously we don't fulfill the API contract not to block, but it has been like that for years so the contract is probably never relied on in the upper layers ;). So I would be OK with leaving this as is. But then my question is why do we even need the staging buffer?

Copy link
Copy Markdown
Contributor Author

@Adam-Maciuga Adam-Maciuga Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all the functions that were supposed to be non-blocking to non-blocking. Tested on l15, worked as expected, I am going to run CI to test a wider range of platforms. This way we should be safe if a change on the upper layers actually relies on this contract.

When it comes to the buffer, you are right, I had a misconception about it, would not be needed in the blocking code. But since now the code is async, it is needed. During my testing the flash always finished before the download, but just to be safe I added a semaphore to avoid data corruption. On l15 with default settings the flash writes usually finished within 20 ms but it sometimes took up to 150ms, so to I chose 500 ms as a maximum time to wait for the flash write. I am not really sure If it should stay like that or some other value should be chosen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either my handling was wrong, or we shouldn't call the FetchNextData function for the last block.
Nevertheless, it now should work correctly.

NordicBuilder added a commit to NordicBuilder/sdk-nrf that referenced this pull request Mar 5, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
NordicBuilder added a commit to NordicBuilder/sdk-nrf that referenced this pull request Mar 5, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
return DeviceLayer::SystemLayer().ScheduleLambda([this] {
PostOTAStateChangeEvent(DeviceLayer::kOtaDownloadComplete);
DFUSync::GetInstance().Free(mDfuSyncMutexId);
System::MapErrorZephyr(dfu_multi_image_done(true));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping the error is pointless if you're not returning anything. Perhaps, add a log on error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a log


return error;
return DeviceLayer::SystemLayer().ScheduleLambda([this] {
System::MapErrorZephyr(dfu_multi_image_done(false));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

/* For typical operation this should never block the thread
* This will wait only if the flash write took longer than block download
*/
if (k_sem_take(&mStagingSem, K_MSEC(500)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary and incorrect :). The reason is that both the flash write and ProcessBlock run on the same thread so if it was ever possible to wait here, the code would just deadlock. Normally, ProcessBlock will just wait scheduled until WriteToFlash exits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if this is the case then you are right the semaphore is not needed, deleted it. I thought the work scheduled with the lambdas runs on a different thread

NordicBuilder added a commit to NordicBuilder/sdk-nrf that referenced this pull request Mar 5, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
NordicBuilder added a commit to NordicBuilder/sdk-nrf that referenced this pull request Mar 6, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
Copy link
Copy Markdown
Contributor

@Damian-Nordic Damian-Nordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


if (error != CHIP_NO_ERROR)
{
return DeviceLayer::SystemLayer().ScheduleLambda([this, error] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps, it's better to return the original error instead of CHIP_NO_ERROR in such a case, but not sure what this changes in practice :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense, changed it.

return DeviceLayer::SystemLayer().ScheduleLambda([this, blockOffset, blockSize] {
CHIP_ERROR err = CHIP_NO_ERROR;
const bool isLastBlock = blockOffset + blockSize >= mParams.totalFileBytes;
if (!isLastBlock)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have time to check this, but it's surprising that this is needed as TransferSession::PrepareBlockQuery() should exit early due to VerifyOrReturnError(mState == TransferState::kTransferInProgress, CHIP_ERROR_INCORRECT_STATE); after the last block has been sent. But anyway, there's probably no harm to skip it either, so it's OK :).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some investigation to understand why this was not needed but is now:
Flow of program before (last block):

  1. Process block flashes the block
  2. Lambda requesting the next block is scheduled
  3. Finalize is called (synchronous)
  4. The lambda is executed; there was no error handling for mDownloader->FetchNextData();

Now:

  1. Process block schedules the lambda
  2. Finalize schedules its lambda
  3. Process Block lambda runs, mDownloader->FetchNextData() returns an error (expected since the download is finished), The error is handled and the mDownloader->EndDownload(); is called
  4. the download process doesn't finish properly

Therefore, the last block has to be processed a bit differently

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so it worked only because FetchNextData error wasn't checked. Thanks for checking :)

Reduced OTA time by 15-20%

Signed-off-by: Adam Maciuga <adam.maciuga@nordicsemi.no>
NordicBuilder added a commit to NordicBuilder/sdk-nrf that referenced this pull request Mar 16, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
@kkasperczyk-no kkasperczyk-no merged commit 8276cee into nrfconnect:master Mar 18, 2026
12 checks passed
ArekBalysNordic pushed a commit to nrfconnect/sdk-nrf that referenced this pull request Mar 18, 2026
Automatically created by action-manifest-pr GH action from PR:
nrfconnect/sdk-connectedhomeip#707

Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants