[nrf toup][nrfconnect] Optimize OTA time#707
Conversation
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
| if (err != 0) | ||
| { | ||
| ChipLogError(SoftwareUpdate, "OTA block write failed %d", err); | ||
| DeviceLayer::SystemLayer().ScheduleLambda([this] { |
There was a problem hiding this comment.
If WriteToFlash was already always asynchronous, this ScheduleLambda wouldn't really be needed anymore.
| }); | ||
| } | ||
|
|
||
| if (isLastBlock) |
There was a problem hiding this comment.
Is this optimization or necessary for some reason? Perhaps, it would be good to always process ProcessBlock and Finalize asynchronously as in OTAImageProcessor.h these are documented as "must not block". This is something we missed originally, but maybe we could clean it as well now that we anyway want to write to flash asynchronously.
There was a problem hiding this comment.
So, the main reason is that after the last block we do not need to request the next data since all the data has been taken in already. In the previous implementation, we always called the FetchNextData, thought it was unnecessary, but I suppose if we can just call it always for the sake of code simplicity, I can just remove the different handling. Will look into that.
When it comes to the asynchronous/synchronous I left the writing to flash synchronous as it was. I decided not to experiment with moving it all to asynchronous way because of this:
case TransferSession::OutputEventType::kBlockReceived: {
chip::ByteSpan blockData(outEvent.blockdata.Data, outEvent.blockdata.Length);
ReturnErrorOnFailure(mImageProcessor->ProcessBlock(blockData)); mStateDelegate>OnUpdateProgressChanged(mImageProcessor>GetPercentComplete());
// TODO: this will cause problems if Finalize() is not guaranteed to do its work after ProcessBlock().
if (outEvent.blockdata.IsEof)
{
mBdxTransfer.PrepareBlockAck();
ReturnErrorOnFailure(mImageProcessor->Finalize());
}
break;
}
Part of BDXDownloaded that calls the ProcessBlock and Finalize
Because of the comment I decided it's safer to leave it synchronous.
Should I rewrite it to do all its work asynchronously?
There was a problem hiding this comment.
I think you did good by not calling FetchNextData for the last block. Otherwise, it might generate another unnecessary Block request if I understand the code properly. Previously, this wasn't an issue because FetchNextData was done asynchronously and the caller of ProcessBlock would generate BlockAckEof message, which would change the state of the transfer as complete (after which any further FetchNextData would just fail).
If you wanted to do things asynchronously you would need to also call FetchNextData and do Finalize() asynchronously to make sure things work properly.
By doing this synchronously we don't fulfill the API contract not to block, but it has been like that for years so the contract is probably never relied on in the upper layers ;). So I would be OK with leaving this as is. But then my question is why do we even need the staging buffer?
There was a problem hiding this comment.
I changed all the functions that were supposed to be non-blocking to non-blocking. Tested on l15, worked as expected, I am going to run CI to test a wider range of platforms. This way we should be safe if a change on the upper layers actually relies on this contract.
When it comes to the buffer, you are right, I had a misconception about it, would not be needed in the blocking code. But since now the code is async, it is needed. During my testing the flash always finished before the download, but just to be safe I added a semaphore to avoid data corruption. On l15 with default settings the flash writes usually finished within 20 ms but it sometimes took up to 150ms, so to I chose 500 ms as a maximum time to wait for the flash write. I am not really sure If it should stay like that or some other value should be chosen
There was a problem hiding this comment.
Either my handling was wrong, or we shouldn't call the FetchNextData function for the last block.
Nevertheless, it now should work correctly.
513e278 to
65c4322
Compare
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
65c4322 to
f369f54
Compare
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
| return DeviceLayer::SystemLayer().ScheduleLambda([this] { | ||
| PostOTAStateChangeEvent(DeviceLayer::kOtaDownloadComplete); | ||
| DFUSync::GetInstance().Free(mDfuSyncMutexId); | ||
| System::MapErrorZephyr(dfu_multi_image_done(true)); |
There was a problem hiding this comment.
Mapping the error is pointless if you're not returning anything. Perhaps, add a log on error?
|
|
||
| return error; | ||
| return DeviceLayer::SystemLayer().ScheduleLambda([this] { | ||
| System::MapErrorZephyr(dfu_multi_image_done(false)); |
| /* For typical operation this should never block the thread | ||
| * This will wait only if the flash write took longer than block download | ||
| */ | ||
| if (k_sem_take(&mStagingSem, K_MSEC(500))) |
There was a problem hiding this comment.
I think this is unnecessary and incorrect :). The reason is that both the flash write and ProcessBlock run on the same thread so if it was ever possible to wait here, the code would just deadlock. Normally, ProcessBlock will just wait scheduled until WriteToFlash exits.
There was a problem hiding this comment.
Ok if this is the case then you are right the semaphore is not needed, deleted it. I thought the work scheduled with the lambdas runs on a different thread
f369f54 to
f4a62a7
Compare
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
f4a62a7 to
4fa88cd
Compare
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
|
|
||
| if (error != CHIP_NO_ERROR) | ||
| { | ||
| return DeviceLayer::SystemLayer().ScheduleLambda([this, error] { |
There was a problem hiding this comment.
nit: Perhaps, it's better to return the original error instead of CHIP_NO_ERROR in such a case, but not sure what this changes in practice :)
There was a problem hiding this comment.
It makes more sense, changed it.
| return DeviceLayer::SystemLayer().ScheduleLambda([this, blockOffset, blockSize] { | ||
| CHIP_ERROR err = CHIP_NO_ERROR; | ||
| const bool isLastBlock = blockOffset + blockSize >= mParams.totalFileBytes; | ||
| if (!isLastBlock) |
There was a problem hiding this comment.
Don't have time to check this, but it's surprising that this is needed as TransferSession::PrepareBlockQuery() should exit early due to VerifyOrReturnError(mState == TransferState::kTransferInProgress, CHIP_ERROR_INCORRECT_STATE); after the last block has been sent. But anyway, there's probably no harm to skip it either, so it's OK :).
There was a problem hiding this comment.
Did some investigation to understand why this was not needed but is now:
Flow of program before (last block):
- Process block flashes the block
- Lambda requesting the next block is scheduled
- Finalize is called (synchronous)
- The lambda is executed; there was no error handling for
mDownloader->FetchNextData();
Now:
- Process block schedules the lambda
- Finalize schedules its lambda
- Process Block lambda runs,
mDownloader->FetchNextData()returns an error (expected since the download is finished), The error is handled and themDownloader->EndDownload();is called - the download process doesn't finish properly
Therefore, the last block has to be processed a bit differently
There was a problem hiding this comment.
ok, so it worked only because FetchNextData error wasn't checked. Thanks for checking :)
Reduced OTA time by 15-20% Signed-off-by: Adam Maciuga <adam.maciuga@nordicsemi.no>
4fa88cd to
00c268f
Compare
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
Automatically created by action-manifest-pr GH action from PR: nrfconnect/sdk-connectedhomeip#707 Signed-off-by: Nordic Builder <pylon@nordicsemi.no>
Summary
The following changes Schedule the next download before the previous chunk is saved to flash
This way, the OTA time drops by around 15-20%
Testing
Perform Matter OTA and observe the speed up.