Skip to content

BT lockless queue#867

Merged
1technophile merged 7 commits into1technophile:developmentfrom
csiki2:BT_lockless_queue
Feb 26, 2021
Merged

BT lockless queue#867
1technophile merged 7 commits into1technophile:developmentfrom
csiki2:BT_lockless_queue

Conversation

@csiki2
Copy link
Contributor

@csiki2 csiki2 commented Feb 11, 2021

Lockless queue for ESP32 BT as discussed.
Note, in theory the dumpDevices also could crash at MQTTtoBT, didn't touched that (yet).

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 12, 2021

An important note:
OMG without this patch lost the wifi connection every 2-3 mins when the TimeBtwRead was set to 1ms (ESP32 ofc).
This behavior still exist with the current patch.
BUT
If you set the BTQueueSize to 1 (don't laugh! :) ) then the disconnection issue "mysteriously" disappears and instead the btQueueBlocked increases by 1 every 1-2 minutes.
I guess this is because we effectively block the start of a BT scan when there will be a wifi send so the chance to have a problem around the shared antenna decreases considerably.
I have found this: https://www.esp32.com/viewtopic.php?t=6707

Again, this doesn't seem to be a real issue as most people won't set TimeBtwRead to 1 and OMG instantly reconnects.
Of course the log is not nice. :)

@1technophile
Copy link
Owner

Thanks for the PR, I did not had time to get my hands into it but will do it soon!

@1technophile 1technophile added this to the v0.9.6 milestone Feb 20, 2021
@1technophile
Copy link
Owner

I'm wondering if we should not add also into the queue the message processed by haRoomPresence.

It could be the cause of the wifi disconnection below:
[E][WiFiClient.cpp:392] write(): fail on fd 55, errno: 104, "Connection reset by peer"

T: TxPower: -59
T: Ble distance 4.76
T: Pub json :{"id":"iii","manufacturerdata":"e000010dca5c960d","rssi":-73,"distance":4.756481} into custom topic: home/home_presence/OpenMQTTGateway_ESP32_IR_RF
T: Get services data number: 1
N: Received json : ["IR","RF","BT"]
T: Service data: 0000000000000000000000000000000000000000
T: Pub json :{"stat_t":"home/OpenMQTTGateway_ESP32_IR_RF/SYStoMQTT","name":"BT: Connnect every X scan(s)","uniq_id":"iiiscanbcnct","val_tpl":"{{ value_json.scanbcnct }}","device":{"name":"OpenMQTTGateway_ESP32_IR_RF","model":"[\"IR\",\"RF\",\"BT\"]","manufacturer":"OMG_community","sw_version":"version_tag","identifiers":["iii"]}} into custom topic: homeassistant/sensor/iiiscanbcnct/config        
T: Service data UUID: 0xfe9f
T: getDeviceByMacii
T: Checking BLE service data validity
T: Non valid service data, removing it
T: manufacturerdata e000010dca5c960d
T: Is it a iNode Energy Meter?
T: Remove manufacturer data
T: Removing servicedatauuid
N: Received json : ["IR","RF","BT"]
T: Pub json :{"stat_t":"home/OpenMQTTGateway_ESP32_IR_RF/LWT","name":"BT: Force scan","uniq_id":"iiiiforce_scan","pl_on":"{\"interval\":0}","pl_avail":"online","pl_not_avail":"offline","cmd_t":"home/OpenMQTTGateway_ESP32_IR_RF/commands/MQTTtoBT/config","device":{"name":"OpenMQTTGateway_ESP32_IR_RF","model":"[\"IR\",\"RF\",\"BT\"]","manufacturer":"OMG_community","sw_version":"version_tag","identifiers":["iii"]}} into custom topic: homeassistant/switch/iiiiDforce_scan/config
[E][WiFiClient.cpp:392] write(): fail on fd 55, errno: 104, "Connection reset by peer"
N: Received json : ["IR","RF","BT"]
W: client not connected can't pub
N: Received json : ["IR","RF","BT"]
W: client not connected can't pub
N: Start BLE processing
T: retrieving value of system characteristics Uptime (s):N5
: Found 3 devices, scan number 1 end deinit controller

The gateway is configured here with BT, RF and IR.

@1technophile
Copy link
Owner

@Legion2 would you be interested in reviewing this PR?

@Legion2 Legion2 self-requested a review February 21, 2021 20:50
Copy link
Contributor

@Legion2 Legion2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not make a comment on the functional aspects of this PR, but besides from that the code looks good to me.

@1technophile
Copy link
Owner

I can not make a comment on the functional aspects

No problem, I take it in charge.

but besides from that the code looks good to me.

Thanks

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 21, 2021

I'm wondering if we should not add also into the queue the message processed by haRoomPresence.

...

You are right!
I don't use subjectHomePresence (I mean at MQTT, it's defined), I didn't notice that it uses pub_custom_topic instead of the PublishDeviceData function.
This is a bit pain as the function is a called in the middle of the normal device sending process at line 529:

#    ifdef subjectHomePresence
      if (advertisedDevice->haveRSSI() && !publishOnlySensors) {
        haRoomPresence(BLEdata); // this device has an rssi and we don't want only sensors so in consequence we can use it for home assistant room presence component
      }
#    endif

and tries to use the buffer element for direct wifi sending:
pub_custom_topic((char*)topic.c_str(), HomePresence, false);
This is in itself not worse than the original queueless code (nobody can touch that JsonObject buffer), but we break the "don't send wifi message from BT core" principle here.
I am also a bit confused regarding the code iself too: it sets the distance of the object, but after that we can go to the "if (advertisedDevice->haveServiceData())" and it's else codepath with PublishDeviceData?
If both can happen, we post the nearly same message twice with different topic?

I can imagine that haRoomPresence in the BLEAdvertisedDeviceCallbacks code should just add all the necessary tags (distance etc.). After that the PublishDeviceData (-> pubBT) -> pubBTMainCore would detect that a pub_custom_topic also should be made to "home_presence".
So we would need to change PublishDeviceData (to let home presence like messages through) and pubBTMainCore (to be able to detect and send to homepresence topic) to be more clever.
I hope this makes sense.

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 21, 2021

Hmmm. Could you please try out the following changes:

void pubBTMainCore(JsonObject& data) {
  String mac_address = data["id"].as();
  mac_address.replace(":", "");
  String mactopic = subjectBTtoMQTT + String("/") + mac_address;
  pub((char*)mactopic.c_str(), data);

  if (data.containsKey("distance")) {
    data.remove("servicedatauuid");
    data.remove("servicedata");
    String topic = String(Base_Topic) + "home_presence/" + String(gateway_name);
    pub_custom_topic((char*)topic.c_str(), data, false);
  }
}

#  ifdef subjectHomePresence
void haRoomPresence(JsonObject& HomePresence) {
  int BLErssi = HomePresence["rssi"];
  Log.trace(F("BLErssi %d" CR), BLErssi);
  int txPower = HomePresence["txpower"] | 0;
  if (txPower >= 0)
    txPower = -59; //if tx power is not found we set a default calibration value
  Log.trace(F("TxPower: %d" CR), txPower);
  double ratio = BLErssi * 1.0 / txPower;
  double distance;
  if (ratio < 1.0) {
    distance = pow(ratio, 10);
  } else {
    distance = (0.89976) * pow(ratio, 7.7095) + 0.111;
  }
  HomePresence["distance"] = distance;
}
#  endif

I already poluted my code somewhat with an extra delay 5:

// should run from the main core
void emptyBTQueue() {
  for (bool first = true;;) {
    int next = atomic_load_explicit(&jsonBTBufferQueueNext, ::memory_order_seq_cst); // use namespace std -> ambiguous error...
    int last = atomic_load_explicit(&jsonBTBufferQueueLast, ::memory_order_seq_cst); // use namespace std -> ambiguous error...
    if (last == next) break;
    if (first) {
      int diff = (2 * BTQueueSize + last - next) % (2 * BTQueueSize);
      btQueueLengthSum += diff;
      btQueueLengthCount++;
      first = false;
      delay(5); // Lets not use the shared antenna yet and wait 5ms for more BT messages
    }
    pubBTMainCore(*jsonBTBufferQueue[next % BTQueueSize].object);
    atomic_store_explicit(&jsonBTBufferQueueNext, (next + 1) % (2 * BTQueueSize), ::memory_order_seq_cst); // use namespace std -> ambiguous error...
  }
}

Still got a reconnect, but after 9 mins, instead of the usual 2-5 mins.
Note, that with the above change we begin to respect the minRssi for haRoomPresence messages (through PublishDeviceData), we might not want that.

@1technophile
Copy link
Owner

1technophile commented Feb 22, 2021

Hmmm. Could you please try out the following changes:

It is working better indeed!

Note, that with the above change we begin to respect the minRssi for haRoomPresence messages (through PublishDeviceData), we might not want that.

Thanks for pointing this, I think we should add a note in the doc regarding this (minRssi impact presence detection...)

Another point, should we remove the
stopProcessing();
and
startProcessing();
L1254 and 1258 of main.ino
As the publication events are coming every time from the same core, I think this is not needed anymore. We could ask ourselves if these methods should be removed from ZgatewayBT.
But we should keep these when an OTA is initiated to prevent the BLE tasks from running and creating issues during the OTA process.

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 22, 2021

We can keep the minRssi agnostic behavior if you need; we can make the PublishDeviceData to call pubBT if "distance" present and also make the pubBTMainCore to skip the simple pub call if the rssi restriction is not met.

Regarding start and stopProcessing: yes, we shouldn't need those anymore for standard cases.

Should I add the changes to this branch?

@1technophile
Copy link
Owner

We can keep the minRssi agnostic behavior if you need; we can make the PublishDeviceData to call pubBT if "distance" present and also make the pubBTMainCore to skip the simple pub call if the rssi restriction is not met.

Yes could be nice, especially for the people that doesn't read the docs ;-)

Should I add the changes to this branch?

Yes please

@@ -676,14 +683,14 @@ void coreTask(void* pvParameters) {
Log.trace(F("BT Task running on core: %d" CR), xPortGetCoreID());
if (!ProcessLock) {
Copy link
Owner

@1technophile 1technophile Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this if not redundant with the condition added into the while loop and of the else if?

Copy link
Contributor Author

@csiki2 csiki2 Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a connection phase before that we may already waited seconds and the validity of ProcessLock long gone.
Also, we don't want to take too much time in anything that has delay() in it.
I even added a delay(Scan_duration < 2000 ? Scan_duration : 2000); to the stopprocessing: this gives more time to leave the BT loop gracefully. Unfortunately waiting to much will result in a timeout at platformio.
I needed these as I am more aggressive in the BT parameterwise and I am able to kill the OTA easily (ofc with the lockless queue)...

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 23, 2021

As you can see, I have updated the branch, but I have a big issue, and I don't know why is that happening.
In the development branch my LYWSDCGQ devices are working perfectly both in passive and active mode (setting in code at BLEScan directly).
For active:
[00:15:21]T: Creating BLE buffer
[00:15:21]N: Device detected: 58:2D:34:3A:78:0C
[00:15:21]T: getDeviceByMac 58:2D:34:3A:78:0C
[00:15:21]T: BLErssi -79
[00:15:21]T: TxPower: -59
[00:15:21]T: Ble distance 8.65
[00:15:21]T: Pub json :{"id":"58:2D:34:3A:78:0C","name":"MJ_HT_V1","rssi":-79,"distance":8.651812} into custom topic: home/home_presence/OpenMQTTGateway_ESP32_BLE
[00:15:21]T: Get services data number: 2
[00:15:21]T: Service data: 5020aa01910c783a342d580d1004dc00bd01
[00:15:21]T: Service data UUID: 0xfe95
[00:15:21]T: getDeviceByMac 58:2D:34:3A:78:0C
[00:15:21]T: Checking BLE service data validity
[00:15:21]T: Searching BLE device data 5020aa01910c783a342d580d1004dc00bd01 size 36
[00:15:21]T: Is it a mi flora ?
[00:15:21]T: Is it a vegtrug ?
[00:15:21]T: Is it a LYWSDCGQ?
[00:15:21]T: LYWSDCGQ data reading
[00:15:21]T: Valid data_length: 8
[00:15:21]T: value 0.00
[00:15:21]T: value 0.00
[00:15:21]T: value 0.00
[00:15:21]N: Subject: /BTtoMQTT/582D343A780C
[00:15:21]N: Received json : {"id":"58:2D:34:3A:78:0C","name":"MJ_HT_V1","rssi":-79,"distance":8.651812,"servicedata":"5020aa01910c783a342d580d1004dc00bd01","servicedatauuid":"0xfe95","model":"LYWSDCGQ","tem":22,"tempc":22,"tempf":71.6,"hum":44.5}
[00:15:21]T: jsonPublishing

(we have a bug here with the value log with %D)
In passive mode we lose the name, but that's ok:
[00:22:32]T: Creating BLE buffer
[00:22:32]N: Device detected: 58:2D:34:3A:78:0C
[00:22:32]T: getDeviceByMac 58:2D:34:3A:78:0C
[00:22:32]T: BLErssi -83
[00:22:32]T: TxPower: -59
[00:22:32]T: Ble distance 12.61
[00:22:32]T: Pub json :{"id":"58:2D:34:3A:78:0C","rssi":-83,"distance":12.61001} into custom topic: home/home_presence/OpenMQTTGateway_ESP32_BLE
[00:22:32]T: Get services data number: 1
[00:22:32]T: Service data: 5020aa01690c783a342d58061002ba01
[00:22:32]T: Service data UUID: 0xfe95
[00:22:32]T: getDeviceByMac 58:2D:34:3A:78:0C
[00:22:32]T: Checking BLE service data validity
[00:22:32]T: Searching BLE device data 5020aa01690c783a342d58061002ba01 size 32
[00:22:32]T: Is it a mi flora ?
[00:22:32]T: Is it a vegtrug ?
[00:22:32]T: Is it a LYWSDCGQ?
[00:22:32]T: LYWSDCGQ data reading
[00:22:32]T: Valid data_length: 4
[00:22:32]T: value 0.00
[00:22:32]N: Subject: /BTtoMQTT/582D343A780C
[00:22:32]N: Received json : {"id":"58:2D:34:3A:78:0C","rssi":-83,"distance":12.61001,"servicedata":"5020aa01690c783a342d58061002ba01","servicedatauuid":"0xfe95","model":"LYWSDCGQ","hum":44.2}
[00:22:32]T: jsonPublishing

But, with the new lockless codepath only the passive mode works (similarly to the development branch), the active ble tracking results with logs like:
[22:47:22]T: Creating BLE buffer
[22:47:22]N: Device detected: 58:2D:34:3A:78:2C
[22:47:22]T: getDeviceByMac 58:2D:34:3A:78:2C
[22:47:22]T: BLErssi -88
[22:47:22]T: TxPower: -59
[22:47:22]T: Ble distance 19.73
[22:47:22]T: Get services data number: 2
[22:47:22]T: Service data: 5020aa01212c783a342d580a10013a
[22:47:22]T: Service data UUID: 0xfe95
[22:47:22]T: Service data: e06f7e0baf25
[22:47:22]T: Service data UUID: 0xffff
[22:47:22]T: getDeviceByMac 58:2D:34:3A:78:2C
[22:47:22]T: Checking BLE service data validity
[22:47:22]T: Searching BLE device data e06f7e0baf25 size 12
[22:47:22]T: Is it a mi flora ?
[22:47:22]T: Is it a vegtrug ?
[22:47:22]T: Is it a LYWSDCGQ?
[22:47:22]T: Is it a JQJCY01YM?
[22:47:22]T: Is it a LYWSD02?
[22:47:22]T: Is it a MUE4094RT?
[22:47:22]T: Is it a CGP1W?
[22:47:22]T: Is it a CGG1
[22:47:22]T: Is it a CGD1?
[22:47:22]T: Is it a MHO_C401?
[22:47:22]T: Is it a LYWSD03MMC?
[22:47:22]T: Is it a custom LYWSD03MMC
[22:47:22]T: servicedatauuid 0xffff
[22:47:22]T: Is it a MiBand?
[22:47:22]T: Is it a XMTZC04HM?
[22:47:22]T: Is it a XMTZC05HM?
[22:47:22]N: Subject: /BTtoMQTT/582D343A782C
[22:47:22]N: Received json : {"id":"58:2D:34:3A:78:2C","name":"MJ_HT_V1","rssi":-88,"distance":19.7325,"servicedata":"e06f7e0baf25","servicedatauuid":"0xffff"}
[22:47:22]T: jsonPublishing
[22:47:22]T: Pub json :{"id":"58:2D:34:3A:78:2C","name":"MJ_HT_V1","rssi":-88,"distance":19.7325} into custom topic: home/home_presence/OpenMQTTGateway_ESP32_BLE

I see the data there, I need to check it what happens.
Strange as I didn't change there anything.

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 23, 2021

Actually, it's like one char less?

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 23, 2021

Got it:
In
class MyAdvertisedDeviceCallbacks : public BLEAdvertisedDeviceCallbacks {
void onResult(BLEAdvertisedDevice* advertisedDevice) {
at the block
if (advertisedDevice->haveServiceData()) {
we generated as many messages as service data we had, the new only generates one.
I have to think about how to resolve this (nicely).

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 23, 2021

I will need one more day, it's 00:51 (24h) here... :)
Sorry.

@1technophile
Copy link
Owner

No problem, have a good night

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 25, 2021

This was way harder than expected.
There is no JsonObject copy so I had to print and reparse, but then the parse destroys the string buffer, I really can't tell any good about this Json library... :/
Also, LYWSDCGQ can easily go out from the 512 buffer with servicedata on:

[02:31:07]T: Buffersize for 0 : 432 -> 501 -> 543
...
Received json : {"id":"58:2D:34:3A:78:0C","name":"MJ_HT_V1","rssi":-85,"distance":15.12852,"model":"LYWSDCGQ","tem":22.4,"tempc":22.4,"tempf":72.32,"hum":44.2,"servicedata":"5020aa01710c783a342d580d1004e000ba01","servicedatauuid":"0xfe95"}

@1technophile
Copy link
Owner

We are running an old version of arduinoJson. It is maybe time to upgrade.

Also, LYWSDCGQ can easily go out from the 512 buffer with servicedata on:

Thanks for pointing this. We may add it int the comment of config_BT.h

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 25, 2021

It seems like Arduino 6+ would need even bigger buffers than 5.13, at least from this:
https://wandbox.org/permlink/8iwHphX2C90tsQDY
For the above line it gives me 1068 bytes (unfortunately I don't know the CPU diffs, but still huge on an esp32 probably would be less, but still huge).

@1technophile
Copy link
Owner

If you are ok, I propose to keep it like that and add a comment into the config_BT.h. Do you think we are ready to merge?

@1technophile
Copy link
Owner

After some hours here is what I get:

{"uptime":43087,"version":"version_tag","freemem":176472,"rssi":-49,"SSID":"","ip":"192.168.1.2","mac":"C4","wifiprt":0,"lowpowermode":0,"btqblck":4,"btqsum":2369,"btqsnd":2366,"btqavg":1.001268,"interval":55555}

Is it coherent with your results also?

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 26, 2021

Yes, how do you have so much free mem?
For me about 30k less.

@1technophile
Copy link
Owner

It varies depending on the measure moment I think:

{"uptime":43687,"version":"version_tag","freemem":144312,"rssi":-49,"SSID":"","ip":"192.168.1.25","mac":"C4"wifiprt":0,"lowpowermode":0,"btqblck":4,"btqsum":2404,"btqsnd":2401,"btqavg":1.001249,"interval":55555}

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 26, 2021

This is new (so far hidden?):

[13:10:36]T: Pub json :{"id":"A4:C1:38:77:7C:52","name":"ATC_777C52","rssi":-70,"distance":3.472448,"model":"LYWSD03MMC_ATC","tempc":23.6,"tempf":74.48,"hum":42,"batt":77,"volt":2.894} into custom topic: home/home_presence/OpenMQTTGateway_ESP32_BLE
[13:10:57]W: MQTT connection...
[13:11:15][E][WiFiClient.cpp:258] connect(): socket error on fd 56, errno: 113, "Software caused connection abort"
[13:11:15]W: failure_number_mqtt: 1
[13:11:15]W: failed, rc=-2
[13:11:20]W: disconnection_handling, failed 1 times
[13:11:20]W: Attempt to reinit wifi: 0

And now stuck mqtt disconnected.
In mqtt:
1614341455: Client OpenMQTTGateway_ESP32_BLE has exceeded timeout, disconnecting.

[EDIT]
Sorry, nothing, I turned on the ESPWifiManualSetup and that seem to cause the issue.

@1technophile
Copy link
Owner

Sorry, nothing, I turned on the ESPWifiManualSetup and that seem to cause the issue.

Yes try to remove this line, I suspect it causes more problem than solutions:

disconnection_handling(failure_number_ntwk);

@csiki2
Copy link
Contributor Author

csiki2 commented Feb 26, 2021

Without the WifiManualSetup, it works perfectly (something to fix after this... ;) ).
I think we are safe for a bigger development branch test. :)

@1technophile
Copy link
Owner

let's go!

@1technophile 1technophile merged commit 1b7f83c into 1technophile:development Feb 26, 2021
@1technophile 1technophile linked an issue Feb 26, 2021 that may be closed by this pull request
JeroenWiersma pushed a commit to JeroenWiersma/OpenMQTTGateway that referenced this pull request Oct 18, 2025
* Lockless json queue for ESP32 BT to avoid SYStoMQTT concurrency issue

* BT haRoomPresence handle with lockless queue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrency bug on esp32 at publishing ble mqtt messages

3 participants