Everyone,

TL;DR: update your local esp-idf clone from our esp-idf repository before doing the next firmware build.

While testing the Mongoose API lock I found out the Mongoose task priority would still occasionally get raised to 22 = wifi task priority. In combination with the Mongoose task essentially being 100% busy while outside the locked poll call, this lead to blocking all other tasks from Mongoose, which caused at least one of the crash effects observed (watchdog timeout from our events task). Adding a priority fix to our netmanager eliminated quite a lot of these crashes.

I then investigated this, as I thought that priority bug was originally coming from the buggy POSIX mutex implementation we fixed in July 2020 (→ http://lists.openvehicles.com/pipermail/ovmsdev/2020-July/006971.html).

It turned out I was wrong, the actual culprit is a bug in the esp-idf spi_flash component: each access to the SPI flash memory needs to be running at maximum priority. The spi_flash methods did this by temporarily changing the current task priority, and reverting to the previous priority without taking into account that the task may have had an inherited priority from an aquired mutex lock. Thus the priority inherited from e.g. the Wifi task would stick.

That was especially present and reproducable when opening the web UI's Config→Firmware page, as that page handler reads the OTA status, which in turn reads the current boot configuration from flash. It also affected the AutoFlash task during firmware updates, and there may be more paths, basically running any "ota" command via a network channel.

As config reads & writes also use SPI flash, these also could produce the bug for any task trying to lock some mutex also being requested by a higher priority task.

This SPI flash bug has been found by other esp-idf users, and has finally been fixed, but only for esp-idf 4.3 & higher:
I have now backported the fix to our version, and haven't had a single unplanned priority change since.

Positive side effect: I see no event queue overflows & almost no effect on the overall performance during an OTA flash process now.

This is though probably not the only cause of remaining watchdog timeouts -- crash reports will tell.

Regards,
Michael

-- 
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926