[Ovmsdev] 3.1.007 OTA

Mon Jun 25 09:56:44 HKT 2018

Greg,

The issue is upgrading from 3.1.007. Previous versions were fine. The 3.1.007 was only ever released to ‘edge’, and I caught the problem before releasing it to ‘main’.

Regards, Mark.

> On 22 Jun 2018, at 1:57 AM, Greg D. <gregd2350 at gmail.com> wrote:
> 
> My module is still on .006, and has not rebooted since late May when .006 replaced .005.  I'm trying to keep it as stock and "not messed with like a developer would" as possible.
> 
> Is .007 not linked to "main"?  Or, should I unplug it until .007 is replaced?
> 
> Greg
> 
> 
> Mark Webb-Johnson wrote:
>> I’ve raised a github issue for Espressif to look at:
>> 
>> https://github.com/espressif/esp-idf/issues/2083 <https://github.com/espressif/esp-idf/issues/2083>
>> 
>> We use 16MB ESP32 modules with 4MB OTA partitions and external SPI RAM. Currently our firmware image is approximately 2.8MB in size. The flow of our code is to make an http GET request for the firmware image, read the headers (in particular download size), call esp_ota_begin() with the firmware download size, then download chunk by chunk and call esp_ota_write for each chunk, and a final esp_ota_end when done. Our networking buffers are in INTERNAL RAM, not SPI RAM.
>> 
>> After our firmware exceeded about 2MB in size, and we started to use SPIRAM more in our application, we started to see random crashes during OTA firmware updates over wifi.
>> 
>> abort() was called at PC 0x401b8e84 on core 0
>> 0x401b8e84: pm_on_beacon_rx at ??:?
>> 
>> Backtrace: 0x40091e6b:0x3ffcc4a0 0x40091fc3:0x3ffcc4c0 0x401b8e84:0x3ffcc4e0 0x401b94ef:0x3ffcc520 0x401b9bd1:0x3ffcc550 0x40089e62:0x3ffcc5a0
>> 
>> 0x40091e6b: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
>> 0x40091fc3: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
>> 0x401b8e84: pm_on_beacon_rx at ??:?
>> 0x401b94ef: ppRxProtoProc at ??:?
>> 0x401b9bd1: ppRxPkt at ??:?
>> 0x40089e62: ppTask at ??:?
>> We narrowed down the issue to using networking functions (reading from the TCP/IP socket) after calling esp_ota_begin with large image sizes (over approximately 2MB). The Espressif code calls a single esp_partition_erase_range() which disables the SPI RAM cache and blocks any task trying to access that. If the system networking task is blocked for too long, then it seems to get messed up in handling wifi beacons, and panics when it finally gets some CPU time?
>> 
>> If we change esp_ota_begin() to use a loop to erase the partition in 256KB chunks (calling esp_partition_erase_range() multiple times), with a 1 tick vTaskDelay between each chunk, the problem goes away and OTA works again. The vTaskDelay is required as without it the panic in pm_on_beacon_rx still happens.
>> 
>> I am not sure how to address this issue. The core spi_flash_erase_range, that this all depends on, already has a loop to erase sector/block by sector/block, and it uses spi_flash_guard_start() and spi_flash_guard_end() correctly for each sector/block. It seems that the tasks are not getting any/enough cpu time between calls to spi_flash_guard_start()/spi_flash_guard_end() in that sector/block erase loop.
>> 
>> Adding a delay there seems kludgy. Perhaps a freertos call to allow other blocked tasks to run? I tried adding a taskYIELD() after the spi_flash_guard_end() in spi_flash_erase_range, but that didn't solve the problem (presumably the task that called esp_ota_begin was higher priority than the networking task?). A vTaskDelay(1) in the same place works and solves the problem, but just seems horribly kludgy.
>> 
>> Overall, I'm just very uncomfortable with how invasive esp_ota_begin() is. With a 2.8MB image size, it blocks all other tasks that touch SPI RAM for about 17 seconds. That is not good. We should be able to OTA flash without starving other tasks in the system so badly.
>> 
>> There is a separate bug in the pm_on_beacon_rx that is triggered by this, but that seems more a symptom that a solution.
>> 
>> As a temporary work-around, I’ve added a one-liner to our Espressif esp-idf clone repository, base function spi_flash_erase_range(), to delay 1 tick after writing each sector. In my testing it solves our problem and doesn’t seem to have much of an impact on spi flash erase performance. However, it is a horrible kludge and we need a better solution - need to wait for Espressif to give guidance. In the meantime, please git pull our esp-idf repository so you get this fix/workaround.
>> 
>> I am concerned about how we can get out of a loop with this. A module wakes up at 2am, checks for an update, finds 3.1.008, downloads it, crashes, reboots, and tries again, and again, and again. With current code it looks like it will do that five times each night.
>> 
>> On my server I only have three cars (other than my own) running 3.1.007. Maybe people here? I guess I can write to those owners and ask them to downgrade to 3.1.006. I can also make a 3.1.008 and put it there but don’t change the ovms3.ver file so it is not automatically downloaded. Any suggestions?
>> 
>> Regards, Mark.
>> 
>>> On 21 Jun 2018, at 9:36 AM, Mark Webb-Johnson <mark at webb-johnson.net <mailto:mark at webb-johnson.net>> wrote:
>>> 
>>> 
>>> On my devices I’m seeing a problem with doing OTA updates when running 3.1.007. I’m seeing:
>>> 
>>> OVMS# ota flash http
>>> Current running partition is: factory
>>> Target partition is: ota_0
>>> Download firmware from api.openvehicles.com/firmware/ota/v3.1/edge/ovms3.bin <http://api.openvehicles.com/firmware/ota/v3.1/edge/ovms3.bin> to ota_0
>>> Expected file size is 2850336
>>> Preparing flash partition...
>>> abort() was called at PC 0x401b8e84 on core 0
>>> 
>>> 0x401b8e84: pm_on_beacon_rx at ??:?
>>> 
>>> Backtrace: 0x40091e6b:0x3ffcc4a0 0x40091fc3:0x3ffcc4c0 0x401b8e84:0x3ffcc4e0 0x401b94ef:0x3ffcc520 0x401b9bd1:0x3ffcc550 0x40089e62:0x3ffcc5a0
>>> 
>>> 0x40091e6b: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
>>> 
>>> 0x40091fc3: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
>>> 
>>> 0x401b8e84: pm_on_beacon_rx at ??:?
>>> 
>>> 0x401b94ef: ppRxProtoProc at ??:?
>>> 
>>> 0x401b9bd1: ppRxPkt at ??:?
>>> 
>>> 0x40089e62: ppTask at ??:?
>>> 
>>> (this is a firmware rebuild so addresses may vary from stock 3.1.007, but the functions are the same)
>>> 
>>> The pm_on_beacon_rx function is in libpp (as are ppRxProtoProc, ppRxPkt, and ppTask.
>>> 
>>> If I comment out the esp_ota_* functions, it does the download and works fine. But if esp_ota_begin is called, then the crash occurs within http.BodyRead. (although I think it is within the ppTask, not in our code).
>>> 
>>> It I reduce the size parameter given to esp_ota_begin, I can get it to work just fine. Around 1.5MB is fine, but above that is starts to get flaky. Espressif say that during flash writes the cache must be disabled, so all access to SPIRAM halted. Sometimes, I get to see a message 'I (20642) wifi: bcn_timout,ap_probe_send_start’ which seems to confirm the wifi task is being starved for cpu time during the esp_ota_begin(). I guess that is triggering a bug within the wifi stack where it doesn’t handle it gracefully.
>>> 
>>> I then tried hacking around the esp_ota_begin() function. The Espressif implementation simply calls esp_partition_erase_range() to erase the partition in one call. For us, that now means blanking 2.8MB of flash and takes about 15 seconds or so. I changed that to call esp_partition_erase_range() multiple times, in 256KB chunks, with a 10ms vTaskDelay between each chunk. Problem solved. OTA works again doing it that way.
>>> 
>>> I’m not sure how to get this fix applied. The issue is in Espressif’s code, related to big partition sizes. I guess they haven’t seen it because with 1MB OTA partitions the delay is only 3 seconds or so and probably not long enough to trigger the issue. I’ll raise a bug report with them, but suspect it will be a while for them to address it. As a temporary workaround for us, I think I’ll bring esp_ota_begin() into our ovms_ota code and customise it there specifically for us. We can always switch back to the normal esp_ota_begin when/if Espressif come up with a permanent solution. I don’t really want to change that core Espressif code, without direction from Espressif as to how to change it (been there, done that, and wasted a lot of time doing it).
>>>  
>>> For safety, I’ve rolled back edge to 3.1.006-15-g14b8eb6. That will stop new deployments of this, while we solve the problem. For existing deployments in the field, I can only think of two solutions: (a) switch ota to factory, reboot, and flash to fixed code, or (b) ota from sdcard, with wifi unused/off.
>>> 
>>> I should have the fix later today.
>>> 
>>> Regards, Mark.
>>> 
>> 
>> 
>> 
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
> 
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20180625/a9667a5e/attachment-0001.html>