[Ovmsdev] std::string data corruption (issue #189)

Michael Balzer dexter at expeedo.de
Thu Jan 31 02:19:59 HKT 2019


Rolled out to "my" edgers. I'll watch crash counters during the next days and report.

Regards,
Michael


Am 30.01.19 um 12:44 schrieb Mark Webb-Johnson:
> Yes, I suspect there is something in there related to the core. I worry that (a) it is not as simple as 2MB/2MB, and (b) we run code on both cores anyway. We
> may have just fixed the test case (which runs on core #1), but not code running on core #0.
>
> Anyway, it is worth a try in the wild to see if it reduces the volume of errors, while Espressif continue to work on it.
>
> Kludgy wonderfulness is in commit 3.2.001-29-g311307cd.
>
> Regards, Mark.
>
>> On 30 Jan 2019, at 7:39 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>
>> Oh my…
>>
>> Can it be the first 2 MB are taboo for core 1 and the second 2 MB for core 0?
>>
>> Anyway, we won't need those 2 MB for now, and hopefully this will be fixed by Espressif soon.
>>
>> So I'd say yes, commit.
>>
>> Regards,
>> Michael
>>
>>
>> Am 30.01.19 um 12:26 schrieb Mark Webb-Johnson:
>>> I tried changing CONFIG_SPIRAM_SIZE in sdkconfig directly, but it just reset back to the default 4MB.
>>>
>>> So changed esp-idf itself, components/esp32/Kconfig, and added a prompt for that config option. I could then change it in menuconfig and the change persisted.
>>>
>>> I then tested:
>>>
>>>     I (0) cpu_start: App cpu up.
>>>     I (715) spiram: SPI SRAM memory test OK
>>>     I (716) heap_init: Initializing. RAM available for dynamic allocation:
>>>     I (716) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
>>>     I (723) heap_init: At 3FFBA920 len 000256E0 (149 KiB): DRAM
>>>     I (728) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>>>     I (735) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>>>     I (742) heap_init: At 400953CC len 0000AC34 (43 KiB): IRAM
>>>     I (747) cpu_start: Pro cpu start user code
>>>     I (752) spiram: Adding pool of 2048K of external SPI memory to heap allocator
>>>     I (158) cpu_start: Starting scheduler on PRO CPU.
>>>     I (0) cpu_start: Starting scheduler on APP CPU.
>>>     This is ESP32 chip with 2 CPU cores, WiFi/BT/BLE, silicon revision 1, 16MB external flash
>>>>>>     I (9566) test_task: testing mode 1 loopcnt 1000
>>>     E (9576) test_task: loop #1: corruption detected at pos=497 size=498
>>>     4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>>     2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>>     #1: size=498, data @0x3f901ca0
>>>     4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>>     2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>>     E (9806) test_task: loop #5: corruption detected at pos=497 size=498
>>>     4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>>     2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>>     E (10036) test_task: loop #9: corruption detected at pos=829 size=830
>>>     4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>>     2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI
>>>     BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019
>>>     15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>>
>>> which looks like a FAIL to me?
>>>
>>> So, then I tried something a little funky:
>>>
>>>  1. Go back to normal CONFIG_SPIRAM_SIZE 4MB
>>>
>>>  2. Add this to main/test_memory_corruption.cpp app_main():
>>>
>>>     void* big2mb = heap_caps_malloc(1024*1024*2, MALLOC_CAP_SPIRAM);
>>>     printf("Big 2MB spi memory allocation is at %p\n",big2mb);
>>>
>>>  3. Rebuild and run...
>>>
>>>
>>> Here is what I get:
>>>
>>>     I (0) cpu_start: App cpu up.
>>>     I (1151) spiram: SPI SRAM memory test OK
>>>     I (1152) heap_init: Initializing. RAM available for dynamic allocation:
>>>     I (1152) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
>>>     I (1159) heap_init: At 3FFBA920 len 000256E0 (149 KiB): DRAM
>>>     I (1164) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>>>     I (1171) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>>>     I (1179) heap_init: At 400953CC len 0000AC34 (43 KiB): IRAM
>>>     I (1184) cpu_start: Pro cpu start user code
>>>     I (1188) spiram: Adding pool of 4096K of external SPI memory to heap allocator
>>>     I (206) cpu_start: Starting scheduler on PRO CPU.
>>>     I (0) cpu_start: Starting scheduler on APP CPU.
>>>     ...
>>>     This is ESP32 chip with 2 CPU cores, WiFi/BT/BLE, silicon revision 1, 16MB external flash
>>>     Big 2MB spi memory allocation is at 0x3f800028
>>>     ...
>>>     I (291) wifi station: ESP_WIFI_MODE_APSTA
>>>     I (291) wifi: wifi driver task: 3ffc25c4, prio:23, stack:3584, core=0
>>>     ...
>>>     I (3281) wifi station: retry to connect to the AP
>>>     I (3281) wifi station: connect to the AP fail
>>>
>>>     I (6111) wifi station: retry to connect to the AP
>>>     I (6111) wifi station: connect to the AP fail
>>>
>>>     I (8461) mem_stresser: started on core 0 mem @0x3fa01b98
>>>     I (8461) mem_stresser: started on core 1 mem @0x3fa81bac
>>>     I (8951) wifi station: retry to connect to the AP
>>>     I (8951) wifi station: connect to the AP fail
>>>
>>>     I (9461) test_task: running on core 1
>>>
>>>     I (9561) test_task: testing mode 1 loopcnt 1000
>>>     #1: size=1039, data @0x3fb01ca4
>>>     4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>>     2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI
>>>     BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019
>>>     15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI
>>>     BLE BT cores=2 rev=ESP32/1
>>>
>>>     I (11781) wifi station: retry to connect to the AP
>>>     I (11781) wifi station: connect to the AP fail
>>>
>>>     I (14621) wifi station: retry to connect to the AP
>>>     I (14621) wifi station: connect to the AP fail
>>>
>>>     I (17451) wifi station: connect to the AP fail
>>>
>>>
>>> I left it running for a few minutes, but no faults shown. Modes 1 through 4 all clean.
>>>
>>> So, to take the funkiness to the next level, I added the same to ovms_main, right at the start (actually in the log level setter, as that seemed to be the
>>> first thing that initialises). Somebody else gets in for 28bytes of SPIRAM (or maybe that is a structure header), but anyway...
>>>
>>>     I (1742) spiram: SPI SRAM memory test OK
>>>     I (1743) heap_init: Initializing. RAM available for dynamic allocation:
>>>     I (1743) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM
>>>     I (1750) heap_init: At 3FFCEDE8 len 00011218 (68 KiB): DRAM
>>>     I (1755) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>>>     I (1762) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>>>     I (1768) heap_init: At 4009F530 len 00000AD0 (2 KiB): IRAM
>>>     I (1775) cpu_start: Pro cpu start user code
>>>     I (1779) spiram: Adding pool of 4096K of external SPI memory to heap allocator
>>>     I (125) ovms_main: Set default logging level for * to INFO
>>>     I (126) ovms_main: Pre-allocated 2MB of SPIRAM at 0x3f800028 (workaround ESP32 bug)
>>>     ...
>>>     I (17783) webserver: Launching Web Server
>>>     I (17823) ssh: Launching SSH Server
>>>     OVMS# test string 500 2
>>>     #1: stdlen = 1027
>>>     1985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30
>>>     2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS
>>>     WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019
>>>     19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf
>>>     v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2
>>>     rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504
>>>
>>>     finished
>>>     OVMS#
>>>
>>>     OVMS# module memory
>>>     Free 8-bit 76160/230212, 32-bit 812/2720, SPIRAM 1886108/4194252
>>>     --Task--     Total DRAM D/IRAM   IRAM SPIRAM   +/- DRAM D/IRAM   IRAM SPIRAM
>>>     main*              8368      0      0   8456      +8368     +0     +0  +8456
>>>     no task*           5376      0      0      0      +5376     +0     +0     +0
>>>     esp_timer         34808      0    6442123100     +34808     +0   +644+2123100
>>>     OVMS Events       36352  41040      0  35120     +36352 +41040     +0 +35120
>>>     OVMS DukTape        188      0      0  34668       +188     +0     +0 +34668
>>>     OVMS CanRx         3124      0      0  26168      +3124     +0     +0 +26168
>>>     ipc0               7776      0      0      0      +7776     +0     +0     +0
>>>     ipc1                 12      0      0      0        +12     +0     +0     +0
>>>     Tmr Svc               0     88      0      0         +0    +88     +0     +0
>>>     tiT                 232    468      0    664       +232   +468     +0   +664
>>>     wifi                600   8896      0   3324       +600  +8896     +0  +3324
>>>     OVMS Console          0    200   1180  26332         +0   +200  +1180 +26332
>>>     mdns                  0    104      0      4         +0   +104     +0     +4
>>>     OVMS NetMan           0    216      0     84         +0   +216     +0    +84
>>>
>>>
>>> I think that wins as the most ugly kludge to date.
>>>
>>> Shall I commit?
>>>
>>> Regards, Mark.
>>>
>>>> On 30 Jan 2019, at 5:00 PM, Mark Webb-Johnson <mark at webb-johnson.net <mailto:mark at webb-johnson.net>> wrote:
>>>>
>>>>
>>>>
>>>>> On 30 Jan 2019, at 4:41 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>
>>>>> Mark,
>>>>>
>>>>> issue #2892 mentions using only the lower 2 MB of SPI RAM as a workaround. Where are our allocations placed by the allocator? Does it fill from the middle
>>>>> or end? If not we don't use the upper half yet.
>>>>
>>>> I saw that, and had a look. I don’t see anything in makeconfig to limit to 2MB (not 4MB).
>>>>
>>>> I do see a CONFIG_SPIRAM_SIZE=4194304 in sdkconfig, but not sure how it gets there. Perhaps we can just change there to 2MB and try?
>>>>
>>>> I did see this in the documentation:
>>>>
>>>>     During ESP-IDF startup, external RAM is mapped into the data address space starting at at address 0x3F800000 (byte-accessible). The length of this
>>>>     region is the same as the SPIRAM size (up to the limit of 4MiB).
>>>>
>>>>
>>>> So, maybe we can look at the address our allocations come from to see their offset from 0x3F800000? If it is coming from the top 2MB, then perhaps we start
>>>> with a big 2MB allocation that we never use?
>>>>  
>>>>> A possible workaround for the most apparent issues with this (string assembly) could be to use char buffers instead of std::string. I was thinking about
>>>>> doing that at least for the websocket stream.
>>>>
>>>> It seems strange that we are only seeing it with std::string. Maybe that just stresses the system more, or does a lot of reallocations?
>>>>
>>>>> But if this really is a hardware issue it can affect all objects in SPI RAM, appending to std::string then only triggers this more often.
>>>>>
>>>>> Another workaround could be to run everything on core 0.
>>>>
>>>> Urgh.
>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 30.01.19 um 03:33 schrieb Mark Webb-Johnson:
>>>>>> Michael,
>>>>>>
>>>>>> Espressif’s response (and linking to that other issue) sounds like others are seeing this.
>>>>>>
>>>>>> Lousy timing (with Chinese New Year next week), so don’t expect anything quick from Espressif. I guess we’ll just have to live with it until they can
>>>>>> find a workaround. It sounds like the issue is at the hardware level and a compiler patch will be needed.
>>>>>>
>>>>>> Regards, Mark.
>>>>>>
>>>>>>> On 30 Jan 2019, at 3:07 AM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>>>
>>>>>>>https://github.com/espressif/esp-idf/issues/3006
>>>>>>>
>>>>>>> Regards,
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> Am 28.01.19 um 20:46 schrieb Michael Balzer:
>>>>>>>> To clarify: the bug is most likely not restricted to the case of building a message in a buffer. It can possibly cause corruptions in any RAM section,
>>>>>>>> so can well be responsible for many/most of the unidentified crashes and stack/heap corruptions we're experiencing.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 28.01.19 um 20:43 schrieb Michael Balzer:
>>>>>>>>> For those not following the github discussion: I'm pretty sure I've nailed the bug down.
>>>>>>>>>
>>>>>>>>> I have reproduced the bug in a simple test project and intend to raise an issue with Espressif on this.
>>>>>>>>>
>>>>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457965435
>>>>>>>>>
>>>>>>>>> If you'd like to test this on your module, configure the project to your wifi credentials, then use "make flash". As the test project is small, this
>>>>>>>>> normally will not erase your OVMS config partition, but a backup is always recommended.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 24.01.19 um 21:19 schrieb Michael Balzer:
>>>>>>>>>> Everyone please have a look at…
>>>>>>>>>>
>>>>>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457334248
>>>>>>>>>>
>>>>>>>>>> Please try to reproduce the bug on your modules.
>>>>>>>>>>
>>>>>>>>>> I'm open for explanations.
>>>>>>>>>>
>>>>>>>>>> I thought this might be some copy-on-write bug with std::string, but the gcc 5.x libstdc++ does no longer use that implementation (wouldn't be C++11
>>>>>>>>>> compliant as well). I also tried moving all strings to temporary buffers, but modes 5 & 6 eliminated this explanation as well.
>>>>>>>>>>
>>>>>>>>>> My remaining theories:
>>>>>>>>>>
>>>>>>>>>>   * A task writing out of bounds (but only 0-bytes?)
>>>>>>>>>>   * A hardware issue only affecting some modules
>>>>>>>>>>
>>>>>>>>>> A hardware issue only affecting some percentage of ESP32 could explain this as well as the strange heap corruptions that seem to affect some modules
>>>>>>>>>> especially often.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Michael
>>>>>>>>>>
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20190130/626e8414/attachment.htm>


More information about the OvmsDev mailing list