[Ovmsdev] std::string data corruption (issue #189)
Michael Balzer
dexter at expeedo.de
Thu Jan 31 02:19:59 HKT 2019
Rolled out to "my" edgers. I'll watch crash counters during the next days and report.
Regards,
Michael
Am 30.01.19 um 12:44 schrieb Mark Webb-Johnson:
> Yes, I suspect there is something in there related to the core. I worry that (a) it is not as simple as 2MB/2MB, and (b) we run code on both cores anyway. We
> may have just fixed the test case (which runs on core #1), but not code running on core #0.
>
> Anyway, it is worth a try in the wild to see if it reduces the volume of errors, while Espressif continue to work on it.
>
> Kludgy wonderfulness is in commit 3.2.001-29-g311307cd.
>
> Regards, Mark.
>
>> On 30 Jan 2019, at 7:39 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>
>> Oh my…
>>
>> Can it be the first 2 MB are taboo for core 1 and the second 2 MB for core 0?
>>
>> Anyway, we won't need those 2 MB for now, and hopefully this will be fixed by Espressif soon.
>>
>> So I'd say yes, commit.
>>
>> Regards,
>> Michael
>>
>>
>> Am 30.01.19 um 12:26 schrieb Mark Webb-Johnson:
>>> I tried changing CONFIG_SPIRAM_SIZE in sdkconfig directly, but it just reset back to the default 4MB.
>>>
>>> So changed esp-idf itself, components/esp32/Kconfig, and added a prompt for that config option. I could then change it in menuconfig and the change persisted.
>>>
>>> I then tested:
>>>
>>> I (0) cpu_start: App cpu up.
>>> I (715) spiram: SPI SRAM memory test OK
>>> I (716) heap_init: Initializing. RAM available for dynamic allocation:
>>> I (716) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
>>> I (723) heap_init: At 3FFBA920 len 000256E0 (149 KiB): DRAM
>>> I (728) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>>> I (735) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>>> I (742) heap_init: At 400953CC len 0000AC34 (43 KiB): IRAM
>>> I (747) cpu_start: Pro cpu start user code
>>> I (752) spiram: Adding pool of 2048K of external SPI memory to heap allocator
>>> I (158) cpu_start: Starting scheduler on PRO CPU.
>>> I (0) cpu_start: Starting scheduler on APP CPU.
>>> This is ESP32 chip with 2 CPU cores, WiFi/BT/BLE, silicon revision 1, 16MB external flash
>>> …
>>> I (9566) test_task: testing mode 1 loopcnt 1000
>>> E (9576) test_task: loop #1: corruption detected at pos=497 size=498
>>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>> 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>> rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>> #1: size=498, data @0x3f901ca0
>>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>> 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>> rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>> E (9806) test_task: loop #5: corruption detected at pos=497 size=498
>>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>> 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>> rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>> E (10036) test_task: loop #9: corruption detected at pos=829 size=830
>>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>> 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>> rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI
>>> BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019
>>> 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>>>
>>>
>>> which looks like a FAIL to me?
>>>
>>> So, then I tried something a little funky:
>>>
>>> 1. Go back to normal CONFIG_SPIRAM_SIZE 4MB
>>>
>>> 2. Add this to main/test_memory_corruption.cpp app_main():
>>>
>>> void* big2mb = heap_caps_malloc(1024*1024*2, MALLOC_CAP_SPIRAM);
>>> printf("Big 2MB spi memory allocation is at %p\n",big2mb);
>>>
>>> 3. Rebuild and run...
>>>
>>>
>>> Here is what I get:
>>>
>>> I (0) cpu_start: App cpu up.
>>> I (1151) spiram: SPI SRAM memory test OK
>>> I (1152) heap_init: Initializing. RAM available for dynamic allocation:
>>> I (1152) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
>>> I (1159) heap_init: At 3FFBA920 len 000256E0 (149 KiB): DRAM
>>> I (1164) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>>> I (1171) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>>> I (1179) heap_init: At 400953CC len 0000AC34 (43 KiB): IRAM
>>> I (1184) cpu_start: Pro cpu start user code
>>> I (1188) spiram: Adding pool of 4096K of external SPI memory to heap allocator
>>> I (206) cpu_start: Starting scheduler on PRO CPU.
>>> I (0) cpu_start: Starting scheduler on APP CPU.
>>> ...
>>> This is ESP32 chip with 2 CPU cores, WiFi/BT/BLE, silicon revision 1, 16MB external flash
>>> Big 2MB spi memory allocation is at 0x3f800028
>>> ...
>>> I (291) wifi station: ESP_WIFI_MODE_APSTA
>>> I (291) wifi: wifi driver task: 3ffc25c4, prio:23, stack:3584, core=0
>>> ...
>>> I (3281) wifi station: retry to connect to the AP
>>> I (3281) wifi station: connect to the AP fail
>>>
>>> I (6111) wifi station: retry to connect to the AP
>>> I (6111) wifi station: connect to the AP fail
>>>
>>> I (8461) mem_stresser: started on core 0 mem @0x3fa01b98
>>> I (8461) mem_stresser: started on core 1 mem @0x3fa81bac
>>> I (8951) wifi station: retry to connect to the AP
>>> I (8951) wifi station: connect to the AP fail
>>>
>>> I (9461) test_task: running on core 1
>>>
>>> I (9561) test_task: testing mode 1 loopcnt 1000
>>> #1: size=1039, data @0x3fb01ca4
>>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27
>>> 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>> rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI
>>> BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019
>>> 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2
>>> rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI
>>> BLE BT cores=2 rev=ESP32/1
>>>
>>> I (11781) wifi station: retry to connect to the AP
>>> I (11781) wifi station: connect to the AP fail
>>>
>>> I (14621) wifi station: retry to connect to the AP
>>> I (14621) wifi station: connect to the AP fail
>>>
>>> I (17451) wifi station: connect to the AP fail
>>>
>>>
>>> I left it running for a few minutes, but no faults shown. Modes 1 through 4 all clean.
>>>
>>> So, to take the funkiness to the next level, I added the same to ovms_main, right at the start (actually in the log level setter, as that seemed to be the
>>> first thing that initialises). Somebody else gets in for 28bytes of SPIRAM (or maybe that is a structure header), but anyway...
>>>
>>> I (1742) spiram: SPI SRAM memory test OK
>>> I (1743) heap_init: Initializing. RAM available for dynamic allocation:
>>> I (1743) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM
>>> I (1750) heap_init: At 3FFCEDE8 len 00011218 (68 KiB): DRAM
>>> I (1755) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>>> I (1762) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>>> I (1768) heap_init: At 4009F530 len 00000AD0 (2 KiB): IRAM
>>> I (1775) cpu_start: Pro cpu start user code
>>> I (1779) spiram: Adding pool of 4096K of external SPI memory to heap allocator
>>> I (125) ovms_main: Set default logging level for * to INFO
>>> I (126) ovms_main: Pre-allocated 2MB of SPIRAM at 0x3f800028 (workaround ESP32 bug)
>>> ...
>>> I (17783) webserver: Launching Web Server
>>> I (17823) ssh: Launching SSH Server
>>> OVMS# test string 500 2
>>> #1: stdlen = 1027
>>> 1985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30
>>> 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2
>>> rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS
>>> WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019
>>> 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf
>>> v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2
>>> rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504
>>>
>>> finished
>>> OVMS#
>>>
>>> OVMS# module memory
>>> Free 8-bit 76160/230212, 32-bit 812/2720, SPIRAM 1886108/4194252
>>> --Task-- Total DRAM D/IRAM IRAM SPIRAM +/- DRAM D/IRAM IRAM SPIRAM
>>> main* 8368 0 0 8456 +8368 +0 +0 +8456
>>> no task* 5376 0 0 0 +5376 +0 +0 +0
>>> esp_timer 34808 0 6442123100 +34808 +0 +644+2123100
>>> OVMS Events 36352 41040 0 35120 +36352 +41040 +0 +35120
>>> OVMS DukTape 188 0 0 34668 +188 +0 +0 +34668
>>> OVMS CanRx 3124 0 0 26168 +3124 +0 +0 +26168
>>> ipc0 7776 0 0 0 +7776 +0 +0 +0
>>> ipc1 12 0 0 0 +12 +0 +0 +0
>>> Tmr Svc 0 88 0 0 +0 +88 +0 +0
>>> tiT 232 468 0 664 +232 +468 +0 +664
>>> wifi 600 8896 0 3324 +600 +8896 +0 +3324
>>> OVMS Console 0 200 1180 26332 +0 +200 +1180 +26332
>>> mdns 0 104 0 4 +0 +104 +0 +4
>>> OVMS NetMan 0 216 0 84 +0 +216 +0 +84
>>>
>>>
>>> I think that wins as the most ugly kludge to date.
>>>
>>> Shall I commit?
>>>
>>> Regards, Mark.
>>>
>>>> On 30 Jan 2019, at 5:00 PM, Mark Webb-Johnson <mark at webb-johnson.net <mailto:mark at webb-johnson.net>> wrote:
>>>>
>>>>
>>>>
>>>>> On 30 Jan 2019, at 4:41 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>
>>>>> Mark,
>>>>>
>>>>> issue #2892 mentions using only the lower 2 MB of SPI RAM as a workaround. Where are our allocations placed by the allocator? Does it fill from the middle
>>>>> or end? If not we don't use the upper half yet.
>>>>
>>>> I saw that, and had a look. I don’t see anything in makeconfig to limit to 2MB (not 4MB).
>>>>
>>>> I do see a CONFIG_SPIRAM_SIZE=4194304 in sdkconfig, but not sure how it gets there. Perhaps we can just change there to 2MB and try?
>>>>
>>>> I did see this in the documentation:
>>>>
>>>> During ESP-IDF startup, external RAM is mapped into the data address space starting at at address 0x3F800000 (byte-accessible). The length of this
>>>> region is the same as the SPIRAM size (up to the limit of 4MiB).
>>>>
>>>>
>>>> So, maybe we can look at the address our allocations come from to see their offset from 0x3F800000? If it is coming from the top 2MB, then perhaps we start
>>>> with a big 2MB allocation that we never use?
>>>>
>>>>> A possible workaround for the most apparent issues with this (string assembly) could be to use char buffers instead of std::string. I was thinking about
>>>>> doing that at least for the websocket stream.
>>>>
>>>> It seems strange that we are only seeing it with std::string. Maybe that just stresses the system more, or does a lot of reallocations?
>>>>
>>>>> But if this really is a hardware issue it can affect all objects in SPI RAM, appending to std::string then only triggers this more often.
>>>>>
>>>>> Another workaround could be to run everything on core 0.
>>>>
>>>> Urgh.
>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 30.01.19 um 03:33 schrieb Mark Webb-Johnson:
>>>>>> Michael,
>>>>>>
>>>>>> Espressif’s response (and linking to that other issue) sounds like others are seeing this.
>>>>>>
>>>>>> Lousy timing (with Chinese New Year next week), so don’t expect anything quick from Espressif. I guess we’ll just have to live with it until they can
>>>>>> find a workaround. It sounds like the issue is at the hardware level and a compiler patch will be needed.
>>>>>>
>>>>>> Regards, Mark.
>>>>>>
>>>>>>> On 30 Jan 2019, at 3:07 AM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>>>
>>>>>>> → https://github.com/espressif/esp-idf/issues/3006
>>>>>>>
>>>>>>> Regards,
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> Am 28.01.19 um 20:46 schrieb Michael Balzer:
>>>>>>>> To clarify: the bug is most likely not restricted to the case of building a message in a buffer. It can possibly cause corruptions in any RAM section,
>>>>>>>> so can well be responsible for many/most of the unidentified crashes and stack/heap corruptions we're experiencing.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 28.01.19 um 20:43 schrieb Michael Balzer:
>>>>>>>>> For those not following the github discussion: I'm pretty sure I've nailed the bug down.
>>>>>>>>>
>>>>>>>>> I have reproduced the bug in a simple test project and intend to raise an issue with Espressif on this.
>>>>>>>>>
>>>>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457965435
>>>>>>>>>
>>>>>>>>> If you'd like to test this on your module, configure the project to your wifi credentials, then use "make flash". As the test project is small, this
>>>>>>>>> normally will not erase your OVMS config partition, but a backup is always recommended.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 24.01.19 um 21:19 schrieb Michael Balzer:
>>>>>>>>>> Everyone please have a look at…
>>>>>>>>>>
>>>>>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457334248
>>>>>>>>>>
>>>>>>>>>> Please try to reproduce the bug on your modules.
>>>>>>>>>>
>>>>>>>>>> I'm open for explanations.
>>>>>>>>>>
>>>>>>>>>> I thought this might be some copy-on-write bug with std::string, but the gcc 5.x libstdc++ does no longer use that implementation (wouldn't be C++11
>>>>>>>>>> compliant as well). I also tried moving all strings to temporary buffers, but modes 5 & 6 eliminated this explanation as well.
>>>>>>>>>>
>>>>>>>>>> My remaining theories:
>>>>>>>>>>
>>>>>>>>>> * A task writing out of bounds (but only 0-bytes?)
>>>>>>>>>> * A hardware issue only affecting some modules
>>>>>>>>>>
>>>>>>>>>> A hardware issue only affecting some percentage of ESP32 could explain this as well as the strange heap corruptions that seem to affect some modules
>>>>>>>>>> especially often.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Michael
>>>>>>>>>>
>>
>> --
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20190130/626e8414/attachment.htm>
More information about the OvmsDev
mailing list