[Ovmsdev] std::string data corruption (issue #189)

Mark Webb-Johnson mark at webb-johnson.net
Wed Jan 30 19:44:59 HKT 2019


Yes, I suspect there is something in there related to the core. I worry that (a) it is not as simple as 2MB/2MB, and (b) we run code on both cores anyway. We may have just fixed the test case (which runs on core #1), but not code running on core #0.

Anyway, it is worth a try in the wild to see if it reduces the volume of errors, while Espressif continue to work on it.

Kludgy wonderfulness is in commit 3.2.001-29-g311307cd.

Regards, Mark.

> On 30 Jan 2019, at 7:39 PM, Michael Balzer <dexter at expeedo.de> wrote:
> 
> Oh my…
> 
> Can it be the first 2 MB are taboo for core 1 and the second 2 MB for core 0?
> 
> Anyway, we won't need those 2 MB for now, and hopefully this will be fixed by Espressif soon.
> 
> So I'd say yes, commit.
> 
> Regards,
> Michael
> 
> 
> Am 30.01.19 um 12:26 schrieb Mark Webb-Johnson:
>> I tried changing CONFIG_SPIRAM_SIZE in sdkconfig directly, but it just reset back to the default 4MB.
>> 
>> So changed esp-idf itself, components/esp32/Kconfig, and added a prompt for that config option. I could then change it in menuconfig and the change persisted.
>> 
>> I then tested:
>> 
>> I (0) cpu_start: App cpu up.
>> I (715) spiram: SPI SRAM memory test OK
>> I (716) heap_init: Initializing. RAM available for dynamic allocation:
>> I (716) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
>> I (723) heap_init: At 3FFBA920 len 000256E0 (149 KiB): DRAM
>> I (728) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>> I (735) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>> I (742) heap_init: At 400953CC len 0000AC34 (43 KiB): IRAM
>> I (747) cpu_start: Pro cpu start user code
>> I (752) spiram: Adding pool of 2048K of external SPI memory to heap allocator
>> I (158) cpu_start: Starting scheduler on PRO CPU.
>> I (0) cpu_start: Starting scheduler on APP CPU.
>> This is ESP32 chip with 2 CPU cores, WiFi/BT/BLE, silicon revision 1, 16MB external flash
>>>> I (9566) test_task: testing mode 1 loopcnt 1000
>> E (9576) test_task: loop #1: corruption detected at pos=497 size=498
>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>> 
>> #1: size=498, data @0x3f901ca0
>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>> 
>> E (9806) test_task: loop #5: corruption detected at pos=497 size=498
>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>> 
>> E (10036) test_task: loop #9: corruption detected at pos=829 size=830
>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23
>> 
>> which looks like a FAIL to me?
>> 
>> So, then I tried something a little funky:
>> 
>> Go back to normal CONFIG_SPIRAM_SIZE 4MB
>> 
>> Add this to main/test_memory_corruption.cpp app_main():
>> 
>> void* big2mb = heap_caps_malloc(1024*1024*2, MALLOC_CAP_SPIRAM);
>> printf("Big 2MB spi memory allocation is at %p\n",big2mb);
>> 
>> Rebuild and run...
>> 
>> Here is what I get:
>> 
>> I (0) cpu_start: App cpu up.
>> I (1151) spiram: SPI SRAM memory test OK
>> I (1152) heap_init: Initializing. RAM available for dynamic allocation:
>> I (1152) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
>> I (1159) heap_init: At 3FFBA920 len 000256E0 (149 KiB): DRAM
>> I (1164) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>> I (1171) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>> I (1179) heap_init: At 400953CC len 0000AC34 (43 KiB): IRAM
>> I (1184) cpu_start: Pro cpu start user code
>> I (1188) spiram: Adding pool of 4096K of external SPI memory to heap allocator
>> I (206) cpu_start: Starting scheduler on PRO CPU.
>> I (0) cpu_start: Starting scheduler on APP CPU.
>> ...
>> This is ESP32 chip with 2 CPU cores, WiFi/BT/BLE, silicon revision 1, 16MB external flash
>> Big 2MB spi memory allocation is at 0x3f800028
>> ...
>> I (291) wifi station: ESP_WIFI_MODE_APSTA
>> I (291) wifi: wifi driver task: 3ffc25c4, prio:23, stack:3584, core=0
>> ...
>> I (3281) wifi station: retry to connect to the AP
>> I (3281) wifi station: connect to the AP fail
>> 
>> I (6111) wifi station: retry to connect to the AP
>> I (6111) wifi station: connect to the AP fail
>> 
>> I (8461) mem_stresser: started on core 0 mem @0x3fa01b98
>> I (8461) mem_stresser: started on core 1 mem @0x3fa81bac
>> I (8951) wifi station: retry to connect to the AP
>> I (8951) wifi station: connect to the AP fail
>> 
>> I (9461) test_task: running on core 1
>> 
>> I (9561) test_task: testing mode 1 loopcnt 1000
>> #1: size=1039, data @0x3fb01ca4
>> 4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1WLAN-214677-52wifi103.2.001-22-g14d16854-dirty/factory/edge (build idf v3.2-dev-1419-g5afc1b0cb-dirty Jan 27 2019 15:08:23)4348864OVMS WIFI BLE BT cores=2 rev=ESP32/1
>> 
>> I (11781) wifi station: retry to connect to the AP
>> I (11781) wifi station: connect to the AP fail
>> 
>> I (14621) wifi station: retry to connect to the AP
>> I (14621) wifi station: connect to the AP fail
>> 
>> I (17451) wifi station: connect to the AP fail
>> 
>> I left it running for a few minutes, but no faults shown. Modes 1 through 4 all clean.
>> 
>> So, to take the funkiness to the next level, I added the same to ovms_main, right at the start (actually in the log level setter, as that seemed to be the first thing that initialises). Somebody else gets in for 28bytes of SPIRAM (or maybe that is a structure header), but anyway...
>> 
>> I (1742) spiram: SPI SRAM memory test OK
>> I (1743) heap_init: Initializing. RAM available for dynamic allocation:
>> I (1743) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM
>> I (1750) heap_init: At 3FFCEDE8 len 00011218 (68 KiB): DRAM
>> I (1755) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
>> I (1762) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
>> I (1768) heap_init: At 4009F530 len 00000AD0 (2 KiB): IRAM
>> I (1775) cpu_start: Pro cpu start user code
>> I (1779) spiram: Adding pool of 4096K of external SPI memory to heap allocator
>> I (125) ovms_main: Set default logging level for * to INFO
>> I (126) ovms_main: Pre-allocated 2MB of SPIRAM at 0x3f800028 (workaround ESP32 bug)
>> ...
>> I (17783) webserver: Launching Web Server
>> I (17823) ssh: Launching SSH Server
>> OVMS# test string 500 2
>> #1: stdlen = 1027
>> 1985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504OVMS WIFI BLE BT cores=2 rev=ESP32/168HIGHWAYS-29wifi163.2.001-22-g14d16854-dirty/factory/edge (build idf v3.1-dev-2835-g151269458-dirty Jan 30 2019 19:17:20)012.61985504
>> 
>> finished
>> OVMS#
>> 
>> OVMS# module memory
>> Free 8-bit 76160/230212, 32-bit 812/2720, SPIRAM 1886108/4194252
>> --Task--     Total DRAM D/IRAM   IRAM SPIRAM   +/- DRAM D/IRAM   IRAM SPIRAM
>> main*              8368      0      0   8456      +8368     +0     +0  +8456
>> no task*           5376      0      0      0      +5376     +0     +0     +0
>> esp_timer         34808      0    6442123100     +34808     +0   +644+2123100
>> OVMS Events       36352  41040      0  35120     +36352 +41040     +0 +35120
>> OVMS DukTape        188      0      0  34668       +188     +0     +0 +34668
>> OVMS CanRx         3124      0      0  26168      +3124     +0     +0 +26168
>> ipc0               7776      0      0      0      +7776     +0     +0     +0
>> ipc1                 12      0      0      0        +12     +0     +0     +0
>> Tmr Svc               0     88      0      0         +0    +88     +0     +0
>> tiT                 232    468      0    664       +232   +468     +0   +664
>> wifi                600   8896      0   3324       +600  +8896     +0  +3324
>> OVMS Console          0    200   1180  26332         +0   +200  +1180 +26332
>> mdns                  0    104      0      4         +0   +104     +0     +4
>> OVMS NetMan           0    216      0     84         +0   +216     +0    +84
>> 
>> I think that wins as the most ugly kludge to date.
>> 
>> Shall I commit?
>> 
>> Regards, Mark.
>> 
>>> On 30 Jan 2019, at 5:00 PM, Mark Webb-Johnson <mark at webb-johnson.net <mailto:mark at webb-johnson.net>> wrote:
>>> 
>>> 
>>> 
>>>> On 30 Jan 2019, at 4:41 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>> 
>>>> Mark,
>>>> 
>>>> issue #2892 mentions using only the lower 2 MB of SPI RAM as a workaround. Where are our allocations placed by the allocator? Does it fill from the middle or end? If not we don't use the upper half yet.
>>> 
>>> I saw that, and had a look. I don’t see anything in makeconfig to limit to 2MB (not 4MB).
>>> 
>>> I do see a CONFIG_SPIRAM_SIZE=4194304 in sdkconfig, but not sure how it gets there. Perhaps we can just change there to 2MB and try?
>>> 
>>> I did see this in the documentation:
>>> 
>>> During ESP-IDF startup, external RAM is mapped into the data address space starting at at address 0x3F800000 (byte-accessible). The length of this region is the same as the SPIRAM size (up to the limit of 4MiB).
>>> 
>>> So, maybe we can look at the address our allocations come from to see their offset from 0x3F800000? If it is coming from the top 2MB, then perhaps we start with a big 2MB allocation that we never use?
>>>  
>>>> A possible workaround for the most apparent issues with this (string assembly) could be to use char buffers instead of std::string. I was thinking about doing that at least for the websocket stream.
>>> 
>>> It seems strange that we are only seeing it with std::string. Maybe that just stresses the system more, or does a lot of reallocations?
>>> 
>>>> But if this really is a hardware issue it can affect all objects in SPI RAM, appending to std::string then only triggers this more often.
>>>> 
>>>> Another workaround could be to run everything on core 0.
>>> 
>>> Urgh.
>>> 
>>>> Regards,
>>>> Michael
>>>> 
>>>> 
>>>> Am 30.01.19 um 03:33 schrieb Mark Webb-Johnson:
>>>>> Michael,
>>>>> 
>>>>> Espressif’s response (and linking to that other issue) sounds like others are seeing this.
>>>>> 
>>>>> Lousy timing (with Chinese New Year next week), so don’t expect anything quick from Espressif. I guess we’ll just have to live with it until they can find a workaround. It sounds like the issue is at the hardware level and a compiler patch will be needed.
>>>>> 
>>>>> Regards, Mark.
>>>>> 
>>>>>> On 30 Jan 2019, at 3:07 AM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>> 
>>>>>>https://github.com/espressif/esp-idf/issues/3006 <https://github.com/espressif/esp-idf/issues/3006>
>>>>>> 
>>>>>> Regards,
>>>>>> Michael
>>>>>> 
>>>>>> 
>>>>>> Am 28.01.19 um 20:46 schrieb Michael Balzer:
>>>>>>> To clarify: the bug is most likely not restricted to the case of building a message in a buffer. It can possibly cause corruptions in any RAM section, so can well be responsible for many/most of the unidentified crashes and stack/heap corruptions we're experiencing.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Michael
>>>>>>> 
>>>>>>> 
>>>>>>> Am 28.01.19 um 20:43 schrieb Michael Balzer:
>>>>>>>> For those not following the github discussion: I'm pretty sure I've nailed the bug down.
>>>>>>>> 
>>>>>>>> I have reproduced the bug in a simple test project and intend to raise an issue with Espressif on this.
>>>>>>>> 
>>>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457965435 <https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457965435>
>>>>>>>> 
>>>>>>>> If you'd like to test this on your module, configure the project to your wifi credentials, then use "make flash". As the test project is small, this normally will not erase your OVMS config partition, but a backup is always recommended.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Michael
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Am 24.01.19 um 21:19 schrieb Michael Balzer:
>>>>>>>>> Everyone please have a look at…
>>>>>>>>> 
>>>>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457334248 <https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/189#issuecomment-457334248>
>>>>>>>>> 
>>>>>>>>> Please try to reproduce the bug on your modules.
>>>>>>>>> 
>>>>>>>>> I'm open for explanations.
>>>>>>>>> 
>>>>>>>>> I thought this might be some copy-on-write bug with std::string, but the gcc 5.x libstdc++ does no longer use that implementation (wouldn't be C++11 compliant as well). I also tried moving all strings to temporary buffers, but modes 5 & 6 eliminated this explanation as well.
>>>>>>>>> 
>>>>>>>>> My remaining theories:
>>>>>>>>> A task writing out of bounds (but only 0-bytes?)
>>>>>>>>> A hardware issue only affecting some modules
>>>>>>>>> A hardware issue only affecting some percentage of ESP32 could explain this as well as the strange heap corruptions that seem to affect some modules especially often.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Michael
>>>>>>>>> 
> 
> -- 
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20190130/cde52293/attachment.htm>


More information about the OvmsDev mailing list