[Ovmsdev] Heap corruption alerts & heap tracing

Thu Jan 22 02:20:24 HKT 2026

I've been running the Mongoose API level lock for three days now without 
issues.

Code & PR is here: 
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/pull/1322

Please test & review.

I cannot tell if this makes a difference from my testing alone, as I 
don't have these issues with my config & network situation.

If no issues turn up I'd like to merge this towards the weekend, so I 
can see if the crash reports show a difference in the field.

Regards,
Michael

Am 17.01.26 um 10:18 schrieb Michael Balzer via OvmsDev:
> FTR, I was too optimistic. The LoadProhibited & IllegalInstruction 
> crashes may be reduced but still are present, also the one in the 
> Console I thought would most probably be caused by the unsynced 
> logging console registry:
>
> 0x4014681e is in OvmsConsole::Poll(unsigned int, void*) 
> (/home/balzer/esp/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_console.cpp:226).
> 221             if (event.type == ALERT_MULTI)
> 222               {
> 223               LogBuffers::iterator before = event.multi->begin(), 
> after;
> 224               while (true)
> 225                 {
> *226                 buffer = *before;*
> 227                 len = strlen(buffer);
> 228                 after = before;
> 229                 ++after;
> 230                 if (after == event.multi->end())
>
> On the positive side, crash frequency still is reduced.
>
> I'm investigating if we need to do more about Mongoose, as many 
> crashes still seem to be related to some network reconfiguration event.
>
> The thread safety patch on the mbufs did help a lot 
> (→https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/120), 
> but we may actually need to move the mutex up to the API level, at 
> least for every function accessing the manager or connection list or a 
> connection's virtual interface.
>
> Crash example on a vtable indirection:
>
> 0x4015d82a is in mg_send 
> (/home/balzer/esp/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/components/mongoose/mongoose/mongoose.c:2776).
> 2771    size_t mg_send(struct mg_connection *nc, const void *buf, int 
> len) {
> 2772      nc->last_io_time = (time_t) mg_time();
> 2773      if (nc->flags & MG_F_UDP) {
> 2774        len = nc->iface->vtable->udp_send(nc, buf, len);
> 2775      } else {
> *2776        len = nc->iface->vtable->tcp_send(nc, buf, len);*
> 2777      }
>
> A Mongoose update won't solve this. Mongoose still has no support or 
> plans for thread safety, they explicitly warn about this:
>
> → https://mongoose.ws/documentation/#connections-and-event-manager
>>
>>     NOTE: Since Mongoose's core is not protected against concurrent
>>     accesses, make sure that all|mg_*|API functions are called from
>>     the same thread or RTOS task
>>
>
> Us ignoring this seems to be the single most problematic concurrency 
> issue remaining.
>
> My current idea to solve this is introducing a `MongooseClient` class 
> that replicates the critical Mongoose API methods wrapped in a mutex 
> lock, and then simply add that as a base class for any component class 
> using the API.
>
> I doubt the calls actually need to be done in the same task, as 
> Cesanta write. If that turns out to be necessary (why would it?), the 
> mutex can be replaced by a callback implementation extending the 
> netmanager's command job queue.
>
> Synchronizing the access implies each API call may need to wait for 
> the netmanager's `mg_mgr_poll()` timeout, which is currently set to 
> 250 ms. If that becomes an issue, we can probably lower the timeout to 
> 100 or even 50 ms without hurting overall performance.
>
> Regards,
> Michael
>
>
> Am 14.01.26 um 18:22 schrieb Michael Balzer via OvmsDev:
>> I hope I'm not too optimistic, but from the crash records it seems 
>> I've fixed about 50-60% of the crashes with the console registry 
>> mutex 
>> (https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/commit/069632f48ef653601b4525201bd0af09a4625208).
>>
>> This has been added in edge build  3.3.005-615-gd132bfe0, which has 
>> now been running in vehicles for two days, with not a single 
>> LoadProhibited or IllegalInstruction crash since, and not a single 
>> crash in OVMS Console.
>>
>> Remaining crashes are all abort()s, with about 11% happening from 
>> detected heap corruption and the rest from the task watchdog. Crash 
>> signatures overlap, i.e. there are watchdog triggers that also detect 
>> a heap corruption, and some heap corruptions probably are just not 
>> detected yet due to the light impact detection mode.
>>
>> So please keep looking for potential heap corruption sources.
>>
>> Btw, the config mutex didn't show a significant crash reduction in 
>> the records, but it should still help reduce the /store corruptions 
>> (time will tell).
>>
>> Regards,
>> Michael
>>
>>
>> Am 29.12.25 um 12:11 schrieb Michael Balzer via OvmsDev:
>>> Checking the heap immediately on events will probably not catch bugs 
>>> that involve late callbacks using already freed buffers, and some 
>>> events also are signaled before the actual shutdown (release of all 
>>> resources) has fully taken place.
>>>
>>> To register the heap checker for a delayed execution, use this scheme:
>>>
>>>> # create custom heap checking event handler:
>>>> vfs echo "module check alert" /store/events/usr.check.heap/00-checkheap
>>>>
>>>> # register delayed heap checks:
>>>> vfs echo "event raise -d5000 usr.check.heap" 
>>>> /store/events/server.v2.stopped/90-checkheap
>>>> vfs echo "event raise -d5000 usr.check.heap" 
>>>> /store/events/server.v3.stopped/90-checkheap
>>>> …
>>>
>>> Note: just guessing with the 5 second delay here, may need more, 
>>> depending on the event.
>>>
>>> Maybe we should register the check to be done by default on some 
>>> selected events. Although the heap check normally is done within 
>>> milliseconds, I still hesitate doing that. The heap check needs to 
>>> lock the system while walking through all allocated memory blocks, 
>>> which may create issues for vehicles that rely on consistent timing 
>>> for CAN or custom hardware communication.
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 29.12.25 um 11:42 schrieb Michael Balzer via OvmsDev:
>>>> The heap check + one-off alert function is now exposed for 
>>>> inclusion in component code and can be called as a shell command to 
>>>> enable inclusion in event scripts et al:
>>>>
>>>>> #include "ovms_module.h"
>>>>>
>>>>> /**
>>>>>  * module_check_heap_alert: check for and send one-off alert 
>>>>> notification on heap corruption
>>>>>  *
>>>>>  *    To enable the check every 5 minutes, set config "module" 
>>>>> "debug.heap" to "yes".
>>>>>  *
>>>>>  *    To add custom checks, call from your code, or register event 
>>>>> scripts as needed.
>>>>>  *    Example: perform heap integrity check when the server V2 
>>>>> gets stopped:
>>>>>  *      vfs echo "module check alert" 
>>>>> /store/events/server.v2.stopped/90-checkheap
>>>>>  *
>>>>>  * @param verbosity     -- optional: channel capacity (default 0)
>>>>>  * @param OvmsWriter    -- optional: channel (default NULL)
>>>>>  * @return heapok       -- false = heap corrupted/full
>>>>>  */
>>>>> extern bool module_check_heap_alert(int verbosity=0, OvmsWriter* 
>>>>> writer=NULL);
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 28.12.25 um 19:21 schrieb Michael Balzer via OvmsDev:
>>>>> PS: when creating a debug build, you may also consider enabling 
>>>>> the "comprehensive" corruption detection mode, as that will catch 
>>>>> more cases.
>>>>>
>>>>> But be aware that has a substantial impact on performance, so a 
>>>>> user will probably not tolerate this for regular daily use.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 28.12.25 um 18:26 schrieb Michael Balzer via OvmsDev:
>>>>>> Everyone,
>>>>>>
>>>>>> from the crash reports 
>>>>>> (https://ovms.dexters-web.de/firmware/developer/), most remaining 
>>>>>> crashes seem to be caused by heap corruptions.
>>>>>>
>>>>>> Not all heap corruptions are easily detectable from the backtrace 
>>>>>> analysis, and the component or action causing the corruption 
>>>>>> isn't detectable at all that way. So I've added a debug option to 
>>>>>> enable a regular heap integrity check ever 5 minutes, with the 
>>>>>> module sending an alert notification when a corruption has been 
>>>>>> detected. Example:
>>>>>>
>>>>>>> Heap corruption detected; reboot advised ASAP!
>>>>>>> Please forward including task records and system log:
>>>>>>>
>>>>>>> CORRUPT HEAP: Bad tail at 0x3f8e44f0 owner 0x3ffea9bc. Expected 
>>>>>>> 0xbaad5678 got 0xbaad5600
>>>>>>
>>>>>> I also added a final heap integrity check to our crash handler, 
>>>>>> so the crash debug records should now show exactly which crashes 
>>>>>> occurred with a corrupted heap.
>>>>>>
>>>>>> In combination with the system log and the task log, that should 
>>>>>> give us some more opportunities to narrow down the cause(s).
>>>>>>
>>>>>> I've also added task ownershop to the heap corruption report. 
>>>>>> Note, this needs my latest additions to our esp-idf fork, so take 
>>>>>> care to pull these before building.
>>>>>>
>>>>>> Be aware, task ownership of corrupted blocks doesn't necessarily 
>>>>>> tell about the task doing the corruption. If the tail canary is 
>>>>>> compromised, and no other block located before that block is 
>>>>>> compromised, it *may* be that task doing the out of bounds write. 
>>>>>> But it may also be a use after free of some previous owner. So 
>>>>>> take task ownership with a grain of salt.
>>>>>>
>>>>>> The corruptions are most probably caused by some unclean shutdown 
>>>>>> of a component or by an undetected race conditions within a 
>>>>>> shutdown procedure. The heap seems to be stable on modules with 
>>>>>> standard configurations and components not being started & shut 
>>>>>> down on a regular base. The heap corruptions are especially 
>>>>>> present now with Smart (SQ) vehicles -- as the Smart doesn't keep 
>>>>>> the 12V battery charged from the main battery, most Smart users 
>>>>>> probably use the power management to shut down Wifi and/or modem 
>>>>>> while parking.
>>>>>>
>>>>>> So our main focus should be on analysing what happens before the 
>>>>>> corruption. Ask users reporting heap corruptions to provide their 
>>>>>> system logs, and possibly also their task logs. To encourage 
>>>>>> enabling these, I've added the config to the web UI 
>>>>>> (Config→Notifications).
>>>>>>
>>>>>>
>>>>>> Once you can reproduce (!) the corruption, heap tracing might 
>>>>>> provide some more insight as to where exactly the corruption occurs.
>>>>>>
>>>>>> Heap tracing means recording all memory allocations and frees. 
>>>>>> This adds a recording layer on top of the heap functions, so 
>>>>>> comes with some cost, even when inactive. CPU overhead is low, 
>>>>>> but stack overhead may be an issue, so I think we should not 
>>>>>> enable heap tracing by default for now, but rather use a debug 
>>>>>> build specifically in cases we think heap tracing might help.
>>>>>>
>>>>>> To enable heap tracing on some user device, I've reworked the 
>>>>>> ESP-IDF heap tracing to enable remote execution and to also 
>>>>>> include the task handles performing the allocations and 
>>>>>> deallocations.
>>>>>>
>>>>>> To enable heap tracing for a build:
>>>>>>
>>>>>>>   * Under|makemenuconfig|, navigate
>>>>>>>     to|Componentsettings|->|HeapMemoryDebugging|and
>>>>>>>     setCONFIG_HEAP_TRACING
>>>>>>>     <https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/kconfig.html#config-heap-tracing>.
>>>>>>>
>>>>>>
>>>>>> There's also an option to set the number of stack backtrace 
>>>>>> frames. Tracing with two frames is mostly useless in our (C++) 
>>>>>> context, as it will normally only show some inner frames of the 
>>>>>> allocator. I've tried raising that to 5, and got an immediate 
>>>>>> crash in the mdns component. I assume raising the depth will need 
>>>>>> raising some stack sizes. If you find a good compromise, please 
>>>>>> report.
>>>>>>
>>>>>> Heap tracing will work best in a reduced configuration. In normal 
>>>>>> operation, my module fills a 500 records buffer within seconds. 
>>>>>> The buffer is a FIFO, so will always contain the latest n 
>>>>>> allocations, and the last entry in the dump is the newest one.
>>>>>>
>>>>>> Reduced example dump:
>>>>>>
>>>>>>> OVMS# mod trace dump
>>>>>>> Heap tracing started/resumed at ccount 0x1c70b453 = logtime 19415
>>>>>>> 1000 allocations trace (1000 entry buffer)
>>>>>>> 258 bytes (@ 0x3fffb3a8) allocated CPU 1 task 0x3ffc92e8 ccount 
>>>>>>> 0x89074e74 caller 0x4031d4f0:0x4031f553
>>>>>>> freed task 0x3ffc92e8 by 0x4031d514:0x4031f595
>>>>>>> 201 bytes (@ 0x3f8e92f4) allocated CPU 1 task 0x3fff2290 ccount 
>>>>>>> 0x89218bbc caller 0x40131b12:0x4014633c
>>>>>>> freed task 0x3fff2290 by 0x402c4634:0x402cb14d
>>>>>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount 
>>>>>>> 0x8a25d47c caller 0x4029b455:0x4017a5dc
>>>>>>> freed task 0x3ffee918 by 0x4029b744:0x4017a5dc
>>>>>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount 
>>>>>>> 0x8db89c40 caller 0x4029b455:0x4017a5dc
>>>>>>> freed task 0x3ffee918 by 0x4029b744:0x4017a5dc
>>>>>>> 12 bytes (@ 0x3f85d0d4) allocated CPU 1 task 0x3ffdfcd0 ccount 
>>>>>>> 0x8fccdfd8 caller 0x40131b12:0x4014633c
>>>>>>> freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04
>>>>>>> […]
>>>>>>> 12 bytes (@ 0x3f85d0d4) allocated CPU 1 task 0x3ffdfcd0 ccount 
>>>>>>> 0x3cf429f4 caller 0x40131b12:0x4014633c
>>>>>>> freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04
>>>>>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount 
>>>>>>> 0x3fe1a0bc caller 0x4029b455:0x4017a5dc
>>>>>>> 229 bytes alive in trace (3/1000 allocations)
>>>>>>> total allocations 1207 total frees 3356
>>>>>>> (NB: Buffer has overflowed; so trace data is incomplete.)
>>>>>>
>>>>>> "ccount" is the ESP32 CCOUNT register (CPU cycles), so provides 
>>>>>> an orientation on when the allocation occurred.
>>>>>>
>>>>>>
>>>>>> Some more basics on heap debugging are also covered here: 
>>>>>> https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/system/heap_debug.html
>>>>>>
>>>>>> Regards,
>>>>>> Michael
>>>>>>
>>>>>> -- 
>>>>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>>>>> Fon 02330 9104094 * Handy 0176 20698926
>>>>>>
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.openvehicles.com
>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>
>>>>> -- 
>>>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>>>> Fon 02330 9104094 * Handy 0176 20698926
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>
>>>> -- 
>>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>>> Fon 02330 9104094 * Handy 0176 20698926
>>>>
>>>>
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>
>>> -- 
>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>> Fon 02330 9104094 * Handy 0176 20698926
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>> Fon 02330 9104094 * Handy 0176 20698926
>>
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> -- 
> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
> Fon 02330 9104094 * Handy 0176 20698926
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20260121/5d9ede10/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20260121/5d9ede10/attachment-0001.sig>