[Ovmsdev] Heap corruption alerts & heap tracing
Michael Balzer
dexter at expeedo.de
Thu Jan 22 02:20:24 HKT 2026
I've been running the Mongoose API level lock for three days now without
issues.
Code & PR is here:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/pull/1322
Please test & review.
I cannot tell if this makes a difference from my testing alone, as I
don't have these issues with my config & network situation.
If no issues turn up I'd like to merge this towards the weekend, so I
can see if the crash reports show a difference in the field.
Regards,
Michael
Am 17.01.26 um 10:18 schrieb Michael Balzer via OvmsDev:
> FTR, I was too optimistic. The LoadProhibited & IllegalInstruction
> crashes may be reduced but still are present, also the one in the
> Console I thought would most probably be caused by the unsynced
> logging console registry:
>
> 0x4014681e is in OvmsConsole::Poll(unsigned int, void*)
> (/home/balzer/esp/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_console.cpp:226).
> 221 if (event.type == ALERT_MULTI)
> 222 {
> 223 LogBuffers::iterator before = event.multi->begin(),
> after;
> 224 while (true)
> 225 {
> *226 buffer = *before;*
> 227 len = strlen(buffer);
> 228 after = before;
> 229 ++after;
> 230 if (after == event.multi->end())
>
> On the positive side, crash frequency still is reduced.
>
> I'm investigating if we need to do more about Mongoose, as many
> crashes still seem to be related to some network reconfiguration event.
>
> The thread safety patch on the mbufs did help a lot
> (→https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/120),
> but we may actually need to move the mutex up to the API level, at
> least for every function accessing the manager or connection list or a
> connection's virtual interface.
>
> Crash example on a vtable indirection:
>
> 0x4015d82a is in mg_send
> (/home/balzer/esp/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/components/mongoose/mongoose/mongoose.c:2776).
> 2771 size_t mg_send(struct mg_connection *nc, const void *buf, int
> len) {
> 2772 nc->last_io_time = (time_t) mg_time();
> 2773 if (nc->flags & MG_F_UDP) {
> 2774 len = nc->iface->vtable->udp_send(nc, buf, len);
> 2775 } else {
> *2776 len = nc->iface->vtable->tcp_send(nc, buf, len);*
> 2777 }
>
> A Mongoose update won't solve this. Mongoose still has no support or
> plans for thread safety, they explicitly warn about this:
>
> → https://mongoose.ws/documentation/#connections-and-event-manager
>>
>> NOTE: Since Mongoose's core is not protected against concurrent
>> accesses, make sure that all|mg_*|API functions are called from
>> the same thread or RTOS task
>>
>
> Us ignoring this seems to be the single most problematic concurrency
> issue remaining.
>
> My current idea to solve this is introducing a `MongooseClient` class
> that replicates the critical Mongoose API methods wrapped in a mutex
> lock, and then simply add that as a base class for any component class
> using the API.
>
> I doubt the calls actually need to be done in the same task, as
> Cesanta write. If that turns out to be necessary (why would it?), the
> mutex can be replaced by a callback implementation extending the
> netmanager's command job queue.
>
> Synchronizing the access implies each API call may need to wait for
> the netmanager's `mg_mgr_poll()` timeout, which is currently set to
> 250 ms. If that becomes an issue, we can probably lower the timeout to
> 100 or even 50 ms without hurting overall performance.
>
> Regards,
> Michael
>
>
> Am 14.01.26 um 18:22 schrieb Michael Balzer via OvmsDev:
>> I hope I'm not too optimistic, but from the crash records it seems
>> I've fixed about 50-60% of the crashes with the console registry
>> mutex
>> (https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/commit/069632f48ef653601b4525201bd0af09a4625208).
>>
>> This has been added in edge build 3.3.005-615-gd132bfe0, which has
>> now been running in vehicles for two days, with not a single
>> LoadProhibited or IllegalInstruction crash since, and not a single
>> crash in OVMS Console.
>>
>> Remaining crashes are all abort()s, with about 11% happening from
>> detected heap corruption and the rest from the task watchdog. Crash
>> signatures overlap, i.e. there are watchdog triggers that also detect
>> a heap corruption, and some heap corruptions probably are just not
>> detected yet due to the light impact detection mode.
>>
>> So please keep looking for potential heap corruption sources.
>>
>> Btw, the config mutex didn't show a significant crash reduction in
>> the records, but it should still help reduce the /store corruptions
>> (time will tell).
>>
>> Regards,
>> Michael
>>
>>
>> Am 29.12.25 um 12:11 schrieb Michael Balzer via OvmsDev:
>>> Checking the heap immediately on events will probably not catch bugs
>>> that involve late callbacks using already freed buffers, and some
>>> events also are signaled before the actual shutdown (release of all
>>> resources) has fully taken place.
>>>
>>> To register the heap checker for a delayed execution, use this scheme:
>>>
>>>> # create custom heap checking event handler:
>>>> vfs echo "module check alert" /store/events/usr.check.heap/00-checkheap
>>>>
>>>> # register delayed heap checks:
>>>> vfs echo "event raise -d5000 usr.check.heap"
>>>> /store/events/server.v2.stopped/90-checkheap
>>>> vfs echo "event raise -d5000 usr.check.heap"
>>>> /store/events/server.v3.stopped/90-checkheap
>>>> …
>>>
>>> Note: just guessing with the 5 second delay here, may need more,
>>> depending on the event.
>>>
>>> Maybe we should register the check to be done by default on some
>>> selected events. Although the heap check normally is done within
>>> milliseconds, I still hesitate doing that. The heap check needs to
>>> lock the system while walking through all allocated memory blocks,
>>> which may create issues for vehicles that rely on consistent timing
>>> for CAN or custom hardware communication.
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 29.12.25 um 11:42 schrieb Michael Balzer via OvmsDev:
>>>> The heap check + one-off alert function is now exposed for
>>>> inclusion in component code and can be called as a shell command to
>>>> enable inclusion in event scripts et al:
>>>>
>>>>> #include "ovms_module.h"
>>>>>
>>>>> /**
>>>>> * module_check_heap_alert: check for and send one-off alert
>>>>> notification on heap corruption
>>>>> *
>>>>> * To enable the check every 5 minutes, set config "module"
>>>>> "debug.heap" to "yes".
>>>>> *
>>>>> * To add custom checks, call from your code, or register event
>>>>> scripts as needed.
>>>>> * Example: perform heap integrity check when the server V2
>>>>> gets stopped:
>>>>> * vfs echo "module check alert"
>>>>> /store/events/server.v2.stopped/90-checkheap
>>>>> *
>>>>> * @param verbosity -- optional: channel capacity (default 0)
>>>>> * @param OvmsWriter -- optional: channel (default NULL)
>>>>> * @return heapok -- false = heap corrupted/full
>>>>> */
>>>>> extern bool module_check_heap_alert(int verbosity=0, OvmsWriter*
>>>>> writer=NULL);
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 28.12.25 um 19:21 schrieb Michael Balzer via OvmsDev:
>>>>> PS: when creating a debug build, you may also consider enabling
>>>>> the "comprehensive" corruption detection mode, as that will catch
>>>>> more cases.
>>>>>
>>>>> But be aware that has a substantial impact on performance, so a
>>>>> user will probably not tolerate this for regular daily use.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 28.12.25 um 18:26 schrieb Michael Balzer via OvmsDev:
>>>>>> Everyone,
>>>>>>
>>>>>> from the crash reports
>>>>>> (https://ovms.dexters-web.de/firmware/developer/), most remaining
>>>>>> crashes seem to be caused by heap corruptions.
>>>>>>
>>>>>> Not all heap corruptions are easily detectable from the backtrace
>>>>>> analysis, and the component or action causing the corruption
>>>>>> isn't detectable at all that way. So I've added a debug option to
>>>>>> enable a regular heap integrity check ever 5 minutes, with the
>>>>>> module sending an alert notification when a corruption has been
>>>>>> detected. Example:
>>>>>>
>>>>>>> Heap corruption detected; reboot advised ASAP!
>>>>>>> Please forward including task records and system log:
>>>>>>>
>>>>>>> CORRUPT HEAP: Bad tail at 0x3f8e44f0 owner 0x3ffea9bc. Expected
>>>>>>> 0xbaad5678 got 0xbaad5600
>>>>>>
>>>>>> I also added a final heap integrity check to our crash handler,
>>>>>> so the crash debug records should now show exactly which crashes
>>>>>> occurred with a corrupted heap.
>>>>>>
>>>>>> In combination with the system log and the task log, that should
>>>>>> give us some more opportunities to narrow down the cause(s).
>>>>>>
>>>>>> I've also added task ownershop to the heap corruption report.
>>>>>> Note, this needs my latest additions to our esp-idf fork, so take
>>>>>> care to pull these before building.
>>>>>>
>>>>>> Be aware, task ownership of corrupted blocks doesn't necessarily
>>>>>> tell about the task doing the corruption. If the tail canary is
>>>>>> compromised, and no other block located before that block is
>>>>>> compromised, it *may* be that task doing the out of bounds write.
>>>>>> But it may also be a use after free of some previous owner. So
>>>>>> take task ownership with a grain of salt.
>>>>>>
>>>>>> The corruptions are most probably caused by some unclean shutdown
>>>>>> of a component or by an undetected race conditions within a
>>>>>> shutdown procedure. The heap seems to be stable on modules with
>>>>>> standard configurations and components not being started & shut
>>>>>> down on a regular base. The heap corruptions are especially
>>>>>> present now with Smart (SQ) vehicles -- as the Smart doesn't keep
>>>>>> the 12V battery charged from the main battery, most Smart users
>>>>>> probably use the power management to shut down Wifi and/or modem
>>>>>> while parking.
>>>>>>
>>>>>> So our main focus should be on analysing what happens before the
>>>>>> corruption. Ask users reporting heap corruptions to provide their
>>>>>> system logs, and possibly also their task logs. To encourage
>>>>>> enabling these, I've added the config to the web UI
>>>>>> (Config→Notifications).
>>>>>>
>>>>>>
>>>>>> Once you can reproduce (!) the corruption, heap tracing might
>>>>>> provide some more insight as to where exactly the corruption occurs.
>>>>>>
>>>>>> Heap tracing means recording all memory allocations and frees.
>>>>>> This adds a recording layer on top of the heap functions, so
>>>>>> comes with some cost, even when inactive. CPU overhead is low,
>>>>>> but stack overhead may be an issue, so I think we should not
>>>>>> enable heap tracing by default for now, but rather use a debug
>>>>>> build specifically in cases we think heap tracing might help.
>>>>>>
>>>>>> To enable heap tracing on some user device, I've reworked the
>>>>>> ESP-IDF heap tracing to enable remote execution and to also
>>>>>> include the task handles performing the allocations and
>>>>>> deallocations.
>>>>>>
>>>>>> To enable heap tracing for a build:
>>>>>>
>>>>>>> * Under|makemenuconfig|, navigate
>>>>>>> to|Componentsettings|->|HeapMemoryDebugging|and
>>>>>>> setCONFIG_HEAP_TRACING
>>>>>>> <https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/kconfig.html#config-heap-tracing>.
>>>>>>>
>>>>>>
>>>>>> There's also an option to set the number of stack backtrace
>>>>>> frames. Tracing with two frames is mostly useless in our (C++)
>>>>>> context, as it will normally only show some inner frames of the
>>>>>> allocator. I've tried raising that to 5, and got an immediate
>>>>>> crash in the mdns component. I assume raising the depth will need
>>>>>> raising some stack sizes. If you find a good compromise, please
>>>>>> report.
>>>>>>
>>>>>> Heap tracing will work best in a reduced configuration. In normal
>>>>>> operation, my module fills a 500 records buffer within seconds.
>>>>>> The buffer is a FIFO, so will always contain the latest n
>>>>>> allocations, and the last entry in the dump is the newest one.
>>>>>>
>>>>>> Reduced example dump:
>>>>>>
>>>>>>> OVMS# mod trace dump
>>>>>>> Heap tracing started/resumed at ccount 0x1c70b453 = logtime 19415
>>>>>>> 1000 allocations trace (1000 entry buffer)
>>>>>>> 258 bytes (@ 0x3fffb3a8) allocated CPU 1 task 0x3ffc92e8 ccount
>>>>>>> 0x89074e74 caller 0x4031d4f0:0x4031f553
>>>>>>> freed task 0x3ffc92e8 by 0x4031d514:0x4031f595
>>>>>>> 201 bytes (@ 0x3f8e92f4) allocated CPU 1 task 0x3fff2290 ccount
>>>>>>> 0x89218bbc caller 0x40131b12:0x4014633c
>>>>>>> freed task 0x3fff2290 by 0x402c4634:0x402cb14d
>>>>>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount
>>>>>>> 0x8a25d47c caller 0x4029b455:0x4017a5dc
>>>>>>> freed task 0x3ffee918 by 0x4029b744:0x4017a5dc
>>>>>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount
>>>>>>> 0x8db89c40 caller 0x4029b455:0x4017a5dc
>>>>>>> freed task 0x3ffee918 by 0x4029b744:0x4017a5dc
>>>>>>> 12 bytes (@ 0x3f85d0d4) allocated CPU 1 task 0x3ffdfcd0 ccount
>>>>>>> 0x8fccdfd8 caller 0x40131b12:0x4014633c
>>>>>>> freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04
>>>>>>> […]
>>>>>>> 12 bytes (@ 0x3f85d0d4) allocated CPU 1 task 0x3ffdfcd0 ccount
>>>>>>> 0x3cf429f4 caller 0x40131b12:0x4014633c
>>>>>>> freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04
>>>>>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount
>>>>>>> 0x3fe1a0bc caller 0x4029b455:0x4017a5dc
>>>>>>> 229 bytes alive in trace (3/1000 allocations)
>>>>>>> total allocations 1207 total frees 3356
>>>>>>> (NB: Buffer has overflowed; so trace data is incomplete.)
>>>>>>
>>>>>> "ccount" is the ESP32 CCOUNT register (CPU cycles), so provides
>>>>>> an orientation on when the allocation occurred.
>>>>>>
>>>>>>
>>>>>> Some more basics on heap debugging are also covered here:
>>>>>> https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/system/heap_debug.html
>>>>>>
>>>>>> Regards,
>>>>>> Michael
>>>>>>
>>>>>> --
>>>>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>>>>> Fon 02330 9104094 * Handy 0176 20698926
>>>>>>
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.openvehicles.com
>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>
>>>>> --
>>>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>>>> Fon 02330 9104094 * Handy 0176 20698926
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>
>>>> --
>>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>>> Fon 02330 9104094 * Handy 0176 20698926
>>>>
>>>>
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>
>>> --
>>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>>> Fon 02330 9104094 * Handy 0176 20698926
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> --
>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>> Fon 02330 9104094 * Handy 0176 20698926
>>
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> --
> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
> Fon 02330 9104094 * Handy 0176 20698926
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20260121/5d9ede10/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20260121/5d9ede10/attachment-0001.sig>
More information about the OvmsDev
mailing list