[Ovmsdev] Heap corruption alerts & heap tracing

Michael Balzer dexter at expeedo.de
Mon Dec 29 18:42:54 HKT 2025


The heap check + one-off alert function is now exposed for inclusion in 
component code and can be called as a shell command to enable inclusion 
in event scripts et al:

> #include "ovms_module.h"
>
> /**
>  * module_check_heap_alert: check for and send one-off alert 
> notification on heap corruption
>  *
>  *    To enable the check every 5 minutes, set config "module" 
> "debug.heap" to "yes".
>  *
>  *    To add custom checks, call from your code, or register event 
> scripts as needed.
>  *    Example: perform heap integrity check when the server V2 gets 
> stopped:
>  *      vfs echo "module check alert" 
> /store/events/server.v2.stopped/90-checkheap
>  *
>  * @param verbosity     -- optional: channel capacity (default 0)
>  * @param OvmsWriter    -- optional: channel (default NULL)
>  * @return heapok       -- false = heap corrupted/full
>  */
> extern bool module_check_heap_alert(int verbosity=0, OvmsWriter* 
> writer=NULL);

Regards,
Michael


Am 28.12.25 um 19:21 schrieb Michael Balzer via OvmsDev:
> PS: when creating a debug build, you may also consider enabling the 
> "comprehensive" corruption detection mode, as that will catch more cases.
>
> But be aware that has a substantial impact on performance, so a user 
> will probably not tolerate this for regular daily use.
>
> Regards,
> Michael
>
>
> Am 28.12.25 um 18:26 schrieb Michael Balzer via OvmsDev:
>> Everyone,
>>
>> from the crash reports 
>> (https://ovms.dexters-web.de/firmware/developer/), most remaining 
>> crashes seem to be caused by heap corruptions.
>>
>> Not all heap corruptions are easily detectable from the backtrace 
>> analysis, and the component or action causing the corruption isn't 
>> detectable at all that way. So I've added a debug option to enable a 
>> regular heap integrity check ever 5 minutes, with the module sending 
>> an alert notification when a corruption has been detected. Example:
>>
>>> Heap corruption detected; reboot advised ASAP!
>>> Please forward including task records and system log:
>>>
>>> CORRUPT HEAP: Bad tail at 0x3f8e44f0 owner 0x3ffea9bc. Expected 
>>> 0xbaad5678 got 0xbaad5600
>>
>> I also added a final heap integrity check to our crash handler, so 
>> the crash debug records should now show exactly which crashes 
>> occurred with a corrupted heap.
>>
>> In combination with the system log and the task log, that should give 
>> us some more opportunities to narrow down the cause(s).
>>
>> I've also added task ownershop to the heap corruption report. Note, 
>> this needs my latest additions to our esp-idf fork, so take care to 
>> pull these before building.
>>
>> Be aware, task ownership of corrupted blocks doesn't necessarily tell 
>> about the task doing the corruption. If the tail canary is 
>> compromised, and no other block located before that block is 
>> compromised, it *may* be that task doing the out of bounds write. But 
>> it may also be a use after free of some previous owner. So take task 
>> ownership with a grain of salt.
>>
>> The corruptions are most probably caused by some unclean shutdown of 
>> a component or by an undetected race conditions within a shutdown 
>> procedure. The heap seems to be stable on modules with standard 
>> configurations and components not being started & shut down on a 
>> regular base. The heap corruptions are especially present now with 
>> Smart (SQ) vehicles -- as the Smart doesn't keep the 12V battery 
>> charged from the main battery, most Smart users probably use the 
>> power management to shut down Wifi and/or modem while parking.
>>
>> So our main focus should be on analysing what happens before the 
>> corruption. Ask users reporting heap corruptions to provide their 
>> system logs, and possibly also their task logs. To encourage enabling 
>> these, I've added the config to the web UI (Config→Notifications).
>>
>>
>> Once you can reproduce (!) the corruption, heap tracing might provide 
>> some more insight as to where exactly the corruption occurs.
>>
>> Heap tracing means recording all memory allocations and frees. This 
>> adds a recording layer on top of the heap functions, so comes with 
>> some cost, even when inactive. CPU overhead is low, but stack 
>> overhead may be an issue, so I think we should not enable heap 
>> tracing by default for now, but rather use a debug build specifically 
>> in cases we think heap tracing might help.
>>
>> To enable heap tracing on some user device, I've reworked the ESP-IDF 
>> heap tracing to enable remote execution and to also include the task 
>> handles performing the allocations and deallocations.
>>
>> To enable heap tracing for a build:
>>
>>>   * Under|makemenuconfig|, navigate
>>>     to|Componentsettings|->|HeapMemoryDebugging|and
>>>     setCONFIG_HEAP_TRACING
>>>     <https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/kconfig.html#config-heap-tracing>.
>>>
>>
>> There's also an option to set the number of stack backtrace frames. 
>> Tracing with two frames is mostly useless in our (C++) context, as it 
>> will normally only show some inner frames of the allocator. I've 
>> tried raising that to 5, and got an immediate crash in the mdns 
>> component. I assume raising the depth will need raising some stack 
>> sizes. If you find a good compromise, please report.
>>
>> Heap tracing will work best in a reduced configuration. In normal 
>> operation, my module fills a 500 records buffer within seconds. The 
>> buffer is a FIFO, so will always contain the latest n allocations, 
>> and the last entry in the dump is the newest one.
>>
>> Reduced example dump:
>>
>>> OVMS# mod trace dump
>>> Heap tracing started/resumed at ccount 0x1c70b453 = logtime 19415
>>> 1000 allocations trace (1000 entry buffer)
>>> 258 bytes (@ 0x3fffb3a8) allocated CPU 1 task 0x3ffc92e8 ccount 
>>> 0x89074e74 caller 0x4031d4f0:0x4031f553
>>> freed task 0x3ffc92e8 by 0x4031d514:0x4031f595
>>> 201 bytes (@ 0x3f8e92f4) allocated CPU 1 task 0x3fff2290 ccount 
>>> 0x89218bbc caller 0x40131b12:0x4014633c
>>> freed task 0x3fff2290 by 0x402c4634:0x402cb14d
>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount 
>>> 0x8a25d47c caller 0x4029b455:0x4017a5dc
>>> freed task 0x3ffee918 by 0x4029b744:0x4017a5dc
>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount 
>>> 0x8db89c40 caller 0x4029b455:0x4017a5dc
>>> freed task 0x3ffee918 by 0x4029b744:0x4017a5dc
>>> 12 bytes (@ 0x3f85d0d4) allocated CPU 1 task 0x3ffdfcd0 ccount 
>>> 0x8fccdfd8 caller 0x40131b12:0x4014633c
>>> freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04
>>> […]
>>> 12 bytes (@ 0x3f85d0d4) allocated CPU 1 task 0x3ffdfcd0 ccount 
>>> 0x3cf429f4 caller 0x40131b12:0x4014633c
>>> freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04
>>> 112 bytes (@ 0x3ffebc80) allocated CPU 1 task 0x3ffee918 ccount 
>>> 0x3fe1a0bc caller 0x4029b455:0x4017a5dc
>>> 229 bytes alive in trace (3/1000 allocations)
>>> total allocations 1207 total frees 3356
>>> (NB: Buffer has overflowed; so trace data is incomplete.)
>>
>> "ccount" is the ESP32 CCOUNT register (CPU cycles), so provides an 
>> orientation on when the allocation occurred.
>>
>>
>> Some more basics on heap debugging are also covered here: 
>> https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/system/heap_debug.html
>>
>> Regards,
>> Michael
>>
>> -- 
>> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
>> Fon 02330 9104094 * Handy 0176 20698926
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> -- 
> Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
> Fon 02330 9104094 * Handy 0176 20698926
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20251229/c8fd00ac/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20251229/c8fd00ac/attachment-0001.sig>


More information about the OvmsDev mailing list