<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
FTR, I was too optimistic. The LoadProhibited &
IllegalInstruction crashes may be reduced but still are present,
also the one in the Console I thought would most probably be caused
by the unsynced logging console registry:<br>
<br>
<font face="monospace">0x4014681e is in OvmsConsole::Poll(unsigned
int, void*)
(/home/balzer/esp/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_console.cpp:226).<br>
221 if (event.type == ALERT_MULTI)<br>
222 {<br>
223 LogBuffers::iterator before =
event.multi->begin(), after;<br>
224 while (true)<br>
225 {<br>
<b>226 buffer = *before;</b><br>
227 len = strlen(buffer);<br>
228 after = before;<br>
229 ++after;<br>
230 if (after == event.multi->end())<br>
</font><br>
On the positive side, crash frequency still is reduced.<br>
<br>
I'm investigating if we need to do more about Mongoose, as many
crashes still seem to be related to some network reconfiguration
event.<br>
<br>
The thread safety patch on the mbufs did help a lot
(→<a class="moz-txt-link-freetext" href="https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/120">https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/120</a>),
but we may actually need to move the mutex up to the API level, at
least for every function accessing the manager or connection list or
a connection's virtual interface.<br>
<br>
Crash example on a vtable indirection:<br>
<br>
<font face="monospace">0x4015d82a is in mg_send
(/home/balzer/esp/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/components/mongoose/mongoose/mongoose.c:2776).<br>
2771 size_t mg_send(struct mg_connection *nc, const void *buf,
int len) {<br>
2772 nc->last_io_time = (time_t) mg_time();<br>
2773 if (nc->flags & MG_F_UDP) {<br>
2774 len = nc->iface->vtable->udp_send(nc, buf,
len);<br>
2775 } else {<br>
<b>2776 len = nc->iface->vtable->tcp_send(nc, buf,
len);</b><br>
2777 }</font><br>
<br>
A Mongoose update won't solve this. Mongoose still has no support or
plans for thread safety, they explicitly warn about this:<br>
<br>
→ <a class="moz-txt-link-freetext" href="https://mongoose.ws/documentation/#connections-and-event-manager">https://mongoose.ws/documentation/#connections-and-event-manager</a><br>
<blockquote type="cite">
<blockquote
style="box-sizing: border-box; margin: 0px 0px 1rem; padding-left: 2.5em; background: rgb(200, 241, 226); border: 1px solid rgb(170, 221, 204); border-radius: 0.5em; color: rgb(85, 85, 85); font-family: Inter, Verdana, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">
<p style="box-sizing: border-box; margin: 0.8em 0px;">NOTE:
Since Mongoose's core is not protected against concurrent
accesses, make sure that all<span> </span><code
style="box-sizing: border-box; font-family: Menlo, Consolas, monospace !important; font-size: 13px !important; color: rgb(85, 85, 85); overflow-wrap: break-word; background: rgb(229, 229, 229); font-weight: bolder; line-height: 1.3 !important; border-radius: 0.2em; overflow: auto; padding: 0.1em 0.5em;">mg_*</code><span> </span>API
functions are called from the same thread or RTOS task</p>
</blockquote>
</blockquote>
<br>
Us ignoring this seems to be the single most problematic concurrency
issue remaining.<br>
<br>
My current idea to solve this is introducing a `MongooseClient`
class that replicates the critical Mongoose API methods wrapped in a
mutex lock, and then simply add that as a base class for any
component class using the API.<br>
<br>
I doubt the calls actually need to be done in the same task, as
Cesanta write. If that turns out to be necessary (why would it?),
the mutex can be replaced by a callback implementation extending the
netmanager's command job queue.<br>
<br>
Synchronizing the access implies each API call may need to wait for
the netmanager's `mg_mgr_poll()` timeout, which is currently set to
250 ms. If that becomes an issue, we can probably lower the timeout
to 100 or even 50 ms without hurting overall performance.<br>
<br>
Regards,<br>
Michael<br>
<br>
<br>
<div class="moz-cite-prefix">Am 14.01.26 um 18:22 schrieb Michael
Balzer via OvmsDev:<br>
</div>
<blockquote type="cite"
cite="mid:e3d9e411-69d5-49ff-80f7-3599556d79a7@expeedo.de">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
I hope I'm not too optimistic, but from the crash records it seems
I've fixed about 50-60% of the crashes with the console registry
mutex
(<a class="moz-txt-link-freetext"
href="https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/commit/069632f48ef653601b4525201bd0af09a4625208"
moz-do-not-send="true">https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/commit/069632f48ef653601b4525201bd0af09a4625208</a>).<br>
<br>
This has been added in edge build 3.3.005-615-gd132bfe0, which
has now been running in vehicles for two days, with not a single
LoadProhibited or IllegalInstruction crash since, and not a single
crash in OVMS Console.<br>
<br>
Remaining crashes are all abort()s, with about 11% happening from
detected heap corruption and the rest from the task watchdog.
Crash signatures overlap, i.e. there are watchdog triggers that
also detect a heap corruption, and some heap corruptions probably
are just not detected yet due to the light impact detection mode.<br>
<br>
So please keep looking for potential heap corruption sources.<br>
<br>
Btw, the config mutex didn't show a significant crash reduction in
the records, but it should still help reduce the /store
corruptions (time will tell).<br>
<br>
Regards,<br>
Michael<br>
<br>
<br>
<div class="moz-cite-prefix">Am 29.12.25 um 12:11 schrieb Michael
Balzer via OvmsDev:<br>
</div>
<blockquote type="cite"
cite="mid:6c229035-e0a2-44d6-b992-3d43871b1351@expeedo.de">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">
Checking the heap immediately on events will probably not catch
bugs that involve late callbacks using already freed buffers,
and some events also are signaled before the actual shutdown
(release of all resources) has fully taken place.<br>
<br>
To register the heap checker for a delayed execution, use this
scheme:<br>
<br>
<blockquote type="cite"><font face="monospace"># create custom
heap checking event handler:<br>
vfs echo "module check alert"
/store/events/usr.check.heap/00-checkheap<br>
</font> <font face="monospace"><br>
# register delayed heap checks:<br>
vfs echo "event raise -d5000 usr.check.heap"
/store/events/server.v2.stopped/90-checkheap<br>
vfs echo "event raise -d5000 usr.check.heap"
/store/events/server.v3.stopped/90-checkheap<br>
…</font></blockquote>
<br>
Note: just guessing with the 5 second delay here, may need more,
depending on the event.<br>
<br>
Maybe we should register the check to be done by default on some
selected events. Although the heap check normally is done within
milliseconds, I still hesitate doing that. The heap check needs
to lock the system while walking through all allocated memory
blocks, which may create issues for vehicles that rely on
consistent timing for CAN or custom hardware communication.<br>
<br>
Regards,<br>
Michael<br>
<br>
<br>
<div class="moz-cite-prefix">Am 29.12.25 um 11:42 schrieb
Michael Balzer via OvmsDev:<br>
</div>
<blockquote type="cite"
cite="mid:8d38266a-8c0d-4937-858c-5a997bc6344e@expeedo.de">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">
The heap check + one-off alert function is now exposed for
inclusion in component code and can be called as a shell
command to enable inclusion in event scripts et al:<br>
<br>
<blockquote type="cite"><font face="monospace">#include
"ovms_module.h"<br>
<br>
/**<br>
* module_check_heap_alert: check for and send one-off
alert notification on heap corruption<br>
* <br>
* To enable the check every 5 minutes, set config
"module" "debug.heap" to "yes".<br>
* <br>
* To add custom checks, call from your code, or
register event scripts as needed.<br>
* Example: perform heap integrity check when the
server V2 gets stopped:<br>
* vfs echo "module check alert"
/store/events/server.v2.stopped/90-checkheap<br>
* <br>
* @param verbosity -- optional: channel capacity
(default 0)<br>
* @param OvmsWriter -- optional: channel (default
NULL)<br>
* @return heapok -- false = heap corrupted/full<br>
*/<br>
extern bool module_check_heap_alert(int verbosity=0,
OvmsWriter* writer=NULL);</font></blockquote>
<br>
Regards,<br>
Michael<br>
<br>
<br>
<div class="moz-cite-prefix">Am 28.12.25 um 19:21 schrieb
Michael Balzer via OvmsDev:<br>
</div>
<blockquote type="cite"
cite="mid:b15deda2-263e-4e7d-b1fb-50832b3afb47@expeedo.de">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">
PS: when creating a debug build, you may also consider
enabling the "comprehensive" corruption detection mode, as
that will catch more cases.<br>
<br>
But be aware that has a substantial impact on performance,
so a user will probably not tolerate this for regular daily
use.<br>
<br>
Regards,<br>
Michael<br>
<br>
<br>
<div class="moz-cite-prefix">Am 28.12.25 um 18:26 schrieb
Michael Balzer via OvmsDev:<br>
</div>
<blockquote type="cite"
cite="mid:4c93150c-bdf6-4bf6-a960-d62af1d317b1@expeedo.de">
<meta http-equiv="content-type"
content="text/html; charset=UTF-8">
Everyone,<br>
<br>
from the crash reports (<a class="moz-txt-link-freetext"
href="https://ovms.dexters-web.de/firmware/developer/"
moz-do-not-send="true">https://ovms.dexters-web.de/firmware/developer/</a>),
most remaining crashes seem to be caused by heap
corruptions.<br>
<br>
Not all heap corruptions are easily detectable from the
backtrace analysis, and the component or action causing
the corruption isn't detectable at all that way. So I've
added a debug option to enable a regular heap integrity
check ever 5 minutes, with the module sending an alert
notification when a corruption has been detected. Example:<br>
<br>
<blockquote type="cite"><font face="monospace">Heap
corruption detected; reboot advised ASAP!<br>
Please forward including task records and system log:<br>
<br>
CORRUPT HEAP: Bad tail at 0x3f8e44f0 owner 0x3ffea9bc.
Expected 0xbaad5678 got 0xbaad5600</font><br>
</blockquote>
<br>
I also added a final heap integrity check to our crash
handler, so the crash debug records should now show
exactly which crashes occurred with a corrupted heap.<br>
<br>
In combination with the system log and the task log, that
should give us some more opportunities to narrow down the
cause(s).<br>
<br>
I've also added task ownershop to the heap corruption
report. Note, this needs my latest additions to our
esp-idf fork, so take care to pull these before building.<br>
<br>
Be aware, task ownership of corrupted blocks doesn't
necessarily tell about the task doing the corruption. If
the tail canary is compromised, and no other block located
before that block is compromised, it *may* be that task
doing the out of bounds write. But it may also be a use
after free of some previous owner. So take task ownership
with a grain of salt.<br>
<br>
The corruptions are most probably caused by some unclean
shutdown of a component or by an undetected race
conditions within a shutdown procedure. The heap seems to
be stable on modules with standard configurations and
components not being started & shut down on a regular
base. The heap corruptions are especially present now with
Smart (SQ) vehicles -- as the Smart doesn't keep the 12V
battery charged from the main battery, most Smart users
probably use the power management to shut down Wifi and/or
modem while parking.<br>
<br>
So our main focus should be on analysing what happens
before the corruption. Ask users reporting heap
corruptions to provide their system logs, and possibly
also their task logs. To encourage enabling these, I've
added the config to the web UI (Config→Notifications).<br>
<br>
<br>
Once you can reproduce (!) the corruption, heap tracing
might provide some more insight as to where exactly the
corruption occurs.<br>
<br>
Heap tracing means recording all memory allocations and
frees. This adds a recording layer on top of the heap
functions, so comes with some cost, even when inactive.
CPU overhead is low, but stack overhead may be an issue,
so I think we should not enable heap tracing by default
for now, but rather use a debug build specifically in
cases we think heap tracing might help.<br>
<br>
To enable heap tracing on some user device, I've reworked
the ESP-IDF heap tracing to enable remote execution and to
also include the task handles performing the allocations
and deallocations.<br>
<br>
To enable heap tracing for a build:<br>
<br>
<blockquote type="cite">
<ul class="simple"
style="box-sizing: border-box; margin: 0px 0px 24px; padding: 0px; list-style: disc; line-height: 24px; color: rgb(64, 64, 64); font-family: Lato, proxima-nova, "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">
<li
style="box-sizing: border-box; list-style: disc; margin-left: 24px;">Under<span> </span><code
class="docutils literal notranslate"
style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: nowrap; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60); overflow-x: auto;"><span
class="pre" style="box-sizing: border-box;">make</span><span> </span><span
class="pre" style="box-sizing: border-box;">menuconfig</span></code>,
navigate to<span> </span><code
class="docutils literal notranslate"
style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: nowrap; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60); overflow-x: auto;"><span
class="pre" style="box-sizing: border-box;">Component</span><span> </span><span
class="pre" style="box-sizing: border-box;">settings</span></code><span> </span>-><span> </span><code
class="docutils literal notranslate"
style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: nowrap; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60); overflow-x: auto;"><span
class="pre" style="box-sizing: border-box;">Heap</span><span> </span><span
class="pre" style="box-sizing: border-box;">Memory</span><span> </span><span
class="pre" style="box-sizing: border-box;">Debugging</span></code><span> </span>and
set<span> </span><a class="reference internal"
href="https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/kconfig.html#config-heap-tracing"
style="box-sizing: border-box; color: rgb(41, 128, 185); text-decoration: none; cursor: pointer;"
moz-do-not-send="true"><span class="std std-ref"
style="box-sizing: border-box;">CONFIG_HEAP_TRACING</span></a>.</li>
</ul>
</blockquote>
<br>
There's also an option to set the number of stack
backtrace frames. Tracing with two frames is mostly
useless in our (C++) context, as it will normally only
show some inner frames of the allocator. I've tried
raising that to 5, and got an immediate crash in the mdns
component. I assume raising the depth will need raising
some stack sizes. If you find a good compromise, please
report.<br>
<br>
Heap tracing will work best in a reduced configuration. In
normal operation, my module fills a 500 records buffer
within seconds. The buffer is a FIFO, so will always
contain the latest n allocations, and the last entry in
the dump is the newest one.<br>
<br>
Reduced example dump:<br>
<br>
<blockquote type="cite"><font face="monospace">OVMS# mod
trace dump<br>
Heap tracing started/resumed at ccount 0x1c70b453 =
logtime 19415<br>
1000 allocations trace (1000 entry buffer)<br>
258 bytes (@ 0x3fffb3a8) allocated CPU 1 task
0x3ffc92e8 ccount 0x89074e74 caller
0x4031d4f0:0x4031f553<br>
freed task 0x3ffc92e8 by 0x4031d514:0x4031f595<br>
201 bytes (@ 0x3f8e92f4) allocated CPU 1 task
0x3fff2290 ccount 0x89218bbc caller
0x40131b12:0x4014633c<br>
freed task 0x3fff2290 by 0x402c4634:0x402cb14d<br>
112 bytes (@ 0x3ffebc80) allocated CPU 1 task
0x3ffee918 ccount 0x8a25d47c caller
0x4029b455:0x4017a5dc<br>
freed task 0x3ffee918 by 0x4029b744:0x4017a5dc<br>
112 bytes (@ 0x3ffebc80) allocated CPU 1 task
0x3ffee918 ccount 0x8db89c40 caller
0x4029b455:0x4017a5dc<br>
freed task 0x3ffee918 by 0x4029b744:0x4017a5dc<br>
12 bytes (@ 0x3f85d0d4) allocated CPU 1 task
0x3ffdfcd0 ccount 0x8fccdfd8 caller
0x40131b12:0x4014633c<br>
freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04<br>
[…]<br>
12 bytes (@ 0x3f85d0d4) allocated CPU 1 task
0x3ffdfcd0 ccount 0x3cf429f4 caller
0x40131b12:0x4014633c<br>
freed task 0x3ffdfcd0 by 0x402c4634:0x401e0a04<br>
112 bytes (@ 0x3ffebc80) allocated CPU 1 task
0x3ffee918 ccount 0x3fe1a0bc caller
0x4029b455:0x4017a5dc<br>
229 bytes alive in trace (3/1000 allocations)<br>
total allocations 1207 total frees 3356<br>
(NB: Buffer has overflowed; so trace data is
incomplete.)</font></blockquote>
<br>
"ccount" is the ESP32 CCOUNT register (CPU cycles), so
provides an orientation on when the allocation occurred.<br>
<br>
<br>
Some more basics on heap debugging are also covered here:
<a class="moz-txt-link-freetext"
href="https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/system/heap_debug.html"
moz-do-not-send="true">https://docs.espressif.com/projects/esp-idf/en/v3.3/api-reference/system/heap_debug.html</a><br>
<br>
Regards,<br>
Michael<br>
<br>
<pre class="moz-signature" cols="72">--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926</pre>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre wrap="" class="moz-quote-pre">_______________________________________________
OvmsDev mailing list
<a class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:OvmsDev@lists.openvehicles.com"
moz-do-not-send="true">OvmsDev@lists.openvehicles.com</a>
<a class="moz-txt-link-freetext"
href="http://lists.openvehicles.com/mailman/listinfo/ovmsdev"
moz-do-not-send="true">http://lists.openvehicles.com/mailman/listinfo/ovmsdev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926</pre>
<br>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre wrap="" class="moz-quote-pre">_______________________________________________
OvmsDev mailing list
<a class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:OvmsDev@lists.openvehicles.com"
moz-do-not-send="true">OvmsDev@lists.openvehicles.com</a>
<a class="moz-txt-link-freetext"
href="http://lists.openvehicles.com/mailman/listinfo/ovmsdev"
moz-do-not-send="true">http://lists.openvehicles.com/mailman/listinfo/ovmsdev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926</pre>
<br>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre wrap="" class="moz-quote-pre">_______________________________________________
OvmsDev mailing list
<a class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:OvmsDev@lists.openvehicles.com"
moz-do-not-send="true">OvmsDev@lists.openvehicles.com</a>
<a class="moz-txt-link-freetext"
href="http://lists.openvehicles.com/mailman/listinfo/ovmsdev"
moz-do-not-send="true">http://lists.openvehicles.com/mailman/listinfo/ovmsdev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926</pre>
<br>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre wrap="" class="moz-quote-pre">_______________________________________________
OvmsDev mailing list
<a class="moz-txt-link-abbreviated moz-txt-link-freetext"
href="mailto:OvmsDev@lists.openvehicles.com"
moz-do-not-send="true">OvmsDev@lists.openvehicles.com</a>
<a class="moz-txt-link-freetext"
href="http://lists.openvehicles.com/mailman/listinfo/ovmsdev"
moz-do-not-send="true">http://lists.openvehicles.com/mailman/listinfo/ovmsdev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926</pre>
<br>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre wrap="" class="moz-quote-pre">_______________________________________________
OvmsDev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:OvmsDev@lists.openvehicles.com">OvmsDev@lists.openvehicles.com</a>
<a class="moz-txt-link-freetext" href="http://lists.openvehicles.com/mailman/listinfo/ovmsdev">http://lists.openvehicles.com/mailman/listinfo/ovmsdev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Michael Balzer * Am Rahmen 5 * D-58313 Herdecke
Fon 02330 9104094 * Handy 0176 20698926</pre>
<br>
</body>
</html>