Everyone,
I've managed to reproduce the effect on a standard
ESP32 development board with a simple test project
only involving timers, some CPU/memory load &
wifi networking.
I've tested both the standard esp-idf release 3.3
and the latest esp-idf release 4.4 (using gcc 8.4)
for this, and the effect is still present.
→ Bug report:
https://github.com/espressif/esp-idf/issues/8234
Attached are my test projects if you'd like to
reproduce this or improve the test.
I haven't tested periodic timer callbacks yet for
the effect. These are normally designed to run
periodically, but if the timing is important (e.g.
on some CAN transmission?), this could cause
erroneous behaviour as well.
Regards,
Michael
Am 30.12.21 um 19:25
schrieb Michael Balzer:
Followup to…
Am 26.09.21 um
19:28 schrieb Michael Balzer:
If so, my conclusions so far would
be:
a) we've got a real heap corruption issue that
sometimes gets triggered by the test. I've had
it twice around the same place, i.e. within
the scheduled event processing. I'll check my
code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash
on the production module, CORRUPT HEAP after
~350 minutes.
OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78.
Expected 0xabba1234 got 0x3f8adcc0
abort() was called at PC 0x400844c3 on core
0
I've found the heap corruption source and a new
bug class:
The corruption was caused by a duplicate free()
(here via delete), which was basically
impossible: it was the free() call for the event
message for a delayed event delivery. There is
exactly one producer (OvmsEvents::ScheduleEvent)
and exactly one consumer
(OvmsEvents::SignalScheduledEvent), which is
called exactly once -- when the single shot
timer expires.
In theory. In reality, the timer
callback occasionally gets executed twice.
To exclude every possible race condition, I
enclosed both producer & consumer into a
semaphore lock. I then changed the code in order
to clear the timer payload as soon as it's read,
and added a test for a NULL payload -- and
voila, the timer callback now gets occasionally
called with a NULL payload, which is also
impossible as the allocation result is checked
in the producer.
I've had no luck reproducing this in a reduced
test project, even with multiple auto reload
timers and distribution over both cores, but I
still see no other explanation than a bug in the
FreeRTOS timer service (or the Espressif
FreeRTOS multi core adaption). Ah, and yes, this
occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using
the same event test loop as for the Duktape
TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-1002687055
My workaround prevents crashes and outputs a log
entry when the NULL payload is detected.
Example log excerpt:
script eval
'testcnt=0; PubSub.subscribe("usr.testev",
function(ev) { var
ms=Number(ev.substr(11))||10; if (++testcnt %
(3*1000/ms) == 0) print(ev + ": " + testcnt);
OvmsEvents.Raise("usr.testev."+ms, ms); })'
script eval 'testcnt=0;
OvmsEvents.Raise("usr.testev.10")'
I (13493019) events: ScheduleEvent: creating
new timer
W (13495919) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13497029) ovms-duk-util: [eval:1:]
usr.testev.10: 300
I (13501109) housekeeping: 2021-12-30 18:12:35
CET (RAM: 8b=64448-67004 32b=6472)
--
I (13521779) ovms-duk-util: [eval:1:]
usr.testev.10: 2100
I (13525809) ovms-duk-util: [eval:1:]
usr.testev.10: 2400
W (13527579) events:
SignalScheduledEvent: duplicate callback
invocation detected
W (13527629) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13529839) ovms-duk-util: [eval:1:]
usr.testev.10: 2700
I (13533329) ovms-server-v2: Incoming Msg:
MP-0 AFA
--
I (13579149) ovms-duk-util: [eval:1:]
usr.testev.10: 6300
I (13583319) ovms-duk-util: [eval:1:]
usr.testev.10: 6600
W (13584679) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13587439) ovms-duk-util: [eval:1:]
usr.testev.10: 6900
I (13591589) ovms-duk-util: [eval:1:]
usr.testev.10: 7200
--
I (13714299) ovms-duk-util: [eval:1:]
usr.testev.10: 16200
I (13718339) ovms-duk-util: [eval:1:]
usr.testev.10: 16500
W (13718719) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13722459) ovms-duk-util: [eval:1:]
usr.testev.10: 16800
I (13726509) ovms-duk-util: [eval:1:]
usr.testev.10: 17100
--
I (13743149) ovms-duk-util: [eval:1:]
usr.testev.10: 18300
I (13747129) ovms-duk-util: [eval:1:]
usr.testev.10: 18600
W (13748979) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13751299) ovms-duk-util: [eval:1:]
usr.testev.10: 18900
I (13755349) ovms-duk-util: [eval:1:]
usr.testev.10: 19200
--
I (13784029) ovms-duk-util: [eval:1:]
usr.testev.10: 21300
I (13788059) ovms-duk-util: [eval:1:]
usr.testev.10: 21600
W (13791409) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13792179) ovms-duk-util: [eval:1:]
usr.testev.10: 21900
I (13796239) ovms-duk-util: [eval:1:]
usr.testev.10: 22200
…
The bug frequency differs from boot to boot, but
as you can see can be very high. I've had runs
with ~ 1 occurrence every 300.000 callbacks, and
runs like the above with ~ 1 every 3.000
callbacks.
If this is a common effect with timer callbacks,
that may cause some of the remaining issues.
It's possible this only happens with single shot
timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards,
Michael
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
<test-timer-dupecall.zip>