Everyone,
I've managed to reproduce the effect on a
standard ESP32 development board with a simple
test project only involving timers, some
CPU/memory load & wifi networking.
I've tested both the standard esp-idf release
3.3 and the latest esp-idf release 4.4 (using
gcc 8.4) for this, and the effect is still
present.
→ Bug report:
https://github.com/espressif/esp-idf/issues/8234
Attached are my test projects if you'd like to
reproduce this or improve the test.
I haven't tested periodic timer callbacks yet
for the effect. These are normally designed to
run periodically, but if the timing is
important (e.g. on some CAN transmission?),
this could cause erroneous behaviour as well.
Regards,
Michael
Am 30.12.21 um
19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um
19:28 schrieb Michael Balzer:
If so, my conclusions so far
would be:
a) we've got a real heap corruption issue
that sometimes gets triggered by the test.
I've had it twice around the same place,
i.e. within the scheduled event
processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second
crash on the production module, CORRUPT
HEAP after ~350 minutes.
OVMS# CORRUPT HEAP: Bad head at
0x3f8adc78. Expected 0xabba1234 got
0x3f8adcc0
abort() was called at PC 0x400844c3 on
core 0
I've found the heap corruption source and a
new bug class:
The corruption was caused by a duplicate
free() (here via delete), which was
basically impossible: it was the free() call
for the event message for a delayed event
delivery. There is exactly one producer
(OvmsEvents::ScheduleEvent) and exactly one
consumer (OvmsEvents::SignalScheduledEvent),
which is called exactly once -- when the
single shot timer expires.
In theory. In reality, the
timer callback occasionally gets executed
twice. To exclude every possible race
condition, I enclosed both producer &
consumer into a semaphore lock. I then
changed the code in order to clear the timer
payload as soon as it's read, and added a
test for a NULL payload -- and voila, the
timer callback now gets occasionally called
with a NULL payload, which is also
impossible as the allocation result is
checked in the producer.
I've had no luck reproducing this in a
reduced test project, even with multiple
auto reload timers and distribution over
both cores, but I still see no other
explanation than a bug in the FreeRTOS timer
service (or the Espressif FreeRTOS multi
core adaption). Ah, and yes, this occurs on
the ESP32/r3 as well.
You should be able to reproduce the effect
using the same event test loop as for the
Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-1002687055
My workaround prevents crashes and outputs a
log entry when the NULL payload is detected.
Example log excerpt:
script eval
'testcnt=0; PubSub.subscribe("usr.testev",
function(ev) { var
ms=Number(ev.substr(11))||10; if
(++testcnt % (3*1000/ms) == 0) print(ev +
": " + testcnt);
OvmsEvents.Raise("usr.testev."+ms, ms);
})'
script eval 'testcnt=0;
OvmsEvents.Raise("usr.testev.10")'
I (13493019) events: ScheduleEvent:
creating new timer
W (13495919) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13497029) ovms-duk-util: [eval:1:]
usr.testev.10: 300
I (13501109) housekeeping: 2021-12-30
18:12:35 CET (RAM: 8b=64448-67004
32b=6472)
--
I (13521779) ovms-duk-util: [eval:1:]
usr.testev.10: 2100
I (13525809) ovms-duk-util: [eval:1:]
usr.testev.10: 2400
W (13527579) events:
SignalScheduledEvent: duplicate callback
invocation detected
W (13527629) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13529839) ovms-duk-util: [eval:1:]
usr.testev.10: 2700
I (13533329) ovms-server-v2: Incoming Msg:
MP-0 AFA
--
I (13579149) ovms-duk-util: [eval:1:]
usr.testev.10: 6300
I (13583319) ovms-duk-util: [eval:1:]
usr.testev.10: 6600
W (13584679) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13587439) ovms-duk-util: [eval:1:]
usr.testev.10: 6900
I (13591589) ovms-duk-util: [eval:1:]
usr.testev.10: 7200
--
I (13714299) ovms-duk-util: [eval:1:]
usr.testev.10: 16200
I (13718339) ovms-duk-util: [eval:1:]
usr.testev.10: 16500
W (13718719) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13722459) ovms-duk-util: [eval:1:]
usr.testev.10: 16800
I (13726509) ovms-duk-util: [eval:1:]
usr.testev.10: 17100
--
I (13743149) ovms-duk-util: [eval:1:]
usr.testev.10: 18300
I (13747129) ovms-duk-util: [eval:1:]
usr.testev.10: 18600
W (13748979) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13751299) ovms-duk-util: [eval:1:]
usr.testev.10: 18900
I (13755349) ovms-duk-util: [eval:1:]
usr.testev.10: 19200
--
I (13784029) ovms-duk-util: [eval:1:]
usr.testev.10: 21300
I (13788059) ovms-duk-util: [eval:1:]
usr.testev.10: 21600
W (13791409) events:
SignalScheduledEvent: duplicate callback
invocation detected
I (13792179) ovms-duk-util: [eval:1:]
usr.testev.10: 21900
I (13796239) ovms-duk-util: [eval:1:]
usr.testev.10: 22200
…
The bug frequency differs from boot to boot,
but as you can see can be very high. I've
had runs with ~ 1 occurrence every 300.000
callbacks, and runs like the above with ~ 1
every 3.000 callbacks.
If this is a common effect with timer
callbacks, that may cause some of the
remaining issues. It's possible this only
happens with single shot timers, haven't
checked our periodic timers yet.
Any additional input on this is welcome.
Regards,
Michael
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
<test-timer-dupecall.zip>