[Ovmsdev] Duplicate timer callbacks
Michael Balzer
dexter at expeedo.de
Sat Jan 15 22:13:03 HKT 2022
I've added a tick count check to the core tickers and to the VW e-Up and
Renault Twizy timer callbacks, which are all periodic.
Code template: insert this at the top of your timer callback:
// Workaround for FreeRTOS duplicate timer callback bug
// (see https://github.com/espressif/esp-idf/issues/8234)
static TickType_t last_tick = 0;
TickType_t tick = xTaskGetTickCount();
if (tick < last_tick + xTimerGetPeriod(timer) - 3) return;
last_tick = tick;
This should also work for a single shot timer, unless it can be reused
with a shorter period (that's the case with the event timer pool).
The "-3" accomodates for delayed timer callback execution. Note that
this dupe detection therefore depends on the period being higher than 3
ticks (30 ms), which is the case for all current periodic timers
(fastest is the 100 ms = 10 ticks timer for the Twizy kickdown).
Regards,
Michael
Am 15.01.22 um 11:08 schrieb Michael Balzer:
> The effect also affects periodic timers, but with a very low frequency.
>
> Extended test & results:
> https://github.com/espressif/esp-idf/issues/8234#issuecomment-1013650548
>
> The test run had 3 periodic callback dupes while counting 5912 single
> shot callback dupes.
>
> Timer usages in the OVMS aside events are a) the housekeeping per
> second timer, from which the monotictime and all standard tickers are
> derived, and b) the update ticker for websocket connections.
>
> Vehicles using timers are: Nissan Leaf, Tesla Roadster, Renault Twizy
> & VW e-Up. I suggest the respective maintainers check their timer uses
> for issues that might result from a duplicate callback.
>
> The event timer already detects duplicate invocations and produces the
> log warning (see below) on this.
>
> Regarding the main ticker timer: a duplicate call would cause the
> monotonictime counter to get ouf of sync with the actual system time,
> but I don't think we would rely on that anywhere. More serious could
> be that the per second ticker can run twice on the same second in this
> case. But I also don't think we rely on a mininum execution interval
> as well -- after all, these are software timers and generally not
> guaranteed to run in exactly the interval they are defined to run. We
> use the monotonictime for some expiry checks, but none of these should
> be so low a single skipped second would matter.
>
> So I don't think the housekeeping ticker running occasionally twice
> would cause an issue. I do remember though I had an issue with the
> monotonictime behaving strangely once, but I don't recall the exact issue.
>
> The websocket timer is used to schedule data updates, metrics etc.,
> for the websocket clients. It does not rely on a minimum interval as
> well, and the occasional queue overflow events we see there are more
> probably caused by a slow network / client.
>
> The vehicle uses of timers may be more critical, as they seem to
> involve protocol timing, I recommend to check these ASAP.
>
> A simple strategy to detect duplicate calls on periodic timers can be
> to keep & compare the FreeRTOS tick count (xTaskGetTickCount()) of the
> last execution in some static variable, if it's the same, skip the run.
>
> For single shot timers, the strategy I've used in the events framework
> seems to work: use the timer ID to check the call validity, i.e.
> define some timer ID property to mark an invalid call and set that
> property in the callback.
>
> Regards,
> Michael
>
>
> Am 14.01.22 um 09:43 schrieb Michael Balzer:
>> The effect is also present with SPIRAM disabled (and 64K for
>> cpuload_task to fit in internal RAM), and it's even more frequent
>> when running in single core mode.
>>
>> I haven't reviewed all our timers yet, I'll try to test periodic
>> timers first for the effect so we know if these need to be reviewed also.
>>
>> Regards,
>> Michael
>>
>>
>> Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
>>> Good to hear this has been confirmed in a test case.
>>>
>>> Do we need to review all our use of timers? Or have you already checked?
>>>
>>> I wonder if it is reproducible single core and without SPI RAM?
>>> Whether this is another ESP32 hardware bug affecting something else...
>>>
>>> Regards, Mark.
>>>
>>>> On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter at expeedo.de> wrote:
>>>>
>>>> Signed PGP part
>>>> Everyone,
>>>>
>>>> I've managed to reproduce the effect on a standard ESP32
>>>> development board with a simple test project only involving timers,
>>>> some CPU/memory load & wifi networking.
>>>>
>>>> I've tested both the standard esp-idf release 3.3 and the latest
>>>> esp-idf release 4.4 (using gcc 8.4) for this, and the effect is
>>>> still present.
>>>>
>>>> → Bug report: https://github.com/espressif/esp-idf/issues/8234
>>>>
>>>> Attached are my test projects if you'd like to reproduce this or
>>>> improve the test.
>>>>
>>>> I haven't tested periodic timer callbacks yet for the effect. These
>>>> are normally designed to run periodically, but if the timing is
>>>> important (e.g. on some CAN transmission?), this could cause
>>>> erroneous behaviour as well.
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 30.12.21 um 19:25 schrieb Michael Balzer:
>>>>> Followup to…
>>>>>
>>>>> Am 26.09.21 um 19:28 schrieb Michael Balzer:
>>>>>> If so, my conclusions so far would be:
>>>>>>
>>>>>> a) we've got a real heap corruption issue that sometimes gets
>>>>>> triggered by the test. I've had it twice around the same place,
>>>>>> i.e. within the scheduled event processing. I'll check my code.
>>>>>>
>>>>>> Am 26.09.21 um 18:46 schrieb Craig Leres:
>>>>>>> Second crash on the production module, CORRUPT HEAP after ~350
>>>>>>> minutes.
>>>>>>>
>>>>>>> OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234
>>>>>>> got 0x3f8adcc0
>>>>>>> abort() was called at PC 0x400844c3 on core 0
>>>>>
>>>>> I've found the heap corruption source and a new bug class:
>>>>>
>>>>> The corruption was caused by a duplicate free() (here via delete),
>>>>> which was basically impossible: it was the free() call for the
>>>>> event message for a delayed event delivery. There is exactly one
>>>>> producer (OvmsEvents::ScheduleEvent) and exactly one consumer
>>>>> (OvmsEvents::SignalScheduledEvent), which is called exactly once
>>>>> -- when the single shot timer expires.
>>>>>
>>>>> In theory. _In reality, the timer callback occasionally gets
>>>>> executed twice_. To exclude every possible race condition, I
>>>>> enclosed both producer & consumer into a semaphore lock. I then
>>>>> changed the code in order to clear the timer payload as soon as
>>>>> it's read, and added a test for a NULL payload -- and voila, the
>>>>> timer callback now gets occasionally called with a NULL payload,
>>>>> which is also impossible as the allocation result is checked in
>>>>> the producer.
>>>>>
>>>>> I've had no luck reproducing this in a reduced test project, even
>>>>> with multiple auto reload timers and distribution over both cores,
>>>>> but I still see no other explanation than a bug in the FreeRTOS
>>>>> timer service (or the Espressif FreeRTOS multi core adaption). Ah,
>>>>> and yes, this occurs on the ESP32/r3 as well.
>>>>>
>>>>> You should be able to reproduce the effect using the same event
>>>>> test loop as for the Duktape TypeError issue:
>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-1002687055
>>>>>
>>>>> My workaround prevents crashes and outputs a log entry when the
>>>>> NULL payload is detected.
>>>>>
>>>>> Example log excerpt:
>>>>>
>>>>> script eval 'testcnt=0; PubSub.subscribe("usr.testev",
>>>>> function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt %
>>>>> (3*1000/ms) == 0) print(ev + ": " + testcnt);
>>>>> OvmsEvents.Raise("usr.testev."+ms, ms); })'
>>>>> script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")'
>>>>> I (13493019) events: ScheduleEvent: creating new timer
>>>>> *W (13495919) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected*
>>>>> I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300
>>>>> I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM:
>>>>> 8b=64448-67004 32b=6472)
>>>>> --
>>>>> I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100
>>>>> I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400
>>>>> *W (13527579) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected**
>>>>> **W (13527629) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected*
>>>>> I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700
>>>>> I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA
>>>>> --
>>>>> I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300
>>>>> I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600
>>>>> *W (13584679) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected*
>>>>> I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900
>>>>> I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200
>>>>> --
>>>>> I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200
>>>>> I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500
>>>>> *W (13718719) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected*
>>>>> I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800
>>>>> I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100
>>>>> --
>>>>> I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300
>>>>> I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600
>>>>> *W (13748979) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected*
>>>>> I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900
>>>>> I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200
>>>>> --
>>>>> I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300
>>>>> I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600
>>>>> *W (13791409) events: SignalScheduledEvent: duplicate callback
>>>>> invocation detected*
>>>>> I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900
>>>>> I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200
>>>>> …
>>>>>
>>>>>
>>>>> The bug frequency differs from boot to boot, but as you can see
>>>>> can be very high. I've had runs with ~ 1 occurrence every 300.000
>>>>> callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
>>>>>
>>>>> If this is a common effect with timer callbacks, that may cause
>>>>> some of the remaining issues. It's possible this only happens with
>>>>> single shot timers, haven't checked our periodic timers yet.
>>>>>
>>>>> Any additional input on this is welcome.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>> --
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>
>>>> --
>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>> <test-timer-dupecall.zip>
>>>>
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> --
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20220115/a2ecf015/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20220115/a2ecf015/attachment-0001.sig>
More information about the OvmsDev
mailing list