[Ovmsdev] Duplicate timer callbacks

Michael Balzer dexter at expeedo.de
Sat Jan 15 22:13:03 HKT 2022


I've added a tick count check to the core tickers and to the VW e-Up and 
Renault Twizy timer callbacks, which are all periodic.

Code template: insert this at the top of your timer callback:

   // Workaround for FreeRTOS duplicate timer callback bug
   // (see https://github.com/espressif/esp-idf/issues/8234)
   static TickType_t last_tick = 0;
   TickType_t tick = xTaskGetTickCount();
   if (tick < last_tick + xTimerGetPeriod(timer) - 3) return;
   last_tick = tick;

This should also work for a single shot timer, unless it can be reused 
with a shorter period (that's the case with the event timer pool).

The "-3" accomodates for delayed timer callback execution. Note that 
this dupe detection therefore depends on the period being higher than 3 
ticks (30 ms), which is the case for all current periodic timers 
(fastest is the 100 ms = 10 ticks timer for the Twizy kickdown).

Regards,
Michael


Am 15.01.22 um 11:08 schrieb Michael Balzer:
> The effect also affects periodic timers, but with a very low frequency.
>
> Extended test & results:
> https://github.com/espressif/esp-idf/issues/8234#issuecomment-1013650548
>
> The test run had 3 periodic callback dupes while counting 5912 single 
> shot callback dupes.
>
> Timer usages in the OVMS aside events are a) the housekeeping per 
> second timer, from which the monotictime and all standard tickers are 
> derived, and b) the update ticker for websocket connections.
>
> Vehicles using timers are: Nissan Leaf, Tesla Roadster, Renault Twizy 
> & VW e-Up. I suggest the respective maintainers check their timer uses 
> for issues that might result from a duplicate callback.
>
> The event timer already detects duplicate invocations and produces the 
> log warning (see below) on this.
>
> Regarding the main ticker timer: a duplicate call would cause the 
> monotonictime counter to get ouf of sync with the actual system time, 
> but I don't think we would rely on that anywhere. More serious could 
> be that the per second ticker can run twice on the same second in this 
> case. But I also don't think we rely on a mininum execution interval 
> as well -- after all, these are software timers and generally not 
> guaranteed to run in exactly the interval they are defined to run. We 
> use the monotonictime for some expiry checks, but none of these should 
> be so low a single skipped second would matter.
>
> So I don't think the housekeeping ticker running occasionally twice 
> would cause an issue. I do remember though I had an issue with the 
> monotonictime behaving strangely once, but I don't recall the exact issue.
>
> The websocket timer is used to schedule data updates, metrics etc., 
> for the websocket clients. It does not rely on a minimum interval as 
> well, and the occasional queue overflow events we see there are more 
> probably caused by a slow network / client.
>
> The vehicle uses of timers may be more critical, as they seem to 
> involve protocol timing, I recommend to check these ASAP.
>
> A simple strategy to detect duplicate calls on periodic timers can be 
> to keep & compare the FreeRTOS tick count (xTaskGetTickCount()) of the 
> last execution in some static variable, if it's the same, skip the run.
>
> For single shot timers, the strategy I've used in the events framework 
> seems to work: use the timer ID to check the call validity, i.e. 
> define some timer ID property to mark an invalid call and set that 
> property in the callback.
>
> Regards,
> Michael
>
>
> Am 14.01.22 um 09:43 schrieb Michael Balzer:
>> The effect is also present with SPIRAM disabled (and 64K for 
>> cpuload_task to fit in internal RAM), and it's even more frequent 
>> when running in single core mode.
>>
>> I haven't reviewed all our timers yet, I'll try to test periodic 
>> timers first for the effect so we know if these need to be reviewed also.
>>
>> Regards,
>> Michael
>>
>>
>> Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
>>> Good to hear this has been confirmed in a test case.
>>>
>>> Do we need to review all our use of timers? Or have you already checked?
>>>
>>> I wonder if it is reproducible single core and without SPI RAM? 
>>> Whether this is another ESP32 hardware bug affecting something else...
>>>
>>> Regards, Mark.
>>>
>>>> On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter at expeedo.de> wrote:
>>>>
>>>> Signed PGP part
>>>> Everyone,
>>>>
>>>> I've managed to reproduce the effect on a standard ESP32 
>>>> development board with a simple test project only involving timers, 
>>>> some CPU/memory load & wifi networking.
>>>>
>>>> I've tested both the standard esp-idf release 3.3 and the latest 
>>>> esp-idf release 4.4 (using gcc 8.4) for this, and the effect is 
>>>> still present.
>>>>
>>>> → Bug report: https://github.com/espressif/esp-idf/issues/8234
>>>>
>>>> Attached are my test projects if you'd like to reproduce this or 
>>>> improve the test.
>>>>
>>>> I haven't tested periodic timer callbacks yet for the effect. These 
>>>> are normally designed to run periodically, but if the timing is 
>>>> important (e.g. on some CAN transmission?), this could cause 
>>>> erroneous behaviour as well.
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 30.12.21 um 19:25 schrieb Michael Balzer:
>>>>> Followup to…
>>>>>
>>>>> Am 26.09.21 um 19:28 schrieb Michael Balzer:
>>>>>> If so, my conclusions so far would be:
>>>>>>
>>>>>> a) we've got a real heap corruption issue that sometimes gets 
>>>>>> triggered by the test. I've had it twice around the same place, 
>>>>>> i.e. within the scheduled event processing. I'll check my code.
>>>>>>
>>>>>> Am 26.09.21 um 18:46 schrieb Craig Leres:
>>>>>>> Second crash on the production module, CORRUPT HEAP after ~350 
>>>>>>> minutes.
>>>>>>>
>>>>>>> OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 
>>>>>>> got 0x3f8adcc0
>>>>>>> abort() was called at PC 0x400844c3 on core 0
>>>>>
>>>>> I've found the heap corruption source and a new bug class:
>>>>>
>>>>> The corruption was caused by a duplicate free() (here via delete), 
>>>>> which was basically impossible: it was the free() call for the 
>>>>> event message for a delayed event delivery. There is exactly one 
>>>>> producer (OvmsEvents::ScheduleEvent) and exactly one consumer 
>>>>> (OvmsEvents::SignalScheduledEvent), which is called exactly once 
>>>>> -- when the single shot timer expires.
>>>>>
>>>>> In theory. _In reality, the timer callback occasionally gets 
>>>>> executed twice_. To exclude every possible race condition, I 
>>>>> enclosed both producer & consumer into a semaphore lock. I then 
>>>>> changed the code in order to clear the timer payload as soon as 
>>>>> it's read, and added a test for a NULL payload -- and voila, the 
>>>>> timer callback now gets occasionally called with a NULL payload, 
>>>>> which is also impossible as the allocation result is checked in 
>>>>> the producer.
>>>>>
>>>>> I've had no luck reproducing this in a reduced test project, even 
>>>>> with multiple auto reload timers and distribution over both cores, 
>>>>> but I still see no other explanation than a bug in the FreeRTOS 
>>>>> timer service (or the Espressif FreeRTOS multi core adaption). Ah, 
>>>>> and yes, this occurs on the ESP32/r3 as well.
>>>>>
>>>>> You should be able to reproduce the effect using the same event 
>>>>> test loop as for the Duktape TypeError issue: 
>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-1002687055
>>>>>
>>>>> My workaround prevents crashes and outputs a log entry when the 
>>>>> NULL payload is detected.
>>>>>
>>>>> Example log excerpt:
>>>>>
>>>>> script eval 'testcnt=0; PubSub.subscribe("usr.testev", 
>>>>> function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % 
>>>>> (3*1000/ms) == 0) print(ev + ": " + testcnt); 
>>>>> OvmsEvents.Raise("usr.testev."+ms, ms); })'
>>>>> script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")'
>>>>> I (13493019) events: ScheduleEvent: creating new timer
>>>>> *W (13495919) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected*
>>>>> I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300
>>>>> I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 
>>>>> 8b=64448-67004 32b=6472)
>>>>> --
>>>>> I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100
>>>>> I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400
>>>>> *W (13527579) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected**
>>>>> **W (13527629) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected*
>>>>> I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700
>>>>> I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA
>>>>> --
>>>>> I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300
>>>>> I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600
>>>>> *W (13584679) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected*
>>>>> I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900
>>>>> I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200
>>>>> --
>>>>> I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200
>>>>> I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500
>>>>> *W (13718719) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected*
>>>>> I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800
>>>>> I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100
>>>>> --
>>>>> I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300
>>>>> I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600
>>>>> *W (13748979) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected*
>>>>> I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900
>>>>> I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200
>>>>> --
>>>>> I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300
>>>>> I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600
>>>>> *W (13791409) events: SignalScheduledEvent: duplicate callback 
>>>>> invocation detected*
>>>>> I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900
>>>>> I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200
>>>>>>>>>>
>>>>>
>>>>> The bug frequency differs from boot to boot, but as you can see 
>>>>> can be very high. I've had runs with ~ 1 occurrence every 300.000 
>>>>> callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
>>>>>
>>>>> If this is a common effect with timer callbacks, that may cause 
>>>>> some of the remaining issues. It's possible this only happens with 
>>>>> single shot timers, haven't checked our periodic timers yet.
>>>>>
>>>>> Any additional input on this is welcome.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>> -- 
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>
>>>> -- 
>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>> <test-timer-dupecall.zip>
>>>>
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> -- 
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20220115/a2ecf015/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20220115/a2ecf015/attachment-0001.sig>


More information about the OvmsDev mailing list