[Ovmsdev] Duplicate timer callbacks

Michael Balzer dexter at expeedo.de
Sat Jan 29 18:31:29 HKT 2022


The duplicate callbacks on the single shot timers actually were my fault.

I missed reading the documentation thoroughly: xTimerChangePeriod() 
actually not only changes the period, it also starts the timer.

So assigning the timer payload after changing it's period introduced a 
race condition. I've fixed that and no longer see duplicate callbacks in 
the test log.

The case of additional callback executions for cyclic timers is still open.

Regards,
Michael


Am 15.01.22 um 15:13 schrieb Michael Balzer:
> I've added a tick count check to the core tickers and to the VW e-Up 
> and Renault Twizy timer callbacks, which are all periodic.
>
> Code template: insert this at the top of your timer callback:
>
>   // Workaround for FreeRTOS duplicate timer callback bug
>   // (see https://github.com/espressif/esp-idf/issues/8234)
>   static TickType_t last_tick = 0;
>   TickType_t tick = xTaskGetTickCount();
>   if (tick < last_tick + xTimerGetPeriod(timer) - 3) return;
>   last_tick = tick;
>
> This should also work for a single shot timer, unless it can be reused 
> with a shorter period (that's the case with the event timer pool).
>
> The "-3" accomodates for delayed timer callback execution. Note that 
> this dupe detection therefore depends on the period being higher than 
> 3 ticks (30 ms), which is the case for all current periodic timers 
> (fastest is the 100 ms = 10 ticks timer for the Twizy kickdown).
>
> Regards,
> Michael
>
>
> Am 15.01.22 um 11:08 schrieb Michael Balzer:
>> The effect also affects periodic timers, but with a very low frequency.
>>
>> Extended test & results:
>> https://github.com/espressif/esp-idf/issues/8234#issuecomment-1013650548
>>
>> The test run had 3 periodic callback dupes while counting 5912 single 
>> shot callback dupes.
>>
>> Timer usages in the OVMS aside events are a) the housekeeping per 
>> second timer, from which the monotictime and all standard tickers are 
>> derived, and b) the update ticker for websocket connections.
>>
>> Vehicles using timers are: Nissan Leaf, Tesla Roadster, Renault Twizy 
>> & VW e-Up. I suggest the respective maintainers check their timer 
>> uses for issues that might result from a duplicate callback.
>>
>> The event timer already detects duplicate invocations and produces 
>> the log warning (see below) on this.
>>
>> Regarding the main ticker timer: a duplicate call would cause the 
>> monotonictime counter to get ouf of sync with the actual system time, 
>> but I don't think we would rely on that anywhere. More serious could 
>> be that the per second ticker can run twice on the same second in 
>> this case. But I also don't think we rely on a mininum execution 
>> interval as well -- after all, these are software timers and 
>> generally not guaranteed to run in exactly the interval they are 
>> defined to run. We use the monotonictime for some expiry checks, but 
>> none of these should be so low a single skipped second would matter.
>>
>> So I don't think the housekeeping ticker running occasionally twice 
>> would cause an issue. I do remember though I had an issue with the 
>> monotonictime behaving strangely once, but I don't recall the exact 
>> issue.
>>
>> The websocket timer is used to schedule data updates, metrics etc., 
>> for the websocket clients. It does not rely on a minimum interval as 
>> well, and the occasional queue overflow events we see there are more 
>> probably caused by a slow network / client.
>>
>> The vehicle uses of timers may be more critical, as they seem to 
>> involve protocol timing, I recommend to check these ASAP.
>>
>> A simple strategy to detect duplicate calls on periodic timers can be 
>> to keep & compare the FreeRTOS tick count (xTaskGetTickCount()) of 
>> the last execution in some static variable, if it's the same, skip 
>> the run.
>>
>> For single shot timers, the strategy I've used in the events 
>> framework seems to work: use the timer ID to check the call validity, 
>> i.e. define some timer ID property to mark an invalid call and set 
>> that property in the callback.
>>
>> Regards,
>> Michael
>>
>>
>> Am 14.01.22 um 09:43 schrieb Michael Balzer:
>>> The effect is also present with SPIRAM disabled (and 64K for 
>>> cpuload_task to fit in internal RAM), and it's even more frequent 
>>> when running in single core mode.
>>>
>>> I haven't reviewed all our timers yet, I'll try to test periodic 
>>> timers first for the effect so we know if these need to be reviewed 
>>> also.
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
>>>> Good to hear this has been confirmed in a test case.
>>>>
>>>> Do we need to review all our use of timers? Or have you already 
>>>> checked?
>>>>
>>>> I wonder if it is reproducible single core and without SPI RAM? 
>>>> Whether this is another ESP32 hardware bug affecting something else...
>>>>
>>>> Regards, Mark.
>>>>
>>>>> On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter at expeedo.de> wrote:
>>>>>
>>>>> Signed PGP part
>>>>> Everyone,
>>>>>
>>>>> I've managed to reproduce the effect on a standard ESP32 
>>>>> development board with a simple test project only involving 
>>>>> timers, some CPU/memory load & wifi networking.
>>>>>
>>>>> I've tested both the standard esp-idf release 3.3 and the latest 
>>>>> esp-idf release 4.4 (using gcc 8.4) for this, and the effect is 
>>>>> still present.
>>>>>
>>>>> → Bug report: https://github.com/espressif/esp-idf/issues/8234
>>>>>
>>>>> Attached are my test projects if you'd like to reproduce this or 
>>>>> improve the test.
>>>>>
>>>>> I haven't tested periodic timer callbacks yet for the effect. 
>>>>> These are normally designed to run periodically, but if the timing 
>>>>> is important (e.g. on some CAN transmission?), this could cause 
>>>>> erroneous behaviour as well.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 30.12.21 um 19:25 schrieb Michael Balzer:
>>>>>> Followup to…
>>>>>>
>>>>>> Am 26.09.21 um 19:28 schrieb Michael Balzer:
>>>>>>> If so, my conclusions so far would be:
>>>>>>>
>>>>>>> a) we've got a real heap corruption issue that sometimes gets 
>>>>>>> triggered by the test. I've had it twice around the same place, 
>>>>>>> i.e. within the scheduled event processing. I'll check my code.
>>>>>>>
>>>>>>> Am 26.09.21 um 18:46 schrieb Craig Leres:
>>>>>>>> Second crash on the production module, CORRUPT HEAP after ~350 
>>>>>>>> minutes.
>>>>>>>>
>>>>>>>> OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 
>>>>>>>> got 0x3f8adcc0
>>>>>>>> abort() was called at PC 0x400844c3 on core 0
>>>>>>
>>>>>> I've found the heap corruption source and a new bug class:
>>>>>>
>>>>>> The corruption was caused by a duplicate free() (here via 
>>>>>> delete), which was basically impossible: it was the free() call 
>>>>>> for the event message for a delayed event delivery. There is 
>>>>>> exactly one producer (OvmsEvents::ScheduleEvent) and exactly one 
>>>>>> consumer (OvmsEvents::SignalScheduledEvent), which is called 
>>>>>> exactly once -- when the single shot timer expires.
>>>>>>
>>>>>> In theory. _In reality, the timer callback occasionally gets 
>>>>>> executed twice_. To exclude every possible race condition, I 
>>>>>> enclosed both producer & consumer into a semaphore lock. I then 
>>>>>> changed the code in order to clear the timer payload as soon as 
>>>>>> it's read, and added a test for a NULL payload -- and voila, the 
>>>>>> timer callback now gets occasionally called with a NULL payload, 
>>>>>> which is also impossible as the allocation result is checked in 
>>>>>> the producer.
>>>>>>
>>>>>> I've had no luck reproducing this in a reduced test project, even 
>>>>>> with multiple auto reload timers and distribution over both 
>>>>>> cores, but I still see no other explanation than a bug in the 
>>>>>> FreeRTOS timer service (or the Espressif FreeRTOS multi core 
>>>>>> adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
>>>>>>
>>>>>> You should be able to reproduce the effect using the same event 
>>>>>> test loop as for the Duktape TypeError issue: 
>>>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-1002687055
>>>>>>
>>>>>> My workaround prevents crashes and outputs a log entry when the 
>>>>>> NULL payload is detected.
>>>>>>
>>>>>> Example log excerpt:
>>>>>>
>>>>>> script eval 'testcnt=0; PubSub.subscribe("usr.testev", 
>>>>>> function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % 
>>>>>> (3*1000/ms) == 0) print(ev + ": " + testcnt); 
>>>>>> OvmsEvents.Raise("usr.testev."+ms, ms); })'
>>>>>> script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")'
>>>>>> I (13493019) events: ScheduleEvent: creating new timer
>>>>>> *W (13495919) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected*
>>>>>> I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300
>>>>>> I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 
>>>>>> 8b=64448-67004 32b=6472)
>>>>>> --
>>>>>> I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100
>>>>>> I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400
>>>>>> *W (13527579) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected**
>>>>>> **W (13527629) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected*
>>>>>> I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700
>>>>>> I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA
>>>>>> --
>>>>>> I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300
>>>>>> I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600
>>>>>> *W (13584679) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected*
>>>>>> I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900
>>>>>> I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200
>>>>>> --
>>>>>> I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200
>>>>>> I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500
>>>>>> *W (13718719) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected*
>>>>>> I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800
>>>>>> I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100
>>>>>> --
>>>>>> I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300
>>>>>> I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600
>>>>>> *W (13748979) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected*
>>>>>> I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900
>>>>>> I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200
>>>>>> --
>>>>>> I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300
>>>>>> I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600
>>>>>> *W (13791409) events: SignalScheduledEvent: duplicate callback 
>>>>>> invocation detected*
>>>>>> I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900
>>>>>> I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200
>>>>>>>>>>>>
>>>>>>
>>>>>> The bug frequency differs from boot to boot, but as you can see 
>>>>>> can be very high. I've had runs with ~ 1 occurrence every 300.000 
>>>>>> callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
>>>>>>
>>>>>> If this is a common effect with timer callbacks, that may cause 
>>>>>> some of the remaining issues. It's possible this only happens 
>>>>>> with single shot timers, haven't checked our periodic timers yet.
>>>>>>
>>>>>> Any additional input on this is welcome.
>>>>>>
>>>>>> Regards,
>>>>>> Michael
>>>>>>
>>>>>> -- 
>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.openvehicles.com
>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>
>>>>> -- 
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>> <test-timer-dupecall.zip>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>
>>> -- 
>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> -- 
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20220129/aa8ece94/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20220129/aa8ece94/attachment-0001.sig>


More information about the OvmsDev mailing list