[Ovmsdev] OVMS going comatose

Sun Apr 25 10:16:15 HKT 2021

Not sure if it is related, but I am hitting this with my work on canlog. I’m seeing:

0x400f7af1: CheckQueueOverflow(char const*, char*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:381

ELF file SHA256: 57f584b0e3fe542a

Backtrace: 0x4008e6f3:0x3ffcce00 0x4008e95d:0x3ffcce20 0x400f7af1:0x3ffcce40 0x400f8051:0x3ffcce80 0x400f91d0:0x3ffcceb0 0x40092387:0x3ffccf00
0x4008e6f3: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:736

0x4008e95d: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:736

0x400f7af1: CheckQueueOverflow(char const*, char*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:381

0x400f8051: OvmsEvents::SignalEvent(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void (*)(char const*, void*), unsigned int) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:459

0x400f91d0: HousekeepingTicker1(void*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_housekeeping.cpp:122

0x40092387: prvProcessExpiredTimer at /Users/mark/esp/esp-idf/components/freertos/timers.c:485
 (inlined by) prvProcessTimerOrBlockTask at /Users/mark/esp/esp-idf/components/freertos/timers.c:571
 (inlined by) prvTimerTask at /Users/mark/esp/esp-idf/components/freertos/timers.c:544

So that is the CheckQueueOverflow() system within events, that handles when an event cannot be queued. In this case it is aborting the system (to cause a restart). Not a watchdog timer.

In my case, using gdb stubs gets into the debugger nicely. But I get:

Info thread gives:

…
  3    Thread 2 (OVMS Events CPU1) 0x40081a28 in esp_crosscore_int_send_yield (core_id=1)
    at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
  2    Thread 1 (OVMS NetMan CPU1) 0x7f400992 in ?? ()
* 1    Remote target     0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156(gdb) thread 1

and:

[Switching to thread 1 (Remote target)]
#0  0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156
156             *((int *) 0) = 0;

(gdb) bt
#0  0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156
#1  0x4008e960 in abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:171
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

[Switching to thread 2 (Thread 1)]						MWJ: This is NETMAN
#0  0x7f400992 in ?? ()

(gdb) bt
#0  0x7f400992 in ?? ()
#1  0x7ffe6583 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) thread 3										MWJ: This is EVENTS
[Switching to thread 3 (Thread 2)]
#0  0x40081a28 in esp_crosscore_int_send_yield (core_id=1) at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
112	    esp_crosscore_int_send(core_id, REASON_YIELD);
(gdb) bt
#0  0x40081a28 in esp_crosscore_int_send_yield (core_id=1) at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
#1  0x4009030c in xQueueGenericReceive (xQueue=0x3ffc01cc, pvBuffer=0x3ffc23f0, xTicksToWait=500, xJustPeeking=0)
    at /Users/mark/esp/esp-idf/components/freertos/queue.c:1592
#2  0x400f85c2 in OvmsEvents::EventTask (this=0x3ffb5dec <MyEvents>)
    at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:213
#3  0x400f8660 in EventLaunchTask (pvParameters=0x3ffb5dec <MyEvents>)
    at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:80
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Seems stack is corrupted somewhere. I’m still looking. But the GDB stub does seem helpful for this.

Mark

> On 25 Apr 2021, at 12:47 AM, Stephen Casner <casner at acm.org> wrote:
> 
> Michael,
> 
> I confirm the same lack of crashes, and I agree that the problem is
> not strictly with WolfSSL.
> 
> It seems that a task WDT ought to be able to report where the task was
> executing at the time.  That would certainly be helpful in narrowing
> down the type of problem.  I'm not sure if the trap to gdb mode would
> provide that info since I observed no response on the USB console when
> the module went comatose
> 
>                                                        -- Steve
> 
> On Sat, 24 Apr 2021, Michael Balzer wrote:
> 
>> Steve,
>> 
>> in the two weeks since disabling TLS on the server V2 connection I haven't had
>> a single crash. While that's not a proof yet the watchdog issue is TLS
>> related, it's at least a strong indicator.
>> 
>> The watchdog triggers if the idle task on a core doesn't get a CPU share for
>> 120 seconds. If the TLS functions block a CPU for more than a few seconds,
>> that's already pretty bad, as that means TLS will cause delays in CAN
>> processing (disrupting protocol transfers) and can possibly cause frame drops
>> and queue overflows. Blocking the whole system for more than 120 seconds is
>> totally unacceptable.
>> 
>> This doesn't feel like a calculation / math performance issue, it rather feels
>> like a bug - and that may imply a security issue as well.
>> 
>> But I don't think this is caused by WolfSSL, as the issue has been present
>> with mbedTLS as well, just didn't occur that frequently. Maybe some race
>> condition with the LwIP task?
>> 
>> Regards,
>> Michael
>> 
>> 
>> Am 11.04.21 um 09:44 schrieb Michael Balzer:
>>> Steve,
>>> 
>>> I can confirm an increase of these events since we changed to WolfSSL, about
>>> once every three days currently for me. The frequency was much lower before,
>>> more like once or twice per month.
>>> 
>>> I've disabled TLS on my module now and will report if that helps.
>>> 
>>> Regards,
>>> Michael
>>> 
>>> 
>>> Am 10.04.21 um 21:20 schrieb Stephen Casner:
>>>> Michael,
>>>> 
>>>> As you saw from my earlier emails, I was getting these crashes
>>>> typically after less than 24 hours of operation.  I changed my config
>>>> to disable TLS on server v2 and rebooted 2021-04-05 23:36:04.648 PDT
>>>> and there has not been a crash since.  So it definitely appears to be
>>>> correlated with the additional processing to support TLS.
>>>> 
>>>>                                                         -- Steve
>>>> 
>>>> On Sun, 4 Apr 2021, Michael Balzer wrote:
>>>> 
>>>>> Steve,
>>>>> 
>>>>> that's the problem with this issue, it's totally unclear what causes
>>>>> this.
>>>>> 
>>>>> The signal dropping begins when the queue is full, which happens after
>>>>> the
>>>>> task has been blocked for ~as many seconds as the queue is big. So there
>>>>> is no
>>>>> logged activity that could cause this, your module basically went into
>>>>> this
>>>>> from idling.
>>>>> 
>>>>> Regards,
>>>>> Michael
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>> 
>> --
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20210425/225d9fd8/attachment-0001.htm>