[Ovmsdev] OVMS going comatose
Mark Webb-Johnson
mark at webb-johnson.net
Sun Apr 25 10:16:15 HKT 2021
Not sure if it is related, but I am hitting this with my work on canlog. I’m seeing:
0x400f7af1: CheckQueueOverflow(char const*, char*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:381
ELF file SHA256: 57f584b0e3fe542a
Backtrace: 0x4008e6f3:0x3ffcce00 0x4008e95d:0x3ffcce20 0x400f7af1:0x3ffcce40 0x400f8051:0x3ffcce80 0x400f91d0:0x3ffcceb0 0x40092387:0x3ffccf00
0x4008e6f3: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:736
0x4008e95d: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:736
0x400f7af1: CheckQueueOverflow(char const*, char*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:381
0x400f8051: OvmsEvents::SignalEvent(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void (*)(char const*, void*), unsigned int) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:459
0x400f91d0: HousekeepingTicker1(void*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_housekeeping.cpp:122
0x40092387: prvProcessExpiredTimer at /Users/mark/esp/esp-idf/components/freertos/timers.c:485
(inlined by) prvProcessTimerOrBlockTask at /Users/mark/esp/esp-idf/components/freertos/timers.c:571
(inlined by) prvTimerTask at /Users/mark/esp/esp-idf/components/freertos/timers.c:544
So that is the CheckQueueOverflow() system within events, that handles when an event cannot be queued. In this case it is aborting the system (to cause a restart). Not a watchdog timer.
In my case, using gdb stubs gets into the debugger nicely. But I get:
Info thread gives:
…
3 Thread 2 (OVMS Events CPU1) 0x40081a28 in esp_crosscore_int_send_yield (core_id=1)
at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
2 Thread 1 (OVMS NetMan CPU1) 0x7f400992 in ?? ()
* 1 Remote target 0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156(gdb) thread 1
and:
[Switching to thread 1 (Remote target)]
#0 0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156
156 *((int *) 0) = 0;
(gdb) bt
#0 0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156
#1 0x4008e960 in abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:171
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
[Switching to thread 2 (Thread 1)] MWJ: This is NETMAN
#0 0x7f400992 in ?? ()
(gdb) bt
#0 0x7f400992 in ?? ()
#1 0x7ffe6583 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 3 MWJ: This is EVENTS
[Switching to thread 3 (Thread 2)]
#0 0x40081a28 in esp_crosscore_int_send_yield (core_id=1) at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
112 esp_crosscore_int_send(core_id, REASON_YIELD);
(gdb) bt
#0 0x40081a28 in esp_crosscore_int_send_yield (core_id=1) at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
#1 0x4009030c in xQueueGenericReceive (xQueue=0x3ffc01cc, pvBuffer=0x3ffc23f0, xTicksToWait=500, xJustPeeking=0)
at /Users/mark/esp/esp-idf/components/freertos/queue.c:1592
#2 0x400f85c2 in OvmsEvents::EventTask (this=0x3ffb5dec <MyEvents>)
at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:213
#3 0x400f8660 in EventLaunchTask (pvParameters=0x3ffb5dec <MyEvents>)
at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:80
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Seems stack is corrupted somewhere. I’m still looking. But the GDB stub does seem helpful for this.
Mark
> On 25 Apr 2021, at 12:47 AM, Stephen Casner <casner at acm.org> wrote:
>
> Michael,
>
> I confirm the same lack of crashes, and I agree that the problem is
> not strictly with WolfSSL.
>
> It seems that a task WDT ought to be able to report where the task was
> executing at the time. That would certainly be helpful in narrowing
> down the type of problem. I'm not sure if the trap to gdb mode would
> provide that info since I observed no response on the USB console when
> the module went comatose
>
> -- Steve
>
> On Sat, 24 Apr 2021, Michael Balzer wrote:
>
>> Steve,
>>
>> in the two weeks since disabling TLS on the server V2 connection I haven't had
>> a single crash. While that's not a proof yet the watchdog issue is TLS
>> related, it's at least a strong indicator.
>>
>> The watchdog triggers if the idle task on a core doesn't get a CPU share for
>> 120 seconds. If the TLS functions block a CPU for more than a few seconds,
>> that's already pretty bad, as that means TLS will cause delays in CAN
>> processing (disrupting protocol transfers) and can possibly cause frame drops
>> and queue overflows. Blocking the whole system for more than 120 seconds is
>> totally unacceptable.
>>
>> This doesn't feel like a calculation / math performance issue, it rather feels
>> like a bug - and that may imply a security issue as well.
>>
>> But I don't think this is caused by WolfSSL, as the issue has been present
>> with mbedTLS as well, just didn't occur that frequently. Maybe some race
>> condition with the LwIP task?
>>
>> Regards,
>> Michael
>>
>>
>> Am 11.04.21 um 09:44 schrieb Michael Balzer:
>>> Steve,
>>>
>>> I can confirm an increase of these events since we changed to WolfSSL, about
>>> once every three days currently for me. The frequency was much lower before,
>>> more like once or twice per month.
>>>
>>> I've disabled TLS on my module now and will report if that helps.
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 10.04.21 um 21:20 schrieb Stephen Casner:
>>>> Michael,
>>>>
>>>> As you saw from my earlier emails, I was getting these crashes
>>>> typically after less than 24 hours of operation. I changed my config
>>>> to disable TLS on server v2 and rebooted 2021-04-05 23:36:04.648 PDT
>>>> and there has not been a crash since. So it definitely appears to be
>>>> correlated with the additional processing to support TLS.
>>>>
>>>> -- Steve
>>>>
>>>> On Sun, 4 Apr 2021, Michael Balzer wrote:
>>>>
>>>>> Steve,
>>>>>
>>>>> that's the problem with this issue, it's totally unclear what causes
>>>>> this.
>>>>>
>>>>> The signal dropping begins when the queue is full, which happens after
>>>>> the
>>>>> task has been blocked for ~as many seconds as the queue is big. So there
>>>>> is no
>>>>> logged activity that could cause this, your module basically went into
>>>>> this
>>>>> from idling.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> --
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20210425/225d9fd8/attachment-0001.htm>
More information about the OvmsDev
mailing list