Not sure if it is related, but I am hitting this with my work on canlog. I’m seeing:

0x400f7af1: CheckQueueOverflow(char const*, char*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:381


ELF file SHA256: 57f584b0e3fe542a

Backtrace: 0x4008e6f3:0x3ffcce00 0x4008e95d:0x3ffcce20 0x400f7af1:0x3ffcce40 0x400f8051:0x3ffcce80 0x400f91d0:0x3ffcceb0 0x40092387:0x3ffccf00
0x4008e6f3: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:736

0x4008e95d: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:736

0x400f7af1: CheckQueueOverflow(char const*, char*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:381

0x400f8051: OvmsEvents::SignalEvent(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void (*)(char const*, void*), unsigned int) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:459

0x400f91d0: HousekeepingTicker1(void*) at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_housekeeping.cpp:122

0x40092387: prvProcessExpiredTimer at /Users/mark/esp/esp-idf/components/freertos/timers.c:485
 (inlined by) prvProcessTimerOrBlockTask at /Users/mark/esp/esp-idf/components/freertos/timers.c:571
 (inlined by) prvTimerTask at /Users/mark/esp/esp-idf/components/freertos/timers.c:544

So that is the CheckQueueOverflow() system within events, that handles when an event cannot be queued. In this case it is aborting the system (to cause a restart). Not a watchdog timer.

In my case, using gdb stubs gets into the debugger nicely. But I get:

Info thread gives:

  3    Thread 2 (OVMS Events CPU1) 0x40081a28 in esp_crosscore_int_send_yield (core_id=1)
    at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
  2    Thread 1 (OVMS NetMan CPU1) 0x7f400992 in ?? ()
* 1    Remote target     0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156(gdb) thread 1

and:

[Switching to thread 1 (Remote target)]
#0  0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156
156             *((int *) 0) = 0;

(gdb) bt
#0  0x4008e6f3 in invoke_abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:156
#1  0x4008e960 in abort () at /Users/mark/esp/esp-idf/components/esp32/panic.c:171
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

[Switching to thread 2 (Thread 1)] MWJ: This is NETMAN
#0  0x7f400992 in ?? ()

(gdb) bt
#0  0x7f400992 in ?? ()
#1  0x7ffe6583 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) thread 3 MWJ: This is EVENTS
[Switching to thread 3 (Thread 2)]
#0  0x40081a28 in esp_crosscore_int_send_yield (core_id=1) at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
112    esp_crosscore_int_send(core_id, REASON_YIELD);
(gdb) bt
#0  0x40081a28 in esp_crosscore_int_send_yield (core_id=1) at /Users/mark/esp/esp-idf/components/esp32/crosscore_int.c:112
#1  0x4009030c in xQueueGenericReceive (xQueue=0x3ffc01cc, pvBuffer=0x3ffc23f0, xTicksToWait=500, xJustPeeking=0)
    at /Users/mark/esp/esp-idf/components/freertos/queue.c:1592
#2  0x400f85c2 in OvmsEvents::EventTask (this=0x3ffb5dec <MyEvents>)
    at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:213
#3  0x400f8660 in EventLaunchTask (pvParameters=0x3ffb5dec <MyEvents>)
    at /Users/mark/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_events.cpp:80
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Seems stack is corrupted somewhere. I’m still looking. But the GDB stub does seem helpful for this.

Mark

On 25 Apr 2021, at 12:47 AM, Stephen Casner <casner@acm.org> wrote:

Michael,

I confirm the same lack of crashes, and I agree that the problem is
not strictly with WolfSSL.

It seems that a task WDT ought to be able to report where the task was
executing at the time.  That would certainly be helpful in narrowing
down the type of problem.  I'm not sure if the trap to gdb mode would
provide that info since I observed no response on the USB console when
the module went comatose

                                                       -- Steve

On Sat, 24 Apr 2021, Michael Balzer wrote:

Steve,

in the two weeks since disabling TLS on the server V2 connection I haven't had
a single crash. While that's not a proof yet the watchdog issue is TLS
related, it's at least a strong indicator.

The watchdog triggers if the idle task on a core doesn't get a CPU share for
120 seconds. If the TLS functions block a CPU for more than a few seconds,
that's already pretty bad, as that means TLS will cause delays in CAN
processing (disrupting protocol transfers) and can possibly cause frame drops
and queue overflows. Blocking the whole system for more than 120 seconds is
totally unacceptable.

This doesn't feel like a calculation / math performance issue, it rather feels
like a bug - and that may imply a security issue as well.

But I don't think this is caused by WolfSSL, as the issue has been present
with mbedTLS as well, just didn't occur that frequently. Maybe some race
condition with the LwIP task?

Regards,
Michael


Am 11.04.21 um 09:44 schrieb Michael Balzer:
Steve,

I can confirm an increase of these events since we changed to WolfSSL, about
once every three days currently for me. The frequency was much lower before,
more like once or twice per month.

I've disabled TLS on my module now and will report if that helps.

Regards,
Michael


Am 10.04.21 um 21:20 schrieb Stephen Casner:
Michael,

As you saw from my earlier emails, I was getting these crashes
typically after less than 24 hours of operation.  I changed my config
to disable TLS on server v2 and rebooted 2021-04-05 23:36:04.648 PDT
and there has not been a crash since.  So it definitely appears to be
correlated with the additional processing to support TLS.

                                                        -- Steve

On Sun, 4 Apr 2021, Michael Balzer wrote:

Steve,

that's the problem with this issue, it's totally unclear what causes
this.

The signal dropping begins when the queue is full, which happens after
the
task has been blocked for ~as many seconds as the queue is big. So there
is no
logged activity that could cause this, your module basically went into
this
from idling.

Regards,
Michael



_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev

--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev