[Ovmsdev] A promiscous wifi client
Stephen Casner
casner at acm.org
Thu Nov 16 10:00:27 HKT 2017
On Mon, 13 Nov 2017, Tom Parker wrote:
> I do see with the terrible network manager and also with your work, a fair
> amount of instability. I'm seeing reboots with socket related frames on the
> stacktrace, and quite often it gets hung up and never reconnects. I haven't
> been able to look at what might be causing the hang because I've never had it
> happen while the laptop is plugged in and monitoring the console. It seems
> very reliable if you reset the OVMS with the button while in range of a
> network and never disconnect. However, maybe 50% of the time, after driving
> out of range it never reconnects when returning to the original network or
> seeing a new network it knows about.
>
> I'll see if I can reproduce this by walking out of range with my cell phone
> while the OVMS is plugged into a laptop.
>
> The following trace is from November 6th when I was using the terrible network
> manager. Sorry I don't have a recent trace from the the new network manager
> but the stack trace look the same as far as I remember. It seems reasonably
> easy to reproduce, my startup script has the following:
I know this one. Here is my sample crash with the same backtrace as
yours, plus the GDB stack trace and some exploration and an
explanation below:
Guru Meditation Error of type LoadProhibited occurred on core 1. Exception was unhandled.
Register dump:
PC : 0x401647a4 PS : 0x00060230 A0 : 0x80177bd8 A1 : 0x3ffdd710
0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195
A2 : 0xdeadbeef A3 : 0x000000d2 A4 : 0x00000001 A5 : 0x3ffc7908
A6 : 0x00000002 A7 : 0x00060023 A8 : 0x000000d2 A9 : 0x3ffdd6f0
A10 : 0x00000001 A11 : 0x00000001 A12 : 0x80085d3d A13 : 0x3ffebd30
A14 : 0x00000003 A15 : 0x00060023 SAR : 0x00000010 EXCCAUSE: 0x0000001c
EXCVADDR: 0xdeadbf03 LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff
Backtrace: 0x401647a4:0x3ffdd710 0x40177bd5:0x3ffdd730 0x4016ad86:0x3ffdd750 0x4016b651:0x3ffdd770 0x4016f286:0x3ffdd790 0x4017650a:0x3ffdd7b0 0x40166465:0x3ffdd7d0
0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195
0x40177bd5: accept_function at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:1813 (discriminator 6)
0x4016ad86: tcp_process at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790 (discriminator 1)
0x4016b651: tcp_input at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357
0x4016f286: ip4_input at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712
0x4017650a: ethernet_input at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176
0x40166465: tcpip_thread at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:474
Entering gdb stub now.
[snip]
event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897
1897 if (scb->sem_signalled == 0) {
(gdb) bt
#0 event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897
#1 0x40177bd8 in accept_function (arg=0x3fff21c4, newpcb=<optimized out>, err=0 '\000') at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:511
#2 0x4016ad89 in tcp_process (pcb=0x3fff26a0) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790
#3 0x4016b654 in tcp_input (p=<optimized out>, inp=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357
#4 0x4016f289 in ip4_input (p=0x3fff2434, inp=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712
#5 0x4017650d in ethernet_input (p=0x3fff2434, netif=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176
#6 0x40166468 in tcpip_thread (arg=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:143
(gdb) list
1892 /* At this point, SYS_ARCH is still protected! */
1893 again:
1894 for (scb = select_cb_list; scb != NULL; scb = scb->next) {
1895 /* remember the state of select_cb_list to detect changes */
1896 last_select_cb_ctr = select_cb_ctr;
1897 if (scb->sem_signalled == 0) {
1898 /* semaphore not signalled yet */
1899 int do_signal = 0;
1900 /* Test this select call for our socket */
1901 if (sock->rcvevent > 0) {
(gdb) print scb
$1 = (struct lwip_select_cb *) 0xdeadbeef
(gdb) print select_cb_list
$2 = (struct lwip_select_cb *) 0x3fff1c08
(gdb) print *select_cb_list
$3 = {next = 0x3fff6ed8, prev = 0x0, readset = 0x3fff1c68, writeset = 0x3fff1c70, exceptset = 0x3fff1c78, sem_signalled = 0, sem = 0x3fff2068}
(gdb) print (struct lwip_select_cb *) 0x3fff6ed8
$4 = (struct lwip_select_cb *) 0x3fff6ed8
(gdb) print *(struct lwip_select_cb *) 0x3fff6ed8
$5 = {next = 0xdeadbeef, prev = 0x3fff1c08, readset = 0xdeadbeef, writeset = 0xdeadbeef, exceptset = 0xdeadbeef, sem_signalled = -559038737, sem = 0xdeadbeef}
(gdb) quit
The problem here is a weakness in LWIP+FreeRTOS. If a task is waiting
on a select() when the task is deleted then that task's entry is not
removed from the select_cb_list even though the entry is freed. (The
several 0xdeadbeef values shown above are there because in my local
working copy I've added code in heap_regions_debug.c to fill heap
blocks with that value when they are freed.) This happens because the
struct lwip_select_cb allocated on the stack to hold information about
the task callback waiting on the sockets. The comment says:
We don't actually need any dynamic memory. Our entry on the
list is only valid while we are in this function, so it's ok
to use local variables. */
The problem is that if the task waiting on the select is deleted then
the stack is freed and this struct becomes invalid, resulting in a
crash.
The select() would work fine here if there was some way to do the
equivalent of issuing a unix signal to the task which would cause the
select() to return. But select() is implemented in LWIP so RTOS knows
nothing about it. The select() is waiting on a semaphore, so waking
it up would be possible if a hook were provided to get to the
semaphore. Doing that might allow switching to the select()-loop
pattern with a single task, simplifying task management and also
saving stack space.
The reason I'm having the TelnetServer wait on the select() is that
when I originally wrote it to be just blocked on an accept() call then
there was no way to unblock it when wifi went down and the task was
deleted. In FreeRTOS killing a task does not clean up its resources
so the listening socket wouldn't close all the way, leaving a 76-byte
block leaked.
Then I found that if I have the TelnetServer task wait on a select()
instead then if I close the server socket from the task that is
powering down the wifi (in this case, ConsoleAsync by command) while
the TelnetServer task is blocked on the select() than the close causes
the select() to return. I can detect the closed socket in the
TelnetServer task and exit cleanly.
Originally the TelnetServer task was a direct child of the wifi task
(using the Parent and TaskBase classes I created) and the wifi task
would do the deletion. Subsequently we've changed to use events and
the cleanup is not working properly. In fact, now when I command wifi
down the system always restarts immediately because the heap debug
code is detecting that a block is freed twice. I need to investigate
further. Using multiple tasks is tricky because these libraries are
not all thread-safe so we have to be very careful to execute code in
the right task.
The new SSH console implementation has the SSHServer task waiting on
select() in the same manner but it also has the SSHConsole task
configuring the socket to be non-blocking and waiting on select()
before calling into the wolfssh library. Thus the risk may be higher
here so these shutdown cases need to be carefully sequenced.
-- Steve
More information about the OvmsDev
mailing list