[Ovmsdev] A promiscous wifi client

Thu Nov 16 14:32:57 HKT 2017

We don’t seem to have this issue with the webserver listener; that uses mongoose (in netmanager). That doesn’t have long-lived client side connections, but the socket listen/accept functionality should be similar. Not sure how mongoose handles it internally, but the library seems quite robust.

I’m wondering if we can migrate to use that? It would probably save a bunch of ram (as it doesn’t need a task for each thread). We are mostly single-user anyway.

Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

I’m happy to try to re-work telnet that way, as an example, if it makes sense to you.

Regards, Mark.

> On 16 Nov 2017, at 10:00 AM, Stephen Casner <casner at acm.org> wrote:
> 
> On Mon, 13 Nov 2017, Tom Parker wrote:
> 
>> I do see with the terrible network manager and also with your work, a fair
>> amount of instability. I'm seeing reboots with socket related frames on the
>> stacktrace, and quite often it gets hung up and never reconnects. I haven't
>> been able to look at what might be causing the hang because I've never had it
>> happen while the laptop is plugged in and monitoring the console. It seems
>> very reliable if you reset the OVMS with the button while in range of a
>> network and never disconnect. However, maybe 50% of the time, after driving
>> out of range it never reconnects when returning to the original network or
>> seeing a new network it knows about.
>> 
>> I'll see if I can reproduce this by walking out of range with my cell phone
>> while the OVMS is plugged into a laptop.
>> 
>> The following trace is from November 6th when I was using the terrible network
>> manager. Sorry I don't have a recent trace from the the new network manager
>> but the stack trace look the same as far as I remember. It seems reasonably
>> easy to reproduce, my startup script has the following:
> 
> I know this one.  Here is my sample crash with the same backtrace as
> yours, plus the GDB stack trace and some exploration and an
> explanation below:
> 
> Guru Meditation Error of type LoadProhibited occurred on core  1. Exception was unhandled.
> Register dump:
> PC      : 0x401647a4  PS      : 0x00060230  A0      : 0x80177bd8  A1      : 0x3ffdd710
> 0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195
> 
> A2      : 0xdeadbeef  A3      : 0x000000d2  A4      : 0x00000001  A5      : 0x3ffc7908
> A6      : 0x00000002  A7      : 0x00060023  A8      : 0x000000d2  A9      : 0x3ffdd6f0
> A10     : 0x00000001  A11     : 0x00000001  A12     : 0x80085d3d  A13     : 0x3ffebd30
> A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x00000010  EXCCAUSE: 0x0000001c
> EXCVADDR: 0xdeadbf03  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff
> 
> Backtrace: 0x401647a4:0x3ffdd710 0x40177bd5:0x3ffdd730 0x4016ad86:0x3ffdd750 0x4016b651:0x3ffdd770 0x4016f286:0x3ffdd790 0x4017650a:0x3ffdd7b0 0x40166465:0x3ffdd7d0
> 0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195
> 
> 0x40177bd5: accept_function at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:1813 (discriminator 6)
> 
> 0x4016ad86: tcp_process at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790 (discriminator 1)
> 
> 0x4016b651: tcp_input at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357
> 
> 0x4016f286: ip4_input at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712
> 
> 0x4017650a: ethernet_input at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176
> 
> 0x40166465: tcpip_thread at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:474
> 
> 
> Entering gdb stub now.
> [snip]
> event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897
> 1897	    if (scb->sem_signalled == 0) {
> (gdb) bt
> #0  event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897
> #1  0x40177bd8 in accept_function (arg=0x3fff21c4, newpcb=<optimized out>, err=0 '\000') at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:511
> #2  0x4016ad89 in tcp_process (pcb=0x3fff26a0) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790
> #3  0x4016b654 in tcp_input (p=<optimized out>, inp=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357
> #4  0x4016f289 in ip4_input (p=0x3fff2434, inp=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712
> #5  0x4017650d in ethernet_input (p=0x3fff2434, netif=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176
> #6  0x40166468 in tcpip_thread (arg=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:143
> (gdb) list
> 1892	  /* At this point, SYS_ARCH is still protected! */
> 1893	again:
> 1894	  for (scb = select_cb_list; scb != NULL; scb = scb->next) {
> 1895	    /* remember the state of select_cb_list to detect changes */
> 1896	    last_select_cb_ctr = select_cb_ctr;
> 1897	    if (scb->sem_signalled == 0) {
> 1898	      /* semaphore not signalled yet */
> 1899	      int do_signal = 0;
> 1900	      /* Test this select call for our socket */
> 1901	      if (sock->rcvevent > 0) {
> (gdb) print scb
> $1 = (struct lwip_select_cb *) 0xdeadbeef
> (gdb) print select_cb_list
> $2 = (struct lwip_select_cb *) 0x3fff1c08
> (gdb) print *select_cb_list
> $3 = {next = 0x3fff6ed8, prev = 0x0, readset = 0x3fff1c68, writeset = 0x3fff1c70, exceptset = 0x3fff1c78, sem_signalled = 0, sem = 0x3fff2068}
> (gdb) print (struct lwip_select_cb *) 0x3fff6ed8
> $4 = (struct lwip_select_cb *) 0x3fff6ed8
> (gdb) print *(struct lwip_select_cb *) 0x3fff6ed8
> $5 = {next = 0xdeadbeef, prev = 0x3fff1c08, readset = 0xdeadbeef, writeset = 0xdeadbeef, exceptset = 0xdeadbeef, sem_signalled = -559038737, sem = 0xdeadbeef}
> (gdb) quit
> 
> The problem here is a weakness in LWIP+FreeRTOS.  If a task is waiting
> on a select() when the task is deleted then that task's entry is not
> removed from the select_cb_list even though the entry is freed.  (The
> several 0xdeadbeef values shown above are there because in my local
> working copy I've added code in heap_regions_debug.c to fill heap
> blocks with that value when they are freed.)  This happens because the
> struct lwip_select_cb allocated on the stack to hold information about
> the task callback waiting on the sockets.  The comment says:
> 
>       We don't actually need any dynamic memory. Our entry on the
>       list is only valid while we are in this function, so it's ok
>       to use local variables. */
> 
> The problem is that if the task waiting on the select is deleted then
> the stack is freed and this struct becomes invalid, resulting in a
> crash.
> 
> The select() would work fine here if there was some way to do the
> equivalent of issuing a unix signal to the task which would cause the
> select() to return.  But select() is implemented in LWIP so RTOS knows
> nothing about it.  The select() is waiting on a semaphore, so waking
> it up would be possible if a hook were provided to get to the
> semaphore.  Doing that might allow switching to the select()-loop
> pattern with a single task, simplifying task management and also
> saving stack space.
> 
> The reason I'm having the TelnetServer wait on the select() is that
> when I originally wrote it to be just blocked on an accept() call then
> there was no way to unblock it when wifi went down and the task was
> deleted.  In FreeRTOS killing a task does not clean up its resources
> so the listening socket wouldn't close all the way, leaving a 76-byte
> block leaked.
> 
> Then I found that if I have the TelnetServer task wait on a select()
> instead then if I close the server socket from the task that is
> powering down the wifi (in this case, ConsoleAsync by command) while
> the TelnetServer task is blocked on the select() than the close causes
> the select() to return.  I can detect the closed socket in the
> TelnetServer task and exit cleanly.
> 
> Originally the TelnetServer task was a direct child of the wifi task
> (using the Parent and TaskBase classes I created) and the wifi task
> would do the deletion.  Subsequently we've changed to use events and
> the cleanup is not working properly.  In fact, now when I command wifi
> down the system always restarts immediately because the heap debug
> code is detecting that a block is freed twice.  I need to investigate
> further.  Using multiple tasks is tricky because these libraries are
> not all thread-safe so we have to be very careful to execute code in
> the right task.
> 
> The new SSH console implementation has the SSHServer task waiting on
> select() in the same manner but it also has the SSHConsole task
> configuring the socket to be non-blocking and waiting on select()
> before calling into the wolfssh library.  Thus the risk may be higher
> here so these shutdown cases need to be carefully sequenced.
> 
>                                                        -- Steve
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.teslaclub.hk
> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev