A promiscous wifi client

Mark Webb-Johnson

9 Nov 2017 9 Nov '17

1:38 p.m.

I’ve committed a change to the esp32wifi component, to try to support a promiscuous mode for the wifi client. This should make it work the way it does on cellphones, computers, and pretty much all equipment. Rather than specifying a specific wifi AP to connect to, it just connects to whichever it can (based on where it has connected to in the past). Up until now, we’ve supported the wifi client mode as follows: config set wifi.ssid <SSID> <password> wifi mode client <SSID> That seems to work well, and reconnection now seems pretty robust. What I’ve committed today supports this variant: config set wifi.ssid <SSID1> <password> config set wifi.ssid <SSID2> <password> … etc wifi mode client The module should then connect to either SSID1 or SSID2, whichever comes into range. I’ve got no idea how phones, laptops, etc, actually implement it, and can’t find much information on the ESP32 resources or forums. There is certainly no magic ‘here is a list of SSIDs - connect to whichever comes close’ function in the wifi libraries. So, I ended up implementing the following logic: Bring up the wifi as a client, in station mode. Every 30 seconds or so, do a wifi scan of available SSIDs. If we find a SSID within range, that has a matching name in config wifi.ssid, we try to actively connect to it. For this, we pick the first SSID we find that matches. If we don’t manage to connect (get disconnected), or there is no SSID in range that matches anything in config wifi.ssid, wait for 30 seconds, then go back and scan again. It’s not elegant, but seems to work. Change is isolated in esp32wifi.{h,cpp}, which is pretty small and manageable at the moment. Please give it a go, and let me know how it works out for you. Regards, Mark. P.S. Also found (and cleaned up) a big stonking memory leak in ‘wifi scan’.

Attachments:

attachment.html (text/html — 2.9 KB)

Show replies by date

Tom Parker

9 Nov 9 Nov

5:52 p.m.

On 09/11/17 18:38, Mark Webb-Johnson wrote:

...

I’ve committed a change to the esp32wifi component, to try to support a promiscuous mode for the wifi client. This should make it work the way it does on cellphones, computers, and pretty much all equipment. Rather than specifying a specific wifi AP to connect to, it just connects to whichever it can (based on where it has connected to in the past).

Nice! This is much better than the terrible network manager I wrote. If you trigger it from a startup script then it never connects because monotonictime is 0 which doesn't pass the test in EventTimer10(). I'm running with the following patch, but you might want to start monotonictime at 1, or not overload a boolean into m_nextscan. diff --git a/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp b/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp index 236f4c9..c6ebcd4 100644 --- a/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp +++ b/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp @@ -262,7 +262,9 @@ void esp32wifi::StartScanningClientMode() memset(&m_wifi_apsta_cfg,0,sizeof(m_wifi_apsta_cfg)); ESP_ERROR_CHECK(esp_wifi_start()); - m_nextscan = monotonictime; + // if we are triggered by a startup script, monotonictime will be zero which + // won't pass the test in EventTimer10() + m_nextscan = monotonictime == 0 ? 1 : monotonictime; } void esp32wifi::StartAccessPointMode(std::string ssid, std::string password) If there is a network in range that has the same name as a network that is saved but has a different password, the client never tries the next network. Instead it tries to connect to the "fake" network over and over and over. It appears to use the network that is alphabetically first in the list, which isn't good if it knows about an AndroidAP and there happens to be someone else's AndroidAP switched on while you're trying to connect to a network with a name that doesn't start with A.

Mark Webb-Johnson

10 Nov 10 Nov

10:45 a.m.

...

If you trigger it from a startup script then it never connects because monotonictime is 0 which doesn't pass the test in EventTimer10(). I'm running with the following patch, but you might want to start monotonictime at 1, or not overload a boolean into m_nextscan. - m_nextscan = monotonictime; + // if we are triggered by a startup script, monotonictime will be zero which + // won't pass the test in EventTimer10() + m_nextscan = monotonictime == 0 ? 1 : monotonictime;

Nice find. I think it is sufficient to just set m_nextscan = monotonictime+1 (to schedule it for the next second). I’ve done that, committed, and pushed.

...

If there is a network in range that has the same name as a network that is saved but has a different password, the client never tries the next network. Instead it tries to connect to the "fake" network over and over and over. It appears to use the network that is alphabetically first in the list, which isn't good if it knows about an AndroidAP and there happens to be someone else's AndroidAP switched on while you're trying to connect to a network with a name that doesn't start with A.

Yeah. I am not sure how this is handled in cellphones/laptops. We can pickup the MAC address of the station, and use that as the key, but even that can be maliciously spoofed. Also a pain for networks with multiple APs on the same SSID, etc. I’m reasonably sure that phones/laptops use the SSID as the key. On my iPhone, I can set a “don’t automatically connect to this SSID” flag on the SSID entry itself, but I can’t prioritise them in any way. Is this going to be an issue? If so, any suggestions for how to address it? Regards, Mark.

...

On 9 Nov 2017, at 5:52 PM, Tom Parker <tom@carrott.org> wrote:

On 09/11/17 18:38, Mark Webb-Johnson wrote:

...
I’ve committed a change to the esp32wifi component, to try to support a promiscuous mode for the wifi client. This should make it work the way it does on cellphones, computers, and pretty much all equipment. Rather than specifying a specific wifi AP to connect to, it just connects to whichever it can (based on where it has connected to in the past).

Nice! This is much better than the terrible network manager I wrote.

If you trigger it from a startup script then it never connects because monotonictime is 0 which doesn't pass the test in EventTimer10(). I'm running with the following patch, but you might want to start monotonictime at 1, or not overload a boolean into m_nextscan.

diff --git a/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp b/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp index 236f4c9..c6ebcd4 100644 --- a/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp +++ b/vehicle/OVMS.V3/components/esp32wifi/esp32wifi.cpp @@ -262,7 +262,9 @@ void esp32wifi::StartScanningClientMode() memset(&m_wifi_apsta_cfg,0,sizeof(m_wifi_apsta_cfg)); ESP_ERROR_CHECK(esp_wifi_start());

- m_nextscan = monotonictime; + // if we are triggered by a startup script, monotonictime will be zero which + // won't pass the test in EventTimer10() + m_nextscan = monotonictime == 0 ? 1 : monotonictime; }

void esp32wifi::StartAccessPointMode(std::string ssid, std::string password)

If there is a network in range that has the same name as a network that is saved but has a different password, the client never tries the next network. Instead it tries to connect to the "fake" network over and over and over. It appears to use the network that is alphabetically first in the list, which isn't good if it knows about an AndroidAP and there happens to be someone else's AndroidAP switched on while you're trying to connect to a network with a name that doesn't start with A.

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Tom Parker

13 Nov 13 Nov

5:47 p.m.

On 10/11/17 15:45, Mark Webb-Johnson wrote:

...

...
If there is a network in range that has the same name as a network that is saved but has a different password, the client never tries the next network. Instead it tries to connect to the "fake" network over and over and over. It appears to use the network that is alphabetically first in the list, which isn't good if it knows about an AndroidAP and there happens to be someone else's AndroidAP switched on while you're trying to connect to a network with a name that doesn't start with A. Yeah. I am not sure how this is handled in cellphones/laptops. We can pickup the MAC address of the station, and use that as the key, but even that can be maliciously spoofed. Also a pain for networks with multiple APs on the same SSID, etc. I’m reasonably sure that phones/laptops use the SSID as the key.

We shouldn't bind to the MAC address -- wifi networks with more than one Access Point will have multiple MAC addresses for the same SSID. From observing my laptop, it appears to try each network it has credentials for, one by one, until it gives up altogether. So I guess the thing to do is to do a scan, then iterate the list of networks detected, and for each we have credentials, try to associate. Since we won't often be in range of networks that we have credentials for but can't associate with, trying a few and failing won't happen often and anyway won't take long. I do see with the terrible network manager and also with your work, a fair amount of instability. I'm seeing reboots with socket related frames on the stacktrace, and quite often it gets hung up and never reconnects. I haven't been able to look at what might be causing the hang because I've never had it happen while the laptop is plugged in and monitoring the console. It seems very reliable if you reset the OVMS with the button while in range of a network and never disconnect. However, maybe 50% of the time, after driving out of range it never reconnects when returning to the original network or seeing a new network it knows about. I'll see if I can reproduce this by walking out of range with my cell phone while the OVMS is plugged into a laptop. The following trace is from November 6th when I was using the terrible network manager. Sorry I don't have a recent trace from the the new network manager but the stack trace look the same as far as I remember. It seems reasonably easy to reproduce, my startup script has the following: vehicle module NL server v2 start wifi mode client 1. reboot with the button 1. it connects to a v2 server via my cell phone hotspot 1. turn off the hotspot 1. it connects to my home wifi 1. it reconnects to the v2 server and crashes. 1. After the crash and reboot it connects to my home wifi and operates normally I'm not seeing the recovery and normal operation very often when I take it for a drive and then return to the house. Trace follows, I've only included the last part, prior to this it shows connecting to the v2 server and then disconnecting from the wifi and the v2 server, and reconnecting to a new wifi network. I (125740) ovms-server-v2: Sending server login: MP-C 0 redacted OVMS > Guru Meditation Error of type LoadProhibited occurred on core 1. Exception was unhandled. Register dump: PC : 0x401630c8 PS : 0x00060d30 A0 : 0x80161a9a A1 : 0x3ffdaaf0 0x401630c8: event_callback at /home/ubuntu/esp/esp-idf/components/lwip/api/sockets.c:3195 A2 : 0x13780368 A3 : 0x000001b0 A4 : 0x00000001 A5 : 0x3ffc750c A6 : 0x00000005 A7 : 0x00000000 A8 : 0x000001b0 A9 : 0x3ffdaad0 A10 : 0x00000001 A11 : 0xffffffff A12 : 0x00000001 A13 : 0x3ffedf92 A14 : 0x00000014 A15 : 0x00060023 SAR : 0x00000010 EXCCAUSE: 0x0000001c EXCVADDR: 0x1378037c LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff Backtrace: 0x401630c8:0x3ffdaaf0 0x40161a97:0x3ffdab10 0x401697c2:0x3ffdab30 0x4016dcee:0x3ffdab50 0x40172ee6:0x3ffdab70 0x401610b5:0x3ffdab90 0x401630c8: event_callback at /home/ubuntu/esp/esp-idf/components/lwip/api/sockets.c:3195 0x40161a97: sent_tcp at /home/ubuntu/esp/esp-idf/components/lwip/api/api_msg.c:1813 (discriminator 6) 0x401697c2: tcp_input at /home/ubuntu/esp/esp-idf/components/lwip/core/tcp_in.c:387 (discriminator 1) 0x4016dcee: ip4_input at /home/ubuntu/esp/esp-idf/components/lwip/core/ipv4/ip4.c:712 0x40172ee6: ethernet_input at /home/ubuntu/esp/esp-idf/components/lwip/netif/ethernet.c:176 0x401610b5: tcpip_thread at /home/ubuntu/esp/esp-idf/components/lwip/api/tcpip.c:474

Stephen Casner

16 Nov 16 Nov

10 a.m.

On Mon, 13 Nov 2017, Tom Parker wrote:

...

I do see with the terrible network manager and also with your work, a fair amount of instability. I'm seeing reboots with socket related frames on the stacktrace, and quite often it gets hung up and never reconnects. I haven't been able to look at what might be causing the hang because I've never had it happen while the laptop is plugged in and monitoring the console. It seems very reliable if you reset the OVMS with the button while in range of a network and never disconnect. However, maybe 50% of the time, after driving out of range it never reconnects when returning to the original network or seeing a new network it knows about.

I'll see if I can reproduce this by walking out of range with my cell phone while the OVMS is plugged into a laptop.

The following trace is from November 6th when I was using the terrible network manager. Sorry I don't have a recent trace from the the new network manager but the stack trace look the same as far as I remember. It seems reasonably easy to reproduce, my startup script has the following:

I know this one. Here is my sample crash with the same backtrace as yours, plus the GDB stack trace and some exploration and an explanation below: Guru Meditation Error of type LoadProhibited occurred on core 1. Exception was unhandled. Register dump: PC : 0x401647a4 PS : 0x00060230 A0 : 0x80177bd8 A1 : 0x3ffdd710 0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195 A2 : 0xdeadbeef A3 : 0x000000d2 A4 : 0x00000001 A5 : 0x3ffc7908 A6 : 0x00000002 A7 : 0x00060023 A8 : 0x000000d2 A9 : 0x3ffdd6f0 A10 : 0x00000001 A11 : 0x00000001 A12 : 0x80085d3d A13 : 0x3ffebd30 A14 : 0x00000003 A15 : 0x00060023 SAR : 0x00000010 EXCCAUSE: 0x0000001c EXCVADDR: 0xdeadbf03 LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff Backtrace: 0x401647a4:0x3ffdd710 0x40177bd5:0x3ffdd730 0x4016ad86:0x3ffdd750 0x4016b651:0x3ffdd770 0x4016f286:0x3ffdd790 0x4017650a:0x3ffdd7b0 0x40166465:0x3ffdd7d0 0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195 0x40177bd5: accept_function at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:1813 (discriminator 6) 0x4016ad86: tcp_process at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790 (discriminator 1) 0x4016b651: tcp_input at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357 0x4016f286: ip4_input at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712 0x4017650a: ethernet_input at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176 0x40166465: tcpip_thread at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:474 Entering gdb stub now. [snip] event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897 1897 if (scb->sem_signalled == 0) { (gdb) bt #0 event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897 #1 0x40177bd8 in accept_function (arg=0x3fff21c4, newpcb=<optimized out>, err=0 '\000') at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:511 #2 0x4016ad89 in tcp_process (pcb=0x3fff26a0) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790 #3 0x4016b654 in tcp_input (p=<optimized out>, inp=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357 #4 0x4016f289 in ip4_input (p=0x3fff2434, inp=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712 #5 0x4017650d in ethernet_input (p=0x3fff2434, netif=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176 #6 0x40166468 in tcpip_thread (arg=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:143 (gdb) list 1892 /* At this point, SYS_ARCH is still protected! */ 1893 again: 1894 for (scb = select_cb_list; scb != NULL; scb = scb->next) { 1895 /* remember the state of select_cb_list to detect changes */ 1896 last_select_cb_ctr = select_cb_ctr; 1897 if (scb->sem_signalled == 0) { 1898 /* semaphore not signalled yet */ 1899 int do_signal = 0; 1900 /* Test this select call for our socket */ 1901 if (sock->rcvevent > 0) { (gdb) print scb $1 = (struct lwip_select_cb *) 0xdeadbeef (gdb) print select_cb_list $2 = (struct lwip_select_cb *) 0x3fff1c08 (gdb) print *select_cb_list $3 = {next = 0x3fff6ed8, prev = 0x0, readset = 0x3fff1c68, writeset = 0x3fff1c70, exceptset = 0x3fff1c78, sem_signalled = 0, sem = 0x3fff2068} (gdb) print (struct lwip_select_cb *) 0x3fff6ed8 $4 = (struct lwip_select_cb *) 0x3fff6ed8 (gdb) print *(struct lwip_select_cb *) 0x3fff6ed8 $5 = {next = 0xdeadbeef, prev = 0x3fff1c08, readset = 0xdeadbeef, writeset = 0xdeadbeef, exceptset = 0xdeadbeef, sem_signalled = -559038737, sem = 0xdeadbeef} (gdb) quit The problem here is a weakness in LWIP+FreeRTOS. If a task is waiting on a select() when the task is deleted then that task's entry is not removed from the select_cb_list even though the entry is freed. (The several 0xdeadbeef values shown above are there because in my local working copy I've added code in heap_regions_debug.c to fill heap blocks with that value when they are freed.) This happens because the struct lwip_select_cb allocated on the stack to hold information about the task callback waiting on the sockets. The comment says: We don't actually need any dynamic memory. Our entry on the list is only valid while we are in this function, so it's ok to use local variables. */ The problem is that if the task waiting on the select is deleted then the stack is freed and this struct becomes invalid, resulting in a crash. The select() would work fine here if there was some way to do the equivalent of issuing a unix signal to the task which would cause the select() to return. But select() is implemented in LWIP so RTOS knows nothing about it. The select() is waiting on a semaphore, so waking it up would be possible if a hook were provided to get to the semaphore. Doing that might allow switching to the select()-loop pattern with a single task, simplifying task management and also saving stack space. The reason I'm having the TelnetServer wait on the select() is that when I originally wrote it to be just blocked on an accept() call then there was no way to unblock it when wifi went down and the task was deleted. In FreeRTOS killing a task does not clean up its resources so the listening socket wouldn't close all the way, leaving a 76-byte block leaked. Then I found that if I have the TelnetServer task wait on a select() instead then if I close the server socket from the task that is powering down the wifi (in this case, ConsoleAsync by command) while the TelnetServer task is blocked on the select() than the close causes the select() to return. I can detect the closed socket in the TelnetServer task and exit cleanly. Originally the TelnetServer task was a direct child of the wifi task (using the Parent and TaskBase classes I created) and the wifi task would do the deletion. Subsequently we've changed to use events and the cleanup is not working properly. In fact, now when I command wifi down the system always restarts immediately because the heap debug code is detecting that a block is freed twice. I need to investigate further. Using multiple tasks is tricky because these libraries are not all thread-safe so we have to be very careful to execute code in the right task. The new SSH console implementation has the SSHServer task waiting on select() in the same manner but it also has the SSHConsole task configuring the socket to be non-blocking and waiting on select() before calling into the wolfssh library. Thus the risk may be higher here so these shutdown cases need to be carefully sequenced. -- Steve

Stephen Casner

10:33 a.m.

On Wed, 15 Nov 2017, Stephen Casner wrote:

...

Originally the TelnetServer task was a direct child of the wifi task (using the Parent and TaskBase classes I created) and the wifi task would do the deletion. Subsequently we've changed to use events and the cleanup is not working properly. In fact, now when I command wifi down the system always restarts immediately because the heap debug code is detecting that a block is freed twice. I need to investigate further. Using multiple tasks is tricky because these libraries are not all thread-safe so we have to be very careful to execute code in the right task.

This double-free is happening in the SSH server, not Telnet, so if you don't have SSH configured you won't hit it. I'll be changing Telnet to also be a configurable component shortly, as Mark requested a few days ago. The reason the system was restarting immediately rather than crashing on an abort to give a stack trace is that the code was trying to do a mem_malloc_show() to list all the blocks. There are too many blocks so the watchdog timed out and restarted the system. Commenting out the mem_malloc_show() allows me to get the GDB stack backtrace for this double-free as shown below so I can debug this. That is one deep stack! I wish FreeRTOS had a nice, clean message-passing signal/wait mechanism for IPC like the real-time OS I helped to write for the PDP-11 in 1975... -- Steve #0 0x40088d54 in invoke_abort () at /Users/casner/src/github/esp-idf/components/esp32/./panic.c:125 #1 0x40088e22 in abort () at /Users/casner/src/github/esp-idf/components/esp32/./panic.c:134 #2 0x40085b6f in mem_free_block (data=0x3fff270c) at /Users/casner/src/github/esp-idf/components/freertos/./heap_regions_debug.c:335 #3 0x400854bb in vPortFreeTagged (pv=0x3fff2714) at /Users/casner/src/github/esp-idf/components/freertos/./heap_regions.c:362 #4 0x400893c8 in vPortFree (pv=0x3fff2714) at /Users/casner/src/github/esp-idf/components/esp32/./heap_alloc_caps.c:323 #5 0x40081fbc in _free_r (r=0x3ffe6ec8, ptr=0x3fff2714) at /Users/casner/src/github/esp-idf/components/newlib/./syscalls.c:32 #6 0x4000beca in ?? () #7 0x400d2ec8 in PwMapListDelete (list=<optimized out>) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp:614 #8 0x400d2ed9 in SSHServer::Cleanup (this=0x3ffe7fc8) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp:293 #9 0x400d2f42 in SSHServer::~SSHServer (this=0x3ffe7fc8, __in_chrg=<optimized out>) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp:197 #10 0x400d2f59 in SSHServer::~SSHServer (this=0x3ffe7fc8, __in_chrg=<optimized out>) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp:198 #11 0x40130a51 in TaskBase::DeleteTask (this=0x3ffe7fc8) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./task_base.cpp:162 #12 0x40130b14 in Parent::DeleteChildren (this=0x3ffc2d58 <MySSH>) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./task_base.cpp:100 #13 0x400d2c40 in OvmsSSH::WifiDown (this=0x3ffc2d58 <MySSH>, event=..., data=0x0) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp:171 #14 0x400d3997 in std::_Mem_fn_base<void (OvmsSSH::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), true>::operator()<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void>(OvmsSSH*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) const (__object=<optimized out>, this=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600 #15 std::_Bind<std::_Mem_fn<void (OvmsSSH::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsSSH*, std::_Placeholder<1>, std::_Placeholder<2>)>::__call<void, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&, 0u, 1u, 2u>(std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&>&&, std::_Index_tuple<0u, 1u, 2u>) (__args=<unknown type in /Users/casner/src/github/ovms/vehicle/OVMS.V3/build/ovms3.elf, CU 0x6c4b, DIE 0x13599>, this=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:1074 #16 std::_Bind<std::_Mem_fn<void (OvmsSSH::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsSSH*, std::_Placeholder<1>, std::_Placeholder<2>)>::operator()<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (this=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:1133 #17 std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsSSH::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsSSH*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:1871 #18 0x4012936c in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (this=0x3ffcf01c, __args#0=..., __args#1=0x0) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271 #19 0x401297f8 in OvmsEvents::SignalEvent (this=<optimized out>, event=..., data=0x0) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_events.cpp:112 #20 0x4012ebc3 in OvmsNetManager::WifiDown (this=0x3ffc4570 <MyNetManager>, event=..., data=0x0) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_netmanager.cpp:91 #21 0x4012f22b in std::_Mem_fn_base<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), true>::operator()<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void>(OvmsNetManager*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) const (__object=<optimized out>, this=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600 #22 std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)>::__call<void, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&, 0u, 1u, 2u>(std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&>&&, std::_Index_tuple<0u, 1u, 2u>) (__args=<unknown type in /Users/casner/src/github/ovms/vehicle/OVMS.V3/build/ovms3.elf, CU 0x324256, DIE 0x331cb6>, this=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:1074 #23 std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)>::operator()<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (this=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:1133 #24 std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:1871 #25 0x4012936c in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (this=0x3ffcfce0, __args#0=..., __args#1=0x0) at /Users/casner/src/github/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271 #26 0x401297f8 in OvmsEvents::SignalEvent (this=<optimized out>, event=..., data=0x0) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_events.cpp:112 #27 0x400d6216 in esp32wifi::StopStation (this=0x3ffdca50) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/esp32wifi/./esp32wifi.cpp:308 #28 0x400d629e in wifi_mode_off (verbosity=<optimized out>, writer=0x3ffe4360, cmd=0x3ffce8fc, argc=0, argv=0x0) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/esp32wifi/./esp32wifi.cpp:130 #29 0x40127115 in OvmsCommand::Execute (this=0x3ffce8fc, verbosity=65535, writer=0x3ffe4360, argc=0, argv=0x0) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_command.cpp:257 #30 0x40127209 in OvmsCommand::Execute (this=0x3ffce740, verbosity=65535, writer=0x3ffe4360, argc=1, argv=0x3ffe6c98) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_command.cpp:302 #31 0x401271fc in OvmsCommand::Execute (this=0x3ffce670, verbosity=65535, writer=0x3ffe4360, argc=2, argv=0x3ffe6c98) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_command.cpp:298 #32 0x401271fc in OvmsCommand::Execute (this=0x3ffc4428 <MyCommandApp+4>, verbosity=65535, writer=0x3ffe4360, argc=3, argv=0x3ffe6c94) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_command.cpp:298 #33 0x40127230 in OvmsCommandApp::Execute (this=0x3ffc4424 <MyCommandApp>, verbosity=65535, writer=0x3ffe4360, argc=3, argv=0x3ffe6c90) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_command.cpp:507 #34 0x401300d4 in Execute (rl=0x3ffe4368, argc=3, argv=0x3ffe6c90) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_shell.cpp:47 #35 0x400f1dc9 in new_line_handler (pThis=0x3ffe4368) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/microrl/./microrl.c:626 #36 0x400f1e2f in microrl_insert_char (pThis=0x3ffe4368, ch=10) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/components/microrl/./microrl.c:674 #37 0x4013013d in OvmsShell::ProcessChars (this=0x3ffe4360, buf=0x3ffe47be "", len=2) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_shell.cpp:69 #38 0x40125125 in ConsoleAsync::HandleDeviceEvent (this=0x3ffe4360, pEvent=<optimized out>) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./console_async.cpp:158 #39 0x401290ad in OvmsConsole::Service (this=0x3ffe4360) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./ovms_console.cpp:142 #40 0x401291b3 in non-virtual thunk to OvmsConsole::Service() () #41 0x40130bbf in TaskBase::Task (object=0x3ffe4490) at /Users/casner/src/github/ovms/vehicle/OVMS.V3/main/./task_base.cpp:131

Mark Webb-Johnson

2:32 p.m.

We don’t seem to have this issue with the webserver listener; that uses mongoose (in netmanager). That doesn’t have long-lived client side connections, but the socket listen/accept functionality should be similar. Not sure how mongoose handles it internally, but the library seems quite robust. I’m wondering if we can migrate to use that? It would probably save a bunch of ram (as it doesn’t need a task for each thread). We are mostly single-user anyway. Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc). I’m happy to try to re-work telnet that way, as an example, if it makes sense to you. Regards, Mark.

...

On 16 Nov 2017, at 10:00 AM, Stephen Casner <casner@acm.org> wrote:

On Mon, 13 Nov 2017, Tom Parker wrote:

...
I do see with the terrible network manager and also with your work, a fair amount of instability. I'm seeing reboots with socket related frames on the stacktrace, and quite often it gets hung up and never reconnects. I haven't been able to look at what might be causing the hang because I've never had it happen while the laptop is plugged in and monitoring the console. It seems very reliable if you reset the OVMS with the button while in range of a network and never disconnect. However, maybe 50% of the time, after driving out of range it never reconnects when returning to the original network or seeing a new network it knows about.

I'll see if I can reproduce this by walking out of range with my cell phone while the OVMS is plugged into a laptop.

The following trace is from November 6th when I was using the terrible network manager. Sorry I don't have a recent trace from the the new network manager but the stack trace look the same as far as I remember. It seems reasonably easy to reproduce, my startup script has the following:

I know this one. Here is my sample crash with the same backtrace as yours, plus the GDB stack trace and some exploration and an explanation below:

Guru Meditation Error of type LoadProhibited occurred on core 1. Exception was unhandled. Register dump: PC : 0x401647a4 PS : 0x00060230 A0 : 0x80177bd8 A1 : 0x3ffdd710 0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195

A2 : 0xdeadbeef A3 : 0x000000d2 A4 : 0x00000001 A5 : 0x3ffc7908 A6 : 0x00000002 A7 : 0x00060023 A8 : 0x000000d2 A9 : 0x3ffdd6f0 A10 : 0x00000001 A11 : 0x00000001 A12 : 0x80085d3d A13 : 0x3ffebd30 A14 : 0x00000003 A15 : 0x00060023 SAR : 0x00000010 EXCCAUSE: 0x0000001c EXCVADDR: 0xdeadbf03 LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff

Backtrace: 0x401647a4:0x3ffdd710 0x40177bd5:0x3ffdd730 0x4016ad86:0x3ffdd750 0x4016b651:0x3ffdd770 0x4016f286:0x3ffdd790 0x4017650a:0x3ffdd7b0 0x40166465:0x3ffdd7d0 0x401647a4: event_callback at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:3195

0x40177bd5: accept_function at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:1813 (discriminator 6)

0x4016ad86: tcp_process at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790 (discriminator 1)

0x4016b651: tcp_input at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357

0x4016f286: ip4_input at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712

0x4017650a: ethernet_input at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176

0x40166465: tcpip_thread at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:474

Entering gdb stub now. [snip] event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897 1897 if (scb->sem_signalled == 0) { (gdb) bt #0 event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/sockets.c:1897 #1 0x40177bd8 in accept_function (arg=0x3fff21c4, newpcb=<optimized out>, err=0 '\000') at /Users/casner/src/github/esp-idf/components/lwip/api/api_msg.c:511 #2 0x4016ad89 in tcp_process (pcb=0x3fff26a0) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:790 #3 0x4016b654 in tcp_input (p=<optimized out>, inp=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/core/tcp_in.c:357 #4 0x4016f289 in ip4_input (p=0x3fff2434, inp=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/core/ipv4/ip4.c:712 #5 0x4017650d in ethernet_input (p=0x3fff2434, netif=0x3ffed5c4) at /Users/casner/src/github/esp-idf/components/lwip/netif/ethernet.c:176 #6 0x40166468 in tcpip_thread (arg=<optimized out>) at /Users/casner/src/github/esp-idf/components/lwip/api/tcpip.c:143 (gdb) list 1892 /* At this point, SYS_ARCH is still protected! */ 1893 again: 1894 for (scb = select_cb_list; scb != NULL; scb = scb->next) { 1895 /* remember the state of select_cb_list to detect changes */ 1896 last_select_cb_ctr = select_cb_ctr; 1897 if (scb->sem_signalled == 0) { 1898 /* semaphore not signalled yet */ 1899 int do_signal = 0; 1900 /* Test this select call for our socket */ 1901 if (sock->rcvevent > 0) { (gdb) print scb $1 = (struct lwip_select_cb *) 0xdeadbeef (gdb) print select_cb_list $2 = (struct lwip_select_cb *) 0x3fff1c08 (gdb) print *select_cb_list $3 = {next = 0x3fff6ed8, prev = 0x0, readset = 0x3fff1c68, writeset = 0x3fff1c70, exceptset = 0x3fff1c78, sem_signalled = 0, sem = 0x3fff2068} (gdb) print (struct lwip_select_cb *) 0x3fff6ed8 $4 = (struct lwip_select_cb *) 0x3fff6ed8 (gdb) print *(struct lwip_select_cb *) 0x3fff6ed8 $5 = {next = 0xdeadbeef, prev = 0x3fff1c08, readset = 0xdeadbeef, writeset = 0xdeadbeef, exceptset = 0xdeadbeef, sem_signalled = -559038737, sem = 0xdeadbeef} (gdb) quit

The problem here is a weakness in LWIP+FreeRTOS. If a task is waiting on a select() when the task is deleted then that task's entry is not removed from the select_cb_list even though the entry is freed. (The several 0xdeadbeef values shown above are there because in my local working copy I've added code in heap_regions_debug.c to fill heap blocks with that value when they are freed.) This happens because the struct lwip_select_cb allocated on the stack to hold information about the task callback waiting on the sockets. The comment says:

We don't actually need any dynamic memory. Our entry on the list is only valid while we are in this function, so it's ok to use local variables. */

The problem is that if the task waiting on the select is deleted then the stack is freed and this struct becomes invalid, resulting in a crash.

The select() would work fine here if there was some way to do the equivalent of issuing a unix signal to the task which would cause the select() to return. But select() is implemented in LWIP so RTOS knows nothing about it. The select() is waiting on a semaphore, so waking it up would be possible if a hook were provided to get to the semaphore. Doing that might allow switching to the select()-loop pattern with a single task, simplifying task management and also saving stack space.

The reason I'm having the TelnetServer wait on the select() is that when I originally wrote it to be just blocked on an accept() call then there was no way to unblock it when wifi went down and the task was deleted. In FreeRTOS killing a task does not clean up its resources so the listening socket wouldn't close all the way, leaving a 76-byte block leaked.

Then I found that if I have the TelnetServer task wait on a select() instead then if I close the server socket from the task that is powering down the wifi (in this case, ConsoleAsync by command) while the TelnetServer task is blocked on the select() than the close causes the select() to return. I can detect the closed socket in the TelnetServer task and exit cleanly.

Originally the TelnetServer task was a direct child of the wifi task (using the Parent and TaskBase classes I created) and the wifi task would do the deletion. Subsequently we've changed to use events and the cleanup is not working properly. In fact, now when I command wifi down the system always restarts immediately because the heap debug code is detecting that a block is freed twice. I need to investigate further. Using multiple tasks is tricky because these libraries are not all thread-safe so we have to be very careful to execute code in the right task.

The new SSH console implementation has the SSHServer task waiting on select() in the same manner but it also has the SSHConsole task configuring the socket to be non-blocking and waiting on select() before calling into the wolfssh library. Thus the risk may be higher here so these shutdown cases need to be carefully sequenced.

-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Stephen Casner

3:06 p.m.

Mark, The problem with select() becoming invalid when the task is killed can be avoided by never killing the task. I assume that is the case for NetManTask with mongoose. You're right that we could fold the listen/accept functionality from TelnetServer task and SSHServer task into NetManTask. When I started with Telnet that option hadn't been created yet and I assumed that the TelnetServer task should not live all the time. I replicated the Telnet design for SSH, but these could be changed. Let me take a look at how to do this. I believe we still need dedicated tasks for the consoles to handle synchronization of output. And we can't just wait on select for these because we also need to wait for Queue events that are the signals for logging output from other tasks. -- Steve On Thu, 16 Nov 2017, Mark Webb-Johnson wrote:

...

We don't seem to have this issue with the webserver listener; that uses mongoose (in netmanager). That doesn't have long-lived client side connections, but the socket listen/accept functionality should be similar. Not sure how mongoose handles it internally, but the library seems quite robust.

I'm wondering if we can migrate to use that? It would probably save a bunch of ram (as it doesn't need a task for each thread). We are mostly single-user anyway.

Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

I'm happy to try to re-work telnet that way, as an example, if it makes sense to you.

Regards, Mark.

Stephen Casner

17 Nov 17 Nov

4:27 p.m.

On Thu, 16 Nov 2017, Mark Webb-Johnson wrote:

...

We don't seem to have this issue with the webserver listener; that uses mongoose (in netmanager). That doesn't have long-lived client side connections, but the socket listen/accept functionality should be similar. Not sure how mongoose handles it internally, but the library seems quite robust.

Mongoose wakes up from the select() periodically to do polling so the deletion of the task would not happen while in select().

...

I'm wondering if we can migrate to use that? It would probably save a bunch of ram (as it doesn't need a task for each thread). We are mostly single-user anyway.

I have converted console_telnet to use mongoose and netmanager in the same manner as ovms_webserver. This eliminates the TelnetServer and TelnetReceiver tasks and saves about 3KB. I have kept TelnetConsole as a separate task from NetManTask. I'm not convinced that combining the console functionality into NetManTask is a good idea; I'm afraid it would introduce some problems with interference between various operations and may get into trouble with the Log() mechanism. Even if not, much more surgery would be required.

...

Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

I didn't do it that way for this telnet conversion, but we could evolve to that if appropriate. Converting ssh may be possible but would be significantly harder. The libtelnet design calls for all the actual I/O (recv/send) to be done outside of the library. That fits with the design of mongoose where it owns that I/O. However, the wolfssh library is designed for the library itself to be in control of the I/O. -- Steve

Stephen Casner

18 Nov 18 Nov

5:05 a.m.

There is more to do on this conversion of the Telnet Console to use mongoose: - When wifi service is shut down by command, my old implementation would close the connections to the clients cleanly. The new implementation with mongoose leaves them hanging. The way to improve this may lie in the shutdown steps in ovms_netmanager. - It looks like there may be a memory blocked leaked at each client disconnection. I need to find that. - I think I can improve how I handle the receive buffer and save some more RAM. I've also thought more about how to convert the SSH Console and will take a stab at that as well. -- Steve On Fri, 17 Nov 2017, Stephen Casner wrote:

...

On Thu, 16 Nov 2017, Mark Webb-Johnson wrote:

...
We don't seem to have this issue with the webserver listener; that uses mongoose (in netmanager). That doesn't have long-lived client side connections, but the socket listen/accept functionality should be similar. Not sure how mongoose handles it internally, but the library seems quite robust.

Mongoose wakes up from the select() periodically to do polling so the deletion of the task would not happen while in select().

...
I'm wondering if we can migrate to use that? It would probably save a bunch of ram (as it doesn't need a task for each thread). We are mostly single-user anyway.

I have converted console_telnet to use mongoose and netmanager in the same manner as ovms_webserver. This eliminates the TelnetServer and TelnetReceiver tasks and saves about 3KB. I have kept TelnetConsole as a separate task from NetManTask. I'm not convinced that combining the console functionality into NetManTask is a good idea; I'm afraid it would introduce some problems with interference between various operations and may get into trouble with the Log() mechanism. Even if not, much more surgery would be required.

...
Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

I didn't do it that way for this telnet conversion, but we could evolve to that if appropriate.

Converting ssh may be possible but would be significantly harder. The libtelnet design calls for all the actual I/O (recv/send) to be done outside of the library. That fits with the design of mongoose where it owns that I/O. However, the wolfssh library is designed for the library itself to be in control of the I/O.

-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Mark Webb-Johnson

19 Nov 19 Nov

9:54 p.m.

Strange. When I tested mongoose, I got a close event (MG_EV_CLOSE) for each open connection, when the interface went down. Regards, Mark.

...

On 18 Nov 2017, at 5:05 AM, Stephen Casner <casner@acm.org> wrote:

There is more to do on this conversion of the Telnet Console to use mongoose:

- When wifi service is shut down by command, my old implementation would close the connections to the clients cleanly. The new implementation with mongoose leaves them hanging. The way to improve this may lie in the shutdown steps in ovms_netmanager.

- It looks like there may be a memory blocked leaked at each client disconnection. I need to find that.

- I think I can improve how I handle the receive buffer and save some more RAM.

I've also thought more about how to convert the SSH Console and will take a stab at that as well.

-- Steve

On Fri, 17 Nov 2017, Stephen Casner wrote:

...
On Thu, 16 Nov 2017, Mark Webb-Johnson wrote:

...
We don't seem to have this issue with the webserver listener; that uses mongoose (in netmanager). That doesn't have long-lived client side connections, but the socket listen/accept functionality should be similar. Not sure how mongoose handles it internally, but the library seems quite robust.

Mongoose wakes up from the select() periodically to do polling so the deletion of the task would not happen while in select().

...
I'm wondering if we can migrate to use that? It would probably save a bunch of ram (as it doesn't need a task for each thread). We are mostly single-user anyway.

I have converted console_telnet to use mongoose and netmanager in the same manner as ovms_webserver. This eliminates the TelnetServer and TelnetReceiver tasks and saves about 3KB. I have kept TelnetConsole as a separate task from NetManTask. I'm not convinced that combining the console functionality into NetManTask is a good idea; I'm afraid it would introduce some problems with interference between various operations and may get into trouble with the Log() mechanism. Even if not, much more surgery would be required.

...
Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

I didn't do it that way for this telnet conversion, but we could evolve to that if appropriate.

Converting ssh may be possible but would be significantly harder. The libtelnet design calls for all the actual I/O (recv/send) to be done outside of the library. That fits with the design of mongoose where it owns that I/O. However, the wolfssh library is designed for the library itself to be in control of the I/O.

-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Stephen Casner

20 Nov 20 Nov

8:58 a.m.

On Sun, 19 Nov 2017, Mark Webb-Johnson wrote:

...

Strange. When I tested mongoose, I got a close event (MG_EV_CLOSE) for each open connection, when the interface went down.

It does send a close event, but not until after the wifi is already shut down so closing the socket at that point does not send any packet to the client. I've decided to punt on that issue, though, because manually shutting down wifi is not an important use case. The more likely case is that wifi connectivity is lost due to motion or other causes and in that case no close packet can be delivered anyway. My other points were:

...

...
- It looks like there may be a memory blocked leaked at each client disconnection. I need to find that.

This is not a leak. It is the policy of lwip to allocate a semaphore to each socket and to not reuse socket structures in its pool until all have been used. So that means up to 10 semaphore blocks of 92 bytes each (104 with debug overhead) will be allocated as new client connections are made. I actually figured this out once before in August when I first implemented telnet.

...

...
- I think I can improve how I handle the receive buffer and save some more RAM.

I've also thought more about how to convert the SSH Console and will take a stab at that as well.

In my local repository I have done both of these improvements but not yet committed them. Furthermore, I have determined that the "surgery" in OvmsConsole required to accommodate ConsoleAsync as a task while ConsoleTelnet and ConsoleSSH run within the NetManTask was not as bad as I thought. So, as you requested, in my local copy I have eliminated all the dedicated tasks for the Telnet and SSH implementations and have that working. However, there are a couple of downsides to using mongoose: 1) Output from other tasks through the OvmsCommandApp::Log() facility will be delayed on average by half the mongoose polling interval, which is currently one second. In my original implementation I avoided polling to avoid delays, but that had other costs. 2) More serious: Mongoose has a policy that callback function will not block, which is reasonable since it acts as a hub for multiple services. However, one consequence is that the mg_send() function always just buffers the data you give it into heap RAM. So, in the SSH console, for example, when you enter a command an MG_EV_RECV event will pass control into code in the console and command classes. That code will generate its output as multiple puts or printf calls each of which results in an mg_send() call. For a command like "?" that does many lines of output, mongoose will be doing a bunch of realloc() operations to keep appending all of the lines into one big buffer in RAM. It is possible for a command to use up all the available RAM. Fortunately mg_send() fails gracefully when that happens, but it means command output is lost and in the meantime other functions that depend upon RAM may fail less gracefully. It was issues like this that led me to be concerned about eliminating the console tasks. There is not any mechanism in Mongoose for the application to flush the send buffer. The data will only be sent when the callback for the MG_EV_RECV event returns to Mongoose. We don't have a way to return to mongoose part way through the command output and then get control back again to finish the command. The state of being in the middle of the command is lost without it being a separate task to hold that state. It might work to recurse into mongoose by calling mg_mgr_poll() again, but the code in mongoose may not be reenterable and this would incur significant risk of stack overflow. Should we consider modifying Mongoose to change its policy so that mg_send() actually sends its data? This could be a flag in the mg_connection object, or there could just be an mg_flush() call added. Digging through the code I found mg_mgr_handle_conn() that does the actual sending, but that function is not exposed in the header file and I don't know whether it would be valid to call it from user code while in the middle of the callback. You also said:

...

...
...
...
Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

Right now I just have some similar code in OvmsTelnet and OvmsSSH classes. When this settles down we could consider factoring that int a base virtual connection class. -- Steve

Mark Webb-Johnson

1:11 p.m.

Steve, The general approach of async net libraries like mongoose seem to be that they don’t allow blocking, and that is why the send doesn’t send immediately. That said, I’ve seen implementations that at least _try_ to send immediately (and only queue it if that fails); however that can complicate the transmitter (which has to deal with the two possible cases). The more general approach seems to be that if we want to send something big, we send a chunk, than wait for the MG_EV_SEND message to indicate that chunk has been sent, then queue the next chunk to go. That seems to work well in Mongoose. My issue is that the higher-level protocols in Mongoose don’t support that well (for example, HTTP client and server). Kind of disappointing. But, the basic low-level networking API seems ok. I can see how something like ‘vfs cat my-big-file’ would cause us concern. I tried the telnet server component, with wifi up, and that ESP_LOGI on to log the event. Dropping the event #0 (poll) notifications, this is what I saw: I (22689) telnet: Launching Telnet Server (Launch a normal telnet connection) I (47049) telnet: Event 1 conn 0x3fff1004, data 0x3fff101c (MG_EV_ACCEPT) I (47049) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (47099) telnet: Event 4 conn 0x3fff1004, data 0x3ffeffc0 (MG_EV_SEND) I (47109) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (51749) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (51759) telnet: Event 4 conn 0x3fff1004, data 0x3ffeffc0 (MG_EV_SEND) I (51889) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (51899) telnet: Event 4 conn 0x3fff1004, data 0x3ffeffc0 (MG_EV_SEND) I (51939) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (51949) telnet: Event 4 conn 0x3fff1004, data 0x3ffeffc0 (MG_EV_SEND) I (52049) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (52059) telnet: Event 4 conn 0x3fff1004, data 0x3ffeffc0 (MG_EV_SEND) I (52209) telnet: Event 3 conn 0x3fff1004, data 0x3ffeffb0 (MG_EV_RECV) I (52209) telnet: Event 4 conn 0x3fff1004, data 0x3ffeffc0 (MG_EV_SEND) (Disconnect the client side) I (54309) telnet: Event 5 conn 0x3fff1004, data 0x0 (MG_EV_CLOSE) (Launch a normal telnet connection) I (58319) telnet: Event 1 conn 0x3fff0f08, data 0x3fff0f20 (MG_EV_ACCEPT) I (58319) telnet: Event 3 conn 0x3fff0f08, data 0x3ffeffb0 (MG_EV_RECV) I (58369) telnet: Event 4 conn 0x3fff0f08, data 0x3ffeffc0 (MG_EV_SEND) I (58379) telnet: Event 3 conn 0x3fff0f08, data 0x3ffeffb0 (MG_EV_RECV) OVMS > wifi mode off Stopping wifi station... I (62569) events: Signal(system.wifi.down) I (62569) events: Signal(network.wifi.down) I (62569) ssh: Stopping SSH Server I (62579) events: Signal(network.down) I (62579) wifi: state: run -> init (0) I (62579) wifi: pm stop, total sleep time: 0/33441266 I (62579) wifi: n:11 0, o:11 0, ap:255 255, sta:11 0, prof:1 I (62589) events: Signal(system.wifi.sta.disconnected) I (62589) telnet: Event 5 conn 0x3fff0f08, data 0x0 (MG_EV_CLOSE) I (62589) events: Signal(network.mgr.stop) I (62589) telnet: Stopping Telnet Server I (62589) events: Signal(system.wifi.sta.stop) I (62589) telnet: Event 5 conn 0x3fff0ccc, data 0x0 (MG_EV_CLOSE) As you say, not elegant, but at least our side gets closed. Regards, Mark.

...

On 20 Nov 2017, at 8:58 AM, Stephen Casner <casner@acm.org> wrote:

On Sun, 19 Nov 2017, Mark Webb-Johnson wrote:

...
Strange. When I tested mongoose, I got a close event (MG_EV_CLOSE) for each open connection, when the interface went down.

It does send a close event, but not until after the wifi is already shut down so closing the socket at that point does not send any packet to the client. I've decided to punt on that issue, though, because manually shutting down wifi is not an important use case. The more likely case is that wifi connectivity is lost due to motion or other causes and in that case no close packet can be delivered anyway.

My other points were:

...
...
- It looks like there may be a memory blocked leaked at each client disconnection. I need to find that.

This is not a leak. It is the policy of lwip to allocate a semaphore to each socket and to not reuse socket structures in its pool until all have been used. So that means up to 10 semaphore blocks of 92 bytes each (104 with debug overhead) will be allocated as new client connections are made. I actually figured this out once before in August when I first implemented telnet.

...
...
- I think I can improve how I handle the receive buffer and save some more RAM.

I've also thought more about how to convert the SSH Console and will take a stab at that as well.

In my local repository I have done both of these improvements but not yet committed them.

Furthermore, I have determined that the "surgery" in OvmsConsole required to accommodate ConsoleAsync as a task while ConsoleTelnet and ConsoleSSH run within the NetManTask was not as bad as I thought. So, as you requested, in my local copy I have eliminated all the dedicated tasks for the Telnet and SSH implementations and have that working.

However, there are a couple of downsides to using mongoose:

1) Output from other tasks through the OvmsCommandApp::Log() facility will be delayed on average by half the mongoose polling interval, which is currently one second. In my original implementation I avoided polling to avoid delays, but that had other costs.

2) More serious: Mongoose has a policy that callback function will not block, which is reasonable since it acts as a hub for multiple services. However, one consequence is that the mg_send() function always just buffers the data you give it into heap RAM. So, in the SSH console, for example, when you enter a command an MG_EV_RECV event will pass control into code in the console and command classes. That code will generate its output as multiple puts or printf calls each of which results in an mg_send() call. For a command like "?" that does many lines of output, mongoose will be doing a bunch of realloc() operations to keep appending all of the lines into one big buffer in RAM. It is possible for a command to use up all the available RAM. Fortunately mg_send() fails gracefully when that happens, but it means command output is lost and in the meantime other functions that depend upon RAM may fail less gracefully. It was issues like this that led me to be concerned about eliminating the console tasks.

There is not any mechanism in Mongoose for the application to flush the send buffer. The data will only be sent when the callback for the MG_EV_RECV event returns to Mongoose. We don't have a way to return to mongoose part way through the command output and then get control back again to finish the command. The state of being in the middle of the command is lost without it being a separate task to hold that state. It might work to recurse into mongoose by calling mg_mgr_poll() again, but the code in mongoose may not be reenterable and this would incur significant risk of stack overflow.

Should we consider modifying Mongoose to change its policy so that mg_send() actually sends its data? This could be a flag in the mg_connection object, or there could just be an mg_flush() call added. Digging through the code I found mg_mgr_handle_conn() that does the actual sending, but that function is not exposed in the header file and I don't know whether it would be valid to call it from user code while in the middle of the callback.

You also said:

...
...
...
...
Perhaps a generic extension to ovms_netmanager where you give it a tcp port and an object type. Then, it listens on that port and launches objects of that type (pre-configured to receive mongoose events as appropriate) for each new connection? Have a base virtual connection class, that gets inherited by each implementation (ssh, telnet, etc).

Right now I just have some similar code in OvmsTelnet and OvmsSSH classes. When this settles down we could consider factoring that int a base virtual connection class.

-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Stephen Casner

2:56 p.m.

On Mon, 20 Nov 2017, Mark Webb-Johnson wrote:

...

The general approach of async net libraries like mongoose seem to be that they don’t allow blocking, and that is why the send doesn’t send immediately. That said, I’ve seen implementations that at least _try_ to send immediately (and only queue it if that fails); however that can complicate the transmitter (which has to deal with the two possible cases).

It seems that it would not that difficult for mg_send() to use a no-wait select() to check whether the socket is available for writing, and if so, to proceed with actually sending it. This could block if the buffer space in the network stack is less than what mg_send() has accumulated, but in most cases it would not block.

...

The more general approach seems to be that if we want to send something big, we send a chunk, than wait for the MG_EV_SEND message to indicate that chunk has been sent, then queue the next chunk to go. That seems to work well in Mongoose. My issue is that the higher-level protocols in Mongoose don’t support that well (for example, HTTP client and server). Kind of disappointing. But, the basic low-level networking API seems ok.

The approach of sending in chunks does not fit well with our command and console software architecture. In particular for SSH, mg_send() is called from the callback function in ConsoleSSH class that is called when the wolfssh code wants to do sendto(). It is possible to return EWOULDBLOCK to wolfssh, in which case the EWOULDBLOCK gets returned from wolfSSH_stream_send() that is called from printf(). But with everything in the NetManTask it is not possible to wait in the printf() function because we have to return from the MG_EV_RECV event callback to let NetManTask actually do the sending. I wrote a new test function that is the RFC 864 character generator like some systems implemented on port 19 in the early days of the Internet. It outputs a configurable number of 72-character lines, defaulting to 1000. Through SSH on my test unit only the first 134 lines survived, the rest were discarded. Now, we can simply declare that in the interest of saving RAM we won't be concerned about the output limits if we don't expect large outputs in this application. Or I could make changes to Mongoose along the lines of what I suggested above. -- Steve

Stephen Casner

21 Nov 21 Nov

3:57 p.m.

On Sun, 19 Nov 2017, Stephen Casner wrote:

...

It seems that it would not that difficult for mg_send() to use a no-wait select() to check whether the socket is available for writing, and if so, to proceed with actually sending it. This could block if the buffer space in the network stack is less than what mg_send() has accumulated, but in most cases it would not block.

I've gone ahead and implemented this change to Mongoose because without it the SSH connection can suffer a badly formed packet and hang due to running out of dynamic memory. I have completed and committed the changes for the Telnet and SSH consoles to no longer create any tasks of their own but instead to run as part of the NetManTask. When there is no connection, this only saves about 1KB because I needed to bump up the stack size for the NetManTask to avoid overflow with the additional complexity of executing the console command code. However, when an SSH connection is established, the reduction in RAM using is about 7KB. -- Steve

3063

Age (days ago)

3075

Last active (days ago)

List overview

Download

14 comments

3 participants

participants (3)

Mark Webb-Johnson
Stephen Casner
Tom Parker