[Ovmsdev] LWIP lock-up
Mark Webb-Johnson
mark at webb-johnson.net
Sun Mar 25 15:54:43 HKT 2018
My car managed to repeat the issue with the task lockup I have seen occasionally. The symptom is that network connectivity fails and the async console becomes unresponsive when you issue a command to bring up/down wifi, simcom, etc.
The new housekeeping 10 minute ticker worked well for this, and stopped in the middle of the night:
…
I (134401020) housekeeping: Local time: Sat Mar 24 20:15:45 2018
I (135001020) housekeeping: Local time: Sat Mar 24 20:25:45 2018
I (135601020) housekeeping: Local time: Sat Mar 24 20:35:45 2018
I (136201020) housekeeping: Local time: Sat Mar 24 20:45:45 2018
Looking at the logs after Sat Mar 24 20:45:45, I see:
D (136543060) simcom: rx line ch=4 len=19 : +PPPD: DISCONNECTED
I (136543070) simcom: PPP Connection disconnected
I (136544020) simcom: Lost network connection (+PPP disconnect in NetMode)
I (136544020) simcom: State: Enter NetLoss state
...
I (136544030) gsm-ppp: Shutting down (hard)...
I (136544040) events: Signal(system.modem.down)
I (136544040) netmanager: Interface priority is st1 (x.y.z.212/255.255.248.0 gateway x.y.z.64)
I (136544050) events: Signal(network.modem.down)
I (136544050) netmanager: MODEM down (with WIFI client up): staying with WIFI client priority
I (136544060) events: Signal(network.interface.change)
OK, so ppp went down (lost network connectivity), and WIFI client was up so we kept with that.
The GSM signal is still up (CREG), so we try to reconnect PPP:
I (136558160) simcom: PPP Connection is ready to start
I (136559020) simcom: State: Enter NetMode state
I (136559020) gsm-ppp: Initialising..
...
D (136589180) simcom: rx line ch=4 len=8 : +CREG: 2
I (136589190) simcom: CREG Network Registration: Searching
...
D (136589250) simcom: rx line ch=4 len=8 : +CREG: 5^[[0m
^[[0;32mI (136589250) simcom: CREG Network Registration: RegisteredRoaming
...
D (136589320) simcom: rx line ch=4 len=8 : +CREG: 2^[[0m
^[[0;32mI (136589320) simcom: CREG Network Registration: Searching
...
D (136589380) simcom: rx line ch=4 len=8 : +CREG: 5^[[0m
^[[0;32mI (136589390) simcom: CREG Network Registration: RegisteredRoaming
...
D (136601690) simcom: rx line ch=3 len=8 : +CREG: 2^[[0m
^[[0;32mI (136601690) simcom: CREG Network Registration: Searching
...
D (136601780) simcom: rx line ch=3 len=8 : +CREG: 5^[[0m
^[[0;32mI (136601780) simcom: CREG Network Registration: RegisteredRoaming
...
V (136617440) gsm-ppp: tx: 7e ff 7d 23 c0 21 7d 25 7d 2c 7d 20 20 4e 6f 20 | ~.}#.!}%},} No
V (136617440) gsm-ppp: tx: 6e 65 74 77 6f 72 6b 20 70 72 6f 74 6f 63 6f 6c | network protocol
V (136617450) gsm-ppp: tx: 73 20 72 75 6e 6e 69 6e 67 4b 8f 7e | s runningK.~
V (136617450) simcom: tx: f9 09 ff 59 7e ff 7d 23 c0 21 7d 25 7d 2c 7d 20 | ...Y~.}#.!}%},}
V (136617450) simcom: tx: 20 4e 6f 20 6e 65 74 77 6f 72 6b 20 70 72 6f 74 | No network prot
V (136617460) simcom: tx: 6f 63 6f 6c 73 20 72 75 6e 6e 69 6e 67 4b 8f 7e | ocols runningK.~
V (136617460) simcom: tx: a6 f9 | ..
V (136617490) simcom: rx: f9 09 ff 25 7e ff 7d 23 c0 21 7d 26 7d 2c 7d 20 | ...%~.}#.!}&},}
V (136617500) simcom: rx: 7d 24 8f 7d 3d 7e fb f9 | }$.}=~..
V (136617510) gsm-mux: ProcessFrame(CHAN=2, ADDR=09, CTRL=ff, FCS=fb, LEN=24)
V (136617520) gsm-mux: ChanProcessFrame(CHAN=2, ADDR=09, CTRL=ff, LEN=21, IFP=3)
V (136617520) gsm-ppp: rx: 7e ff 7d 23 c0 21 7d 26 7d 2c 7d 20 7d 24 8f 7d | ~.}#.!}&},} }$.}
V (136617530) gsm-ppp: rx: 3d 7e | =~
I (136617540) gsm-ppp: StatusCallBack: Connection Lost
E (136617540) gsm-ppp: status_cb: Connection lost
I (136617550) gsm-ppp: Shutdown (via status callback)
I (136617550) events: Signal(system.modem.down)
I (136617560) netmanager: Interface priority is st1 (x.y.z.212/255.255.248.0 gateway z.y.z.64)
I (136617560) gsm-ppp: Attempting PPP reconnecting in 30 seconds...
So, while initialising PPP we’ve been told that the cellular network is down (CREG: 2). Then, it flaps up and down like a yoyo, until the PPP connect finally fails.
At this point, things go screwy. The simcom tells us the PPP is down, and then we lose the ability to transmit to the modem (or at least log those transmits):
V (136618650) simcom: rx: 50 50 50 44 3a 20 44 49 53 43 4f 4e 4e 45 43 54 | PPPD: DISCONNECT
V (136618660) simcom: rx: 45 44 0d 0a d4 f9 f9 11 ff 2f 0d 0a 2b 50 50 50 | ED......./..+PPP
V (136618660) gsm-mux: ProcessFrame(CHAN=3, ADDR=0d, CTRL=ff, FCS=d4, LEN=29)
V (136618670) gsm-mux: ChanProcessFrame(CHAN=3, ADDR=0d, CTRL=ff, LEN=26, IFP=3)
D (136618680) simcom: rx line ch=3 len=19 : +PPPD: DISCONNECTED
I (136618690) simcom: PPP Connection disconnected
V (136618690) simcom: rx: 44 3a 20 44 49 53 43 4f 4e 4e 45 43 54 45 44 0d | D: DISCONNECTED.
V (136618700) simcom: rx: 0a d9 f9 | ...
V (136618710) gsm-mux: ProcessFrame(CHAN=4, ADDR=11, CTRL=ff, FCS=d9, LEN=29)
V (136618710) gsm-mux: ChanProcessFrame(CHAN=4, ADDR=11, CTRL=ff, LEN=26, IFP=3)
D (136618720) simcom: rx line ch=4 len=19 : +PPPD: DISCONNECTED
I (136618730) simcom: PPP Connection disconnected
I (136619020) simcom: Lost network connection (+PPP disconnect in NetMode)
I (136619020) simcom: State: Enter NetLoss state
V (136619030) simcom: tx: f9 0d ff 19 41 54 2b 43 47 41 54 54 3d 30 0d 0a | ....AT+CGATT=0..
V (136619030) simcom: tx: 14 f9 | ..
I (136619030) gsm-ppp: Shutting down (hard)...^[[0m
I (136619040) events: Signal(system.modem.down)^[[0m
I (136619040) netmanager: Interface priority is st1 (x.y.z.212/255.255.248.0 gateway x.y.z.64)
V (136632920) simcom: rx: f9 0d ff 0d 0d 0a 4f 4b 0d 0a 0f f9 | ......OK....
V (153076210) simcom: rx: f9 09 ff 1d 0d 0a 2b 43 53 51 3a 20 31 2c 39 39 | ......+CSQ: 1,99
V (153076220) simcom: rx: 0d 0a d1 f9 f9 0d ff 1d 0d 0a 2b 43 53 51 3a 20 | ..........+CSQ:
V (153076220) simcom: rx: 31 2c 39 39 0d 0a 13 f9 f9 11 ff 1d 0d 0a 2b 43 | 1,99..........+C
V (153076230) simcom: rx: 53 51 3a 20 31 2c 39 39 0d 0a 1e f9 | SQ: 1,99....
V (153202470) simcom: rx: f9 09 ff 1d 0d 0a 2b 43 53 51 3a 20 34 2c 39 39 | ......+CSQ: 4,99
V (153202480) simcom: rx: 0d 0a d1 f9 f9 0d ff 1d 0d 0a 2b 43 53 51 3a 20 | ..........+CSQ:
V (153202490) simcom: rx: 34 2c 39 39 0d 0a 13 f9 f9 11 ff 1d 0d 0a 2b 43 | 4,99..........+C
V (153202500) simcom: rx: 53 51 3a 20 34 2c 39 39 0d 0a 1e f9 | SQ: 4,99....
V (179212720) simcom: rx: f9 09 ff 1d 0d 0a 2b 43 53 51 3a 20 30 2c 39 39 | ......+CSQ: 0,99
V (179212730) simcom: rx: 0d 0a d1 f9 f9 0d ff 1d 0d 0a 2b 43 53 51 3a 20 | ..........+CSQ:
V (179212730) simcom: rx: 30 2c 39 39 0d 0a 13 f9 f9 11 ff 1d 0d 0a 2b 43 | 0,99..........+C
V (179212740) simcom: rx: 53 51 3a 20 30 2c 39 39 0d 0a 1e f9 | SQ: 0,99....
V (179242750) simcom: rx: f9 09 ff 1d 0d 0a 2b 43 53 51 3a 20 30 2c 39 39 | ......+CSQ: 0,99
V (179242760) simcom: rx: 0d 0a d1 f9 f9 0d ff 1d 0d 0a 2b 43 53 51 3a 20 | ..........+CSQ:
V (179242770) simcom: rx: 30 2c 39 39 0d 0a 13 f9 f9 11 ff 1d 0d 0a 2b 43 | 0,99..........+C
V (179242770) simcom: rx: 53 51 3a 20 30 2c 39 39 0d 0a 1e f9 | SQ: 0,99....
V (179246100) simcom: rx: f9 09 ff 19 0d 0a 2b 43 52 45 47 3a 20 32 0d 0a | ......+CREG: 2..
V (179246110) simcom: rx: d6 f9 f9 0d ff 19 0d 0a 2b 43 52 45 47 3a 20 32 | ........+CREG: 2
V (179246110) simcom: rx: 0d 0a 14 f9 f9 11 ff 19 0d 0a 2b 43 52 45 47 3a | ..........+CREG:
V (179246120) simcom: rx: 20 32 0d 0a 19 f9 | 2....
V (179251580) simcom: rx: f9 09 ff 1d 0d 0a 2b 43 53 51 3a 20 34 2c 39 39 | ......+CSQ: 4,99
V (179251590) simcom: rx: 0d 0a d1 f9 f9 0d ff 1d 0d 0a 2b 43 53 51 3a 20 | ..........+CSQ:
V (179251600) simcom: rx: 34 2c 39 39 0d 0a 13 f9 f9 11 ff 1d 0d 0a 2b 43 | 4,99..........+C
V (179251600) simcom: rx: 53 51 3a 20 34 2c 39 39 0d 0a 1e f9 | SQ: 4,99....
I (179290100) wifi: bcn_timout,ap_probe_send_start
I (179292610) wifi: ap_probe_send over, resett wifi status to disassoc
I (179292610) wifi: state: run -> init (1)
I (179292620) wifi: pm stop, total sleep time: 0/138543477
I (179292620) wifi: n:13 0, o:13 0, ap:255 255, sta:13 0, prof:1
The simcom itself seems to be up, and is giving us unsolicited notifications. Finally, the wifi stack falls over (bcn_timeout.ap_probe_sent).
This is very nasty, and I suspect will take some time to resolve. We can probably make it more stable by introducing a check for network loss, and requiring the network to be up and stable for N seconds before even attempting a reconnection; but I don’t think that addresses the core issue (which is most likely in the lwip ppp stack).
Code-wise, the simcom driver entering ‘NetLoss’ state calls m_ppp.Shutdown(true) (which issues the shutting down (hard) message). Our pppos driver issues the system.modem.down event (which causes netmanager to do it’s thing), and then calls ppapi_close(m_ppp, nocarrier). I suspect that never returns (but don’t know for sure).
The other workaround is to enable watchdog timers. I’m 99% certain this sort of lockup would mess up the idle tasks and those would pickup and fire a watchdog timer. So, I’ve gone ahead and committed code to enable the watchdog timers (2 minutes), and make that available as an OTA update:
OVMS# ota status
Firmware: 3.1.001/ota_0/main (build idf v3.1-dev-453-g0f978bcb Mar 25 2018 14:45:59)
Server Available: 3.1.001
The module in my car is now running OTA, so I can verify the OTA update process is smooth (it seems so).
Regards, Mark.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20180325/ff4cb0ee/attachment.htm>
More information about the OvmsDev
mailing list