[Ovmsdev] Time for 3.2.003? / Issue #241

Marko Juhanne marko.juhanne at gmail.com
Sat Sep 7 17:16:05 HKT 2019


Hi,

not sure if this is relevant, but my modifications have caused stack
overflow in few modules, and required increasing the stack size. Might
happen sporadically in some execution branches in your fork too?

See if increasing stack size in these modules help:
CONFIG_OVMS_HW_CAN_RX_QUEUE_SIZE -> 4096
CONFIG_OVMS_VEHICLE_RXTASK_STACK -> 8192
CONFIG_OVMS_SYS_COMMAND_STACK_SIZE *2

Br, Marko




On Sat, Sep 7, 2019 at 11:56 AM Michael Balzer <dexter at expeedo.de> wrote:

> I think the RTOS timer service task starves. It's running on core 0 with
> priority 1.
>
> Taks on core 0 sorted by priority:
>
> Number of Tasks = 20      Stack:  Now   Max Total    Heap 32-bit SPIRAM C#
> PRI
> 3FFC84A8  6 Blk ipc0              388   500  1024    7788      0      0
> 0  24
> 3FFC77F0  5 Blk OVMS CanRx        428   428  2048    3052      0  31844
> 0  23
> 3FFAFBF4  1 Blk esp_timer         400   656  4096   35928    644  25804
> 0  22
> 3FFD3240 19 Blk wifi              460  2716  3584   43720      0     20
> 0  22
> 3FFC03C4  2 Blk eventTask         448  1984  4608     104      0      0
> 0  20
> 3FFC8F14 17 Blk tiT               500  2308  3072    6552      0      0
> *  18
> 3FFE14F0 26 Blk OVMS COrx         456   456  4096       0      0      0
> 0   7
> 3FFE19D4 27 Blk OVMS COwrk        476   476  3072       0      0      0
> 0   7
> 3FFCBC34 12 Blk Tmr Svc           352   928  3072      88      0      0
> 0   1
> 3FFE7708 23 Blk mdns              468  1396  4096     108      0      0
> 0   1
>
> I don't think it's our CanRx, as that only fetches and queues CAN frames,
> the actual work is done by the listeners. The CO tasks only run for CANopen
> jobs, which are few for normal operation.
>
> That leaves the system tasks, with main suspect -once again- the wifi blob.
>
> We need to know how much CPU time the tasks actually use now. I think I
> saw some option for this in the FreeRTOS config.
>
> Regards,
> Michael
>
>
> Am 06.09.19 um 23:15 schrieb Michael Balzer:
>
> The workaround is based on the monotonictime being updated per second, as
> do the history record offsets.
>
> Apparently, that mechanism doesn't work reliably. That may be an indicator
> for some bigger underlying issue.
>
> Example log excerpt:
>
> 2019-09-06 22:07:48.126919 +0200 info  main: #173 C MITPROHB rx msg h
> 964,0,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:03.089031 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-10,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:05.041574 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-20,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:05.052644 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-30,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:05.063617 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-49,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:05.077527 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-59,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:05.193775 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-70,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:13.190645 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-80,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:22.077994 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-90,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:09:54.590300 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-109,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:11:10.127054 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-119,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:11:16.794200 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-130,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:11:22.455652 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-140,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:12:49.423412 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-150,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:12:49.442096 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-169,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:12:49.461941 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-179,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:14:39.828133 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-190,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:14:39.858144 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-200,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:14:52.020319 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-210,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:14:54.452637 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-229,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:15:12.613935 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-239,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:15:35.223845 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-250,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:16:09.255059 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-260,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:17:31.919754 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-270,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:19:23.366267 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-289,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:21:57.344609 +0200 info  main: #173 C MITPROHB rx msg h
> 964,-299,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:23:40.082406 +0200 info  main: #31 C MITPROHB rx msg h
> 964,-1027,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
> 2019-09-06 22:25:58.061883 +0200 info  main: #31 C MITPROHB rx msg h
> 964,-1040,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>
>
> This shows the ticker was only run 299 times from 22:07:48 to 22:21:57.
>
> After 22:21:57 the workaround was triggered and did a reconnect.
> Apparently during that network reinitialization of 103 seconds, the per
> second ticker was run 628 times.
>
> That can't be catching up on the event queue, as that queue has only 20
> slots. So something strange is going on here.
>
> Any ideas?
>
> Regards,
> Michael
>
>
> Am 06.09.19 um 08:04 schrieb Michael Balzer:
>
> Mark & anyone else running a V2 server,
>
> as most cars don't send history records, this also needs the change to the
> server I just pushed, i.e. server version 2.4.2.
>
>
> https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/master
>
> Regards,
> Michael
>
>
> Am 05.09.19 um 19:55 schrieb Michael Balzer:
>
> I've pushed the nasty workaround: the v2 server checks for no RX over 15
> minutes, then restarts the network (wifi & modem) as configured for
> autostart.
>
> Rolled out on my server in edge as 3.2.002-237-ge075f655.
>
> Please test.
>
> Regards,
> Michael
>
>
> Am 05.09.19 um 01:58 schrieb Mark Webb-Johnson:
>
> Mark, you can check your server logs for history messages with ridiculous
> time offsets:
>
> [sddexter at ns27 server]$ cat log-20190903 | egrep "rx msg h
> [0-9]+,-[0-9]{4}" | wc -l
> 455283
>
>
> I checked my logs and see 12 vehicles showing this. But, 2 only show this
> for a debugcrash log (which is expected, I guess, if the time is not synced
> at report time). I’ve got 4 cars with the offset > 10,000.
>
> Regards, Mark.
>
> On 4 Sep 2019, at 4:45 AM, Michael Balzer <dexter at expeedo.de> wrote:
>
> Everyone,
>
> I've pushed a change that needs some testing.
>
> I had the issue myself now parking at a certain distance from my garage
> wifi AP, i.e. on the edge of "in", after wifi had been disconnected for
> some hours, and with the module still connected via modem. The wifi blob
> had been trying to connect to the AP for about two hours.
>
> As seen before, the module saw no error, just the server responses and
> commands stopped coming in. I noticed the default interface was still "st1"
> despite wifi having been disconnected and modem connected. The DNS was also
> still configured for my wifi network, and the interface seemed to have an
> IP address -- but wasn't pingable from the wifi network.
>
> A power cycle of the modem solved the issue without reboot. So the cause
> may be in the modem/ppp subsystem, or it may be related (in some weird way)
> to the default interface / DNS setup.
>
> More tests showed the default interface again/still got set by the wifi
> blob itself at some point, overriding our modem prioritization. The events
> we didn't handle up to now were "sta.connected" and "sta.lostip", so I
> added these, and the bug didn't show up again since then. That doesn't mean
> anything, so we need to test this.
>
> The default interface really shouldn't affect inbound packet routing of an
> established connection, but there always may be strange bugs lurking in
> those libs.
>
> The change also reimplements the wifi signal strength reading, as the
> tests also showed that still wasn't working well using the CSI callback. It
> now seems to be much more reliable.
>
> Please test & report. The single module will be hard to test, as the bug
> isn't reproducable easily, but you can still try if wifi / modem
> transitions work well.
>
> Mark, you can check your server logs for history messages with ridiculous
> time offsets:
>
> [sddexter at ns27 server]$ cat log-20190903 | egrep "rx msg h
> [0-9]+,-[0-9]{4}" | wc -l
> 455283
>
> The bug now severely affects the V2 server performance, as the server is
> single threaded and doesn't scale very well to this kind of bulk data
> bursts, especially when coming from multiple modules in parallel. So we
> really need to solve this now. Slow reactions or connection drops from my
> server lately have been due to this bug. If this change doesn't solve it,
> we'll need to add some reboot trigger on "too many server v2 notification
> retransmissions" -- or maybe a modem power cycle will do, that wouldn't
> discard the data.
>
> Thanks,
> Michael
>
>
> Am 03.09.19 um 07:46 schrieb Mark Webb-Johnson:
>
> No problem. We can hold. I won’t commit anything for the next few days (and agree to hold-off on Markos’s pull). Let me know when you are ready.
>
> Regards, Mark.
>
>
> On 3 Sep 2019, at 1:58 AM, Michael Balzer <dexter at expeedo.de> <dexter at expeedo.de> wrote:
>
> Mark, please wait.
>
> I may just have found the cause for issue #241, or at least something I need to investigate before releasing.
>
> I need to dig into my logs first, and try something.
>
> Regards,
> Michael
>
>
> Am 02.09.19 um 12:23 schrieb Michael Balzer:
>
> Nothing open from my side at the moment.
>
> I haven't had the time to look in to Markos pull request, but from a first check also think that's going too deep to be included in this release.
>
> Regards,
> Michael
>
>
> Am 02.09.19 um 04:15 schrieb Mark Webb-Johnson:
>
> I think it is well past time for a 3.2.003 release. Things seems table in edge (although some things only partially implemented).
>
> Anything people want to include at the last minute, or can we go ahead and build?
>
> Regards, Mark.
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
>
> _______________________________________________
> OvmsDev mailing listOvmsDev at lists.openvehicles.comhttp://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> --
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20190907/0d95cd70/attachment.htm>


More information about the OvmsDev mailing list