The percentages are calculated from the time slice counter differences between two command invocations (or boot), but their 32 bit limit shouldn't have an effect that soon.
Also strange: your esp_timer SPIRAM allocation…
Is that a build with toolkit -98?
Regards,
Michael
Am 20.09.19 um 03:29 schrieb Mark Webb-Johnson:
Strange. Connected to my car, and from ovms_shell asked for ‘module tasks’:
ovms> module tasks...Vehicle Response:Number of Tasks = 17 Stack: Now Max Total Heap 32-bit SPIRAM C# PRI CPU%3FFAFBF4 1 Blk esp_timer 400 656 4096 35944 6442122956 0 22 2%3FFBDBBC 2 Blk eventTask 440 1928 4608 140 0 0 0 20 0%3FFBFF00 3 Blk OVMS Events 444 3404 8192 70816 0 21092 1 5 9%3FFC3300 4 Blk OVMS DukTape 460 9548 12288 188 0 31020 1 3 12%3FFC4FF4 5 Blk OVMS CanRx 432 832 2048 3144 0 32072 0 23 120%3FFC5D3C 6 Blk ipc0 392 504 1024 7804 0 0 0 24 2%3FFC6340 7 Blk ipc1 396 444 1024 12 0 0 1 24 0%3FFC8194 10 Rdy IDLE0 416 512 1024 0 0 0 0 0 362%3FFC872C 11 Rdy IDLE1 408 504 1024 0 0 0 1 0 387%3FFC94C4 12 Blk Tmr Svc 352 912 3072 88 0 0 0 20 1%3FFC6794 17 Blk tiT 504 2536 3072 7668 0 0 * 18 6%3FFCF3E0 18 Blk OVMS SIMCOM 460 2412 4096 40 0 3892 1 5 1%3FFD0C10 19 Blk wifi 460 2716 3584 36356 0 2912 0 22 23%3FFDBD70 20 Blk OVMS Vehicle 460 1308 6144 0 0 2336 1 10 95%3FFDF0FC 21 Rdy OVMS Console 488 1736 6144 0 0 20 1 5 0%3FFDF64C 22 Blk mdns 472 1576 4096 108 0 0 0 1 1%3FFE38F0 35 Rdy OVMS NetMan 1384 2984 8192 144 25180 5004 1 5 3%
ovms> module tasks...Vehicle Response:Number of Tasks = 17 Stack: Now Max Total Heap 32-bit SPIRAM C# PRI CPU%3FFAFBF4 1 Blk esp_timer 400 656 4096 35944 6442122956 0 22 0%3FFBDBBC 2 Blk eventTask 440 1928 4608 140 0 0 0 20 0%3FFBFF00 3 Blk OVMS Events 444 3404 8192 70816 0 21092 1 5 0%3FFC3300 4 Blk OVMS DukTape 460 9548 12288 188 0 23712 1 3 1%3FFC4FF4 5 Blk OVMS CanRx 432 832 2048 3144 0 32072 0 23 4%3FFC5D3C 6 Blk ipc0 392 504 1024 7804 0 0 0 24 0%3FFC6340 7 Blk ipc1 396 444 1024 12 0 0 1 24 0%3FFC8194 10 Rdy IDLE0 416 512 1024 0 0 0 0 0 96%3FFC872C 11 Rdy IDLE1 408 504 1024 0 0 0 1 0 95%3FFC94C4 12 Blk Tmr Svc 352 912 3072 88 0 0 0 20 0%3FFC6794 17 Blk tiT 504 2536 3072 7668 0 0 * 18 0%3FFCF3E0 18 Blk OVMS SIMCOM 668 2412 4096 40 0 3892 1 5 0%3FFD0C10 19 Blk wifi 460 2716 3584 36356 0 2912 0 22 0%3FFDBD70 20 Blk OVMS Vehicle 460 1308 6144 0 0 2336 1 10 3%3FFDF0FC 21 Blk OVMS Console 552 1736 6144 0 0 20 1 5 0%3FFDF64C 22 Blk mdns 472 1576 4096 108 0 0 0 1 0%3FFE38F0 35 Rdy OVMS NetMan 1896 3240 8192 400 25180 5392 1 5 0%
The second one is what I normally see, but the first is bizarre. Any idea what the time frame for the CPU% is?
Regards, Mark
On 8 Sep 2019, at 5:43 PM, Michael Balzer <dexter@expeedo.de> wrote:
I've pushed some modifications and improvements to (hopefully) fix the timer issue or at least be able to debug it.
Some sdkconfig changes are necessary.
The build including these updates is on my edge release as 3.2.002-258-g20ae554b.
Btw: the network restart strategy seems to mitigate issue #241; I've seen a major drop on record repetitions on my server since the rollout.
commit 99e4e48bdd40b7004c0976f51aba9e3da4ecab53
Module: add per task CPU usage statistics, add task stats history records
To enable CPU usage statistics, apply the changes to sdkconfig
included. The CPU usage shown by the commands is calculated against
the last task status retrieved (or system boot).
Command changes:
- "module tasks" -- added CPU (core) usage in percent per task
New command:
- "module tasks data" -- output task stats in history record form
New config:
- [module] debug.tasks -- yes (default) = send task stats every 5 minutes
New history record:
- "*-OVM-DebugTasks" v1: <taskcnt,totaltime> + per task:
<tasknum,name,state,stack_now,stack_max,stack_total,
heap_total,heap_32bit,heap_spi,runtime>
Note: CPU core use percentage = runtime / totaltime
commit 950172c216a72beb4da0bc7a40a46995a6105955
Build config: default timer service task priority raised to 20
Background: the FreeRTOS timer service shall only be used for very
short and non-blocking jobs. We delegate event processing to our
events task, anything else in timers needs to run with high
priority.
commit 31ac19d187480046c16356b80668de45cacbb83d
DukTape: add build config for task priority, default lowered to 3
Background: the DukTape garbage collector shall run on lower
priority than tasks like SIMCOM & events
commit e0a44791fbcfb5a4e4cad24c9d1163b76e637b4f
Server V2: use esp_log_timestamp for timeout detection,
add timeout config, limit data records & size per second
New config:
- [server.v2] timeout.rx -- timeout in seconds, default 960
commit 684a4ce9525175a910040f0d1ca82ac212fbf5de
Notify: use esp_log_timestamp for creation time instead of monotonictime
to harden against timer service starvation / ticker event drops
Regards,
Michael
Am 07.09.19 um 10:55 schrieb Michael Balzer:
I think the RTOS timer service task starves. It's running on core 0 with priority 1.
Taks on core 0 sorted by priority:
Number of Tasks = 20 Stack: Now Max Total Heap 32-bit SPIRAM C# PRI
3FFC84A8 6 Blk ipc0 388 500 1024 7788 0 0 0 24
3FFC77F0 5 Blk OVMS CanRx 428 428 2048 3052 0 31844 0 23
3FFAFBF4 1 Blk esp_timer 400 656 4096 35928 644 25804 0 22
3FFD3240 19 Blk wifi 460 2716 3584 43720 0 20 0 22
3FFC03C4 2 Blk eventTask 448 1984 4608 104 0 0 0 20
3FFC8F14 17 Blk tiT 500 2308 3072 6552 0 0 * 18
3FFE14F0 26 Blk OVMS COrx 456 456 4096 0 0 0 0 7
3FFE19D4 27 Blk OVMS COwrk 476 476 3072 0 0 0 0 7
3FFCBC34 12 Blk Tmr Svc 352 928 3072 88 0 0 0 1
3FFE7708 23 Blk mdns 468 1396 4096 108 0 0 0 1
I don't think it's our CanRx, as that only fetches and queues CAN frames, the actual work is done by the listeners. The CO tasks only run for CANopen jobs, which are few for normal operation.
That leaves the system tasks, with main suspect -once again- the wifi blob.
We need to know how much CPU time the tasks actually use now. I think I saw some option for this in the FreeRTOS config.
Regards,
Michael
Am 06.09.19 um 23:15 schrieb Michael Balzer:
The workaround is based on the monotonictime being updated per second, as do the history record offsets.
Apparently, that mechanism doesn't work reliably. That may be an indicator for some bigger underlying issue.
Example log excerpt:
2019-09-06 22:07:48.126919 +0200 info main: #173 C MITPROHB rx msg h 964,0,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:03.089031 +0200 info main: #173 C MITPROHB rx msg h 964,-10,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.041574 +0200 info main: #173 C MITPROHB rx msg h 964,-20,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.052644 +0200 info main: #173 C MITPROHB rx msg h 964,-30,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.063617 +0200 info main: #173 C MITPROHB rx msg h 964,-49,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.077527 +0200 info main: #173 C MITPROHB rx msg h 964,-59,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.193775 +0200 info main: #173 C MITPROHB rx msg h 964,-70,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:13.190645 +0200 info main: #173 C MITPROHB rx msg h 964,-80,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:22.077994 +0200 info main: #173 C MITPROHB rx msg h 964,-90,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:54.590300 +0200 info main: #173 C MITPROHB rx msg h 964,-109,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:11:10.127054 +0200 info main: #173 C MITPROHB rx msg h 964,-119,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:11:16.794200 +0200 info main: #173 C MITPROHB rx msg h 964,-130,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:11:22.455652 +0200 info main: #173 C MITPROHB rx msg h 964,-140,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:12:49.423412 +0200 info main: #173 C MITPROHB rx msg h 964,-150,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:12:49.442096 +0200 info main: #173 C MITPROHB rx msg h 964,-169,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:12:49.461941 +0200 info main: #173 C MITPROHB rx msg h 964,-179,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:39.828133 +0200 info main: #173 C MITPROHB rx msg h 964,-190,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:39.858144 +0200 info main: #173 C MITPROHB rx msg h 964,-200,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:52.020319 +0200 info main: #173 C MITPROHB rx msg h 964,-210,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:54.452637 +0200 info main: #173 C MITPROHB rx msg h 964,-229,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:15:12.613935 +0200 info main: #173 C MITPROHB rx msg h 964,-239,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:15:35.223845 +0200 info main: #173 C MITPROHB rx msg h 964,-250,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:16:09.255059 +0200 info main: #173 C MITPROHB rx msg h 964,-260,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:17:31.919754 +0200 info main: #173 C MITPROHB rx msg h 964,-270,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:19:23.366267 +0200 info main: #173 C MITPROHB rx msg h 964,-289,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:21:57.344609 +0200 info main: #173 C MITPROHB rx msg h 964,-299,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:23:40.082406 +0200 info main: #31 C MITPROHB rx msg h 964,-1027,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:25:58.061883 +0200 info main: #31 C MITPROHB rx msg h 964,-1040,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
This shows the ticker was only run 299 times from 22:07:48 to 22:21:57.
After 22:21:57 the workaround was triggered and did a reconnect. Apparently during that network reinitialization of 103 seconds, the per second ticker was run 628 times.
That can't be catching up on the event queue, as that queue has only 20 slots. So something strange is going on here.
Any ideas?
Regards,
Michael
Am 06.09.19 um 08:04 schrieb Michael Balzer:
Mark & anyone else running a V2 server,
as most cars don't send history records, this also needs the change to the server I just pushed, i.e. server version 2.4.2.
https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/master <https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/master>
Regards,
Michael
Am 05.09.19 um 19:55 schrieb Michael Balzer:
I've pushed the nasty workaround: the v2 server checks for no RX over 15 minutes, then restarts the network (wifi & modem) as configured for autostart.
Rolled out on my server in edge as 3.2.002-237-ge075f655.
Please test.
Regards,
Michael
Am 05.09.19 um 01:58 schrieb Mark Webb-Johnson:
Mark, you can check your server logs for history messages with ridiculous time offsets:
[sddexter@ns27 server]$ cat log-20190903 | egrep "rx msg h [0-9]+,-[0-9]{4}" | wc -l
455283
I checked my logs and see 12 vehicles showing this. But, 2 only show this for a debugcrash log (which is expected, I guess, if the time is not synced at report time). I’ve got 4 cars with the offset > 10,000.
Regards, Mark.
On 4 Sep 2019, at 4:45 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Everyone,
I've pushed a change that needs some testing.
I had the issue myself now parking at a certain distance from my garage wifi AP, i.e. on the edge of "in", after wifi had been disconnected for some hours, and with the module still connected via modem. The wifi blob had been trying to connect to the AP for about two hours.
As seen before, the module saw no error, just the server responses and commands stopped coming in. I noticed the default interface was still "st1" despite wifi having been disconnected and modem connected. The DNS was also still configured for my wifi network, and the interface seemed to have an IP address -- but wasn't pingable from the wifi network.
A power cycle of the modem solved the issue without reboot. So the cause may be in the modem/ppp subsystem, or it may be related (in some weird way) to the default interface / DNS setup.
More tests showed the default interface again/still got set by the wifi blob itself at some point, overriding our modem prioritization. The events we didn't handle up to now were "sta.connected" and "sta.lostip", so I added these, and the bug didn't show up again since then. That doesn't mean anything, so we need to test this.
The default interface really shouldn't affect inbound packet routing of an established connection, but there always may be strange bugs lurking in those libs.
The change also reimplements the wifi signal strength reading, as the tests also showed that still wasn't working well using the CSI callback. It now seems to be much more reliable.
Please test & report. The single module will be hard to test, as the bug isn't reproducable easily, but you can still try if wifi / modem transitions work well.
Mark, you can check your server logs for history messages with ridiculous time offsets:
[sddexter@ns27 server]$ cat log-20190903 | egrep "rx msg h [0-9]+,-[0-9]{4}" | wc -l
455283
The bug now severely affects the V2 server performance, as the server is single threaded and doesn't scale very well to this kind of bulk data bursts, especially when coming from multiple modules in parallel. So we really need to solve this now. Slow reactions or connection drops from my server lately have been due to this bug. If this change doesn't solve it, we'll need to add some reboot trigger on "too many server v2 notification retransmissions" -- or maybe a modem power cycle will do, that wouldn't discard the data.
Thanks,
Michael
Am 03.09.19 um 07:46 schrieb Mark Webb-Johnson:
No problem. We can hold. I won’t commit anything for the next few days (and agree to hold-off on Markos’s pull). Let me know when you are ready.
Regards, Mark.
On 3 Sep 2019, at 1:58 AM, Michael Balzer <dexter@expeedo.de> <mailto:dexter@expeedo.de> wrote:
Mark, please wait.
I may just have found the cause for issue #241, or at least something I need to investigate before releasing.
I need to dig into my logs first, and try something.
Regards,
Michael
Am 02.09.19 um 12:23 schrieb Michael Balzer:
Nothing open from my side at the moment.
I haven't had the time to look in to Markos pull request, but from a first check also think that's going too deep to be included in this release.
Regards,
Michael
Am 02.09.19 um 04:15 schrieb Mark Webb-Johnson:
I think it is well past time for a 3.2.003 release. Things seems table in edge (although some things only partially implemented).
Anything people want to include at the last minute, or can we go ahead and build?
Regards, Mark.
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26