Re: [Ovmsdev] Module tasks

20 Sep 2019

      No, the 32 bit limit indeed has an effect pretty soon. I had assumed they are "ticks", but they really are at the timer resolution (see DebugTasks records).

Timer resolution is currently 1 MHz, so the counters will cycle over every 71.6 minutes.

So if you don't enable the debug.tasks report, the fist manual report is likely to look strange.

We can get a checkpoint every five minutes regardless of the report activation. Or we can add a measurement delay to the command.

Regards,
Michael

Am 20.09.19 um 16:16 schrieb Michael Balzer:
...
The percentages are calculated from the time slice counter differences between two command invocations (or boot), but their 32 bit limit shouldn't have an
effect that soon.
Also strange: your esp_timer SPIRAM allocation…
Is that a build with toolkit -98?
Regards,
Michael
Am 20.09.19 um 03:29 schrieb Mark Webb-Johnson:
...
Strange. Connected to my car, and from ovms_shell asked for ‘module tasks’:
ovms> module tasks
    ...
    Vehicle Response:
      Number of Tasks = 17        Stack:  Now   Max Total    Heap 32-bit SPIRAM C# PRI CPU%
      3FFAFBF4    1 Blk esp_timer         400   656  4096   35944    6442122956  0  22   2%
      3FFBDBBC    2 Blk eventTask         440  1928  4608     140      0      0  0  20   0%
      3FFBFF00    3 Blk OVMS Events       444  3404  8192   70816      0  21092  1   5   9%
      3FFC3300    4 Blk OVMS DukTape      460  9548 12288     188      0  31020  1   3  12%
      3FFC4FF4    5 Blk OVMS CanRx        432   832  2048    3144      0  32072  0  23 120%
      3FFC5D3C    6 Blk ipc0              392   504  1024    7804      0      0  0  24   2%
      3FFC6340    7 Blk ipc1              396   444  1024      12      0      0  1  24   0%
      3FFC8194   10 Rdy IDLE0             416   512  1024       0      0      0  0   0 362%
      3FFC872C   11 Rdy IDLE1             408   504  1024       0      0      0  1   0 387%
      3FFC94C4   12 Blk Tmr Svc           352   912  3072      88      0      0  0  20   1%
      3FFC6794   17 Blk tiT               504  2536  3072    7668      0      0  *  18   6%
      3FFCF3E0   18 Blk OVMS SIMCOM       460  2412  4096      40      0   3892  1   5   1%
      3FFD0C10   19 Blk wifi              460  2716  3584   36356      0   2912  0  22  23%
      3FFDBD70   20 Blk OVMS Vehicle      460  1308  6144       0      0   2336  1  10  95%
      3FFDF0FC   21 Rdy OVMS Console      488  1736  6144       0      0     20  1   5   0%
      3FFDF64C   22 Blk mdns              472  1576  4096     108      0      0  0   1   1%
      3FFE38F0   35 Rdy OVMS NetMan      1384  2984  8192     144  25180   5004  1   5   3%
ovms> module tasks
    ...
    Vehicle Response:
      Number of Tasks = 17        Stack:  Now   Max Total    Heap 32-bit SPIRAM C# PRI CPU%
      3FFAFBF4    1 Blk esp_timer         400   656  4096   35944    6442122956  0  22   0%
      3FFBDBBC    2 Blk eventTask         440  1928  4608     140      0      0  0  20   0%
      3FFBFF00    3 Blk OVMS Events       444  3404  8192   70816      0  21092  1   5   0%
      3FFC3300    4 Blk OVMS DukTape      460  9548 12288     188      0  23712  1   3   1%
      3FFC4FF4    5 Blk OVMS CanRx        432   832  2048    3144      0  32072  0  23   4%
      3FFC5D3C    6 Blk ipc0              392   504  1024    7804      0      0  0  24   0%
      3FFC6340    7 Blk ipc1              396   444  1024      12      0      0  1  24   0%
      3FFC8194   10 Rdy IDLE0             416   512  1024       0      0      0  0   0  96%
      3FFC872C   11 Rdy IDLE1             408   504  1024       0      0      0  1   0  95%
      3FFC94C4   12 Blk Tmr Svc           352   912  3072      88      0      0  0  20   0%
      3FFC6794   17 Blk tiT               504  2536  3072    7668      0      0  *  18   0%
      3FFCF3E0   18 Blk OVMS SIMCOM       668  2412  4096      40      0   3892  1   5   0%
      3FFD0C10   19 Blk wifi              460  2716  3584   36356      0   2912  0  22   0%
      3FFDBD70   20 Blk OVMS Vehicle      460  1308  6144       0      0   2336  1  10   3%
      3FFDF0FC   21 Blk OVMS Console      552  1736  6144       0      0     20  1   5   0%
      3FFDF64C   22 Blk mdns              472  1576  4096     108      0      0  0   1   0%
      3FFE38F0   35 Rdy OVMS NetMan      1896  3240  8192     400  25180   5392  1   5   0%
The second one is what I normally see, but the first is bizarre. Any idea what the time frame for the CPU% is?
Regards, Mark
...
...
...
...
> On 8 Sep 2019, at 5:43 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
>
> I've pushed some modifications and improvements to (hopefully) fix the timer issue or at least be able to debug it.
>
> Some sdkconfig changes are necessary.
>
> The build including these updates is on my edge release as 3.2.002-258-g20ae554b.
>
> Btw: the network restart strategy seems to mitigate issue #241; I've seen a major drop on record repetitions on my server since the rollout.
>
>
> commit 99e4e48bdd40b7004c0976f51aba9e3da4ecab53
>
>    Module: add per task CPU usage statistics, add task stats history records
>
>    To enable CPU usage statistics, apply the changes to sdkconfig
>    included. The CPU usage shown by the commands is calculated against
>    the last task status retrieved (or system boot).
>
>    Command changes:
>    - "module tasks" -- added CPU (core) usage in percent per task
>
>    New command:
>    - "module tasks data" -- output task stats in history record form
>
>    New config:
>    - [module] debug.tasks -- yes (default) = send task stats every 5 minutes
>
>    New history record:
>    - "*-OVM-DebugTasks" v1: <taskcnt,totaltime> + per task:
>        <tasknum,name,state,stack_now,stack_max,stack_total,
>         heap_total,heap_32bit,heap_spi,runtime>
>      Note: CPU core use percentage = runtime / totaltime
>
> commit 950172c216a72beb4da0bc7a40a46995a6105955
>
>    Build config: default timer service task priority raised to 20
>
>    Background: the FreeRTOS timer service shall only be used for very
>      short and non-blocking jobs. We delegate event processing to our
>      events task, anything else in timers needs to run with high
>      priority.
>
> commit 31ac19d187480046c16356b80668de45cacbb83d
>
>    DukTape: add build config for task priority, default lowered to 3
>
>    Background: the DukTape garbage collector shall run on lower
>      priority than tasks like SIMCOM & events
>
> commit e0a44791fbcfb5a4e4cad24c9d1163b76e637b4f
>
>    Server V2: use esp_log_timestamp for timeout detection,
>      add timeout config, limit data records & size per second
>
>    New config:
>    - [server.v2] timeout.rx -- timeout in seconds, default 960
>
> commit 684a4ce9525175a910040f0d1ca82ac212fbf5de
>
>    Notify: use esp_log_timestamp for creation time instead of monotonictime
>      to harden against timer service starvation / ticker event drops
>
>
> Regards,
> Michael
>
>
> Am 07.09.19 um 10:55 schrieb Michael Balzer:
>> I think the RTOS timer service task starves. It's running on core 0 with priority 1.
>>
>> Taks on core 0 sorted by priority:
>>
>> Number of Tasks = 20      Stack:  Now   Max Total    Heap 32-bit SPIRAM C# PRI
>> 3FFC84A8  6 Blk ipc0              388   500  1024    7788      0      0  0  24
>> 3FFC77F0  5 Blk OVMS CanRx        428   428  2048    3052      0  31844  0  23
>> 3FFAFBF4  1 Blk esp_timer         400   656  4096   35928    644  25804  0  22
>> 3FFD3240 19 Blk wifi              460  2716  3584   43720      0     20  0  22
>> 3FFC03C4  2 Blk eventTask         448  1984  4608     104      0      0  0  20
>> 3FFC8F14 17 Blk tiT               500  2308  3072    6552      0      0  *  18
>> 3FFE14F0 26 Blk OVMS COrx         456   456  4096       0      0      0  0   7
>> 3FFE19D4 27 Blk OVMS COwrk        476   476  3072       0      0      0  0   7
>> 3FFCBC34 12 Blk Tmr Svc           352   928  3072      88      0      0  0   1
>> 3FFE7708 23 Blk mdns              468  1396  4096     108      0      0  0   1
>>
>> I don't think it's our CanRx, as that only fetches and queues CAN frames, the actual work is done by the listeners. The CO tasks only run for CANopen
>> jobs, which are few for normal operation.
>>
>> That leaves the system tasks, with main suspect -once again- the wifi blob.
>>
>> We need to know how much CPU time the tasks actually use now. I think I saw some option for this in the FreeRTOS config.
>>
>> Regards,
>> Michael
>>
>>
>> Am 06.09.19 um 23:15 schrieb Michael Balzer:
>>> The workaround is based on the monotonictime being updated per second, as do the history record offsets.
>>>
>>> Apparently, that mechanism doesn't work reliably. That may be an indicator for some bigger underlying issue.
>>>
>>> Example log excerpt:
>>>
>>> 2019-09-06 22:07:48.126919 +0200 info  main: #173 C MITPROHB rx msg h 964,0,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:03.089031 +0200 info  main: #173 C MITPROHB rx msg h 964,-10,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:05.041574 +0200 info  main: #173 C MITPROHB rx msg h 964,-20,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:05.052644 +0200 info  main: #173 C MITPROHB rx msg h 964,-30,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:05.063617 +0200 info  main: #173 C MITPROHB rx msg h 964,-49,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:05.077527 +0200 info  main: #173 C MITPROHB rx msg h 964,-59,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:05.193775 +0200 info  main: #173 C MITPROHB rx msg h 964,-70,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:13.190645 +0200 info  main: #173 C MITPROHB rx msg h 964,-80,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:22.077994 +0200 info  main: #173 C MITPROHB rx msg h 964,-90,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:09:54.590300 +0200 info  main: #173 C MITPROHB rx msg h 964,-109,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:11:10.127054 +0200 info  main: #173 C MITPROHB rx msg h 964,-119,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:11:16.794200 +0200 info  main: #173 C MITPROHB rx msg h 964,-130,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:11:22.455652 +0200 info  main: #173 C MITPROHB rx msg h 964,-140,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:12:49.423412 +0200 info  main: #173 C MITPROHB rx msg h 964,-150,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:12:49.442096 +0200 info  main: #173 C MITPROHB rx msg h 964,-169,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:12:49.461941 +0200 info  main: #173 C MITPROHB rx msg h 964,-179,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:14:39.828133 +0200 info  main: #173 C MITPROHB rx msg h 964,-190,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:14:39.858144 +0200 info  main: #173 C MITPROHB rx msg h 964,-200,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:14:52.020319 +0200 info  main: #173 C MITPROHB rx msg h 964,-210,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:14:54.452637 +0200 info  main: #173 C MITPROHB rx msg h 964,-229,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:15:12.613935 +0200 info  main: #173 C MITPROHB rx msg h 964,-239,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:15:35.223845 +0200 info  main: #173 C MITPROHB rx msg h 964,-250,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:16:09.255059 +0200 info  main: #173 C MITPROHB rx msg h 964,-260,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:17:31.919754 +0200 info  main: #173 C MITPROHB rx msg h 964,-270,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:19:23.366267 +0200 info  main: #173 C MITPROHB rx msg h 964,-289,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:21:57.344609 +0200 info  main: #173 C MITPROHB rx msg h 964,-299,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:23:40.082406 +0200 info  main: #31 C MITPROHB rx msg h 964,-1027,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>> 2019-09-06 22:25:58.061883 +0200 info  main: #31 C MITPROHB rx msg h 964,-1040,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
>>>
>>>
>>> This shows the ticker was only run 299 times from 22:07:48 to 22:21:57.
>>>
>>> After 22:21:57 the workaround was triggered and did a reconnect. Apparently during that network reinitialization of 103 seconds, the per second ticker
>>> was run 628 times.
>>>
>>> That can't be catching up on the event queue, as that queue has only 20 slots. So something strange is going on here.
>>>
>>> Any ideas?
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 06.09.19 um 08:04 schrieb Michael Balzer:
>>>> Mark & anyone else running a V2 server,
>>>>
>>>> as most cars don't send history records, this also needs the change to the server I just pushed, i.e. server version 2.4.2.
>>>>
>>>> https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/maste...
>>>> <https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/master>
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 05.09.19 um 19:55 schrieb Michael Balzer:
>>>>> I've pushed the nasty workaround: the v2 server checks for no RX over 15 minutes, then restarts the network (wifi & modem) as configured for autostart.
>>>>>
>>>>> Rolled out on my server in edge as 3.2.002-237-ge075f655.
>>>>>
>>>>> Please test.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 05.09.19 um 01:58 schrieb Mark Webb-Johnson:
>>>>>>> Mark, you can check your server logs for history messages with ridiculous time offsets:
>>>>>>> [sddexter@ns27 server]$ cat log-20190903 | egrep "rx msg h [0-9]+,-[0-9]{4}" | wc -l
>>>>>>> 455283
>>>>>>
>>>>>> I checked my logs and see 12 vehicles showing this. But, 2 only show this for a debugcrash log (which is expected, I guess, if the time is not
>>>>>> synced at report time). I’ve got 4 cars with the offset > 10,000.
>>>>>>
>>>>>> Regards, Mark.
>>>>>>
>>>>>>> On 4 Sep 2019, at 4:45 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de> <mailto:dexter@expeedo.de>> wrote:
>>>>>>>
>>>>>>> Everyone,
>>>>>>>
>>>>>>> I've pushed a change that needs some testing.
>>>>>>>
>>>>>>> I had the issue myself now parking at a certain distance from my garage wifi AP, i.e. on the edge of "in", after wifi had been disconnected for
>>>>>>> some hours, and with the module still connected via modem. The wifi blob had been trying to connect to the AP for about two hours.
>>>>>>>
>>>>>>> As seen before, the module saw no error, just the server responses and commands stopped coming in. I noticed the default interface was still "st1"
>>>>>>> despite wifi having been disconnected and modem connected. The DNS was also still configured for my wifi network, and the interface seemed to have
>>>>>>> an IP address -- but wasn't pingable from the wifi network.
>>>>>>>
>>>>>>> A power cycle of the modem solved the issue without reboot. So the cause may be in the modem/ppp subsystem, or it may be related (in some weird
>>>>>>> way) to the default interface / DNS setup.
>>>>>>>
>>>>>>> More tests showed the default interface again/still got set by the wifi blob itself at some point, overriding our modem prioritization. The events
>>>>>>> we didn't handle up to now were "sta.connected" and "sta.lostip", so I added these, and the bug didn't show up again since then. That doesn't mean
>>>>>>> anything, so we need to test this.
>>>>>>>
>>>>>>> The default interface really shouldn't affect inbound packet routing of an established connection, but there always may be strange bugs lurking in
>>>>>>> those libs.
>>>>>>>
>>>>>>> The change also reimplements the wifi signal strength reading, as the tests also showed that still wasn't working well using the CSI callback. It
>>>>>>> now seems to be much more reliable.
>>>>>>>
>>>>>>> Please test & report. The single module will be hard to test, as the bug isn't reproducable easily, but you can still try if wifi / modem
>>>>>>> transitions work well.
>>>>>>>
>>>>>>> Mark, you can check your server logs for history messages with ridiculous time offsets:
>>>>>>> [sddexter@ns27 server]$ cat log-20190903 | egrep "rx msg h [0-9]+,-[0-9]{4}" | wc -l
>>>>>>> 455283
>>>>>>> The bug now severely affects the V2 server performance, as the server is single threaded and doesn't scale very well to this kind of bulk data
>>>>>>> bursts, especially when coming from multiple modules in parallel. So we really need to solve this now. Slow reactions or connection drops from my
>>>>>>> server lately have been due to this bug. If this change doesn't solve it, we'll need to add some reboot trigger on "too many server v2
>>>>>>> notification retransmissions" -- or maybe a modem power cycle will do, that wouldn't discard the data.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> Am 03.09.19 um 07:46 schrieb Mark Webb-Johnson:
>>>>>>>> No problem. We can hold. I won’t commit anything for the next few days (and agree to hold-off on Markos’s pull). Let me know when you are ready.
>>>>>>>>
>>>>>>>> Regards, Mark.
>>>>>>>>
>>>>>>>>> On 3 Sep 2019, at 1:58 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> <mailto:dexter@expeedo.de> wrote:
>>>>>>>>>
>>>>>>>>> Mark, please wait.
>>>>>>>>>
>>>>>>>>> I may just have found the cause for issue #241, or at least something I need to investigate before releasing.
>>>>>>>>>
>>>>>>>>> I need to dig into my logs first, and try something.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 02.09.19 um 12:23 schrieb Michael Balzer:
>>>>>>>>>> Nothing open from my side at the moment.
>>>>>>>>>>
>>>>>>>>>> I haven't had the time to look in to Markos pull request, but from a first check also think that's going too deep to be included in this release.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Michael
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 02.09.19 um 04:15 schrieb Mark Webb-Johnson:
>>>>>>>>>>> I think it is well past time for a 3.2.003 release. Things seems table in edge (although some things only partially implemented).
>>>>>>>>>>>
>>>>>>>>>>> Anything people want to include at the last minute, or can we go ahead and build?
>>>>>>>>>>>
>>>>>>>>>>> Regards, Mark.
>>>>>>>>>>>
-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26