That's 6.55MB/month, unless you have
unusually short months! :-)
In what space is that data stored? A log written to
SD? That's not
likely to fill up the SD card too fast, but what
happens if no SD card
is installed?
-- Steve
On Thu, 19 Sep 2019, Mark Webb-Johnson wrote:
To enable CPU
usage statistics, apply the changes to sdkconfig
included.
New history record:
- "*-OVM-DebugTasks" v1:
<taskcnt,totaltime> + per task:
<tasknum,name,state,stack_now,stack_max,stack_total,
heap_total,heap_32bit,heap_spi,runtime>
Note: CPU core use percentage = runtime /
totaltime
I’ve just noticed that this is enabled by default
now (my production build has the sdkconfig updated,
as per defaults).
I am seeing 758 bytes of history record, every 5
minutes. About 218KB/day, or 654KB/month.
Should this be opt-in?
Regards, Mark.
On 8 Sep 2019, at
5:43 PM, Michael Balzer <dexter@expeedo.de>
wrote:
I've pushed some modifications and improvements to
(hopefully) fix the timer issue or at least be
able to debug it.
Some sdkconfig changes are necessary.
The build including these updates is on my edge
release as 3.2.002-258-g20ae554b.
Btw: the network restart strategy seems to
mitigate issue #241; I've seen a major drop on
record repetitions on my server since the rollout.
commit 99e4e48bdd40b7004c0976f51aba9e3da4ecab53
Module: add per task CPU usage statistics, add
task stats history records
To enable CPU usage statistics, apply the
changes to sdkconfig
included. The CPU usage shown by the commands
is calculated against
the last task status retrieved (or system
boot).
Command changes:
- "module tasks" -- added CPU (core) usage in
percent per task
New command:
- "module tasks data" -- output task stats in
history record form
New config:
- [module] debug.tasks -- yes (default) = send
task stats every 5 minutes
New history record:
- "*-OVM-DebugTasks" v1:
<taskcnt,totaltime> + per task:
<tasknum,name,state,stack_now,stack_max,stack_total,
heap_total,heap_32bit,heap_spi,runtime>
Note: CPU core use percentage = runtime /
totaltime
commit 950172c216a72beb4da0bc7a40a46995a6105955
Build config: default timer service task
priority raised to 20
Background: the FreeRTOS timer service shall
only be used for very
short and non-blocking jobs. We delegate
event processing to our
events task, anything else in timers needs to
run with high
priority.
commit 31ac19d187480046c16356b80668de45cacbb83d
DukTape: add build config for task priority,
default lowered to 3
Background: the DukTape garbage collector shall
run on lower
priority than tasks like SIMCOM & events
commit e0a44791fbcfb5a4e4cad24c9d1163b76e637b4f
Server V2: use esp_log_timestamp for timeout
detection,
add timeout config, limit data records &
size per second
New config:
- [server.v2] timeout.rx -- timeout in seconds,
default 960
commit 684a4ce9525175a910040f0d1ca82ac212fbf5de
Notify: use esp_log_timestamp for creation time
instead of monotonictime
to harden against timer service starvation /
ticker event drops
Regards,
Michael
Am 07.09.19 um 10:55 schrieb Michael Balzer:
I think the RTOS
timer service task starves. It's running on core
0 with priority 1.
Taks on core 0 sorted by priority:
Number of Tasks = 20 Stack: Now Max
Total Heap 32-bit SPIRAM C# PRI
3FFC84A8 6 Blk ipc0 388 500
1024 7788 0 0 0 24
3FFC77F0 5 Blk OVMS CanRx 428 428
2048 3052 0 31844 0 23
3FFAFBF4 1 Blk esp_timer 400 656
4096 35928 644 25804 0 22
3FFD3240 19 Blk wifi 460 2716
3584 43720 0 20 0 22
3FFC03C4 2 Blk eventTask 448 1984
4608 104 0 0 0 20
3FFC8F14 17 Blk tiT 500 2308
3072 6552 0 0 * 18
3FFE14F0 26 Blk OVMS COrx 456 456
4096 0 0 0 0 7
3FFE19D4 27 Blk OVMS COwrk 476 476
3072 0 0 0 0 7
3FFCBC34 12 Blk Tmr Svc 352 928
3072 88 0 0 0 1
3FFE7708 23 Blk mdns 468 1396
4096 108 0 0 0 1
I don't think it's our CanRx, as that only
fetches and queues CAN frames, the actual work
is done by the listeners. The CO tasks only run
for CANopen jobs, which are few for normal
operation.
That leaves the system tasks, with main suspect
-once again- the wifi blob.
We need to know how much CPU time the tasks
actually use now. I think I saw some option for
this in the FreeRTOS config.
Regards,
Michael
Am 06.09.19 um 23:15 schrieb Michael Balzer:
The workaround
is based on the monotonictime being updated
per second, as do the history record offsets.
Apparently, that mechanism doesn't work
reliably. That may be an indicator for some
bigger underlying issue.
Example log excerpt:
2019-09-06 22:07:48.126919 +0200 info main:
#173 C MITPROHB rx msg h
964,0,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:03.089031 +0200 info main:
#173 C MITPROHB rx msg h
964,-10,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.041574 +0200 info main:
#173 C MITPROHB rx msg h
964,-20,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.052644 +0200 info main:
#173 C MITPROHB rx msg h
964,-30,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.063617 +0200 info main:
#173 C MITPROHB rx msg h
964,-49,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.077527 +0200 info main:
#173 C MITPROHB rx msg h
964,-59,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:05.193775 +0200 info main:
#173 C MITPROHB rx msg h
964,-70,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:13.190645 +0200 info main:
#173 C MITPROHB rx msg h
964,-80,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:22.077994 +0200 info main:
#173 C MITPROHB rx msg h
964,-90,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:09:54.590300 +0200 info main:
#173 C MITPROHB rx msg h
964,-109,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:11:10.127054 +0200 info main:
#173 C MITPROHB rx msg h
964,-119,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:11:16.794200 +0200 info main:
#173 C MITPROHB rx msg h
964,-130,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:11:22.455652 +0200 info main:
#173 C MITPROHB rx msg h
964,-140,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:12:49.423412 +0200 info main:
#173 C MITPROHB rx msg h
964,-150,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:12:49.442096 +0200 info main:
#173 C MITPROHB rx msg h
964,-169,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:12:49.461941 +0200 info main:
#173 C MITPROHB rx msg h
964,-179,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:39.828133 +0200 info main:
#173 C MITPROHB rx msg h
964,-190,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:39.858144 +0200 info main:
#173 C MITPROHB rx msg h
964,-200,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:52.020319 +0200 info main:
#173 C MITPROHB rx msg h
964,-210,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:14:54.452637 +0200 info main:
#173 C MITPROHB rx msg h
964,-229,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:15:12.613935 +0200 info main:
#173 C MITPROHB rx msg h
964,-239,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:15:35.223845 +0200 info main:
#173 C MITPROHB rx msg h
964,-250,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:16:09.255059 +0200 info main:
#173 C MITPROHB rx msg h
964,-260,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:17:31.919754 +0200 info main:
#173 C MITPROHB rx msg h
964,-270,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:19:23.366267 +0200 info main:
#173 C MITPROHB rx msg h
964,-289,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:21:57.344609 +0200 info main:
#173 C MITPROHB rx msg h
964,-299,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:23:40.082406 +0200 info main:
#31 C MITPROHB rx msg h
964,-1027,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
2019-09-06 22:25:58.061883 +0200 info main:
#31 C MITPROHB rx msg h
964,-1040,RT-BAT-C,5,86400,2,1,3830,3795,3830,-10,25,25,25,0
This shows the ticker was only run 299 times
from 22:07:48 to 22:21:57.
After 22:21:57 the workaround was triggered
and did a reconnect. Apparently during that
network reinitialization of 103 seconds, the
per second ticker was run 628 times.
That can't be catching up on the event queue,
as that queue has only 20 slots. So something
strange is going on here.
Any ideas?
Regards,
Michael
Am 06.09.19 um 08:04 schrieb Michael Balzer:
Mark &
anyone else running a V2 server,
as most cars don't send history records,
this also needs the change to the server I
just pushed, i.e. server version 2.4.2.
https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/master
<https://github.com/openvehicles/Open-Vehicle-Monitoring-System/commits/master>
Regards,
Michael
Am 05.09.19 um 19:55 schrieb Michael Balzer:
I've pushed
the nasty workaround: the v2 server checks
for no RX over 15 minutes, then restarts
the network (wifi & modem) as
configured for autostart.
Rolled out on my server in edge as
3.2.002-237-ge075f655.
Please test.
Regards,
Michael
Am 05.09.19 um 01:58 schrieb Mark
Webb-Johnson:
Mark,
you can check your server logs for
history messages with ridiculous time
offsets:
[sddexter@ns27 server]$ cat
log-20190903 | egrep "rx msg h
[0-9]+,-[0-9]{4}" | wc -l
455283
I checked my logs and see 12 vehicles
showing this. But, 2 only show this for
a debugcrash log (which is expected, I
guess, if the time is not synced at
report time). I’ve got 4 cars with the
offset > 10,000.
Regards, Mark.
On 4
Sep 2019, at 4:45 AM, Michael Balzer
<dexter@expeedo.de
<mailto:dexter@expeedo.de>>
wrote:
Everyone,
I've pushed a change that needs some
testing.
I had the issue myself now parking at
a certain distance from my garage wifi
AP, i.e. on the edge of "in", after
wifi had been disconnected for some
hours, and with the module still
connected via modem. The wifi blob had
been trying to connect to the AP for
about two hours.
As seen before, the module saw no
error, just the server responses and
commands stopped coming in. I noticed
the default interface was still "st1"
despite wifi having been disconnected
and modem connected. The DNS was also
still configured for my wifi network,
and the interface seemed to have an IP
address -- but wasn't pingable from
the wifi network.
A power cycle of the modem solved the
issue without reboot. So the cause may
be in the modem/ppp subsystem, or it
may be related (in some weird way) to
the default interface / DNS setup.
More tests showed the default
interface again/still got set by the
wifi blob itself at some point,
overriding our modem prioritization.
The events we didn't handle up to now
were "sta.connected" and "sta.lostip",
so I added these, and the bug didn't
show up again since then. That doesn't
mean anything, so we need to test
this.
The default interface really shouldn't
affect inbound packet routing of an
established connection, but there
always may be strange bugs lurking in
those libs.
The change also reimplements the wifi
signal strength reading, as the tests
also showed that still wasn't working
well using the CSI callback. It now
seems to be much more reliable.
Please test & report. The single
module will be hard to test, as the
bug isn't reproducable easily, but you
can still try if wifi / modem
transitions work well.
Mark, you can check your server logs
for history messages with ridiculous
time offsets:
[sddexter@ns27 server]$ cat
log-20190903 | egrep "rx msg h
[0-9]+,-[0-9]{4}" | wc -l
455283
The bug now severely affects the V2
server performance, as the server is
single threaded and doesn't scale very
well to this kind of bulk data bursts,
especially when coming from multiple
modules in parallel. So we really need
to solve this now. Slow reactions or
connection drops from my server lately
have been due to this bug. If this
change doesn't solve it, we'll need to
add some reboot trigger on "too many
server v2 notification
retransmissions" -- or maybe a modem
power cycle will do, that wouldn't
discard the data.
Thanks,
Michael
Am 03.09.19 um 07:46 schrieb Mark
Webb-Johnson:
No
problem. We can hold. I won’t commit
anything for the next few days (and
agree to hold-off on Markos’s pull).
Let me know when you are ready.
Regards, Mark.
On
3 Sep 2019, at 1:58 AM, Michael
Balzer <dexter@expeedo.de>
<mailto:dexter@expeedo.de>
wrote:
Mark, please wait.
I may just have found the cause
for issue #241, or at least
something I need to investigate
before releasing.
I need to dig into my logs first,
and try something.
Regards,
Michael
Am 02.09.19 um 12:23 schrieb
Michael Balzer:
Nothing
open from my side at the moment.
I haven't had the time to look
in to Markos pull request, but
from a first check also think
that's going too deep to be
included in this release.
Regards,
Michael
Am 02.09.19 um 04:15 schrieb
Mark Webb-Johnson:
I
think it is well past time for
a 3.2.003 release. Things
seems table in edge (although
some things only partially
implemented).
Anything people want to
include at the last minute, or
can we go ahead and build?
Regards, Mark.