[Ovmsdev] CAN-3 broken again?

Greg D. gregd2350 at gmail.com
Tue Jan 9 04:03:49 HKT 2018


Hi Michael,

The overflow needs to have something else happen to the system,
consuming enough CPU time that we lose a frame or two.  The Wifi client
connecting seems to do that pretty reliably, but other things can do it
too (e.g. playing with the modem).  Restarting the obd2ecu process
resets the bus, and everything runs fine until that perturbing (e.g.
wifi connect) event occurs again.  I do not recall this happening prior
to the optimizations that were done last month, but so much else has
changed that it's hard to point to any one of them as the cause.

Thinking about my experiments last night, I recall that there were times
where the bus was running even though Rx Overflow was non-zero.  I also
recall that sometimes restarting obd2ecu didn't clear the 0x2040 error
status, and sometimes it did, and that seemed to be separate from
whether the bus was hung or not.  I don't recall now what exactly the
conditions were (it was late), but there may be more to the mcp2515
handling than just the overflow.  But I'd start there.

Thanks for reminding me about the trace facility.  I was relying on the
can status command to count the frames, but confirmed with trace that
none were received.  The obd2ecu process also has a lot of logging
built-in (set log level to debug), and that confirmed no frames are
received after the hangs.

Are there any debugging hooks in the MCP driver that we can use to
understand its state?

Greg


Michael Balzer wrote:
> Greg,
>
> error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after
> restarting the obdii process.
>
> As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough.
>
> My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback.
>
> But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working.
>
> Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command?
>
> Regards,
> Michael
>
>
> Am 08.01.2018 um 07:31 schrieb Greg D.:
>> Quick update...  I can reliably get the CAN 3 bus to hang with an Rx
>> overflow by having the modem running, then telling WiFi to connect, but
>> only with the OBDWiz dongle.  Connecting an actual HUD display doesn't
>> seem to trigger the effect.  Surprisingly, I can do a full module reset
>> while the OBDWiz is running, without it disconnecting.  (OBDII ECU is
>> started in the system.start script, along with the v2 server and vehicle
>> module setting, though all this testing is done on the bench without the
>> car.)
>>
>> Also, once the bus is hung, restarting the OBD2ECU process sometimes
>> only lets the OBDWiz dongle get part-way through its connect process
>> before it hangs again.  Consistently 7 frames received, 10 sent (due to
>> some of the responses taking multiple frames).  It may be significant
>> that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...
>>
>> That said, another development is that the Rx overflow may not be fatal
>> after all, if I start things with the HUD, then swap dongles to the
>> OBDWiz.  Seems that having an external 12v power source keeps things
>> running even with the overflow status.  Since earlier (few months ago)
>> testing didn't have the modem running, and the OBDWiz dongle doesn't
>> need the 12v power (it's a USB device, on a different PC), this test
>> combination is new.  Error flags on the can status are 0x2040, by the
>> way, when it hangs.
>>
>> But still, even with the 12v, I can reliably cause the bus to hang by
>> starting with the OBDWiz dongle running, get the modem connected, then
>> connect wifi.  The partial connect and hang seems to be solved with the
>> 12v power; the full hang is not.  But the full hang (with 12v attached)
>> can be reset by stopping and restarting the obdii ecu.  Interestingly,
>> the 0x2040 error status is not cleared when restarting the obdii
>> process, but the frame and frame error counters do get set back to zero.
>>
>> Still looking for more clues...  Any ideas on how to narrow this down?
>>
>> Greg
>>
>>
>> Greg D. wrote:
>>> I've turned off Canopen, SSH, and Telnet earlier, and that seemed to
>>> stop the crashes.  Just now added Bluetooth to that list, for good measure.
>>>
>>> Let's see how that holds...
>>>
>>> As for the issues with CAN-3, I seem to be able to hang it by simply
>>> starting WiFi while the OBDII ECU is running with an OBDII device
>>> attached (OBDWiz, in this case).  Trying to reconnect the OBDII device
>>> fails - no frames are received.  Stopping and restarting the OBDII ECU
>>> task lets me reconnect.  If I look at the can status when hung, I see
>>> that Rx Ovrflw is 1, and the Rx counter doesn't increment.  I'm guessing
>>> that starting WiFi is taking enough CPU time that the OBDII ECU task is
>>> falling behind, causing the overflow.  Apparently, that overflow is not
>>> being handled, leading to the hang.
>>>
>>> On an earlier run (before removing Bluetooth), I was able to get the
>>> OBDWiz dongle to connect for a few frames, after which it hung.  That
>>> behavior didn't repeat just now, but I'm not sure what else was running
>>> at the time (e.g. the modem / ppp).  The connect sequence from OBDWiz
>>> does a few frames rapidly (an initial PID 0, followed by requests for
>>> ECU Name and VIN), before a more relaxed polling starts.  So, if there's
>>> another task taking up CPU time, I can see where an Rx overflow could
>>> occur during that initial connect sequence.
>>>
>>> Driving a HUD is not a critical task, so I would be against a general
>>> raising of task priority.  Rather, we need to figure out how to handle
>>> the Rx Overflow, and keep the frames coming in.  OBDII devices generally
>>> are somewhat forgiving about lost frames, but apparently the OBDWiz has
>>> a short attention span and lets you know that something is wrong.
>>>
>>> I'll take a look at the 2515 code, but I'm not much of an expert on the
>>> chip's care and feeding under such circumstances.  If someone more in
>>> the know about it could take a look, that would be great.
>>>
>>> Thanks,
>>>
>>> Greg
>>>
>>>
>>> Stephen Casner wrote:
>>>> Greg,
>>>>
>>>> Yes, definitely running out of free RAM, but I don't know the meaning
>>>> of the WindowOverflow messages.
>>>>
>>>> The first time I built with release/v3.0 of esp-idf I was not able to
>>>> open an ssh connection; the error displayed was about a crypto
>>>> failure.  After quite a bit of digging to narrow down to where the
>>>> error was occurring, I finally found that the problem was running out
>>>> of free RAM.  My solution was to disable bluetooth entirely, which
>>>> made a big difference in the amount of free RAM.
>>>>
>>>>                                                         -- Steve
>>>>
>>>> On Sun, 7 Jan 2018, Greg D. wrote:
>>>>
>>>>> Hi Michael, Steve, Mark,
>>>>>
>>>>> Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the
>>>>> issue.  Crash and reboot log attached (crash.txt).  One thing I've been wondering
>>>>> about are the several lines "_WindowOverflow4 at ??:?" during the boot process.  Is
>>>>> that indicative of a problem, later to manifest in the crash?
>>>>>
>>>>> My builds include pretty much everything, except for the Leaf, Twizy, and Soul.
>>>>>
>>>>> The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other
>>>>> stuff, including a lot stuff updated in Canopen and Kia.  I have a script that does
>>>>> the git fetch master, merge, and push back to my github fork, the output of which is
>>>>> attached (update.txt).  As a test, I removed Canopen from the build config, and the
>>>>> crash has disappeared.  CAN-3 also appears to have come back to life (!), at least
>>>>> initially.  I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in
>>>>> some sequence (still trying to pin that down), but that also leads to another crash
>>>>> (crash2.txt, attached).
>>>>>
>>>>> Mark:  Note also the issue with DNS failures getting to the v2 server.  I enabled the
>>>>> modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2
>>>>> server.  Disabling Wifi didn't bring it back, and powering off the modem (in
>>>>> preparation for turning it back on) caused the crash.
>>>>>
>>>>> So, two questions...  First, why the apparent conflict between Canopen or wifi/modem
>>>>> and obd2ecu over access to the 3rd CAN bus?  Why would the modem or wifi have any
>>>>> effect on a CAN bus?
>>>>>
>>>>> Second, overall memory usage seems to be at the limit.  What sort of budget do we have
>>>>> for what remains to be done, and how are we going to be packaging the build options
>>>>> for when non-developers want to get their hands on the product?  Will we be able to
>>>>> turn everything on, minus the developer / debug stuff, or will we have a separate SKU
>>>>> for each model car?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Greg
>>>>>
>>>>>
>>>>> Michael Balzer wrote:
>>>>>
>>>>> Greg,
>>>>>
>>>>> which commits / changes do you mean? The CAN drivers have not been changed since the T
>>>>> X performance fix, which Geir reported having solved his last issues.
>>>>>
>>>>> The current version is stable over here, but without the SSH component -- I can't use
>>>>> that due to memory getting too low together with the Twizy component.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 07.01.2018 um 08:04 schrieb Greg D.:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> I just resync'd with the main repository, and am not receiving frames on
>>>>> CAN-3 anymore.  I see there were changes to the chip driver...
>>>>>
>>>>> I'm also seeing crashes right after getting connected to WiFi,
>>>>> immediately after the system tries to start SSH.
>>>>>
>>>>> Seems like we just took a big step backward.  What happened?
>>>>>
>>>>> Greg
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.teslaclub.hk
>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.teslaclub.hk
>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.teslaclub.hk
>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev



More information about the OvmsDev mailing list