[Ovmsdev] Can buses stop after some time

Mark Webb-Johnson mark at webb-johnson.net
Fri Jul 6 09:05:16 HKT 2018


Greg,

You and I are seeing the same problem. 128 errors (due to HUD trying different CAN bus baud rates) and the bus locks up. You can confirm this with ‘can can3 status’, and fix it with ‘power can3 off’ + ‘can can3 start active 500000’.

We know we can reset the entire mcp2515 chip (which is what ‘power can3 off’ does), but not sure of a lighter way of clearing those receive error counters. The data sheet says they are cleared automatically be a sequence of good data, but with just the HUD and OVMS on the bus I don’t think we’ll ever see that.

I am just wondering if that is the same fault as Tom and Stein are seeing? Only way to know for sure is ‘can can2 status’, and seeing if ‘power can2 off’ + ‘can can2 start active 500000’ resolves the issue.

Regards, Mark.

> On 6 Jul 2018, at 4:47 AM, Greg D. <gregd2350 at gmail.com> wrote:
> 
> Hi Mark, Tom, et al,
> 
> See my earlier posts on the progress here, or the lack thereof...
> 
> I can reliably reproduce the issue by having a HUD connected to CAN3, and then (after the HUD has started trying to connect), starting the obdii ecu task.  This fails 100%.  If I start the HUD or use an OBDII dongle, let it make a mess of the bus through whatever it's doing, and then stop it before starting the obdii task, it never fails.  So, there seems to be a race condition somewhere in the receive side of the world, during the process of opening the CAN device while traffic is actively being received.  The obdii ecu task, however, is very reliable once it starts, and I've not had any sort of lockups once going.  But, note that the usage of the bus is almost entirely a request / reply sort of thing, so is self limiting.
> 
> I've done a bunch of tracing and debug-printf'ing around this issue, and have not yet found how to get the receiver to go again, once hung.  I do not believe, for example,that the SPI bus is hung, because I can continue to get various status interrupts while the errors mount.  Just not any receive frames, in fact, no frames at all if I start the HUD first.  I do get the status interrupts Mark has flagged (0 -> 3 -> b), and when received, I tried clearing them explicitly (clearing the interrupt status, that is).  No change in behavior, I suspect because I'm just clearing the status, not the underlying cause.  Unfortunately, I don't see any way to reset just the receiver, and resetting the chip would likely just drop into the same state again (assuming CAN traffic continues to be received).
> 
> Where I think I left things a month ago (before getting side-tracked on other projects) was to put in a delay in the obdii task so that stuff builds up without being received, trying to force a lockup due to the overflow.  No luck.  If I put in a long enough delay, the HUD thinks the car has been turned off, and goes to sleep.  Less than that, and things recover.  This was starting to make my head hurt, so I let it rest for a bit, and got side-tracked, sorry.
> 
> Tom, getting a status from you on what the chip thinks is going on when you see the lockup will be interesting.  I'm assuming that you are receiving stuff for a while, but there's a race condition in the receive processing somehow that you can hit, that the obdii request/response sequencing will never hit.  Do you ever transmit on your CAN bus?  I wonder transmitting a "NOP" frame of some sort would help...
> 
> I've got commitments here until next week, but may be able to get back to poking at this after that.
> 
> Greg
> 
> 
> Mark Webb-Johnson wrote:
>> I’ve spent some time on this, and finally managed to reliably repeat it (at least in one case) by:
>> 
>> Connect an external HUD and ‘obdii ecu start can3’.
>> Once the HUD is connected and working, manually change baud rate to incorrect ‘can can3 start active 250000’.
>> Watch errors start streaming in.
>> If I quickly switch back with ‘can can3 start active 500000’, it recovers and everything is fine.
>> If I leave it running, it seems to count up to 128 errors, and then lock up. At this point even a ‘can can3 start active 500000’ doesn’t solve it.
>> A ‘power can3 off’ then ‘can can3 start active 500000’ recovers it.
>> 
>> Here is what it looks like in the failed state:
>> 
>> OVMS# can can3 status
>> CAN:       can3
>> Mode:      Active
>> Speed:     250000
>> Interrupts:               35901
>> Rx pkt:                       0
>> Rx err:                     128
>> Rx ovrflw:                    0
>> Tx pkt:                       0
>> Tx delays:                    0
>> Tx err:                       0
>> Tx ovrflw:                    0
>> Err flags: 0x800b
>> D (697321) canlog: Status can3 intr=35900 rxpkt=0 txpkt=0 errflags=0x800b rxerr=128 txerr=0 rxovr=0 txovr=0 txdelay=0
>> 
>> Can you check to see what yours looks like next time it fails?
>> 
>> Looking at the MCP2515 data sheet (page #45), it has this to say:
>> 
>> 6.6 Error States
>> 
>> Detected errors are made known to all other nodes via error frames. The transmission of the erroneous mes- sage is aborted and the frame is repeated as soon as possible. Furthermore, each CAN node is in one of the three error states according to the value of the internal error counters:
>> 
>> 1. Error-active.
>> 2. Error-passive.
>> 3. Bus-off (transmitter only).
>> 
>> The error-active state is the usual state where the node can transmit messages and active error frames (made of dominant bits) without any restrictions.
>> 
>> In the error-passive state, messages and passive error frames (made of recessive bits) may be transmitted.
>> 
>> The bus-off state makes it temporarily impossible for the station to participate in the bus communication. During this state, messages can neither be received or transmitted. Only transmitters can go bus-off.
>> 
>> 6.7 Error Modes and Error Counters
>> 
>> The MCP2515 contains two error counters: the Receive Error Counter (REC) (see Register 6-2) and the Transmit Error Counter (TEC) (see Register 6-1). The values of both counters can be read by the MCU. These counters are incremented/decremented in accordance with the CAN bus specification.
>> 
>> The MCP2515 is error-active if both error counters are below the error-passive limit of 128.
>> 
>> It is error-passive if at least one of the error counters equals or exceeds 128.
>> 
>> It goes to bus-off if the TEC exceeds the bus-off limit of 255. The device remains in this state until the bus-off recovery sequence is received. The bus-off recovery sequence consists of 128 occurrences and 11 consec- utive recessive bits (see Figure 6-1).
>> 
>> The Current Error mode of the MCP2515 can be read by the MCU via the EFLG register (see Register 6-3).
>> 
>> Additionally, there is an error state warning flag bit (EFLG:EWARN) which is set if at least one of the error counters equals or exceeds the error warning limit of 96. EWARN is reset if both error counters are less than the error warning limit.
>> 
>> I don’t think we access these TEC and REC registers, but the 128 number cannot be a coincidence.
>> 
>> We do access the EFLG register, in our ISR, and here is what I see:
>> 
>> E (685091) canlog: Error can3 intr=30 rxpkt=0 txpkt=0 errflags=0x8000 rxerr=56 txerr=0 rxovr=0 txovr=0 txdelay=0
>> E (685091) canlog: Error can3 intr=31 rxpkt=0 txpkt=0 errflags=0x8000 rxerr=58 txerr=0 rxovr=0 txovr=0 txdelay=0
>> E (685091) canlog: Error can3 intr=32 rxpkt=0 txpkt=0 errflags=0x8000 rxerr=60 txerr=0 rxovr=0 txovr=0 txdelay=0
>> E (685091) canlog: Error can3 intr=43 rxpkt=0 txpkt=0 errflags=0x8000 rxerr=81 txerr=0 rxovr=0 txovr=0 txdelay=0
>> E (685101) canlog: Error can3 intr=60 rxpkt=0 txpkt=0 errflags=0x8003 rxerr=113 txerr=0 rxovr=0 txovr=0 txdelay=0
>> 
>> Lower 8bits of that is the EFLG, so 0x00 is normal, 0x03 is when the error is hit, and 0x0b is what we see later. Documentation for this flag is:
>> 
>> bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit 0
>> 
>> R/W-0 R-0 R-0 R-0 R-0 R-0
>> 
>> bit#7: RX1OVR: Receive Buffer 1 Overflow Flag bit
>> - Set when a valid message is received for RXB1 and CANINTF.RX1IF = 1 - Must be reset by MCU
>> 
>> bit#6: RX0OVR: Receive Buffer 0 Overflow Flag bit
>> - Set when a valid message is received for RXB0 and CANINTF.RX0IF = 1
>> - Must be reset by MCU
>> 
>> bit#5: TXBO: Bus-Off Error Flag bit
>> - Bit set when TEC reaches 255
>> - Reset after a successful bus recovery sequence
>> 
>> bit#4: TXEP: Transmit Error-Passive Flag bit
>> - Set when TEC is equal to or greater than 128 - Reset when TEC is less than 128
>> 
>> bit#3: RXEP: Receive Error-Passive Flag bit
>> - Set when REC is equal to or greater than 128
>> - Reset when REC is less than 128
>> 
>> bit#2: TXWAR: Transmit Error Warning Flag bit
>> - Set when TEC is equal to or greater than 96 - Reset when TEC is less than 96
>> 
>> bit#1: RXWAR: Receive Error Warning Flag bit
>> - Set when REC is equal to or greater than 96 - Reset when REC is less than 96
>> 
>> bit#0: EWARN: Error Warning Flag bit
>> - Set when TEC or REC is equal to or greater than 96 (TXWAR or RXWAR = 1)
>> - Reset when both REC and TEC are less than 96
>> 
>> So that is EWARN+RXWAR when the 128 error issue occurs, and EWARN+RXWAR+RXEP when everything is locked up. We have code to clear the error condition (in the interrupt flags register), but that doesn’t seem to get out of this 128 error lock-up.
>> 
>> I am not sure of the best approach for this. Perhaps pickup the condition, and reset the SPI bus, in a timer every 10 seconds or so?
>> 
>> I am not sure if this is your problem (a ‘can can2 status’ would tell us). In any case, the fix for this is to pickup this error condition in the ISR and fix it (or perhaps a separate periodic timer).
>> 
>> Regards, Mark.
>> 
>>> On 5 Jul 2018, at 3:55 PM, Tom Parker <tom at carrott.org <mailto:tom at carrott.org>> wrote:
>>> 
>>> I haven't had a chance to try to work out what is going on.
>>> 
>>> I can say that the second can interface doesn't work for very long before stopping. This manifests most obviously on my Leaf as a stopped odometer in the OVMS app. If you look at the metrics in the console then everything that comes from the Car CAN bus (ie the second CAN bus) has frozen.
>>> 
>>> The first CAN interface seems much more reliable, with SOC information from the EV bus being fairly reliably reported.
>>> 
>>> I haven't done the modification to make my 3.0 unit's GPS work so I haven't experienced the stolen detection.
>>> 
>>> 
>>> On 05/07/18 18:34, Stein Arne Sordal wrote:
>>>> Did anyone figure out what happens here?
>>>> Now the OVMS thinks my car is stolen since it´s moving (GPS) and CAN2 is dead.
>>>> Reboot of module brings CAN2 back to life for a period of time.
>>>> 
>>>> -Stein Arne Sordal-
>>>> 
>>>> 
>>>> 
>>>>> On 11 May 2018, at 12:29, Stein Arne Sordal <ovms at topphemmelig.no <mailto:ovms at topphemmelig.no>> wrote:
>>>>> 
>>>>> Hi Tom
>>>>> 
>>>>> I have seen this with my Leaf.
>>>>> I´ve been on vacation, so I haven´t got time to test a lot, but it looks like one of the can buses stops. Started testing again today.
>>>>> 
>>>>> -Stein Arne Sordal-
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 11 May 2018, at 12:22, Tom Parker <tom at carrott.org <mailto:tom at carrott.org>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> I synced up with master about a week ago and since then I've seen both can busses stop working. I still see the 12v battery metric changing, but everything that comes from the car stops. Rebooting the module with "module reset" does not seem to fix it, while make app-flash monitor does fix it. I haven't tried make monitor on it's own.
>>>>>> 
>>>>>> Is anyone else seeing behavior like this?
>>>>>> 
>>>>>> Sorry for the vague bug report. I'll spend some time later this weekend to try to gather more information.
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
>>> 
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
>> 
>> 
>> 
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
> 
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20180706/1ea5e8bc/attachment.htm>


More information about the OvmsDev mailing list