[Ovmsdev] esp32can issue

Tue Aug 14 18:57:37 HKT 2018

Mark,

Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:

CAN:       can1
Mode:      Active
Speed:     500000
Interrupts:            22425486
Rx pkt:                21080181
Rx err:                       5
Rx ovrflw:                 2054
Tx pkt:                 1340517
Tx delays:                  335
Tx err:                       0
Tx ovrflw:                    0
Err flags: 0x00800c5b

That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?

ECC 5b = form error on TX in acknowledge delimiter (?)

Can we use some of this info as an indicator that restarting the interface is necessary?

Regards,
Michael

Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
> I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance.
> CAN2/CAN3 tx is dog slow compared to CAN1.
>
>     OVMS# test cantx can2 100
>     Testing 100 frames on can2
>     Transmitted 100 frames in 0.299612s = 2996us/frame
>
>     OVMS# test cantx can1 100
>     Testing 100 frames on can1
>     Transmitted 100 frames in 0.013691s = 136us/frame
>
>
> I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it
> resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
>
> I’ll continue looking into this tomorrow.
>
> Regards, Mark.
>
>> On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark at webb-johnson.net <mailto:mark at webb-johnson.net>> wrote:
>>
>>> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in
>>> PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
>>
>> I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages
>> are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
>>
>> I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing
>> is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to
>> RX FIFO wrapping).
>>
>> Looking at page #28 of the datasheet, we have:
>>
>>     /4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to
>>     logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive
>>     interrupt bit is reset./
>>
>>
>> Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would
>> presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
>>
>>     while (MODULE_ESP32CAN->SR.B.RBS)
>>       {
>>       // Read and process a message
>>       }
>>
>>
>> in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages
>> available).
>>
>> Regards, Mark.
>>
>>> On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>
>>> Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
>>>> I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking.
>>>> The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing
>>>> it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of
>>>> this issue?
>>>
>>> You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
>>>
>>> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in
>>> PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
>>>
>>> Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
>>>
>>> IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI
>>> would be 0, so rxframe() wouldn't be called.
>>>
>>> …unless IR.RI was set due to a hardware bug…
>>>
>>> …or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt
>>> coming in with IR.RI, but I've seen nothing in the documentation about something like this.
>>>
>>>> I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out
>>>> something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but
>>>> don’t see this problem. Maybe it is rare, or we are just not noticing it.
>>>
>>> I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context
>>> switches in some situations, which causes the CAN rx task execution pauses.
>>>
>>> Just a wild guess though, I would check the source for this pattern if it was open.
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>>> Regards, Mark
>>>>
>>>>> On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>
>>>>> Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
>>>>>
>>>>> The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO
>>>>> will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
>>>>>
>>>>> Unfortunately, something goes completely wrong:
>>>>>
>>>>> a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
>>>>>
>>>>> b) after resolving the overrun, a garbled message is received.
>>>>>
>>>>> The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read
>>>>> from the bus, so something is going very wrong here.
>>>>>
>>>>> I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
>>>>>
>>>>> Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also
>>>>> reflects the garbled message, which causes the charger to stop.)
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------------------------------
>>>>> …
>>>>> 274.341 1R11 155 07 97 E4 54 4B A8 00 6C
>>>>> 274.341 1T11 155 01 97 E4 54 4B A8 00 6C
>>>>> 274.351 1R11 155 07 97 E5 54 4B A8 00 6C
>>>>> 274.351 1T11 155 01 97 E5 54 4B A8 00 6C
>>>>> 274.361 1R11 155 07 97 E5 54 4B A8 00 6C
>>>>> 274.361 1T11 155 01 97 E5 54 4B A8 00 6C
>>>>> 274.371 1R11 423 03 22 FF FF 00 E0 00 DB
>>>>> 274.371 1R11 426 00 38 01 00 4D 7E 00
>>>>> 274.371 1R11 597 20 A4 03 B1 2F 00 01 53
>>>>> 274.371 1R11 627 00 00 00
>>>>> 274.371 1R11 155 07 97 E5 54 4B A8 00 6C
>>>>> 274.371 1T11 155 01 97 E5 54 4B A8 00 6C
>>>>> 274.381 1R11 155 07 97 E5 54 4B A8 00 6C
>>>>> 274.381 1T11 155 01 97 E5 54 4B A8 00 6C
>>>>> 274.391 1R11 5D7 FF FF 01 E4 53 80 00
>>>>>
>>>>>  - normal up to here
>>>>>
>>>>>  !!! CAN RX is blocked here for ~70 ms !!!
>>>>>    … interrupts disabled? by whom?
>>>>>
>>>>> 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>> 274.461 1R11 436 00 0E 3C B2 00 00
>>>>> 274.461 1R11 155 07 97 E5 54 4B A8 00 6C
>>>>> 274.461 1R11 424 11 40 15 26 47 64 00 48
>>>>> 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F
>>>>> 274.461 1R11 556 30 63 07 30 73 07 30 7A
>>>>> 274.461 1T11 155 01 97 E5 54 4B A8 00 6C      ← reflection for 155 still works here
>>>>>
>>>>>  !!! next RX pause, ~90 ms this time !!!
>>>>>
>>>>>  !!! ID 599 has a period of 100 ms, but we now get it 5 times:
>>>>>
>>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>>
>>>>> 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17
>>>>>  crtd fprintf bug compensation:
>>>>>  274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17
>>>>>    errflags=528128 = 08 0F 00
>>>>>    0x08 = IR: Data overrun
>>>>>    0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full
>>>>>    0x00 = ECC: -
>>>>>
>>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C                    RX "A"
>>>>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C                    RX "B"
>>>>> 274.551 1R11 155 07 97 E7 54 4B A8 00 6C                    RX "C"
>>>>> 274.551 1R11 423 03 22 FF FF 00 E0 00 DB
>>>>> 274.551 1R11 426 00 38 01 00 4D 7E 00
>>>>> 274.551 1R11 597 20 A4 03 B1 2F 00 01 53
>>>>> 274.551 1R11 627 00 00 00
>>>>> 274.551 1T11 155 01 97 E6 54 4B A8 00 6C                    reflection "A"
>>>>> 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C       reflection "B"
>>>>> 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C       reflection "C"
>>>>> 274.551 1T11 155 01 97 E8 54 4B A8 00 6C                    reflection "D"
>>>>>
>>>>>  !!! 120 ms pause !!!
>>>>>
>>>>> 274.671 1R11 155 07 97 E8 54 4B A8 00 6C                    RX "D"
>>>>>
>>>>>  → "D" was sent to the listeners & reflected by vehicle::IncomingFrame()
>>>>>    then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
>>>>>
>>>>>  and now it's getting really weird:
>>>>>
>>>>> 274.671 1R11 155 07 08 2A A0 07 97 E7 54                    RX "E": invalid frame / garbled message!
>>>>>
>>>>>  … this frame certainly was not on the bus!
>>>>>    bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus
>>>>>    bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155
>>>>>    - copy / memory error?      … very unlikely
>>>>>    - CAN RX handling error?    … doesn't seem so
>>>>>    - RXFIFO message window bug?
>>>>>
>>>>> 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19
>>>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C
>>>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C
>>>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C
>>>>> 274.671 1T11 155 01 97 E6 54 4B A8 00 6C
>>>>> 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C
>>>>> 274.671 1R11 155 07 97 F2 54 4B A8 00 6C
>>>>> 274.671 1R11 155 07 97 F3 54 4B A8 00 6C
>>>>> 274.671 1R11 423 03 22 FF FF 00 E0 00 DB
>>>>> 274.671 1R11 426 00 38 01 00 4D 7E 00
>>>>> 274.671 1R11 597 20 A4 03 B1 2F 00 01 53
>>>>> 274.671 1R11 627 00 00 00
>>>>> 274.671 1R11 155 07 97 F5 54 4B A8 00 6C
>>>>> 274.671 1R11 155 07 97 F6 54 4B A8 00 6C
>>>>> 274.671 1R11 5D7 FF FF 01 E4 53 80 00
>>>>> 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF
>>>>> 274.671 1R11 155 07 97 F7 54 4B A8 00 6C
>>>>> 274.671 1R11 436 00 0E 3C B2 00 00
>>>>> 274.671 1R11 041 A0 07 97 F5 54 4B A8 00
>>>>> 274.671 1T11 155 01 97 E7 54 4B A8 00 6C
>>>>> 274.671 1T11 155 01 08 2A A0 07 97 E7 54                    reflection of invalid msg "E" → charge terminates due to error
>>>>>
>>>>>
>>>>> -- 
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>
>>> -- 
>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20180814/e90590d9/attachment-0001.html>