Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident. The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue. Unfortunately, something goes completely wrong: a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times) b) after resolving the overrun, a garbled message is received. The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here. I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong. Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.) Regards, Michael ---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00 - normal up to here !!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom? 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here !!! next RX pause, ~90 ms this time !!! !!! ID 599 has a period of 100 ms, but we now get it 5 times: 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: - 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D" !!! 120 ms pause !!! 274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D" → "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms and now it's getting really weird: 274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message! … this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug? 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Nasty. I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. I reviewed the ESP32.COM thread, and master driver code from the original developers, but couldn’t find anything obvious. There was one late commit (early this year) commented as 'avoid dominant CAN bus level while initialisation’ and just affecting initialisation. I don’t think that is related, but I just applied it to our code base anyway. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue? I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it. Regards, Mark
On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de> wrote:
Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
Unfortunately, something goes completely wrong:
a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
b) after resolving the overrun, a garbled message is received.
The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here.
I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.)
Regards, Michael
---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00
- normal up to here
!!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom?
274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here
!!! next RX pause, ~90 ms this time !!!
!!! ID 599 has a period of 100 ms, but we now get it 5 times:
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: -
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D"
!!! 120 ms pause !!!
274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D"
→ "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
and now it's getting really weird:
274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message!
… this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug?
274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in. But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that. Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled. IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called. …unless IR.RI was set due to a hardware bug… …or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses. Just a wild guess though, I would check the source for this pattern if it was open. Regards, Michael
Regards, Mark
On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de> wrote:
Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
Unfortunately, something goes completely wrong:
a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
b) after resolving the overrun, a garbled message is received.
The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here.
I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.)
Regards, Michael
---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00
- normal up to here
!!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom?
274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here
!!! next RX pause, ~90 ms this time !!!
!!! ID 599 has a period of 100 ms, but we now get it 5 times:
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: -
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D"
!!! 120 ms pause !!!
274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D"
→ "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
and now it's getting really weird:
274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message!
… this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug?
274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set. I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping). Looking at page #28 of the datasheet, we have: 4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset. Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a: while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message } in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available). Regards, Mark.
On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de> wrote:
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called.
…unless IR.RI was set due to a hardware bug…
…or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses.
Just a wild guess though, I would check the source for this pattern if it was open.
Regards, Michael
Regards, Mark
On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de> wrote:
Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
Unfortunately, something goes completely wrong:
a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
b) after resolving the overrun, a garbled message is received.
The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here.
I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.)
Regards, Michael
---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00
- normal up to here
!!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom?
274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here
!!! next RX pause, ~90 ms this time !!!
!!! ID 599 has a period of 100 ms, but we now get it 5 times:
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: -
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D"
!!! 120 ms pause !!!
274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D"
→ "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
and now it's getting really weird:
274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message!
… this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug?
274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1. OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues. I’ll continue looking into this tomorrow. Regards, Mark.
On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping).
Looking at page #28 of the datasheet, we have:
4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset.
Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message }
in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available).
Regards, Mark.
On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called.
…unless IR.RI was set due to a hardware bug…
…or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses.
Just a wild guess though, I would check the source for this pattern if it was open.
Regards, Michael
Regards, Mark
On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
Unfortunately, something goes completely wrong:
a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
b) after resolving the overrun, a garbled message is received.
The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here.
I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.)
Regards, Michael
---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00
- normal up to here
!!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom?
274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here
!!! next RX pause, ~90 ms this time !!!
!!! ID 599 has a period of 100 ms, but we now get it 5 times:
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: -
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D"
!!! 120 ms pause !!!
274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D"
→ "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
and now it's getting really weird:
274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message!
… this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug?
274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Mark, Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was: CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set? ECC 5b = form error on TX in acknowledge delimiter (?) Can we use some of this info as an indicator that restarting the interface is necessary? Regards, Michael Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1.
OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame
OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame
I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
I’ll continue looking into this tomorrow.
Regards, Mark.
On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping).
Looking at page #28 of the datasheet, we have:
/4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset./
Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message }
in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available).
Regards, Mark.
On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called.
…unless IR.RI was set due to a hardware bug…
…or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses.
Just a wild guess though, I would check the source for this pattern if it was open.
Regards, Michael
Regards, Mark
On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
Unfortunately, something goes completely wrong:
a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
b) after resolving the overrun, a garbled message is received.
The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here.
I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.)
Regards, Michael
---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00
- normal up to here
!!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom?
274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here
!!! next RX pause, ~90 ms this time !!!
!!! ID 599 has a period of 100 ms, but we now get it 5 times:
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: -
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D"
!!! 120 ms pause !!!
274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D"
→ "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
and now it's getting really weird:
274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message!
… this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug?
274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover? When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient? Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy? Regards, Mark.
On 14 Aug 2018, at 6:57 PM, Michael Balzer <dexter@expeedo.de> wrote:
Mark,
Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:
CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b
That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?
ECC 5b = form error on TX in acknowledge delimiter (?)
Can we use some of this info as an indicator that restarting the interface is necessary?
Regards, Michael
Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1.
OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame
OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame
I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
I’ll continue looking into this tomorrow.
Regards, Mark.
On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping).
Looking at page #28 of the datasheet, we have:
4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset.
Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message }
in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available).
Regards, Mark.
On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called.
…unless IR.RI was set due to a hardware bug…
…or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses.
Just a wild guess though, I would check the source for this pattern if it was open.
Regards, Michael
Regards, Mark
On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident.
The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue.
Unfortunately, something goes completely wrong:
a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times)
b) after resolving the overrun, a garbled message is received.
The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read from the bus, so something is going very wrong here.
I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong.
Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also reflects the garbled message, which causes the charger to stop.)
Regards, Michael
---------------------------------------------------------------------------------------- … 274.341 1R11 155 07 97 E4 54 4B A8 00 6C 274.341 1T11 155 01 97 E4 54 4B A8 00 6C 274.351 1R11 155 07 97 E5 54 4B A8 00 6C 274.351 1T11 155 01 97 E5 54 4B A8 00 6C 274.361 1R11 155 07 97 E5 54 4B A8 00 6C 274.361 1T11 155 01 97 E5 54 4B A8 00 6C 274.371 1R11 423 03 22 FF FF 00 E0 00 DB 274.371 1R11 426 00 38 01 00 4D 7E 00 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 274.371 1R11 627 00 00 00 274.371 1R11 155 07 97 E5 54 4B A8 00 6C 274.371 1T11 155 01 97 E5 54 4B A8 00 6C 274.381 1R11 155 07 97 E5 54 4B A8 00 6C 274.381 1T11 155 01 97 E5 54 4B A8 00 6C 274.391 1R11 5D7 FF FF 01 E4 53 80 00
- normal up to here
!!! CAN RX is blocked here for ~70 ms !!! … interrupts disabled? by whom?
274.461 1R11 599 00 00 4D 7E 1E 26 FF FF 274.461 1R11 436 00 0E 3C B2 00 00 274.461 1R11 155 07 97 E5 54 4B A8 00 6C 274.461 1R11 424 11 40 15 26 47 64 00 48 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F 274.461 1R11 556 30 63 07 30 73 07 30 7A 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here
!!! next RX pause, ~90 ms this time !!!
!!! ID 599 has a period of 100 ms, but we now get it 5 times:
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF
274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 crtd fprintf bug compensation: 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 errflags=528128 = 08 0F 00 0x08 = IR: Data overrun 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full 0x00 = ECC: -
274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" 274.551 1R11 423 03 22 FF FF 00 E0 00 DB 274.551 1R11 426 00 38 01 00 4D 7E 00 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 274.551 1R11 627 00 00 00 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D"
!!! 120 ms pause !!!
274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D"
→ "D" was sent to the listeners & reflected by vehicle::IncomingFrame() then, before the LogFrame() call was processed, the CAN task was paused for 120 ms
and now it's getting really weird:
274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message!
… this frame certainly was not on the bus! bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 - copy / memory error? … very unlikely - CAN RX handling error? … doesn't seem so - RXFIFO message window bug?
274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 97 E6 54 4B A8 00 6C 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C 274.671 1R11 155 07 97 F2 54 4B A8 00 6C 274.671 1R11 155 07 97 F3 54 4B A8 00 6C 274.671 1R11 423 03 22 FF FF 00 E0 00 DB 274.671 1R11 426 00 38 01 00 4D 7E 00 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 274.671 1R11 627 00 00 00 274.671 1R11 155 07 97 F5 54 4B A8 00 6C 274.671 1R11 155 07 97 F6 54 4B A8 00 6C 274.671 1R11 5D7 FF FF 01 E4 53 80 00 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF 274.671 1R11 155 07 97 F7 54 4B A8 00 6C 274.671 1R11 436 00 0E 3C B2 00 00 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 274.671 1T11 155 01 97 E7 54 4B A8 00 6C 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
On 14 Aug 2018, at 6:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Mark,
Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:
CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b
That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?
ECC 5b = form error on TX in acknowledge delimiter (?)
Can we use some of this info as an indicator that restarting the interface is necessary?
Regards, Michael
Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1.
OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame
OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame
I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
I’ll continue looking into this tomorrow.
Regards, Mark.
On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping).
Looking at page #28 of the datasheet, we have:
4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset.
Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message }
in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available).
Regards, Mark.
On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson:
I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called.
…unless IR.RI was set due to a hardware bug…
…or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses.
Just a wild guess though, I would check the source for this pattern if it was open.
Regards, Michael
Regards, Mark
> On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: > > Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident. > > The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO > will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue. > > Unfortunately, something goes completely wrong: > > a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times) > > b) after resolving the overrun, a garbled message is received. > > The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read > from the bus, so something is going very wrong here. > > I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong. > > Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also > reflects the garbled message, which causes the charger to stop.) > > Regards, > Michael > > > ---------------------------------------------------------------------------------------- > … > 274.341 1R11 155 07 97 E4 54 4B A8 00 6C > 274.341 1T11 155 01 97 E4 54 4B A8 00 6C > 274.351 1R11 155 07 97 E5 54 4B A8 00 6C > 274.351 1T11 155 01 97 E5 54 4B A8 00 6C > 274.361 1R11 155 07 97 E5 54 4B A8 00 6C > 274.361 1T11 155 01 97 E5 54 4B A8 00 6C > 274.371 1R11 423 03 22 FF FF 00 E0 00 DB > 274.371 1R11 426 00 38 01 00 4D 7E 00 > 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 > 274.371 1R11 627 00 00 00 > 274.371 1R11 155 07 97 E5 54 4B A8 00 6C > 274.371 1T11 155 01 97 E5 54 4B A8 00 6C > 274.381 1R11 155 07 97 E5 54 4B A8 00 6C > 274.381 1T11 155 01 97 E5 54 4B A8 00 6C > 274.391 1R11 5D7 FF FF 01 E4 53 80 00 > > - normal up to here > > !!! CAN RX is blocked here for ~70 ms !!! > … interrupts disabled? by whom? > > 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF > 274.461 1R11 436 00 0E 3C B2 00 00 > 274.461 1R11 155 07 97 E5 54 4B A8 00 6C > 274.461 1R11 424 11 40 15 26 47 64 00 48 > 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F > 274.461 1R11 556 30 63 07 30 73 07 30 7A > 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here > > !!! next RX pause, ~90 ms this time !!! > > !!! ID 599 has a period of 100 ms, but we now get it 5 times: > > 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF > > 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 > crtd fprintf bug compensation: > 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 > errflags=528128 = 08 0F 00 > 0x08 = IR: Data overrun > 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full > 0x00 = ECC: - > > 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF > 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF > 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF > 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF > 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" > 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" > 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" > 274.551 1R11 423 03 22 FF FF 00 E0 00 DB > 274.551 1R11 426 00 38 01 00 4D 7E 00 > 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 > 274.551 1R11 627 00 00 00 > 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" > 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" > 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" > 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D" > > !!! 120 ms pause !!! > > 274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D" > > → "D" was sent to the listeners & reflected by vehicle::IncomingFrame() > then, before the LogFrame() call was processed, the CAN task was paused for 120 ms > > and now it's getting really weird: > > 274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message! > > … this frame certainly was not on the bus! > bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus > bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 > - copy / memory error? … very unlikely > - CAN RX handling error? … doesn't seem so > - RXFIFO message window bug? > > 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 > 274.671 1R11 155 07 97 E7 54 4B A8 00 6C > 274.671 1R11 155 07 97 E7 54 4B A8 00 6C > 274.671 1R11 155 07 97 E7 54 4B A8 00 6C > 274.671 1T11 155 01 97 E6 54 4B A8 00 6C > 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C > 274.671 1R11 155 07 97 F2 54 4B A8 00 6C > 274.671 1R11 155 07 97 F3 54 4B A8 00 6C > 274.671 1R11 423 03 22 FF FF 00 E0 00 DB > 274.671 1R11 426 00 38 01 00 4D 7E 00 > 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 > 274.671 1R11 627 00 00 00 > 274.671 1R11 155 07 97 F5 54 4B A8 00 6C > 274.671 1R11 155 07 97 F6 54 4B A8 00 6C > 274.671 1R11 5D7 FF FF 01 E4 53 80 00 > 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF > 274.671 1R11 155 07 97 F7 54 4B A8 00 6C > 274.671 1R11 436 00 0E 3C B2 00 00 > 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 > 274.671 1T11 155 01 97 E7 54 4B A8 00 6C > 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error > > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening? On 16/08/18 01:46, Mark Webb-Johnson wrote:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
The mcp2515 issues are probably different than esp32can, but similar logic can probably be used to address them both. After a week in my car with three buses running, can2 locked up today. Here is the status: OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 40069 Rx pkt: 40429 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001 OVMS# can can3 status CAN: can3 Mode: Active Speed: 125000 Interrupts: 5003600 Rx pkt: 5003610 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001 CAN3 was operating normally. Flags identical. I fixed this with: OVMS# can can2 start active 500000 Can bus can2 started in mode active at speed 500000bps OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 831 Rx pkt: 831 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001 No need to power on/off. For mcp2515, I’ll try to add a ‘kick’ function able to try to read the status registers and restart as appropriate. That should give us more information. Regards, Mark.
On 16 Aug 2018, at 4:58 PM, Tom Parker <tom@carrott.org> wrote:
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening?
On 16/08/18 01:46, Mark Webb-Johnson wrote:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Hi I tried the "can can2 start active 500000” command. The first time it stopped again almost instant. The second time it worked for a minute. I got two different error flags: CAN: can2 Mode: Active Speed: 500000 Interrupts: 16480 Rx pkt: 16643 Rx err: 0 Rx ovrflw: 37 Tx pkt: 231 Tx delays: 0 Tx err: 128 Tx ovrflw: 0 Err flags: 0x8015 CAN: can2 Mode: Active Speed: 500000 Interrupts: 9164 Rx pkt: 9230 Rx err: 0 Rx ovrflw: 17 Tx pkt: 25 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x2040 Regards, Stein Arne Sordal
On 16 Aug 2018, at 13:05, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
The mcp2515 issues are probably different than esp32can, but similar logic can probably be used to address them both.
After a week in my car with three buses running, can2 locked up today. Here is the status:
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 40069 Rx pkt: 40429 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
OVMS# can can3 status CAN: can3 Mode: Active Speed: 125000 Interrupts: 5003600 Rx pkt: 5003610 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
CAN3 was operating normally. Flags identical.
I fixed this with:
OVMS# can can2 start active 500000 Can bus can2 started in mode active at speed 500000bps
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 831 Rx pkt: 831 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
No need to power on/off.
For mcp2515, I’ll try to add a ‘kick’ function able to try to read the status registers and restart as appropriate. That should give us more information.
Regards, Mark.
On 16 Aug 2018, at 4:58 PM, Tom Parker <tom@carrott.org <mailto:tom@carrott.org>> wrote:
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening?
On 16/08/18 01:46, Mark Webb-Johnson wrote:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
What version firmware are you running? Those err flags seem too short.
On 16 Aug 2018, at 9:41 PM, Stein Arne Sordal <ovms@topphemmelig.no> wrote:
Hi
I tried the "can can2 start active 500000” command. The first time it stopped again almost instant. The second time it worked for a minute.
I got two different error flags:
CAN: can2 Mode: Active Speed: 500000 Interrupts: 16480 Rx pkt: 16643 Rx err: 0 Rx ovrflw: 37 Tx pkt: 231 Tx delays: 0 Tx err: 128 Tx ovrflw: 0 Err flags: 0x8015
CAN: can2 Mode: Active Speed: 500000 Interrupts: 9164 Rx pkt: 9230 Rx err: 0 Rx ovrflw: 17 Tx pkt: 25 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x2040
Regards, Stein Arne Sordal
On 16 Aug 2018, at 13:05, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
The mcp2515 issues are probably different than esp32can, but similar logic can probably be used to address them both.
After a week in my car with three buses running, can2 locked up today. Here is the status:
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 40069 Rx pkt: 40429 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
OVMS# can can3 status CAN: can3 Mode: Active Speed: 125000 Interrupts: 5003600 Rx pkt: 5003610 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
CAN3 was operating normally. Flags identical.
I fixed this with:
OVMS# can can2 start active 500000 Can bus can2 started in mode active at speed 500000bps
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 831 Rx pkt: 831 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
No need to power on/off.
For mcp2515, I’ll try to add a ‘kick’ function able to try to read the status registers and restart as appropriate. That should give us more information.
Regards, Mark.
On 16 Aug 2018, at 4:58 PM, Tom Parker <tom@carrott.org> wrote:
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening?
On 16/08/18 01:46, Mark Webb-Johnson wrote: Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
3.1.008 with some small custom mods. I will update later today. -Stein Arne Sordal-
On 16 Aug 2018, at 15:52, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
What version firmware are you running? Those err flags seem too short.
On 16 Aug 2018, at 9:41 PM, Stein Arne Sordal <ovms@topphemmelig.no <mailto:ovms@topphemmelig.no>> wrote:
Hi
I tried the "can can2 start active 500000” command. The first time it stopped again almost instant. The second time it worked for a minute.
I got two different error flags:
CAN: can2 Mode: Active Speed: 500000 Interrupts: 16480 Rx pkt: 16643 Rx err: 0 Rx ovrflw: 37 Tx pkt: 231 Tx delays: 0 Tx err: 128 Tx ovrflw: 0 Err flags: 0x8015
CAN: can2 Mode: Active Speed: 500000 Interrupts: 9164 Rx pkt: 9230 Rx err: 0 Rx ovrflw: 17 Tx pkt: 25 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x2040
Regards, Stein Arne Sordal
On 16 Aug 2018, at 13:05, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
The mcp2515 issues are probably different than esp32can, but similar logic can probably be used to address them both.
After a week in my car with three buses running, can2 locked up today. Here is the status:
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 40069 Rx pkt: 40429 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
OVMS# can can3 status CAN: can3 Mode: Active Speed: 125000 Interrupts: 5003600 Rx pkt: 5003610 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
CAN3 was operating normally. Flags identical.
I fixed this with:
OVMS# can can2 start active 500000 Can bus can2 started in mode active at speed 500000bps
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 831 Rx pkt: 831 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
No need to power on/off.
For mcp2515, I’ll try to add a ‘kick’ function able to try to read the status registers and restart as appropriate. That should give us more information.
Regards, Mark.
On 16 Aug 2018, at 4:58 PM, Tom Parker <tom@carrott.org <mailto:tom@carrott.org>> wrote:
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening?
On 16/08/18 01:46, Mark Webb-Johnson wrote:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Switched from main to edge and now I got the long error flags: CAN: can2 Mode: Active Speed: 500000 Interrupts: 30701 Rx pkt: 30711 Rx err: 0 Rx ovrflw: 0 Tx pkt: 51 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001 -Stein Arne-
On 16 Aug 2018, at 15:52, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
What version firmware are you running? Those err flags seem too short.
On 16 Aug 2018, at 9:41 PM, Stein Arne Sordal <ovms@topphemmelig.no <mailto:ovms@topphemmelig.no>> wrote:
Hi
I tried the "can can2 start active 500000” command. The first time it stopped again almost instant. The second time it worked for a minute.
I got two different error flags:
CAN: can2 Mode: Active Speed: 500000 Interrupts: 16480 Rx pkt: 16643 Rx err: 0 Rx ovrflw: 37 Tx pkt: 231 Tx delays: 0 Tx err: 128 Tx ovrflw: 0 Err flags: 0x8015
CAN: can2 Mode: Active Speed: 500000 Interrupts: 9164 Rx pkt: 9230 Rx err: 0 Rx ovrflw: 17 Tx pkt: 25 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x2040
Regards, Stein Arne Sordal
On 16 Aug 2018, at 13:05, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
The mcp2515 issues are probably different than esp32can, but similar logic can probably be used to address them both.
After a week in my car with three buses running, can2 locked up today. Here is the status:
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 40069 Rx pkt: 40429 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
OVMS# can can3 status CAN: can3 Mode: Active Speed: 125000 Interrupts: 5003600 Rx pkt: 5003610 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
CAN3 was operating normally. Flags identical.
I fixed this with:
OVMS# can can2 start active 500000 Can bus can2 started in mode active at speed 500000bps
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 831 Rx pkt: 831 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
No need to power on/off.
For mcp2515, I’ll try to add a ‘kick’ function able to try to read the status registers and restart as appropriate. That should give us more information.
Regards, Mark.
On 16 Aug 2018, at 4:58 PM, Tom Parker <tom@carrott.org <mailto:tom@carrott.org>> wrote:
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening?
On 16/08/18 01:46, Mark Webb-Johnson wrote:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Despite working on this on-and-off for a month, I can’t find the root cause. The MCP2515 just seems to randomly get overloaded, and stop receiving, but the flags don’t Indicate it in any reliable way. I have just committed a change to try to workaround this: Introduce a watchdog reset count in the status for each can bus. Introduce a watchdog timer that tracks the time the last message was received. If the car is OFF, the watchdog timer is simply reset once every ten seconds. If the car is ON, and the watchdog timer indicates no messages received for 60 seconds (or longer), the bus is reset (turned off and on). An error is logged to record this, and the status counter incremented. That should go out to EDGE release tonight. It is already in my car, but that only fails once every week or so. Regards, Mark
On 16 Aug 2018, at 7:05 PM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
The mcp2515 issues are probably different than esp32can, but similar logic can probably be used to address them both.
After a week in my car with three buses running, can2 locked up today. Here is the status:
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 40069 Rx pkt: 40429 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
OVMS# can can3 status CAN: can3 Mode: Active Speed: 125000 Interrupts: 5003600 Rx pkt: 5003610 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
CAN3 was operating normally. Flags identical.
I fixed this with:
OVMS# can can2 start active 500000 Can bus can2 started in mode active at speed 500000bps
OVMS# can can2 status CAN: can2 Mode: Active Speed: 500000 Interrupts: 831 Rx pkt: 831 Rx err: 0 Rx ovrflw: 0 Tx pkt: 0 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0x01000001
No need to power on/off.
For mcp2515, I’ll try to add a ‘kick’ function able to try to read the status registers and restart as appropriate. That should give us more information.
Regards, Mark.
On 16 Aug 2018, at 4:58 PM, Tom Parker <tom@carrott.org <mailto:tom@carrott.org>> wrote:
On my Leaf the bus seems to stop quite often, so the reset would have to operate quite often too. Perhaps increment a metric when the reset fires so we can monitor how often it is happening?
On 16/08/18 01:46, Mark Webb-Johnson wrote:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Our current implementation lacks a recovery attempt from bus errors and arbitration losses, we only record the error status. The new Espressif CAN driver may contain some useful info on this. Regards, Michael Am 15.08.2018 um 15:46 schrieb Mark Webb-Johnson:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
On 14 Aug 2018, at 6:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Mark,
Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:
CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b
That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?
ECC 5b = form error on TX in acknowledge delimiter (?)
Can we use some of this info as an indicator that restarting the interface is necessary?
Regards, Michael
Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1.
OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame
OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame
I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
I’ll continue looking into this tomorrow.
Regards, Mark.
On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping).
Looking at page #28 of the datasheet, we have:
/4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset./
Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message }
in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available).
Regards, Mark.
On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson: > I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. > The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing > it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause > of this issue?
You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in.
But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled.
IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI would be 0, so rxframe() wouldn't be called.
…unless IR.RI was set due to a hardware bug…
…or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt coming in with IR.RI, but I've seen nothing in the documentation about something like this.
> I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out > something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but > don’t see this problem. Maybe it is rare, or we are just not noticing it.
I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context switches in some situations, which causes the CAN rx task execution pauses.
Just a wild guess though, I would check the source for this pattern if it was open.
Regards, Michael
> Regards, Mark > >> On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >> >> Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident. >> >> The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the >> RXFIFO >> will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue. >> >> Unfortunately, something goes completely wrong: >> >> a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times) >> >> b) after resolving the overrun, a garbled message is received. >> >> The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been >> read >> from the bus, so something is going very wrong here. >> >> I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong. >> >> Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also >> reflects the garbled message, which causes the charger to stop.) >> >> Regards, >> Michael >> >> >> ---------------------------------------------------------------------------------------- >> … >> 274.341 1R11 155 07 97 E4 54 4B A8 00 6C >> 274.341 1T11 155 01 97 E4 54 4B A8 00 6C >> 274.351 1R11 155 07 97 E5 54 4B A8 00 6C >> 274.351 1T11 155 01 97 E5 54 4B A8 00 6C >> 274.361 1R11 155 07 97 E5 54 4B A8 00 6C >> 274.361 1T11 155 01 97 E5 54 4B A8 00 6C >> 274.371 1R11 423 03 22 FF FF 00 E0 00 DB >> 274.371 1R11 426 00 38 01 00 4D 7E 00 >> 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 >> 274.371 1R11 627 00 00 00 >> 274.371 1R11 155 07 97 E5 54 4B A8 00 6C >> 274.371 1T11 155 01 97 E5 54 4B A8 00 6C >> 274.381 1R11 155 07 97 E5 54 4B A8 00 6C >> 274.381 1T11 155 01 97 E5 54 4B A8 00 6C >> 274.391 1R11 5D7 FF FF 01 E4 53 80 00 >> >> - normal up to here >> >> !!! CAN RX is blocked here for ~70 ms !!! >> … interrupts disabled? by whom? >> >> 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF >> 274.461 1R11 436 00 0E 3C B2 00 00 >> 274.461 1R11 155 07 97 E5 54 4B A8 00 6C >> 274.461 1R11 424 11 40 15 26 47 64 00 48 >> 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F >> 274.461 1R11 556 30 63 07 30 73 07 30 7A >> 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here >> >> !!! next RX pause, ~90 ms this time !!! >> >> !!! ID 599 has a period of 100 ms, but we now get it 5 times: >> >> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >> >> 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >> crtd fprintf bug compensation: >> 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >> errflags=528128 = 08 0F 00 >> 0x08 = IR: Data overrun >> 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full >> 0x00 = ECC: - >> >> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" >> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" >> 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" >> 274.551 1R11 423 03 22 FF FF 00 E0 00 DB >> 274.551 1R11 426 00 38 01 00 4D 7E 00 >> 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 >> 274.551 1R11 627 00 00 00 >> 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" >> 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" >> 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" >> 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D" >> >> !!! 120 ms pause !!! >> >> 274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D" >> >> → "D" was sent to the listeners & reflected by vehicle::IncomingFrame() >> then, before the LogFrame() call was processed, the CAN task was paused for 120 ms >> >> and now it's getting really weird: >> >> 274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message! >> >> … this frame certainly was not on the bus! >> bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus >> bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 >> - copy / memory error? … very unlikely >> - CAN RX handling error? … doesn't seem so >> - RXFIFO message window bug? >> >> 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 >> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >> 274.671 1T11 155 01 97 E6 54 4B A8 00 6C >> 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C >> 274.671 1R11 155 07 97 F2 54 4B A8 00 6C >> 274.671 1R11 155 07 97 F3 54 4B A8 00 6C >> 274.671 1R11 423 03 22 FF FF 00 E0 00 DB >> 274.671 1R11 426 00 38 01 00 4D 7E 00 >> 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 >> 274.671 1R11 627 00 00 00 >> 274.671 1R11 155 07 97 F5 54 4B A8 00 6C >> 274.671 1R11 155 07 97 F6 54 4B A8 00 6C >> 274.671 1R11 5D7 FF FF 01 E4 53 80 00 >> 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF >> 274.671 1R11 155 07 97 F7 54 4B A8 00 6C >> 274.671 1R11 436 00 0E 3C B2 00 00 >> 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 >> 274.671 1T11 155 01 97 E7 54 4B A8 00 6C >> 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error >> >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> >> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I Updated to the latest firmware and CAN2 has not yet stopped with this firmware. But if I compile the latest firmware and adding the lock/unlock doors functionality, CAN2 stops within one hour…. Anyone have some clues? Regards, Stein Arne Sordal
On 19 Aug 2018, at 13:59, Michael Balzer <dexter@expeedo.de> wrote:
Our current implementation lacks a recovery attempt from bus errors and arbitration losses, we only record the error status.
The new Espressif CAN driver may contain some useful info on this.
Regards, Michael
Am 15.08.2018 um 15:46 schrieb Mark Webb-Johnson:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
On 14 Aug 2018, at 6:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Mark,
Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:
CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b
That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?
ECC 5b = form error on TX in acknowledge delimiter (?)
Can we use some of this info as an indicator that restarting the interface is necessary?
Regards, Michael
Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1.
OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame
OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame
I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
I’ll continue looking into this tomorrow.
Regards, Mark.
On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in > PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that.
I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set.
I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping).
Looking at page #28 of the datasheet, we have:
4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset.
Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a:
while (MODULE_ESP32CAN->SR.B.RBS) { // Read and process a message }
in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available).
Regards, Mark.
> On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: > > Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson: >> I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. >> The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue? > > You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in. > > But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in > PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that. > > Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled. > > IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI > would be 0, so rxframe() wouldn't be called. > > …unless IR.RI was set due to a hardware bug… > > …or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt > coming in with IR.RI, but I've seen nothing in the documentation about something like this. > >> I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it. > > I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context > switches in some situations, which causes the CAN rx task execution pauses. > > Just a wild guess though, I would check the source for this pattern if it was open. > > Regards, > Michael > > >> Regards, Mark >> >>> On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>> >>> Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident. >>> >>> The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO >>> will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue. >>> >>> Unfortunately, something goes completely wrong: >>> >>> a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times) >>> >>> b) after resolving the overrun, a garbled message is received. >>> >>> The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read >>> from the bus, so something is going very wrong here. >>> >>> I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong. >>> >>> Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also >>> reflects the garbled message, which causes the charger to stop.) >>> >>> Regards, >>> Michael >>> >>> >>> ---------------------------------------------------------------------------------------- >>> … >>> 274.341 1R11 155 07 97 E4 54 4B A8 00 6C >>> 274.341 1T11 155 01 97 E4 54 4B A8 00 6C >>> 274.351 1R11 155 07 97 E5 54 4B A8 00 6C >>> 274.351 1T11 155 01 97 E5 54 4B A8 00 6C >>> 274.361 1R11 155 07 97 E5 54 4B A8 00 6C >>> 274.361 1T11 155 01 97 E5 54 4B A8 00 6C >>> 274.371 1R11 423 03 22 FF FF 00 E0 00 DB >>> 274.371 1R11 426 00 38 01 00 4D 7E 00 >>> 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 >>> 274.371 1R11 627 00 00 00 >>> 274.371 1R11 155 07 97 E5 54 4B A8 00 6C >>> 274.371 1T11 155 01 97 E5 54 4B A8 00 6C >>> 274.381 1R11 155 07 97 E5 54 4B A8 00 6C >>> 274.381 1T11 155 01 97 E5 54 4B A8 00 6C >>> 274.391 1R11 5D7 FF FF 01 E4 53 80 00 >>> >>> - normal up to here >>> >>> !!! CAN RX is blocked here for ~70 ms !!! >>> … interrupts disabled? by whom? >>> >>> 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF >>> 274.461 1R11 436 00 0E 3C B2 00 00 >>> 274.461 1R11 155 07 97 E5 54 4B A8 00 6C >>> 274.461 1R11 424 11 40 15 26 47 64 00 48 >>> 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F >>> 274.461 1R11 556 30 63 07 30 73 07 30 7A >>> 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here >>> >>> !!! next RX pause, ~90 ms this time !!! >>> >>> !!! ID 599 has a period of 100 ms, but we now get it 5 times: >>> >>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>> >>> 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >>> crtd fprintf bug compensation: >>> 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >>> errflags=528128 = 08 0F 00 >>> 0x08 = IR: Data overrun >>> 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full >>> 0x00 = ECC: - >>> >>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" >>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" >>> 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" >>> 274.551 1R11 423 03 22 FF FF 00 E0 00 DB >>> 274.551 1R11 426 00 38 01 00 4D 7E 00 >>> 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 >>> 274.551 1R11 627 00 00 00 >>> 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" >>> 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" >>> 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" >>> 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D" >>> >>> !!! 120 ms pause !!! >>> >>> 274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D" >>> >>> → "D" was sent to the listeners & reflected by vehicle::IncomingFrame() >>> then, before the LogFrame() call was processed, the CAN task was paused for 120 ms >>> >>> and now it's getting really weird: >>> >>> 274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message! >>> >>> … this frame certainly was not on the bus! >>> bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus >>> bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 >>> - copy / memory error? … very unlikely >>> - CAN RX handling error? … doesn't seem so >>> - RXFIFO message window bug? >>> >>> 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 >>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>> 274.671 1T11 155 01 97 E6 54 4B A8 00 6C >>> 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C >>> 274.671 1R11 155 07 97 F2 54 4B A8 00 6C >>> 274.671 1R11 155 07 97 F3 54 4B A8 00 6C >>> 274.671 1R11 423 03 22 FF FF 00 E0 00 DB >>> 274.671 1R11 426 00 38 01 00 4D 7E 00 >>> 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 >>> 274.671 1R11 627 00 00 00 >>> 274.671 1R11 155 07 97 F5 54 4B A8 00 6C >>> 274.671 1R11 155 07 97 F6 54 4B A8 00 6C >>> 274.671 1R11 5D7 FF FF 01 E4 53 80 00 >>> 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF >>> 274.671 1R11 155 07 97 F7 54 4B A8 00 6C >>> 274.671 1R11 436 00 0E 3C B2 00 00 >>> 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 >>> 274.671 1T11 155 01 97 E7 54 4B A8 00 6C >>> 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error >>> >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> >>> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Can you send us the sdkconfig you use to build with? Regards, Mark.
On 20 Aug 2018, at 2:38 PM, Stein Arne Sordal <ovms@topphemmelig.no> wrote:
I Updated to the latest firmware and CAN2 has not yet stopped with this firmware. But if I compile the latest firmware and adding the lock/unlock doors functionality, CAN2 stops within one hour…. Anyone have some clues?
Regards, Stein Arne Sordal
On 19 Aug 2018, at 13:59, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Our current implementation lacks a recovery attempt from bus errors and arbitration losses, we only record the error status.
The new Espressif CAN driver may contain some useful info on this.
Regards, Michael
Am 15.08.2018 um 15:46 schrieb Mark Webb-Johnson:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
On 14 Aug 2018, at 6:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Mark,
Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:
CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b
That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?
ECC 5b = form error on TX in acknowledge delimiter (?)
Can we use some of this info as an indicator that restarting the interface is necessary?
Regards, Michael
Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson:
I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1.
OVMS# test cantx can2 100 Testing 100 frames on can2 Transmitted 100 frames in 0.299612s = 2996us/frame
OVMS# test cantx can1 100 Testing 100 frames on can1 Transmitted 100 frames in 0.013691s = 136us/frame
I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues.
I’ll continue looking into this tomorrow.
Regards, Mark.
> On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote: > >> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in >> PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that. > > I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set. > > I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping). > > Looking at page #28 of the datasheet, we have: > > 4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset. > > Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a: > > while (MODULE_ESP32CAN->SR.B.RBS) > { > // Read and process a message > } > > in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available). > > Regards, Mark. > >> On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >> >> Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson: >>> I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. >>> The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue? >> >> You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in. >> >> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in >> PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that. >> >> Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled. >> >> IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI >> would be 0, so rxframe() wouldn't be called. >> >> …unless IR.RI was set due to a hardware bug… >> >> …or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt >> coming in with IR.RI, but I've seen nothing in the documentation about something like this. >> >>> I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it. >> >> I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context >> switches in some situations, which causes the CAN rx task execution pauses. >> >> Just a wild guess though, I would check the source for this pattern if it was open. >> >> Regards, >> Michael >> >> >>> Regards, Mark >>> >>>> On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>> >>>> Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident. >>>> >>>> The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO >>>> will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue. >>>> >>>> Unfortunately, something goes completely wrong: >>>> >>>> a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times) >>>> >>>> b) after resolving the overrun, a garbled message is received. >>>> >>>> The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read >>>> from the bus, so something is going very wrong here. >>>> >>>> I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong. >>>> >>>> Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also >>>> reflects the garbled message, which causes the charger to stop.) >>>> >>>> Regards, >>>> Michael >>>> >>>> >>>> ---------------------------------------------------------------------------------------- >>>> … >>>> 274.341 1R11 155 07 97 E4 54 4B A8 00 6C >>>> 274.341 1T11 155 01 97 E4 54 4B A8 00 6C >>>> 274.351 1R11 155 07 97 E5 54 4B A8 00 6C >>>> 274.351 1T11 155 01 97 E5 54 4B A8 00 6C >>>> 274.361 1R11 155 07 97 E5 54 4B A8 00 6C >>>> 274.361 1T11 155 01 97 E5 54 4B A8 00 6C >>>> 274.371 1R11 423 03 22 FF FF 00 E0 00 DB >>>> 274.371 1R11 426 00 38 01 00 4D 7E 00 >>>> 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 >>>> 274.371 1R11 627 00 00 00 >>>> 274.371 1R11 155 07 97 E5 54 4B A8 00 6C >>>> 274.371 1T11 155 01 97 E5 54 4B A8 00 6C >>>> 274.381 1R11 155 07 97 E5 54 4B A8 00 6C >>>> 274.381 1T11 155 01 97 E5 54 4B A8 00 6C >>>> 274.391 1R11 5D7 FF FF 01 E4 53 80 00 >>>> >>>> - normal up to here >>>> >>>> !!! CAN RX is blocked here for ~70 ms !!! >>>> … interrupts disabled? by whom? >>>> >>>> 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> 274.461 1R11 436 00 0E 3C B2 00 00 >>>> 274.461 1R11 155 07 97 E5 54 4B A8 00 6C >>>> 274.461 1R11 424 11 40 15 26 47 64 00 48 >>>> 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F >>>> 274.461 1R11 556 30 63 07 30 73 07 30 7A >>>> 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here >>>> >>>> !!! next RX pause, ~90 ms this time !!! >>>> >>>> !!! ID 599 has a period of 100 ms, but we now get it 5 times: >>>> >>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> >>>> 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >>>> crtd fprintf bug compensation: >>>> 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >>>> errflags=528128 = 08 0F 00 >>>> 0x08 = IR: Data overrun >>>> 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full >>>> 0x00 = ECC: - >>>> >>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" >>>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" >>>> 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" >>>> 274.551 1R11 423 03 22 FF FF 00 E0 00 DB >>>> 274.551 1R11 426 00 38 01 00 4D 7E 00 >>>> 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 >>>> 274.551 1R11 627 00 00 00 >>>> 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" >>>> 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" >>>> 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" >>>> 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D" >>>> >>>> !!! 120 ms pause !!! >>>> >>>> 274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D" >>>> >>>> → "D" was sent to the listeners & reflected by vehicle::IncomingFrame() >>>> then, before the LogFrame() call was processed, the CAN task was paused for 120 ms >>>> >>>> and now it's getting really weird: >>>> >>>> 274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message! >>>> >>>> … this frame certainly was not on the bus! >>>> bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus >>>> bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 >>>> - copy / memory error? … very unlikely >>>> - CAN RX handling error? … doesn't seem so >>>> - RXFIFO message window bug? >>>> >>>> 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 >>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>>> 274.671 1T11 155 01 97 E6 54 4B A8 00 6C >>>> 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C >>>> 274.671 1R11 155 07 97 F2 54 4B A8 00 6C >>>> 274.671 1R11 155 07 97 F3 54 4B A8 00 6C >>>> 274.671 1R11 423 03 22 FF FF 00 E0 00 DB >>>> 274.671 1R11 426 00 38 01 00 4D 7E 00 >>>> 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 >>>> 274.671 1R11 627 00 00 00 >>>> 274.671 1R11 155 07 97 F5 54 4B A8 00 6C >>>> 274.671 1R11 155 07 97 F6 54 4B A8 00 6C >>>> 274.671 1R11 5D7 FF FF 01 E4 53 80 00 >>>> 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF >>>> 274.671 1R11 155 07 97 F7 54 4B A8 00 6C >>>> 274.671 1R11 436 00 0E 3C B2 00 00 >>>> 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 >>>> 274.671 1T11 155 01 97 E7 54 4B A8 00 6C >>>> 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error >>>> >>>> >>>> -- >>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>> >>>> >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> >> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
sdkconfig attached. Regards, Stein Arne Sordal
On 21 Aug 2018, at 05:10, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Can you send us the sdkconfig you use to build with?
Regards, Mark.
On 20 Aug 2018, at 2:38 PM, Stein Arne Sordal <ovms@topphemmelig.no <mailto:ovms@topphemmelig.no>> wrote:
I Updated to the latest firmware and CAN2 has not yet stopped with this firmware. But if I compile the latest firmware and adding the lock/unlock doors functionality, CAN2 stops within one hour…. Anyone have some clues?
Regards, Stein Arne Sordal
On 19 Aug 2018, at 13:59, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Our current implementation lacks a recovery attempt from bus errors and arbitration losses, we only record the error status.
The new Espressif CAN driver may contain some useful info on this.
Regards, Michael
Am 15.08.2018 um 15:46 schrieb Mark Webb-Johnson:
Sorry, forgot to mention: another option is to poll the bus manually if an interrupt hasn’t been received in N seconds. Check the status registers and if everything is not perfect then reset the controller.
On 15 Aug 2018, at 9:45 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
This was my worry for both esp32 can and mcp2515. We get an interrupt, but for some reason the status is not showing anything to do, so we return from the interrupt. Then, the status changes. Or, for a particular status we had to do something that we didn’t. The bus is locked up, and the interrupt never fires again, so we never get called. I wonder if we just fired the interrupt handler again manually, would it recover?
When this happens, do we need to power off then on the can bus, or is a ‘can canX stop’ ‘can canX start …’ sufficient?
Perhaps a general solution would be a watchdog. If the CAN bus does not receive anything for N seconds (perhaps 60), then restart the controller? We could simply protect it from restarting twice (a simple check of the counters), so worse case this would restart once 60 seconds after a bus normally went idle. Or is that too kludgy?
Regards, Mark.
On 14 Aug 2018, at 6:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Mark,
Frank has observed another lockup on can1, RX was stuck, not sure about TX. Status was:
CAN: can1 Mode: Active Speed: 500000 Interrupts: 22425486 Rx pkt: 21080181 Rx err: 5 Rx ovrflw: 2054 Tx pkt: 1340517 Tx delays: 335 Tx err: 0 Tx ovrflw: 0 Err flags: 0x00800c5b
That seems to be a bus error IRQ -- but strangely, the bus error status flag is not set?
ECC 5b = form error on TX in acknowledge delimiter (?)
Can we use some of this info as an indicator that restarting the interface is necessary?
Regards, Michael
Am 12.06.2018 um 15:44 schrieb Mark Webb-Johnson: > I wrote extensions to the ‘test canrx’ and ‘test cantx’ commands. They now provide high resolution timings, which can be used to get an idea of performance. CAN2/CAN3 tx is dog slow compared to CAN1. > > OVMS# test cantx can2 100 > Testing 100 frames on can2 > Transmitted 100 frames in 0.299612s = 2996us/frame > > OVMS# test cantx can1 100 > Testing 100 frames on can1 > Transmitted 100 frames in 0.013691s = 136us/frame > > I tried adding the "while (MODULE_ESP32CAN->SR.B.RBS)” into esp32can, and it seems to work just fine. Not sure how to replicate the problem, so not sure if it resolves anything - but it doesn’t seem to make anything worse so I committed it. Feel free to revert that commit if it causes any issues. > > I’ll continue looking into this tomorrow. > > Regards, Mark. > >> On 12 Jun 2018, at 2:37 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote: >> >>> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in >>> PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that. >> >> I suggest to put a check for MODULE_ESP32CAN->SR.B.RBS at the start of ESP32CAN_rxframe. That status bit is documented as indicating "one or more messages are available in the RXFIFO”. Presumably we shouldn’t be reading from the RX FIFO unless that bit is set. >> >> I agree that the documentation doesn’t talk about spurious interrupts, but adding that check should not do any harm. Reading the receive buffer when nothing is in it, by comparison, would produce all sorts of weird incoming frame results (including duplicates, and the partial frame corruptions you mention, due to RX FIFO wrapping). >> >> Looking at page #28 of the datasheet, we have: >> >> 4. After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer. If there is no other message available, the receive interrupt bit is reset. >> >> Does that mean that we should be checking the receive buffer in ESP32CAN_rxframe and looping? I’m confused because if it did not then the interrupt would presumably fire again, and cause recursion (or is the new interrupt delayed until the end of the current ISR execution)? Perhaps we can simply do a: >> >> while (MODULE_ESP32CAN->SR.B.RBS) >> { >> // Read and process a message >> } >> >> in ESP32CAN_rxframe to process all incoming messages? The mcp2515 does something similar (relying on can.cpp CAN_rxstask to loop until no more messages available). >> >> Regards, Mark. >> >>> On 12 Jun 2018, at 5:22 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>> >>> Am 11.06.2018 um 04:07 schrieb Mark Webb-Johnson: >>>> I think the fault must be internal (controller, or our driver code), as the can bus itself has error checking. >>>> The only other thing I noticed on my review is that in ESP32CAN_rxframe there is no check that a frame is actually available before reading and queueing it. If the RX interrupt is spurious, that could cause all sorts of issues. I think it would be safer to have such a check, and maybe that is the cause of this issue? >>> >>> You mean we could be reading the RX mailbox a) without a new signal and b) before the next message is fully rolled in. >>> >>> But how would you check that a frame is available? I understand from the documentation, the RX IRQ isn't raised unless the FIFO has a valid message and (in >>> PeliCAN mode) has been mapped to the buffer address. I can't see how we could check the frame validity additionally to that. >>> >>> Setting CMR.RRB then allows the chip to fetch the next message from the FIFO, which will trigger another IR.RI as soon as the buffer is filled. >>> >>> IR.RI is a level interrupt, so if stuck will trigger again. But reading IR clears it. So even if the ISR would be called again due to an IRQ handling bug, IR.RI >>> would be 0, so rxframe() wouldn't be called. >>> >>> …unless IR.RI was set due to a hardware bug… >>> >>> …or IR.RI could also be set on a data overrun (or other error) for an invalid message. That would mean IR.RI must be discarded in case of an error interrupt >>> coming in with IR.RI, but I've seen nothing in the documentation about something like this. >>> >>>> I’m working on the MCP2515 driver at the moment, to try to find out what is causing the OBDII HUD lock-up (and others have seen). Perhaps I’ll find out something when reviewing the driver and shared CAN library. Due to the huge volume of traffic on my Tesla Model S, I get a large number of overruns but don’t see this problem. Maybe it is rare, or we are just not noticing it. >>> >>> I assume the problem can be triggered by poor Wifi connectivity -- that's the situation at Frank's home. I assume the Wifi driver disables interrupts or context >>> switches in some situations, which causes the CAN rx task execution pauses. >>> >>> Just a wild guess though, I would check the source for this pattern if it was open. >>> >>> Regards, >>> Michael >>> >>> >>>> Regards, Mark >>>> >>>>> On 10 Jun 2018, at 9:05 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>> >>>>> Frank Demes reported a very strange CAN behaviour. He managed to get a CRTD log of one incident. >>>>> >>>>> The effect is triggered by an unknown cause freezing the CAN rx task for short periods of time, i.e. 70 - 120 ms. This alone should be no problem, the RXFIFO >>>>> will probaly overflow, but that is handled by our driver, so except some messages lost this should not be an issue. >>>>> >>>>> Unfortunately, something goes completely wrong: >>>>> >>>>> a) on the overrun event, the last valid message is delivered multiple times (i.e. 5 times) >>>>> >>>>> b) after resolving the overrun, a garbled message is received. >>>>> >>>>> The garbled message contains a valid byte sequence in the wrong place, i.e. bytes 5-8 would be valid if they were bytes 1-4. Bytes 1-4 cannot have been read >>>>> from the bus, so something is going very wrong here. >>>>> >>>>> I have walked through the esp32can driver and checked the SJA1000 documentation but have no clue what is going wrong. >>>>> >>>>> Any ideas? CRTD excerpt below. (Frame 155 is reflected by the module with byte 1 changed from 07 to 01 to limit the charge current. The problem is, it also >>>>> reflects the garbled message, which causes the charger to stop.) >>>>> >>>>> Regards, >>>>> Michael >>>>> >>>>> >>>>> ---------------------------------------------------------------------------------------- >>>>> … >>>>> 274.341 1R11 155 07 97 E4 54 4B A8 00 6C >>>>> 274.341 1T11 155 01 97 E4 54 4B A8 00 6C >>>>> 274.351 1R11 155 07 97 E5 54 4B A8 00 6C >>>>> 274.351 1T11 155 01 97 E5 54 4B A8 00 6C >>>>> 274.361 1R11 155 07 97 E5 54 4B A8 00 6C >>>>> 274.361 1T11 155 01 97 E5 54 4B A8 00 6C >>>>> 274.371 1R11 423 03 22 FF FF 00 E0 00 DB >>>>> 274.371 1R11 426 00 38 01 00 4D 7E 00 >>>>> 274.371 1R11 597 20 A4 03 B1 2F 00 01 53 >>>>> 274.371 1R11 627 00 00 00 >>>>> 274.371 1R11 155 07 97 E5 54 4B A8 00 6C >>>>> 274.371 1T11 155 01 97 E5 54 4B A8 00 6C >>>>> 274.381 1R11 155 07 97 E5 54 4B A8 00 6C >>>>> 274.381 1T11 155 01 97 E5 54 4B A8 00 6C >>>>> 274.391 1R11 5D7 FF FF 01 E4 53 80 00 >>>>> >>>>> - normal up to here >>>>> >>>>> !!! CAN RX is blocked here for ~70 ms !!! >>>>> … interrupts disabled? by whom? >>>>> >>>>> 274.461 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> 274.461 1R11 436 00 0E 3C B2 00 00 >>>>> 274.461 1R11 155 07 97 E5 54 4B A8 00 6C >>>>> 274.461 1R11 424 11 40 15 26 47 64 00 48 >>>>> 274.461 1R11 425 0A 1D 44 FF FE 40 01 1F >>>>> 274.461 1R11 556 30 63 07 30 73 07 30 7A >>>>> 274.461 1T11 155 01 97 E5 54 4B A8 00 6C ← reflection for 155 still works here >>>>> >>>>> !!! next RX pause, ~90 ms this time !!! >>>>> >>>>> !!! ID 599 has a period of 100 ms, but we now get it 5 times: >>>>> >>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> >>>>> 274.551 1CEV Error intr=58445 rxpkt=27132 txpkt=528128 errflags=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >>>>> crtd fprintf bug compensation: >>>>> 274.551 1CEV Error rxpkt=58445 txpkt=27132 errflags=528128 intr=0x15433 rxerr=0 txerr=0 rxovr=2 txovr=0 txdelay=17 >>>>> errflags=528128 = 08 0F 00 >>>>> 0x08 = IR: Data overrun >>>>> 0x0F = SR: Bus OK, error limit not yet reached, RX & TX idle, RXFIFO overrun, RX buffer full >>>>> 0x00 = ECC: - >>>>> >>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> 274.551 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "A" >>>>> 274.551 1R11 155 07 97 E6 54 4B A8 00 6C RX "B" >>>>> 274.551 1R11 155 07 97 E7 54 4B A8 00 6C RX "C" >>>>> 274.551 1R11 423 03 22 FF FF 00 E0 00 DB >>>>> 274.551 1R11 426 00 38 01 00 4D 7E 00 >>>>> 274.551 1R11 597 20 A4 03 B1 2F 00 01 53 >>>>> 274.551 1R11 627 00 00 00 >>>>> 274.551 1T11 155 01 97 E6 54 4B A8 00 6C reflection "A" >>>>> 274.551 1CEV TX_Queue T11 155 01 97 E6 54 4B A8 00 6C reflection "B" >>>>> 274.551 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C reflection "C" >>>>> 274.551 1T11 155 01 97 E8 54 4B A8 00 6C reflection "D" >>>>> >>>>> !!! 120 ms pause !!! >>>>> >>>>> 274.671 1R11 155 07 97 E8 54 4B A8 00 6C RX "D" >>>>> >>>>> → "D" was sent to the listeners & reflected by vehicle::IncomingFrame() >>>>> then, before the LogFrame() call was processed, the CAN task was paused for 120 ms >>>>> >>>>> and now it's getting really weird: >>>>> >>>>> 274.671 1R11 155 07 08 2A A0 07 97 E7 54 RX "E": invalid frame / garbled message! >>>>> >>>>> … this frame certainly was not on the bus! >>>>> bytes 2-4 (08 2A A0) = complete nonsense, not matching any frame on the bus >>>>> bytes 5-8 (07 97 E7 54) = normally bytes 1-4 (!) of ID 155 >>>>> - copy / memory error? … very unlikely >>>>> - CAN RX handling error? … doesn't seem so >>>>> - RXFIFO message window bug? >>>>> >>>>> 274.671 1CEV Error intr=58458 rxpkt=27134 txpkt=528128 errflags=0x15456 rxerr=0 txerr=0 rxovr=3 txovr=0 txdelay=19 >>>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>>>> 274.671 1R11 155 07 97 E7 54 4B A8 00 6C >>>>> 274.671 1T11 155 01 97 E6 54 4B A8 00 6C >>>>> 274.671 1CEV TX_Queue T11 155 01 97 E7 54 4B A8 00 6C >>>>> 274.671 1R11 155 07 97 F2 54 4B A8 00 6C >>>>> 274.671 1R11 155 07 97 F3 54 4B A8 00 6C >>>>> 274.671 1R11 423 03 22 FF FF 00 E0 00 DB >>>>> 274.671 1R11 426 00 38 01 00 4D 7E 00 >>>>> 274.671 1R11 597 20 A4 03 B1 2F 00 01 53 >>>>> 274.671 1R11 627 00 00 00 >>>>> 274.671 1R11 155 07 97 F5 54 4B A8 00 6C >>>>> 274.671 1R11 155 07 97 F6 54 4B A8 00 6C >>>>> 274.671 1R11 5D7 FF FF 01 E4 53 80 00 >>>>> 274.671 1R11 599 00 00 4D 7E 1E 26 FF FF >>>>> 274.671 1R11 155 07 97 F7 54 4B A8 00 6C >>>>> 274.671 1R11 436 00 0E 3C B2 00 00 >>>>> 274.671 1R11 041 A0 07 97 F5 54 4B A8 00 >>>>> 274.671 1T11 155 01 97 E7 54 4B A8 00 6C >>>>> 274.671 1T11 155 01 08 2A A0 07 97 E7 54 reflection of invalid msg "E" → charge terminates due to error >>>>> >>>>> >>>>> -- >>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>> >>>>> >>>>> _______________________________________________ >>>>> OvmsDev mailing list >>>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> >>> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >> > > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
participants (5)
-
Greg D. -
Mark Webb-Johnson -
Michael Balzer -
Stein Arne Sordal -
Tom Parker