[Ovmsdev] CAN-3 broken again?

Michael Balzer dexter at expeedo.de
Mon Jan 15 00:52:18 HKT 2018


Greg,

Am 13.01.2018 um 21:07 schrieb Greg D.:
> I tried it 3 times, and on the second attempt the dongle connected and ran without further interaction.  Timing is on the hairy edge, so debug logging can
> affect the results.
>
> If I understand this test, it seems that the receive side is hung, but not the transmit side.  Interesting!  Need to noodle on this a bit...

Yes, it's apparently just the MCP RX that is hung, the MCP TX, CAN framework and OBD2ECU system still work.

Also, frame #15 should have triggered an RX overflow, but it didn't. So either the MCP has completely shut off the receiver (how, why?), or the error interrupts
also get cut off (more likely).

Maybe the interleaved RX frame (#9, flow control) triggers the problem. But Wireshark logs the RX occurring 277 us after the TX, so the IRQs can hardly overlap
and there's plenty of time for the TX cleanup…


…or maybe the mcp2515 has a bug: http://linux-can.vger.kernel.narkive.com/mPiIYqDN/patch-mcp251x-mcp2515-stops-receiving

> The mcp2515 sometimes seems to trigger an interrupt with the corresponding register not being set yet.
> This makes the driver exit the interrupt because there is obviously nothing to do, but the interrupt line is kept low.
> Therefore the driver does not see any more interrupts until the chip is reset (via interface down/up).
[…]
> I've had this problem before but with the MCP2515 connected to a Microchip
> Microcontroller. I've been working on other stuff and just got back to looking
> into this problem. Whilst it's not Linux based I was getting this exact
> problem and was forced to poll the MCP2515 instead of managing it via
> interrupts. Earlier today I got the best of both worlds by doing nothing in
> the ISR other then setting a FLAG and clearing the interrupt. The Flag is then
> picked up in the main processing loop of the code. It's a race condition and a
> flaw in the MCP2515, in my opinion.

This seems to match our issue.

There's another comment:

> Hmmmm, I think level triggered interrupt would help here.

We currently let the interrupt trigger on the negative edge (line 77). Maybe GPIO_INTR_LOW_LEVEL can help?

Polling the IRQ flags for 10 ms is certainly a bad solution…


Regards,
Michael


> Greg
>
>
> Michael Balzer wrote:
>> Greg,
>>
>> please do the same test including the OVMS log output at log level verbose with can trace on.
>>
>> Additionally, when it hangs, please issue
>>
>> can can3 rx standard 7df 02 01 00 00 00 00 00 ff
>> can can3 status
>> can can3 tx standard 7e8 06 41 00 18 19 00 01 ff
>> can can3 status
>>
>> …still with wireshark capturing and without any restart of the obd2ecu process.
>>
>> Thanks,
>> Michael
>>
>>
>> Am 13.01.2018 um 19:44 schrieb Greg D.:
>>> Hi Michael,
>>>
>>> Much better.  Crash is solved, but unfortunately when I remove the delays in the obd2ecu application (marked with "temporary" in the comments), the bus
>>> still hangs as before.  Actually, slightly worse, because I used to be able to get by if I turned off the VIN reporting; now that hangs too.  But it was
>>> working by the slimmest of margin before, so it could also be just by luck.
>>>
>>> Wireshark trace of the interaction, attached.  Turning off privacy (so I should reply to the VIN request) results in the same trace, so the bus is hanging
>>> right around that point.  Notably, the receive side is hung too (i.e. I never got the VIN request), and counting frames (5 Rx, 7 Tx = 12), we can see the
>>> hang occurred right after the ECU Name was sent (frame 12), and before the VIN request (frame 13), which I never received.
>>>
>>> Frame 13 in the trace is where the OBDWiz dongle is requesting the VIN. The bus is apparently hung at this point, so there is no reply.  The next frame is
>>> the dongle re-connecting with the OVMS module after a timeout.  The OBDWiz dongle retries the connect a few more times, then gives up. 
>>>
>>> 'can can3 status' at this point is:
>>>
>>> OVMS > can can3 status
>>> CAN:       can3
>>> Mode:      Active
>>> Speed:     500000
>>> Rx pkt:                       5
>>> Rx err:                       0
>>> Rx ovrflw:                    0
>>> Tx pkt:                       7
>>> Tx delays:                    0
>>> Tx err:                       0
>>> Tx ovrflw:                    0
>>> Err flags: 0
>>> OVMS >
>>>
>>> If I stop and restart the obd2ecu task, I can re-create this same sequence, so a close/open properly resets the chip/driver.
>>>
>>> Hope this helps.  Let me know what else I can do.
>>>
>>> Greg
>>>
>>>
>>>
>>> Michael Balzer wrote:
>>>> Just added an additional fix for this.
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 13.01.2018 um 19:03 schrieb Michael Balzer:
>>>>> Please check again.
>>>>>
>>>>> Thanks,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 13.01.2018 um 18:42 schrieb Greg D.:
>>>>>> Gave it a quick try, and got a crash...  I'll see if I can isolate it a bit, but here's something to start with.  Tombstone, attached.  
>>>>>>
>>>>>> Greg
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Geir Øyvind Vælidalo wrote:
>>>>>>> I currently don’t send anything on can2. I could try to send something, but the car is away this weekend :-(
>>>>>>>
>>>>>>> Geir
>>>>>>>
>>>>>>>
>>>>>>>> 13. jan. 2018 kl. 17:24 skrev Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>>:
>>>>>>>>
>>>>>>>> Part one (TX queue) done & pushed.
>>>>>>>>
>>>>>>>> OVMS > can can1 status
>>>>>>>> CAN:       can1
>>>>>>>> Mode:      Active
>>>>>>>> Speed:     500000
>>>>>>>> Rx pkt:                  236657
>>>>>>>> Rx err:                       1
>>>>>>>> Rx ovrflw:                    0
>>>>>>>> Tx pkt:                  106378
>>>>>>>> Tx delays:                    4
>>>>>>>> Tx err:                       0
>>>>>>>> Tx ovrflw:                    0
>>>>>>>> Err flags: 0x800caa
>>>>>>>>
>>>>>>>> TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir,
>>>>>>>> could you…?
>>>>>>>>
>>>>>>>> The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX
>>>>>>>> IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems.
>>>>>>>>
>>>>>>>>
>>>>>>>> No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and
>>>>>>>> will automatically be used by the drivers when no TX buffers are free.
>>>>>>>>
>>>>>>>> If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now
>>>>>>>> also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if
>>>>>>>> queue is full).
>>>>>>>>
>>>>>>>>
>>>>>>>> I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue
>>>>>>>> debugging.
>>>>>>>>
>>>>>>>> E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0
>>>>>>>> E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0
>>>>>>>>
>>>>>>>> …that's also a first part of the logging extension (part two).
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 12.01.2018 um 19:01 schrieb Michael Balzer:
>>>>>>>>> Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX
>>>>>>>>> queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a
>>>>>>>>> scheme like that already? ;)
>>>>>>>>>
>>>>>>>>> Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson:
>>>>>>>>>> Option B sounds like a good approach.
>>>>>>>>>>
>>>>>>>>>> Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback?
>>>>>>>>>>
>>>>>>>>>> Regards, Mark.
>>>>>>>>>>
>>>>>>>>>>> On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Greg, Mark,
>>>>>>>>>>>
>>>>>>>>>>> I can check your new code after work.
>>>>>>>>>>>
>>>>>>>>>>> For the TX performance/overflow issue, there are basically two options:
>>>>>>>>>>>
>>>>>>>>>>>   * A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient
>>>>>>>>>>>     delays (very ugly)
>>>>>>>>>>>   * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if
>>>>>>>>>>>     the queue is full
>>>>>>>>>>>       o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as
>>>>>>>>>>>         fast as the driver can handle them
>>>>>>>>>>>
>>>>>>>>>>> Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I
>>>>>>>>>>> don't think applications should need to handle TX overflows.
>>>>>>>>>>>
>>>>>>>>>>> I can try to implement that this weekend if it's urgent now.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Michael
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am 11.01.2018 um 05:55 schrieb Greg D.:
>>>>>>>>>>>> Hi Mark, Micheal,
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, good news and bad news.
>>>>>>>>>>>>
>>>>>>>>>>>> Good news:  Rx problem I believe is fixed.  Return is true only if we received something, else false.  And the other interrupt conditions are
>>>>>>>>>>>> handled at the same time, so no hangs are seen when restarting wifi.  Rx overflow counter does increment properly.  Yea!  Code has been pushed to
>>>>>>>>>>>> my clone on Github.
>>>>>>>>>>>>
>>>>>>>>>>>> Bad news:  I am still able to hang the bus, but I think it's on the transmit side.  The obd2ecu process can send up to 3 frames back to back to
>>>>>>>>>>>> report the ECU Name, followed soon after by several more with to grab the VIN.  Without any flow control on the transmit side, and with a
>>>>>>>>>>>> half-duplex CAN bus, that's just too much.  Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I
>>>>>>>>>>>> don't respond to the VIN request (which lets everything drain as OBDWiz times out).  Also verified by putting temporary delays in the obd2ecu code
>>>>>>>>>>>> to let things drain a bit between frames.  So, the transmit side is still a bit fragile, depending on timing.  Not sure quite what to do here, as
>>>>>>>>>>>> there is no easy place to queue things...  Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the
>>>>>>>>>>>> driver, no?).  Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge.  Do any other uses of the MCP busses do a
>>>>>>>>>>>> burst of transmitting?  If not, I'll put the delays in the obd2ecu code and call it close enough.  Lemme know.
>>>>>>>>>>>>
>>>>>>>>>>>> For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. 
>>>>>>>>>>>> https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/mcp2515/mcp2515.cpp  Hopefully he'll be back
>>>>>>>>>>>> on line before I get up in the morning.  Wonderful how the Earth's spin helps with the teamwork.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together.  But as it is now,
>>>>>>>>>>>> it's much better than it was before.  Really, this time.  :)
>>>>>>>>>>>>
>>>>>>>>>>>> Greg
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Greg D. wrote:
>>>>>>>>>>>>> Hi Mark,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I believe you are right about the multiple flags, and the code only processing Rx and "error" separately.  Fundamentally, a roll-over from buffer
>>>>>>>>>>>>> 0 to buffer 1 isn't really an error, just a statement of fact on what happened.  So, we should have buffer 1 and the rollover flag at the same
>>>>>>>>>>>>> time, which in fact is what I saw.  I need to handle the Rx overflow at the same time as the buffer 1 receive, I think...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I need to grab some dinner, but have a fix in the works.  Will report back in a few hours, hopefully with good news...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Greg
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mark Webb-Johnson wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The design of the system is as follows:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   * The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be:
>>>>>>>>>>>>>>       o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler.
>>>>>>>>>>>>>>       o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly.
>>>>>>>>>>>>>>       o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once.
>>>>>>>>>>>>>>   * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and
>>>>>>>>>>>>>>     RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame.
>>>>>>>>>>>>>>   * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient.
>>>>>>>>>>>>>>       o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can
>>>>>>>>>>>>>>         driver uses this option.
>>>>>>>>>>>>>>       o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback.
>>>>>>>>>>>>>>         The mcp2515 driver uses this option.
>>>>>>>>>>>>>>   * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then
>>>>>>>>>>>>>>     it is called again.
>>>>>>>>>>>>>>   * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual
>>>>>>>>>>>>>>     tasks for each canbus, hence saving stack RAM).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer
>>>>>>>>>>>>>> parameter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done)
>>>>>>>>>>>>>> has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again
>>>>>>>>>>>>>> (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but
>>>>>>>>>>>>>> then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop
>>>>>>>>>>>>>> can be simply handled in the RxCallback itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the
>>>>>>>>>>>>>> RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy
>>>>>>>>>>>>>> to make that change - it is pretty trivial.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards, Mark.
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> OvmsDev mailing list
>>>>>>>>>>> OvmsDev at lists.teslaclub.hk <mailto:OvmsDev at lists.teslaclub.hk>
>>>>>>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> OvmsDev mailing list
>>>>>>>>>> OvmsDev at lists.teslaclub.hk
>>>>>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> OvmsDev mailing list
>>>>>>>>> OvmsDev at lists.teslaclub.hk
>>>>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>>> _______________________________________________
>>>>>>>> OvmsDev mailing list
>>>>>>>> OvmsDev at lists.teslaclub.hk <mailto:OvmsDev at lists.teslaclub.hk>
>>>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> OvmsDev mailing list
>>>>>>> OvmsDev at lists.teslaclub.hk
>>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.teslaclub.hk
>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>
>>>>> -- 
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.teslaclub.hk
>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>
>>>> -- 
>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>
>>>>
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.teslaclub.hk
>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.teslaclub.hk
>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.teslaclub.hk
>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.teslaclub.hk
> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.teslaclub.hk/pipermail/ovmsdev/attachments/20180114/551efda7/attachment-0001.html>


More information about the OvmsDev mailing list