Hi Mark,

Sorry for the radio silence on this end.  I have been poking at the code, but have yet to find out what is going on.  But I have some data.

When the lockup occurs, it's not for lack of getting any interrupts.  Following my recipe for recreating the hang, the first callback gives us a status (from register 0x2c) of 0x80, which is a receive error, but none of the error bits (register 0x2d) are set.  As time goes on, the status sometimes changes to 0xA0 for an interrupt or two, then back to 0x80.  Meanwhile the error bits go from 0 to 3 to 0xB, and stick there over a period of maybe half to a dozen interrupts.  This is all within a blink - big burst of interrupts (hundreds) that get processed, then all goes quiet with no further interrupt activity.  I suspect at that point the chip has gone into "passive mode", and basically curled up in a little ball, whimpering softly.  The length of the burst seems to depend on where in the HUD's startup cycle I enable the obd2ecu task, but I don't have a recipe for how to control it.

Significantly through all this, there are no (zero) received messages.  Also, that the interrupts are essentially back-to-back, means that we are not effectively dealing with the underlying cause of the interrupt.

What seems to be happening is that there are messages being received at the CAN interface at the same time as the chip is being configured.  That the issue is 100% reproducible when traffic is received and 0% if I stop the HUD briefly while enabling the obd2ecu task, tells me that there is a window in the configuration code which is wider than the time between HUD frames (about 100ms).  There are a pair of 50ms delays in the CAN Start routine (neither of which appear to be documented in the chip data sheet), but removing them simply broke the CAN system.

Simply turning on the HUD by toggling ext12v (so it's clean, with no switch bounce at the HUD) causes a pair of status 0x80 / error 00 interrupts, some short time apart, before normal poll / response operation begins.  I presume it's just a bit of noise on the CAN bus as the HUD turns on its chips, but conclude that the 0x80 / 00 condition (which is also seen at the start of the hang scenario) is not specifically fatal.  What seems to be at issue with the hang is that we are lacking a way to clear the receiver after getting it set up  This chip is odd, in that one clears the error flags in the interrupt status register, but not anything more internal to address what caused the interrupt in the first place. 

At least, I couldn't find such a command.  Still looking...  I don't suppose there is a receive buffer between the CAN bus and the MC2515 that I could temporarily turn off to silence the bus while configuring the chip, is there?

Greg


Mark Webb-Johnson wrote:
The fault must be in mcp2515::RxCallback.

Good news is that the way we marshal these interrupts, that callback is in a normal task and can be logged / debugged using normal tools. It is not running in interrupt context.

The usual cause of these sorts of things is the interrupt not being raised, so mcp2515::RxCallback is not called, and locks forever. Perhaps you can try adding a command to inject a spoofed interrupt message (see MCP2515_isr and just use xQueueSend not xQueueSendFromISR), and see if that command can ‘free’ a locked up CAN bus. If so, that is the cause.

Regards, Mark.

On 24 May 2018, at 1:26 PM, Greg D. <gregd2350@gmail.com> wrote:

Hi Mark,

Ok, did some more poking around, being careful to not wiggle too many things at once.  I can get a reliable lockup by doing the following:

1.  power ext12v off
2.  obdii ecu stop
3.  power ext12v on
...wait a few seconds
4.  obdii ecu start can3

If I restart obdii too soon, all works.  Otherwise, I can repeatedly disable and re-enable 12v to cycle the HUD, and it will never connect.  The ordering of steps 1 and 2 don't seem to matter.  Unfortunately, I don't see anything in the can status that's uniquely different between scenarios where its working and not.  Will need some additional diagnostic logging...

Now, for fun, I hook the OBDII Dongle to the module, and try the same steps, but instead of turning on the HUD, I try connecting a few times while the OBDII ECU task is not running (simulating the HUD's attempts to connect), then start the ecu, then try to connect.  It connects!  And this is with the Dongle doing its multi-speed scan each time.  So simply having frames come in while we're not watching, or frames coming in at the wrong speed does not cause the hang.  Rather, it might be that we've got a window in the code where incoming traffic colliding with the opening of the CAN driver is nailing the chip in some critical region.  If I hit it just right, I can sometimes cause this collision with the Dongle by stopping the ecu, starting the connect, then restarting the ecu during the connect sequence.  Not always, but sometimes.

Just a guess...  I need to dust off the chip document and see if there are any interesting bits to look at.

Greg


Mark Webb-Johnson wrote:
When the client (HUD, whatever) is trying to connect to the ECU, it can try 500K, 250K. Or it can try 250K, 500K. I suspect yours tries the first descending sequence, and hence doesn’t have any issues as it finds the match at 500K first.

Anyway, this MCP2515 can bus lockup is something we have to fix. The fact it is reproducible on obd2ecu is good and helpful for that.


_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev



_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev