[Ovmsdev] CAN-3 broken again?

Thu Jan 11 06:27:29 HKT 2018

Hi Michael,

Returning true was done on a late-night debugging whim, as an
experiment.  I haven't looked "upstream" to see what the false return
would do, but clearly it's having some negative effect on the ability
for the CAN bus to operate.  I did a bit more poking around, and I now
believe that returning True is totally correct in this circumstance.

The functioning of buffer overflow, I believe, is working as it should. 
I see that most of the time, frames come in on buffer 0.  When I cause
the overflow by starting wifi, I see a single frame received in buffer
1, along with the status of a buffer overflow from buffer 0, but the
interrupt status only shows buffer 1 as being full: status from register
2C is 0x22, not 0x23.  The error status was 0x40, indicating the single
overflow, as expected.  My guess is that the timing is such that buffer
0 was being read at the time the next frame arrived, so it went into
buffer 1, and that buffer 0 had emptied by the time buffer 1's interrupt
was seen.  I have not seen a buffer 1 overflow (which would indicate
that a frame was actually lost), so the buffer 0 overflow is totally not
an issue.  At most, it's a warning that the system is under load.  No
surprise there; it was.

But we still need to assume that Rx overflows can occur in real life,
and not have that be a fatal error requiring a reboot or process
restart.  The system has limited CPU power, and not all CAN bus
consumers can have top priority in their processing.  To prevent the
overflows is probably going to be expensive (redesign, perhaps), and is
unnecessary.  That the OBDII devices do retries proves that the obd2ecu
system can withstand a lost frame here and there and still operate
correctly.  I can even do a full reboot of the module without OBDWiz
complaining, which kind of surprised me when it first occurred.  And
since the overflows I have seen don't indicate an actual frame loss, we
really should be returning True.

Now, if the load on the processor increases such that we do get an
overflow from buffer 1, what then?  I suggest nothing different.  Again,
the design of OBDII is such that it can withstand occasional frame
losses, and even so, what would you do?  There is nothing the average
user would do, nor is there a protocol to engage for error handling. 
The same, I believe, is true for other uses of the MCP CAN busses, which
fortunately appear to be much lower in data rate, and are less likely to
overflow.  In an embedded system, we should note the event (counters
increment, as now), and move on. 

If there are any CAN messages that would have an effect on the system if
lost (are there any one-time occurrences?), that would probably be worth
noting...  An Info loglevel message should probably be displayed for the
developers if we do get a buffer 1 overflow, indicating a lost frame. 
I'll add that and push.

Thanks for looking over my shoulder!

Greg

Michael Balzer wrote:
> Greg,
>
> the RxCallback() is designed to process the errors after the RX buffers, so the return false is basically correct, as there should be no more to do for this rx
> loop.
>
> Also, if the error handler returns true to the framework, a random frame will be sent to the listeners. If this is going to be the solution, we need another
> return code for the loop.
>
> I still think this is a performance issue of the queue/callback scheme. If you return true from the error handler, RxCallback() will be called again after the ,
> so may clear some RX buffer being filled directly on clearing the error flags... maybe that's why this change makes a difference?
>
> I don't understand how the flags 0x2040 = RXB0 overflow can happen. Have a look at the receive flow chart on page 26: if RXB0 is full, it will roll over into
> RXB1, if that's also full it will generate an RXB_1_ overflow. With rollover enabled, RXB0 overflow should never happen. Strange.
>
> Regards,
> Michael
>
>
> Am 10.01.2018 um 08:24 schrieb Greg D.:
>> Ok, I think I found the issue with receive buffer overflows killing the
>> interface.  Errors get handled by the Receive callback, but the function
>> was returning false.  Apparently that signaled somebody to stop. 
>> Changed it to return true, now things keep running.  Since the error was
>> handled, why hang?  Seems an overly severe punishment.
>>
>> Mark:  Made the change to mcp2525.cpp and pushed to my fork.  Unless
>> there's an objection to hiding the error, please pull.
>>
>> Todo: see if I can replicate the issue with the transmit side and
>> Canopen.  For some reason, that stopped failing earlier today (not that
>> not failing is a bad thing...)  It could be a similar issue with a
>> transmit error, but I need to replicate it first.
>>
>> Greg
>>
>>
>> Michael Balzer wrote:
>>> Greg,
>>>
>>> error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after
>>> restarting the obdii process.
>>>
>>> As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough.
>>>
>>> My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback.
>>>
>>> But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working.
>>>
>>> Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command?
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 08.01.2018 um 07:31 schrieb Greg D.:
>>>> Quick update...  I can reliably get the CAN 3 bus to hang with an Rx
>>>> overflow by having the modem running, then telling WiFi to connect, but
>>>> only with the OBDWiz dongle.  Connecting an actual HUD display doesn't
>>>> seem to trigger the effect.  Surprisingly, I can do a full module reset
>>>> while the OBDWiz is running, without it disconnecting.  (OBDII ECU is
>>>> started in the system.start script, along with the v2 server and vehicle
>>>> module setting, though all this testing is done on the bench without the
>>>> car.)
>>>>
>>>> Also, once the bus is hung, restarting the OBD2ECU process sometimes
>>>> only lets the OBDWiz dongle get part-way through its connect process
>>>> before it hangs again.  Consistently 7 frames received, 10 sent (due to
>>>> some of the responses taking multiple frames).  It may be significant
>>>> that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...
>>>>
>>>> That said, another development is that the Rx overflow may not be fatal
>>>> after all, if I start things with the HUD, then swap dongles to the
>>>> OBDWiz.  Seems that having an external 12v power source keeps things
>>>> running even with the overflow status.  Since earlier (few months ago)
>>>> testing didn't have the modem running, and the OBDWiz dongle doesn't
>>>> need the 12v power (it's a USB device, on a different PC), this test
>>>> combination is new.  Error flags on the can status are 0x2040, by the
>>>> way, when it hangs.
>>>>
>>>> But still, even with the 12v, I can reliably cause the bus to hang by
>>>> starting with the OBDWiz dongle running, get the modem connected, then
>>>> connect wifi.  The partial connect and hang seems to be solved with the
>>>> 12v power; the full hang is not.  But the full hang (with 12v attached)
>>>> can be reset by stopping and restarting the obdii ecu.  Interestingly,
>>>> the 0x2040 error status is not cleared when restarting the obdii
>>>> process, but the frame and frame error counters do get set back to zero.
>>>>
>>>> Still looking for more clues...  Any ideas on how to narrow this down?
>>>>
>>>> Greg
>>>>
>>>>
>>>> Greg D. wrote:
>>>>> I've turned off Canopen, SSH, and Telnet earlier, and that seemed to
>>>>> stop the crashes.  Just now added Bluetooth to that list, for good measure.
>>>>>
>>>>> Let's see how that holds...
>>>>>
>>>>> As for the issues with CAN-3, I seem to be able to hang it by simply
>>>>> starting WiFi while the OBDII ECU is running with an OBDII device
>>>>> attached (OBDWiz, in this case).  Trying to reconnect the OBDII device
>>>>> fails - no frames are received.  Stopping and restarting the OBDII ECU
>>>>> task lets me reconnect.  If I look at the can status when hung, I see
>>>>> that Rx Ovrflw is 1, and the Rx counter doesn't increment.  I'm guessing
>>>>> that starting WiFi is taking enough CPU time that the OBDII ECU task is
>>>>> falling behind, causing the overflow.  Apparently, that overflow is not
>>>>> being handled, leading to the hang.
>>>>>
>>>>> On an earlier run (before removing Bluetooth), I was able to get the
>>>>> OBDWiz dongle to connect for a few frames, after which it hung.  That
>>>>> behavior didn't repeat just now, but I'm not sure what else was running
>>>>> at the time (e.g. the modem / ppp).  The connect sequence from OBDWiz
>>>>> does a few frames rapidly (an initial PID 0, followed by requests for
>>>>> ECU Name and VIN), before a more relaxed polling starts.  So, if there's
>>>>> another task taking up CPU time, I can see where an Rx overflow could
>>>>> occur during that initial connect sequence.
>>>>>
>>>>> Driving a HUD is not a critical task, so I would be against a general
>>>>> raising of task priority.  Rather, we need to figure out how to handle
>>>>> the Rx Overflow, and keep the frames coming in.  OBDII devices generally
>>>>> are somewhat forgiving about lost frames, but apparently the OBDWiz has
>>>>> a short attention span and lets you know that something is wrong.
>>>>>
>>>>> I'll take a look at the 2515 code, but I'm not much of an expert on the
>>>>> chip's care and feeding under such circumstances.  If someone more in
>>>>> the know about it could take a look, that would be great.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Greg
>>>>>
>>>>>
>>>>> Stephen Casner wrote:
>>>>>> Greg,
>>>>>>
>>>>>> Yes, definitely running out of free RAM, but I don't know the meaning
>>>>>> of the WindowOverflow messages.
>>>>>>
>>>>>> The first time I built with release/v3.0 of esp-idf I was not able to
>>>>>> open an ssh connection; the error displayed was about a crypto
>>>>>> failure.  After quite a bit of digging to narrow down to where the
>>>>>> error was occurring, I finally found that the problem was running out
>>>>>> of free RAM.  My solution was to disable bluetooth entirely, which
>>>>>> made a big difference in the amount of free RAM.
>>>>>>
>>>>>>                                                         -- Steve
>>>>>>
>>>>>> On Sun, 7 Jan 2018, Greg D. wrote:
>>>>>>
>>>>>>> Hi Michael, Steve, Mark,
>>>>>>>
>>>>>>> Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the
>>>>>>> issue.  Crash and reboot log attached (crash.txt).  One thing I've been wondering
>>>>>>> about are the several lines "_WindowOverflow4 at ??:?" during the boot process.  Is
>>>>>>> that indicative of a problem, later to manifest in the crash?
>>>>>>>
>>>>>>> My builds include pretty much everything, except for the Leaf, Twizy, and Soul.
>>>>>>>
>>>>>>> The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other
>>>>>>> stuff, including a lot stuff updated in Canopen and Kia.  I have a script that does
>>>>>>> the git fetch master, merge, and push back to my github fork, the output of which is
>>>>>>> attached (update.txt).  As a test, I removed Canopen from the build config, and the
>>>>>>> crash has disappeared.  CAN-3 also appears to have come back to life (!), at least
>>>>>>> initially.  I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in
>>>>>>> some sequence (still trying to pin that down), but that also leads to another crash
>>>>>>> (crash2.txt, attached).
>>>>>>>
>>>>>>> Mark:  Note also the issue with DNS failures getting to the v2 server.  I enabled the
>>>>>>> modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2
>>>>>>> server.  Disabling Wifi didn't bring it back, and powering off the modem (in
>>>>>>> preparation for turning it back on) caused the crash.
>>>>>>>
>>>>>>> So, two questions...  First, why the apparent conflict between Canopen or wifi/modem
>>>>>>> and obd2ecu over access to the 3rd CAN bus?  Why would the modem or wifi have any
>>>>>>> effect on a CAN bus?
>>>>>>>
>>>>>>> Second, overall memory usage seems to be at the limit.  What sort of budget do we have
>>>>>>> for what remains to be done, and how are we going to be packaging the build options
>>>>>>> for when non-developers want to get their hands on the product?  Will we be able to
>>>>>>> turn everything on, minus the developer / debug stuff, or will we have a separate SKU
>>>>>>> for each model car?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Greg
>>>>>>>
>>>>>>>
>>>>>>> Michael Balzer wrote:
>>>>>>>
>>>>>>> Greg,
>>>>>>>
>>>>>>> which commits / changes do you mean? The CAN drivers have not been changed since the T
>>>>>>> X performance fix, which Geir reported having solved his last issues.
>>>>>>>
>>>>>>> The current version is stable over here, but without the SSH component -- I can't use
>>>>>>> that due to memory getting too low together with the Twizy component.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> Am 07.01.2018 um 08:04 schrieb Greg D.:
>>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I just resync'd with the main repository, and am not receiving frames on
>>>>>>> CAN-3 anymore.  I see there were changes to the chip driver...
>>>>>>>
>>>>>>> I'm also seeing crashes right after getting connected to WiFi,
>>>>>>> immediately after the system tries to start SSH.
>>>>>>>
>>>>>>> Seems like we just took a big step backward.  What happened?
>>>>>>>
>>>>>>> Greg
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> OvmsDev mailing list
>>>>>>> OvmsDev at lists.teslaclub.hk
>>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.teslaclub.hk
>>>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.teslaclub.hk
>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.teslaclub.hk
>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev