[Ovmsdev] CAN drivers: fix & harmonize frame transmission failure handling

Sun Jan 10 02:11:57 HKT 2021

Michael,

Do you know if anyone is tracking your changes for the Tesla Roadster
to make any changes that are necessary or appropriate?  I am not at
all familiar with that part of the code.

                                                        -- Steve

On Sat, 9 Jan 2021, Michael Balzer wrote:

> I've just pushed the two poller extensions described before:
>
> 1. If some TxCallback has been registered, either globally or on the
>    frame, the CAN framework now won't add an extra error status log
>    entry on TX failure. If you need to see/log these, activate a CAN
>    logger capable of catching TX failures, e.g. "can log start monitor
>    crtd", and look for "TX_Fail" entries. With a registered TxCallback,
>    during normal operation you will only see error level log entries
>    from TX failures when the CAN framework encounters a bus error
>    condition. "Ping" frames/requests sent while the error state is
>    active won't produce standard log entries.
>
> 2. The vehicle poller now registers a TxCallback for all requests sent,
>    so automatically fulfills the above condition. You can hook into
>    that callback simply by overriding the following method:
>
>    /**
>      * IncomingPollTxCallback: poller TX callback (stub, override with
>    vehicle implementation)
>      *  This is called by PollerTxCallback() on TX success/failure for
>    a poller request.
>      *  You can use this to detect CAN bus issues, e.g. if the car
>    switches off the OBD port.
>      *
>      *  ATT: this is executed in the main CAN task context. Keep it simple.
>      *    Complex processing here will affect overall CAN performance.
>      *
>      *  @param bus
>      *    CAN bus the current poll is done on
>      *  @param txid
>      *    The module TX ID of the current poll
>      *  @param type
>      *    OBD2 mode / UDS polling type, e.g. VEHICLE_POLL_TYPE_READDTC
>      *  @param pid
>      *    PID addressed (depending on the request type, may be none / 8
>    bit / 16 bit)
>      *  @param success
>      *    Frame transmission success
>      */
>    void OvmsVehicle::IncomingPollTxCallback(canbus* bus, uint32_t txid,
>    uint16_t type, uint16_t pid, bool success)
>       {
>       }
>
>
> Regards,
> Michael
>
>
> Am 09.01.21 um 14:18 schrieb Michael Balzer:
> > As all tests were positive and without issues, I've merged the rework into
> > master.
> >
> > I now consider extending the poller to allow to hook into transmission
> > failures.
> >
> > Also, if a TX callback is present in a frame, I don't think we need the
> > error log entry from the CAN framework. That would eliminate most CAN error
> > log messages from regular poller "pings".
> >
> > Regards,
> > Michael
> >
> >
> > Am 08.01.21 um 18:52 schrieb Michael Balzer:
> > > Steve,
> > >
> > > thanks, that's perfect. The failure handling works as designed in your
> > > case.
> > >
> > > Regarding your question:
> > > > It does prompt me to ask a question that I had - On the i3, if you do
> > > > something like send a lock from the key or the Connected Drive APP then
> > > > the OBD-II comes alive but goes asleep again in less than a minute.
> > > >
> > > > if I have a PID that I poll infrequently - say every 120 seconds.  What
> > > > happens in this case?  Would they be seen as "overdue" when the bus
> > > > comes alive and polled immediately, or is it a matter of luck if the
> > > > 120th tick arrives at a time when the bus is alive?
> > > >
> > > > If the latter I need to poll even things like the VIN every 10 seconds
> > > > to make sure I get it before the bus goes to sleep again.
> > >
> > > With the old handling, the queued frames would have get sent as soon as
> > > the bus got awake again. That's nasty, as the frames may have been for a
> > > specific task (e.g. some protocol part), and should not be sent to a just
> > > woken up car. That could produce any sort of problem up to queued OBD
> > > writes corrupting the car memory. It was also nasty the driver would then
> > > send the whole TX queue at once, flooding the bus. A vehicle could see
> > > that as a malicious activity and block access.
> > >
> > > The new handling will abort the transmission as soon as the CAN controller
> > > runs into the retransmission limit (128 tries, formally CAN error-passive
> > > mode).
> > >
> > > So you now need to "ping" the car with some simple read or session state
> > > request, and check if a response comes in to determine if the bus is
> > > online. If using the poller, you'll get a respective Incoming...()
> > > callback. If you don't use the poller, you can set the TX callback pointer
> > > on the frame you send. The TX callback is called with a success indicator,
> > > so you can know a frame has been sent even if you don't get a response
> > > from the device.
> > >
> > > Regards,
> > > Michael
> > >
> > >
> > > Am 08.01.21 um 09:33 schrieb Steve Davies:
> > > > Hi Michael,
> > > >
> > > > Here's the log from a test on my car with your branch
> > > >
> > > > I started the car, left it for a while, then shut it down and waited
> > > > until the OBD-II first went to "not getting replies to my requests" and
> > > > then to "not sending anything at all".
> > > >
> > > > Hope its helpful.
> > > >
> > > > https://drive.google.com/file/d/1AavD41HCykYrn-BxQXNufu2dT_UVCXYU/view?usp=sharing
> > > > <https://drive.google.com/file/d/1AavD41HCykYrn-BxQXNufu2dT_UVCXYU/view?usp=sharing>
> > > >
> > > > Steve
> > > >
> > > >
> > > > On Fri, 8 Jan 2021 at 08:22, Steve Davies <steve at telviva.co.za
> > > > <mailto:steve at telviva.co.za>> wrote:
> > > >
> > > >     Hi Michael,
> > > >
> > > >     The change looks helpful, thanks.  I'll try it during the course
> > > >     of the day.
> > > >
> > > >     It does prompt me to ask a question that I had - On the i3, if
> > > >     you do something like send a lock from the key or the Connected
> > > >     Drive APP then the OBD-II comes alive but goes asleep again in
> > > >     less than a minute.
> > > >
> > > >     if I have a PID that I poll infrequently - say every 120
> > > >     seconds.  What happens in this case?  Would they be seen as
> > > >     "overdue" when the bus comes alive and polled immediately, or is
> > > >     it a matter of luck if the 120th tick arrives at a time when the
> > > >     bus is alive?
> > > >
> > > >     If the latter I need to poll even things like the VIN every 10
> > > >     seconds to make sure I get it before the bus goes to sleep again.
> > > >
> > > >     Thanks,
> > > >     Steve
> > > >
> > > >
> > > >     On Thu, 7 Jan 2021 at 18:22, Michael Balzer <dexter at expeedo.de
> > > >     <mailto:dexter at expeedo.de>> wrote:
> > > >
> > > >         Everyone,
> > > >
> > > >         please pull & test the new "can-txfail-fix" branch. It's up
> > > >         to date and includes the BMW i3 code already.
> > > >
> > > >         I need to get feedback from users of both can1 (esp32can) &
> > > >         can2/3/4 (mcp2515), as changes had to be made to both drivers.
> > > >
> > > >         I'll quote from my commit:
> > > >         https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/commit/c94592a11ad2c989e65313d23a8876cf38787d70
> > > >         <https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/commit/c94592a11ad2c989e65313d23a8876cf38787d70>
> > > >
> > > >         Design goals:
> > > >         - any TX can either fail or succeed, the result state is
> > > > terminal
> > > >         - the respective TX callback is called exactly once
> > > >         - transmissions fail on reaching the error-passive bus state
> > > >            and on message/bus errors while in error-passive state
> > > >         - a failed TX will be aborted (no retries after bus recovery),
> > > >            i.e. will be retried at most 128 times (in error-active
> > > > phase)
> > > >         - reduce excessive CAN error logging
> > > >         - reduce excessive interrupt load with switched-off buses
> > > >
> > > >         This results in the application being able to reliably detect a
> > > >         switched-off vehicle bus by the TX callback's success indicator.
> > > >         It also results in frames no longer being held in the TX buffer
> > > >         or added to the TX queue when the bus is switched off. The
> > > >         application can now rely on getting a clean bus state on every
> > > >         reconnect, without any queued old frames to be sent
> > > > automatically.
> > > >
> > > >         Secondary benefit from aborting the transmission is, the module
> > > >         doesn't need to handle the load from the continuously triggered
> > > >         CAN error interrupts by retransmission attempts in error-passive
> > > >         state.
> > > >
> > > >
> > > >         Reason for this was a) Steve's question on aborting
> > > >         transmissions / flushing the queue and b) my new car now
> > > >         also switching off the bus, with the annoying effect of a
> > > >         frozen can1 every 2-3 days, needing to reboot the module.
> > > >         I'm not sure yet if the freeze issue is solved, but I
> > > >         haven't had it since running these changes on my module.
> > > >
> > > >         The other issue of the transceivers resending frames queued
> > > >         long ago may have caused all sorts of strange & unrepeatable
> > > >         issues. I remember the VW crew having problems that fell
> > > >         into this category.
> > > >
> > > >         I've verified the new MCP2515 implementation only on my
> > > >         workbench (with an Arduino as the CAN tester), so real life
> > > >         tests are necessary.
> > > >
> > > >         Thanks,
> > > >         Michael
> > > >
> > > >         --         Michael Balzer * Helkenberger Weg 9 * D-58256
> > > > Ennepetal
> > > >         Fon 02333 / 833 5735 * Handy 0176 / 206 989 26