I don’t know the cause of the HUD bus hang, so really not sure if it is resolved or not. The SPI driver is the standard ESP IDF one now, so should be stable. But what was the state of the bus when it hung? Perhaps some error condition set in the MCP2515?
Trying this again... (got a 554 blocked reply the first time). apologize if it's a duplicate.
Will this fix the bus hangs we have with the HUD devices? Ref: "Can
buses stop after some time" email thread from 5/28/2019... As I
wrote at the time, a 100% way to reproduce the hang was to have a
HUD device running, then start the OBD2ECU task. I haven't
used my HUD in some time, so I don't know if this was fixed since
then. If not, that was a very quick and easy way to reproduce the
issue.
Working with another developer, we have made some changes in a ’spimaster’ branch:
Stop using spi_nodma fork of ESP’s standard spi code, and switch back to use the standard ESP IDF spi master.
To support >3 devices (which ESP IDF spi master doesn’t due to hardware limitations of CS line and 3x DMA channels), change to use software CS line for the MAX7317 driver (the MCP2515 continue to use hardware CS).
Confirm the changes to our MCP2515 driver related to keeping track of the last buffer read, to solve the out-of-order issue.
Confirm the fix for another related issue where we don’t block (delay) if the can tx queue is full.
These seem better now, and I am able to establish a CAN IP connection over MCP2515. Frames come in order, and we are seeing performance around 700 frames/second - which should be adequate for our needs.
I’ll do some more testing over the next few days, and if no issues found merge back to master.
Regards, Mark.
On 7 Jun 2021, at 12:16 AM, Michael Balzer <dexter@expeedo.de> wrote:
Tom suggests implementing a state machine to reproduce the receive
order. His analysis & solution looks sound to me.
Regards,
Michael
Am 06.06.21 um 14:50 schrieb Mark
Webb-Johnson:
I spent quite a bit of time on this. With my standard test packet
of 11 CAN frames expected, and the standard driver, I get perhaps
4 or 5 of them (about half are lost, and some are out of order).
I made the suggested change to move
the MyCan.IncomingFrame() call out of the ‘can’ object (when
frameReceived is true) to within the
mcp2515 AsynchronousInterruptHandler itself. That allows the
handler to receive more than one frame per call and is a very
simple change. Once that is done, we can at least now try to
tune it.
So I then modified the code of
mcp2515 AsynchronousInterruptHandler to loop so long as the
interrupt flag says either buffer #0 or #1 has a frame. The
result looks something like this:
D
(63192) mcp2515: AsynchronousInterruptHandler instat=01
errflag=00 txb0ctrl=00
D
(63192) mcp2515: AsynchronousInterruptHandler rx frame
from buffer #0 (ID 0x110 B1=54)
D
(63192) mcp2515: AsynchronousInterruptHandler instat=23
errflag=40 txb0ctrl=00
D
(63192) mcp2515: AsynchronousInterruptHandler rx frame
from buffer #0 (ID 0x110 B1=40)
D
(63192) mcp2515: AsynchronousInterruptHandler rx frame
from buffer #1 (ID 0x110 B1=45)
D
(63192) mcp2515: AsynchronousInterruptHandler Clear RX
buffer #0 overflow flag
D
(63192) mcp2515: AsynchronousInterruptHandler instat=23
errflag=00 txb0ctrl=00
D
(63192) mcp2515: AsynchronousInterruptHandler rx frame
from buffer #0 (ID 0x110 B1=24)
D
(63192) mcp2515: AsynchronousInterruptHandler rx frame
from buffer #1 (ID 0x110 B1=34)
D
(63192) mcp2515: AsynchronousInterruptHandler instat=20
errflag=00 txb0ctrl=00
The actual frames on the bus are (B1 values) 54, 45, 40,
0a, 7a, d5, 0c, 14, 1c, 24, 2c, and 34. Looking at the above
debug output, we get:
Interrupt flags show buffer #0 has a frame. It
is B1=54. Good.
Interrupt flags show buffers #0 and #1 both
have frames. Buffer #0 has B1=40 and buffer #1 has B1=45.
Etc etc
That is not good. What must have happened is that the first
B1=54 frame arrived, got put in buffer #0, and interrupt was
raised. We checked the interrupt flags, found buffer #0 had
something, and read the frame ok. All is good. But what is
happening now is that between the time we checked the
interrupt flags and the time we finished reading the 13 bytes
from buffer #0, a second frame arrived and was put in buffer
#1. Then a third frame arrives and is put in buffer #0. We
loop back to check the interrupt flags and find both buffers
have frames ready. So we ready buffer #0 to get the third
frame, then buffer #1 to get the second frame. We are out of
sequence.
By removing the ESP_LOGD statements, I can improve
performance enough to get 10 out of the 11 frames, but still
sometimes frames are swapped in order.
By over-clocking the MCP2515 SPI bus (supposed to be 10MHz,
but I push it to 15MHz), I can get all 11 frames, but two are
out of order.
I suppose I can minimise the chance of the out-of-order
issue by repeating the call to read interrupt flags after
processing buffer #0 but before checking for buffer #1. That
would at least reduce the time window to as small as possible,
but would be another SPI call and is too slow. Doing that
brings us back to losing frames.
Another approach may relate to our current use of the READ
command to read 5 status registers (interrupt flags, error
flags, two skipped, then transmit buffer #0 flags). There are
two specific commands ‘read status’ (which gets
the rx and tx
buffer status flags in one byte), and ‘rx status’
(which gets
just the receive buffer status and some info on the frames
received, again in one byte). I think those are more designed
for what we are trying to do. I can try to optimise the read
loop at the start of the AsynchronousInterruptHandler to use
one of those - they are 2 SPI bytes vs 7 for what we are doing
at the moment (so more than three times as fast).
I think it will also be worthwhile having a look at some
other open source mcp2515 drivers to see how other people are
doing it.
The handler can only return one frame. As it is,
if both buffers #0 and #1 have a frame, it returns
#0. I am not sure if it gets called again (seems
to depend on the interrupt gpio status).
// Read the interrupt pin status
and if it's still active (low), require
another interrupt handling iteration
return
!gpio_get_level((gpio_num_t)m_intpin);
Maybe a quick solution is to just
return true, immediately
after *frameReceived=true, if intflag=0x01
and (intstat & CANINTF_RX1IF)? That would
dispatch the incoming frame, then call back for
more (from the loop in the can object).
I am not sure in general
why AsynchronousInterruptHandler uses a bool
frameReceived flag, and doesn’t just simply
dispatch the frame immediately to the can
object? That would simplify things and allow
the AsynchronousInterruptHandler to handle
receiving both frames in the same call. Given
that MCP2515 is the only driver
using AsynchronousInterruptHandler, that would
be an easy fix.
Regards, Mark.
On 4 Jun 2021, at 2:29 PM,
Michael Balzer <dexter@expeedo.de>
wrote:
Signed
PGP part
Mark,
the handler is meant to read both
buffers sequentially, and on a
quick glance I don't see why it
wouldn't. But it can't hurt if you
do an audit of the code.
I remember having had that
out-of-order discussion when
handling both RX buffers before
here, but don't remember the
outcome. Too bad the list archives
cannot be searched.
I think it was the MCP not doing
overflows from RX buffer 1 to 0.
I.e. if buffer 1 still has a frame
on arrival, the new frame will be
lost. That means losing a frame if
the handler cannot react fast
enough, but receiving out of order
would be worse.
Regards,
Michael
Am
04.06.21 um 04:16 schrieb Mark
Webb-Johnson:
Michael,
Good suggestion on
the timing. I think it best to
use the same timings as the
Arduino library, and have
committed that change. No
vehicle modules currently use
1Mbps on MCP2515 anyway.
Unfortunately, it didn’t
resolve my problem.
Looking at the
error flags I see:
Error
flag: 0x23401c01
intstat
0x23
ERRIF Error
Interrupt pending
RX0IF Rx buffer 0
full interrupt
RX1IF Rx buffer 1
full interrupt
errflag
0x40
RX0OVR Rx buffer
0
overflow
intflag
0x1c01
0x01 Implied from Rx
buffer 0 full
0x1c = 0001 1100
Means RXB0 overflow.
No data lost in this
case (it went into
RXB1)
Means (errflag &
EFLG_RX01OVR), clear
RX buffer overflow
flags
So we have CAN
frames in BOTH rx buffers #0
and #1. Looking at our driver
code
(mcp2515::AsynchronousInterruptHandler),
it seems in that case we only
read from buffer #0. From the
flow I can see, we are going
to lose that second frame.
We’re not really handling the
issue of two frames being in
the buffers when the interrupt
handler is called.
As the
architecture of mcp2515::AsynchronousInterruptHandler
can only receive one frame,
it is not so simple to fix.
We could simply read and
return the frame in buffer
#0, requesting to be called
again (return true), but
another frame may arrive
(into buffer #0) before we
get called again, and that
is going to result in
out-of-order frames.
I’ll work
on
improving the handling of this
case.
Regards, Mark.
On 3 Jun
2021, at 3:07 PM,
Michael Balzer <dexter@expeedo.de>
wrote:
Signed
PGP part
Mark,
I'd give the bit
timing a try
first, the
MCP2515 seems to
be very
sensitive for
this. I've even
had some trouble
finding a
working
configuration
for the 50 kbit
timing I've
added a couple
weeks ago.
We currently use
00 / D0 / 82
which is also
the result of
the old Intrepid
timing
calculator.
That's a
propagation
segment of 1 Tq
and 3 Tq per
phase, resulting
in samling
between 50% -
62.5%.
The Arduino MCP
CAN lib by Cory
Fowler also had
this previously,
but then changed
in…
…to 00 / CA /
81, which is a
propagation
segment of 3 Tq
and 2 Tq per
phase, shifting
the sampling
window to 62.5 -
75%.
Our current
configuration
scheme for the
internal SJA1000
compatible CAN
seems to sample
at 62.5 - 75% as
well, so that
would also
match.
Regards,
Michael
Am
03.06.21 um
07:36 schrieb
Mark
Webb-Johnson:
I’m
working on an
implementation
of IP stack
over CAN for
the Tesla
Roadster. IP
frames are
encoded as a
length
followed by a
sequence of
CAN frames,
all on the
same ID. This
runs over a
1MHz bus, so
presumably the
traffic volume
could be high
at times.
I
was having
problems with
this running
on CAN2, so
tried CAN1 and
it worked
perfectly.
Here are some
simple dumps
of a single
PING packet
(and single
PING response
packet):
ID
#111 is used
to transmit an
IP packet, and
ID #110 is
used to
receive an IP
packet. The
special empty
data frame is
an
acknowledgment.
Using
latest master
branch code
(3.2.016-196-g0aad1a9f/ota_1/edge
(build idf
v3.3.4-848-g1ff5e24b1
Jun 2 2021
09:28:58)).
So,
first let’s
test with
traffic on
CAN1 (active,
1Mbps), and
listening on
CAN2 (listen,
1Mbps):
TCPDUMP:
05:57:55.980291
IP (tos 0x0,
ttl 64, id
43101, offset
0, flags [DF],
proto ICMP
(1), length
84)
Conclusion
is that the
CAN1 traffic
looks fine,
and the PING
packet gets a
good reply.
All
successful.
But the CAN2
listen is
missing a few
packets and
the last
packet is out
of order.
Now,
let’s test
with traffic
on CAN2
(active,
1Mbps), and
listening on
CAN1 (listen,
1Mbps):
TCPDUMP:
06:00:33.004060
IP (tos 0x0,
ttl 64, id
58240, offset
0, flags [DF],
proto ICMP
(1), length
84)
It is
1Mbps, with
30us or so
between each
packet. This
is the only
traffic on the
bus.
Everything
else is turned
off. Roughly
12 packets
each way.
Surely even if
we were
hitting a
performance
limit, our
buffers can
handle 12
packets?
The
good news is
that I have a
good
environment to
replicate this
issue now, so
any fix should
be easy to
test.
I
haven’t
worked
on the MCP2515
driver in our
code in a long
time, but it
certainly
seems
something is
messed up and
that could be
badly
affecting
vehicle
modules using
anything other
than CAN1.
I
will start to
look at this
over the
weekend, but
has anyone got
any
ideas/suggestions?
Perhaps the
bit timing
registers are
off by a small
amount (so it
works on CAN1
but not on
CAN2)? Or
something more
serious in our
driver?