<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
Second finding & commit:<br>
<br>
TL;DR:<br>
<ul>
<li>ESP32 CAN controller is broken, delivers trashed &
duplicated RX frames</li>
<li>we need to optimize interrupt latency</li>
</ul>
<br>
Users have been reporting occasional false BMS readings on the Twizy
for more than a year.<br>
The error always trashed the reading of voltage #10 and/or
temperature #8, so I initially<br>
thought that must be a variant of the SPIRAM/cache bug.<br>
<br>
But the SPIRAM fix didn't help, and the bug still occured with all
BMS data in internal<br>
RAM. The issue correlated with high load situations and the SIMCOM
UART issue. On testing the<br>
latest changes by Marko, my car module began showing that issue
regularly, so I could take a<br>
closer look. It turned out the wrong data came from the CAN bus.
Checking with a CAN logger<br>
didn't show any wrong data on the bus.<br>
<br>
So I scanned our CAN driver for potential issues, found & solved
some (even a serious<br>
race condition) but that didn't fix it. I then added debug dumps of
all frames read, dug<br>
into the SJA1000 spec sheet and found this:<br>
<br>
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to
work as a ring buffer,<br>
with frames wrapping around as necessary. If no space is left in the
FIFO for a new frame<br>
coming in, that frame should be discarded completely, not counted,
just the overflow<br>
indicator should be set & the according interrupt be generated.<br>
<br>
Reality differs. The frame causing the overflow is added to the FIFO
partially. It's also<br>
counted both in the RMC register and indicated by RBS as a valid
frame when retrieving<br>
the FIFO contents. The read buffer delivers the partial frame + some
trashed bytes up<br>
to the nominal frame length.<br>
<br>
In the BMS frame sequence it would normally trash bytes 7 & 8 on
the sixth frame of length<br>
8 in a sequence. A standard frame needs a 3 byte header + the data
bytes in the FIFO, so<br>
6 frames would add up to 66 bytes => 64 + 2 trash. The first
trash byte normally is "08",<br>
the second "84" or "2a" or sometimes "ab", possibly some internal
SJA1000 data.<br>
<br>
Better yet, if you rely on the indicators of the SJA1000 when
reading the FIFO, the<br>
SJA1000 will repeat the first frame within the FIFO after the
invalid frame. Mostly<br>
once or twice, and sometimes 7 times or more:<br>
<br>
<tt>inv_msg: framecnt=13, invindex=6</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 |
4..?........$....@."7U.7</tt><tt><br>
</tt><tt>inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 |
4..?........%.....D..N.&</tt><tt><br>
</tt><tt>inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 |
4..?........T...7777777.</tt><tt><br>
</tt><tt>inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a |
4..?........V...1c.1S.1J</tt><tt><br>
</tt><tt>inv_msg: 57 05 00 00 31 43 14 31 53 15 <font
color="#ff0000"><b>08 2a</b></font> | 4..?........W...1C.1S..*</tt><tt><br>
</tt><tt> ^^^^^ <font
color="#ff0000">trashed bytes</font></tt><tt><br>
</tt><tt>… followed by <font color="#ff0000">7 repetitions of the
first frame</font>:</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><tt><br>
</tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n</tt><br>
<br>
If you had occasional strange false readings leading to strange
effects, this<br>
may be the explanation. Note: the SJA1000 FIFO bug may need specific
circumstance<br>
to occur in addition to the overflow, maybe the overflow happening
on a specific<br>
byte position in the FIFO -- I haven't tried to determine that.<br>
<br>
I've tried clearing the overflow ASAP, looping on the IR indicator
and looping<br>
on the RMC, the fault is always the same. The workaround I've now
done is adding<br>
up the message lengths read during a FIFO run and discard all frames
exceeding<br>
the 64 byte border. Not pretty, but functional.<br>
<br>
That keeps the trashed frames from being forwarded to the OVMS
application level.<br>
<br>
As with the SIMCOM UART, these overflows wouldn't happen if the
interrupt handler<br>
would react fast enough. It seems we've got a general interrupt
latency issue.<br>
<br>
And just as with the SIMCOM I saw a reduction of overflows when
allocating the<br>
CAN IRQ on core #0, another one with raising the IRQ level to 3, so
that's<br>
also part of my commit.<br>
<br>
The esp-idf CAN driver apparently has the same issue even in the
latest release,<br>
I'll contact the author about this or post another issue to
Espressif.<br>
<br>
So our interrupt allocation is now:<br>
<br>
<tt>Task Run# Lvl ISR# Usage Source</tt><tt><br>
</tt><tt>----------------------------------------------------------------------------------------</tt><tt><br>
</tt><tt>- 0 1 0 RTC Core
ETS_RTC_CORE_INTR_SOURCE</tt><tt><br>
</tt><tt>esp_timer 0 1 0 RTOS esp_timer
ETS_TIMER2_INTR_SOURCE</tt><tt><br>
</tt><tt>ipc0 0 1 0 RTOS Scheduler Core0
ETS_FROM_CPU_INTR0_SOURCE</tt><tt><br>
</tt><tt>ipc1 1 1 1 RTOS Scheduler Core1
ETS_FROM_CPU_INTR1_SOURCE</tt><tt><br>
</tt><tt>main 0 1 0 RTOS Watchdog
ETS_TG0_WDT_LEVEL_INTR_SOURCE</tt><tt><br>
</tt><tt>OVMS Events 1 1 1 Peripherals (GPIO ISR)
ETS_GPIO_INTR_SOURCE</tt><tt><br>
</tt><tt>OVMS Events 1 1 1 MAX7317 (spi_nodma)
ETS_SPI3_INTR_SOURCE</tt><tt><br>
</tt><b><tt>esp32can init 0 3 0 ESP32 CAN
ETS_CAN_INTR_SOURCE</tt></b><b><tt><br>
</tt></b><b><tt>OVMS SIMCOM 0 2 0 SIMCOM
UART ETS_UART1_INTR_SOURCE</tt></b><tt><br>
</tt><tt>OVMS Events 1 1 1 USB console (UART)
ETS_UART0_INTR_SOURCE</tt><tt><br>
</tt><tt>OVMS Events 1 1 1 SD card
ETS_SDIO_HOST_INTR_SOURCE</tt><tt><br>
</tt><tt>----------------------------------------------------------------------------------------</tt><br>
<br>
I think we need to find the source of the interrupt latency and/or
find some<br>
better scheme to distribute the tasks & interrupts over the
cores.<br>
<br>
I still don't know why core #1 already got an issue with just our 6
interrupts.<br>
This may be related to the SPIRAM fix adding overhead, but I've seen
the FIFO<br>
overflows before applying the fix, and the CAN issue also has been
present for<br>
over a year.<br>
<br>
My working hypothesis is some task/driver disabling interrupts for
too often<br>
/ too long, or non-IRAM ISRs getting delayed for too often / too
long. And it<br>
seems this is an issue especially on core #1.<br>
<br>
Please verify & comment.<br>
<br>
Regards,<br>
Michael<br>
<br>
PS: a little side mystery: the netmanager task starts with priority
5, then<br>
gets raised to 22 (probably a mutex priority inheritance), but never
gets<br>
lowered back to 5 afterwards. A FreeRTOS bug?<br>
<br>
<pre class="moz-signature" cols="160">--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
</pre>
</body>
</html>