I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some
standard hardware.
I've ordered some of those combined ESP32 + OLED units, but don't
have any CAN transceivers I could hook up to them.
Maybe the info I've provided is sufficient for Espressif to
investigate.
Regards,
Michael
Am 31.10.19 um 22:24 schrieb Michael
Balzer:
Second finding & commit:
TL;DR:
- ESP32 CAN controller is broken, delivers trashed &
duplicated RX frames
- we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the
Twizy for more than a year.
The error always trashed the reading of voltage #10 and/or
temperature #8, so I initially
thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with
all BMS data in internal
RAM. The issue correlated with high load situations and the
SIMCOM UART issue. On testing the
latest changes by Marko, my car module began showing that issue
regularly, so I could take a
closer look. It turned out the wrong data came from the CAN bus.
Checking with a CAN logger
didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found &
solved some (even a serious
race condition) but that didn't fix it. I then added debug dumps
of all frames read, dug
into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to
work as a ring buffer,
with frames wrapping around as necessary. If no space is left in
the FIFO for a new frame
coming in, that frame should be discarded completely, not
counted, just the overflow
indicator should be set & the according interrupt be
generated.
Reality differs. The frame causing the overflow is added to the
FIFO partially. It's also
counted both in the RMC register and indicated by RBS as a valid
frame when retrieving
the FIFO contents. The read buffer delivers the partial frame +
some trashed bytes up
to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 &
8 on the sixth frame of length
8 in a sequence. A standard frame needs a 3 byte header + the
data bytes in the FIFO, so
6 frames would add up to 66 bytes => 64 + 2 trash. The first
trash byte normally is "08",
the second "84" or "2a" or sometimes "ab", possibly some
internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when
reading the FIFO, the
SJA1000 will repeat the first frame within the FIFO after the
invalid frame. Mostly
once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 |
4..?........$....@."7U.7
inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 |
4..?........%.....D..N.&
inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 |
4..?........T...7777777.
inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a |
4..?........V...1c.1S.1J
inv_msg: 57 05 00 00 31 43 14 31 53 15 08 2a |
4..?........W...1C.1S..*
^^^^^ trashed bytes
… followed by 7 repetitions of
the first frame:
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |
4..?........U.....PTT .n
If you had occasional strange false readings leading to strange
effects, this
may be the explanation. Note: the SJA1000 FIFO bug may need
specific circumstance
to occur in addition to the overflow, maybe the overflow
happening on a specific
byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR
indicator and looping
on the RMC, the fault is always the same. The workaround I've
now done is adding
up the message lengths read during a FIFO run and discard all
frames exceeding
the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS
application level.
As with the SIMCOM UART, these overflows wouldn't happen if the
interrupt handler
would react fast enough. It seems we've got a general interrupt
latency issue.
And just as with the SIMCOM I saw a reduction of overflows when
allocating the
CAN IRQ on core #0, another one with raising the IRQ level to 3,
so that's
also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the
latest release,
I'll contact the author about this or post another issue to
Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage
Source
----------------------------------------------------------------------------------------
- 0 1 0 RTC
Core ETS_RTC_CORE_INTR_SOURCE
esp_timer 0 1 0 RTOS
esp_timer ETS_TIMER2_INTR_SOURCE
ipc0 0 1 0 RTOS Scheduler
Core0 ETS_FROM_CPU_INTR0_SOURCE
ipc1 1 1 1 RTOS Scheduler
Core1 ETS_FROM_CPU_INTR1_SOURCE
main 0 1 0 RTOS
Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE
OVMS Events 1 1 1 Peripherals (GPIO
ISR) ETS_GPIO_INTR_SOURCE
OVMS Events 1 1 1 MAX7317
(spi_nodma) ETS_SPI3_INTR_SOURCE
esp32can init 0 3 0 ESP32
CAN ETS_CAN_INTR_SOURCE
OVMS SIMCOM 0 2 0 SIMCOM
UART ETS_UART1_INTR_SOURCE
OVMS Events 1 1 1 USB console
(UART) ETS_UART0_INTR_SOURCE
OVMS Events 1 1 1 SD
card ETS_SDIO_HOST_INTR_SOURCE
----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency
and/or find some
better scheme to distribute the tasks & interrupts over the
cores.
I still don't know why core #1 already got an issue with just
our 6 interrupts.
This may be related to the SPIRAM fix adding overhead, but I've
seen the FIFO
overflows before applying the fix, and the CAN issue also has
been present for
over a year.
My working hypothesis is some task/driver disabling interrupts
for too often
/ too long, or non-IRAM ISRs getting delayed for too often / too
long. And it
seems this is an issue especially on core #1.
Please verify & comment.
Regards,
Michael
PS: a little side mystery: the netmanager task starts with
priority 5, then
gets raised to 22 (probably a mutex priority inheritance), but
never gets
lowered back to 5 afterwards. A FreeRTOS bug?
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev
--
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________
OvmsDev mailing list
OvmsDev@lists.openvehicles.com
http://lists.openvehicles.com/mailman/listinfo/ovmsdev