[Ovmsdev] ESP32 CAN controller hardware bug / interrupt issue

Fri Apr 2 14:51:02 HKT 2021

It's 17 months after my report, and – lo & behold – Espressif just fixed 
the issue in their TWAI driver:

https://github.com/espressif/esp-idf/commit/2f5806092135e3d991057bc06225bdcf536e93a5

The commit includes three more workarounds for hardware bugs (see 
options added to Kconfig) probably worth checking for our driver:

 1. "Add SW workaround for REC change during bus-off" – apparently the
    receive error counter can get stuck on a bus-off event preventing
    recovery
 2. "Add SW workaround for TX interrupt lost errata" – apparently
    there's a race condition when reading the transmit interrupt
 3. "Add SW workaround for invalid RX frame errata" – this already has
    been spotted by Craig, and he's added a counter for this, but it
    seems the fix also does a controller reset when this occurs → we
    probably should do so as well?

Issue #1 could be the cause for the occasional CAN1 freezes we see, with 
the CAN watchdog being our current workaround. #2 could cause a similar 
effect, especially on a polling-only bus.

So it could be worth porting those fixes to our driver. Craig, do you 
want to take this on?

Regards,
Michael

Am 01.11.19 um 12:35 schrieb Michael Balzer:
> An unusual fast reply from Espressif. They basically declared it to be 
> a feature…
>
> https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085 
> <https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085>
>
> Regards,
> Michael
>
>
> Am 01.11.19 um 10:34 schrieb Michael Balzer:
>> I've submitted an issue to Espressif on this:
>>
>> https://github.com/espressif/esp-idf/issues/4276 
>> <https://github.com/espressif/esp-idf/issues/4276>
>>
>> It would be nice if I could reproduce the effect using some standard 
>> hardware.
>> I've ordered some of those combined ESP32 + OLED units, but don't 
>> have any CAN transceivers I could hook up to them.
>> Maybe the info I've provided is sufficient for Espressif to investigate.
>>
>> Regards,
>> Michael
>>
>>
>> Am 31.10.19 um 22:24 schrieb Michael Balzer:
>>> Second finding & commit:
>>>
>>> TL;DR:
>>>
>>>   * ESP32 CAN controller is broken, delivers trashed & duplicated RX
>>>     frames
>>>   * we need to optimize interrupt latency
>>>
>>>
>>> Users have been reporting occasional false BMS readings on the Twizy 
>>> for more than a year.
>>> The error always trashed the reading of voltage #10 and/or 
>>> temperature #8, so I initially
>>> thought that must be a variant of the SPIRAM/cache bug.
>>>
>>> But the SPIRAM fix didn't help, and the bug still occured with all 
>>> BMS data in internal
>>> RAM. The issue correlated with high load situations and the SIMCOM 
>>> UART issue. On testing the
>>> latest changes by Marko, my car module began showing that issue 
>>> regularly, so I could take a
>>> closer look. It turned out the wrong data came from the CAN bus. 
>>> Checking with a CAN logger
>>> didn't show any wrong data on the bus.
>>>
>>> So I scanned our CAN driver for potential issues, found & solved 
>>> some (even a serious
>>> race condition) but that didn't fix it. I then added debug dumps of 
>>> all frames read, dug
>>> into the SJA1000 spec sheet and found this:
>>>
>>> The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to 
>>> work as a ring buffer,
>>> with frames wrapping around as necessary. If no space is left in the 
>>> FIFO for a new frame
>>> coming in, that frame should be discarded completely, not counted, 
>>> just the overflow
>>> indicator should be set & the according interrupt be generated.
>>>
>>> Reality differs. The frame causing the overflow is added to the FIFO 
>>> partially. It's also
>>> counted both in the RMC register and indicated by RBS as a valid 
>>> frame when retrieving
>>> the FIFO contents. The read buffer delivers the partial frame + some 
>>> trashed bytes up
>>> to the nominal frame length.
>>>
>>> In the BMS frame sequence it would normally trash bytes 7 & 8 on the 
>>> sixth frame of length
>>> 8 in a sequence. A standard frame needs a 3 byte header + the data 
>>> bytes in the FIFO, so
>>> 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash 
>>> byte normally is "08",
>>> the second "84" or "2a" or sometimes "ab", possibly some internal 
>>> SJA1000 data.
>>>
>>> Better yet, if you rely on the indicators of the SJA1000 when 
>>> reading the FIFO, the
>>> SJA1000 will repeat the first frame within the FIFO after the 
>>> invalid frame. Mostly
>>> once or twice, and sometimes 7 times or more:
>>>
>>> inv_msg: framecnt=13, invindex=6
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$.... at ."7U.7
>>> inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.&
>>> inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777.
>>> inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J
>>> inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 
>>> 4..?........W...1C.1S..*
>>>                                        ^^^^^ trashed bytes
>>> … followed by 7 repetitions of the first frame:
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>
>>> If you had occasional strange false readings leading to strange 
>>> effects, this
>>> may be the explanation. Note: the SJA1000 FIFO bug may need specific 
>>> circumstance
>>> to occur in addition to the overflow, maybe the overflow happening 
>>> on a specific
>>> byte position in the FIFO -- I haven't tried to determine that.
>>>
>>> I've tried clearing the overflow ASAP, looping on the IR indicator 
>>> and looping
>>> on the RMC, the fault is always the same. The workaround I've now 
>>> done is adding
>>> up the message lengths read during a FIFO run and discard all frames 
>>> exceeding
>>> the 64 byte border. Not pretty, but functional.
>>>
>>> That keeps the trashed frames from being forwarded to the OVMS 
>>> application level.
>>>
>>> As with the SIMCOM UART, these overflows wouldn't happen if the 
>>> interrupt handler
>>> would react fast enough. It seems we've got a general interrupt 
>>> latency issue.
>>>
>>> And just as with the SIMCOM I saw a reduction of overflows when 
>>> allocating the
>>> CAN IRQ on core #0, another one with raising the IRQ level to 3, so 
>>> that's
>>> also part of my commit.
>>>
>>> The esp-idf CAN driver apparently has the same issue even in the 
>>> latest release,
>>> I'll contact the author about this or post another issue to Espressif.
>>>
>>> So our interrupt allocation is now:
>>>
>>> Task           Run#  Lvl  ISR#  Usage Source
>>> ----------------------------------------------------------------------------------------
>>> -              0     1    0     RTC Core                 
>>> ETS_RTC_CORE_INTR_SOURCE
>>> esp_timer      0     1    0     RTOS esp_timer           
>>> ETS_TIMER2_INTR_SOURCE
>>> ipc0           0     1    0     RTOS Scheduler Core0     
>>> ETS_FROM_CPU_INTR0_SOURCE
>>> ipc1           1     1    1     RTOS Scheduler Core1     
>>> ETS_FROM_CPU_INTR1_SOURCE
>>> main           0     1    0     RTOS Watchdog            
>>> ETS_TG0_WDT_LEVEL_INTR_SOURCE
>>> OVMS Events    1     1    1     Peripherals (GPIO ISR)   
>>> ETS_GPIO_INTR_SOURCE
>>> OVMS Events    1     1    1     MAX7317 (spi_nodma)      
>>> ETS_SPI3_INTR_SOURCE
>>> *esp32can init  0     3    0     ESP32 CAN                
>>> ETS_CAN_INTR_SOURCE**
>>> **OVMS SIMCOM    0     2    0     SIMCOM UART              
>>> ETS_UART1_INTR_SOURCE*
>>> OVMS Events    1     1    1     USB console (UART)       
>>> ETS_UART0_INTR_SOURCE
>>> OVMS Events    1     1    1     SD card                  
>>> ETS_SDIO_HOST_INTR_SOURCE
>>> ----------------------------------------------------------------------------------------
>>>
>>> I think we need to find the source of the interrupt latency and/or 
>>> find some
>>> better scheme to distribute the tasks & interrupts over the cores.
>>>
>>> I still don't know why core #1 already got an issue with just our 6 
>>> interrupts.
>>> This may be related to the SPIRAM fix adding overhead, but I've seen 
>>> the FIFO
>>> overflows before applying the fix, and the CAN issue also has been 
>>> present for
>>> over a year.
>>>
>>> My working hypothesis is some task/driver disabling interrupts for 
>>> too often
>>> / too long, or non-IRAM ISRs getting delayed for too often / too 
>>> long. And it
>>> seems this is an issue especially on core #1.
>>>
>>> Please verify & comment.
>>>
>>> Regards,
>>> Michael
>>>
>>> PS: a little side mystery: the netmanager task starts with priority 
>>> 5, then
>>> gets raised to 22 (probably a mutex priority inheritance), but never 
>>> gets
>>> lowered back to 5 afterwards. A FreeRTOS bug?
>>>
>>> -- 
>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> -- 
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20210402/b081aca8/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20210402/b081aca8/attachment-0001.sig>