[Ovmsdev] ESP32 CAN controller hardware bug / interrupt issue

Fri Nov 8 01:19:57 HKT 2019

No luck with all variations. I've switched back to the byte counting workaround.

Regards,
Michael

Am 07.11.19 um 10:31 schrieb Michael Balzer:
> The new implementation doesn't seem to work very well. It seems this approach loses a lot of CAN frames. I don't have the means to test how many, but I see
> the effect in my vehicle events & status.
>
> The Espressif info given by Darian in the issue was either inaccurate or misses some important detail. There's another (non Espressif) developer on the issue
> trying to get the idf driver to correctly handle the issue, he hasn't had any luck as well up to now.
>
> I'll try another interpretation of the scheme outlined by Darian. If that doesn't work I'll revert to my first workaround until we get answers or template
> code from Espressif.
>
> Regards,
> Michael
>
>
> Am 03.11.19 um 19:46 schrieb Michael Balzer:
>> I've pushed the new implementation based on the Espressif info and some more experimentation.
>>
>> I've also enabled the IRAM flag for the GPIO and SPI_noDMA handlers, as all handlers here already were in IRAM.
>>
>> I've been testing this over the weekend, had no issues and slightly reduced can1 frame drops.
>>
>> Mark, maybe you can do some additional high load CAN tests in your lab.
>>
>> Regards,
>> Michael
>>
>>
>> Am 01.11.19 um 12:35 schrieb Michael Balzer:
>>> An unusual fast reply from Espressif. They basically declared it to be a feature…
>>>
>>> https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085
>>>
>>> Regards,
>>> Michael
>>>
>>>
>>> Am 01.11.19 um 10:34 schrieb Michael Balzer:
>>>> I've submitted an issue to Espressif on this:
>>>>
>>>> https://github.com/espressif/esp-idf/issues/4276
>>>>
>>>> It would be nice if I could reproduce the effect using some standard hardware.
>>>> I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them.
>>>> Maybe the info I've provided is sufficient for Espressif to investigate.
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>>
>>>> Am 31.10.19 um 22:24 schrieb Michael Balzer:
>>>>> Second finding & commit:
>>>>>
>>>>> TL;DR:
>>>>>
>>>>>   * ESP32 CAN controller is broken, delivers trashed & duplicated RX frames
>>>>>   * we need to optimize interrupt latency
>>>>>
>>>>>
>>>>> Users have been reporting occasional false BMS readings on the Twizy for more than a year.
>>>>> The error always trashed the reading of voltage #10 and/or temperature #8, so I initially
>>>>> thought that must be a variant of the SPIRAM/cache bug.
>>>>>
>>>>> But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal
>>>>> RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the
>>>>> latest changes by Marko, my car module began showing that issue regularly, so I could take a
>>>>> closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger
>>>>> didn't show any wrong data on the bus.
>>>>>
>>>>> So I scanned our CAN driver for potential issues, found & solved some (even a serious
>>>>> race condition) but that didn't fix it. I then added debug dumps of all frames read, dug
>>>>> into the SJA1000 spec sheet and found this:
>>>>>
>>>>> The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer,
>>>>> with frames wrapping around as necessary. If no space is left in the FIFO for a new frame
>>>>> coming in, that frame should be discarded completely, not counted, just the overflow
>>>>> indicator should be set & the according interrupt be generated.
>>>>>
>>>>> Reality differs. The frame causing the overflow is added to the FIFO partially. It's also
>>>>> counted both in the RMC register and indicated by RBS as a valid frame when retrieving
>>>>> the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up
>>>>> to the nominal frame length.
>>>>>
>>>>> In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length
>>>>> 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so
>>>>> 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08",
>>>>> the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
>>>>>
>>>>> Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the
>>>>> SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly
>>>>> once or twice, and sometimes 7 times or more:
>>>>>
>>>>> inv_msg: framecnt=13, invindex=6
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$.... at ."7U.7
>>>>> inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.&
>>>>> inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777.
>>>>> inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J
>>>>> inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..*
>>>>>                                        ^^^^^ trashed bytes
>>>>> … followed by 7 repetitions of the first frame:
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>> inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
>>>>>
>>>>> If you had occasional strange false readings leading to strange effects, this
>>>>> may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance
>>>>> to occur in addition to the overflow, maybe the overflow happening on a specific
>>>>> byte position in the FIFO -- I haven't tried to determine that.
>>>>>
>>>>> I've tried clearing the overflow ASAP, looping on the IR indicator and looping
>>>>> on the RMC, the fault is always the same. The workaround I've now done is adding
>>>>> up the message lengths read during a FIFO run and discard all frames exceeding
>>>>> the 64 byte border. Not pretty, but functional.
>>>>>
>>>>> That keeps the trashed frames from being forwarded to the OVMS application level.
>>>>>
>>>>> As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler
>>>>> would react fast enough. It seems we've got a general interrupt latency issue.
>>>>>
>>>>> And just as with the SIMCOM I saw a reduction of overflows when allocating the
>>>>> CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's
>>>>> also part of my commit.
>>>>>
>>>>> The esp-idf CAN driver apparently has the same issue even in the latest release,
>>>>> I'll contact the author about this or post another issue to Espressif.
>>>>>
>>>>> So our interrupt allocation is now:
>>>>>
>>>>> Task           Run#  Lvl  ISR#  Usage                    Source
>>>>> ----------------------------------------------------------------------------------------
>>>>> -              0     1    0     RTC Core                 ETS_RTC_CORE_INTR_SOURCE
>>>>> esp_timer      0     1    0     RTOS esp_timer           ETS_TIMER2_INTR_SOURCE
>>>>> ipc0           0     1    0     RTOS Scheduler Core0     ETS_FROM_CPU_INTR0_SOURCE
>>>>> ipc1           1     1    1     RTOS Scheduler Core1     ETS_FROM_CPU_INTR1_SOURCE
>>>>> main           0     1    0     RTOS Watchdog            ETS_TG0_WDT_LEVEL_INTR_SOURCE
>>>>> OVMS Events    1     1    1     Peripherals (GPIO ISR)   ETS_GPIO_INTR_SOURCE
>>>>> OVMS Events    1     1    1     MAX7317 (spi_nodma)      ETS_SPI3_INTR_SOURCE
>>>>> *esp32can init  0     3    0     ESP32 CAN                ETS_CAN_INTR_SOURCE**
>>>>> **OVMS SIMCOM    0     2    0     SIMCOM UART              ETS_UART1_INTR_SOURCE*
>>>>> OVMS Events    1     1    1     USB console (UART)       ETS_UART0_INTR_SOURCE
>>>>> OVMS Events    1     1    1     SD card                  ETS_SDIO_HOST_INTR_SOURCE
>>>>> ----------------------------------------------------------------------------------------
>>>>>
>>>>> I think we need to find the source of the interrupt latency and/or find some
>>>>> better scheme to distribute the tasks & interrupts over the cores.
>>>>>
>>>>> I still don't know why core #1 already got an issue with just our 6 interrupts.
>>>>> This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO
>>>>> overflows before applying the fix, and the CAN issue also has been present for
>>>>> over a year.
>>>>>
>>>>> My working hypothesis is some task/driver disabling interrupts for too often
>>>>> / too long, or non-IRAM ISRs getting delayed for too often / too long. And it
>>>>> seems this is an issue especially on core #1.
>>>>>
>>>>> Please verify & comment.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>> PS: a little side mystery: the netmanager task starts with priority 5, then
>>>>> gets raised to 22 (probably a mutex priority inheritance), but never gets
>>>>> lowered back to 5 afterwards. A FreeRTOS bug?
>>>>>
>>>>> -- 
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>
>>>>> _______________________________________________
>>>>> OvmsDev mailing list
>>>>> OvmsDev at lists.openvehicles.com
>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>
>>>> -- 
>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>
>>>> _______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>
>>> -- 
>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
> -- 
> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20191107/8a8b1052/attachment.htm>