ESP32 CAN controller hardware bug / interrupt issue
Second finding & commit: TL;DR: * ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug. But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus. So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this: The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated. Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length. In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data. Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more: inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that. I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional. That keeps the trashed frames from being forwarded to the OVMS application level. As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue. And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit. The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif. So our interrupt allocation is now: Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ---------------------------------------------------------------------------------------- I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores. I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year. My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1. Please verify & comment. Regards, Michael PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug? -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I've submitted an issue to Espressif on this: https://github.com/espressif/esp-idf/issues/4276 It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate. Regards, Michael Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
An unusual fast reply from Espressif. They basically declared it to be a feature… https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085 Regards, Michael Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I've pushed the new implementation based on the Espressif info and some more experimentation. I've also enabled the IRAM flag for the GPIO and SPI_noDMA handlers, as all handlers here already were in IRAM. I've been testing this over the weekend, had no issues and slightly reduced can1 frame drops. Mark, maybe you can do some additional high load CAN tests in your lab. Regards, Michael Am 01.11.19 um 12:35 schrieb Michael Balzer:
An unusual fast reply from Espressif. They basically declared it to be a feature…
https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085
Regards, Michael
Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
The new implementation doesn't seem to work very well. It seems this approach loses a lot of CAN frames. I don't have the means to test how many, but I see the effect in my vehicle events & status. The Espressif info given by Darian in the issue was either inaccurate or misses some important detail. There's another (non Espressif) developer on the issue trying to get the idf driver to correctly handle the issue, he hasn't had any luck as well up to now. I'll try another interpretation of the scheme outlined by Darian. If that doesn't work I'll revert to my first workaround until we get answers or template code from Espressif. Regards, Michael Am 03.11.19 um 19:46 schrieb Michael Balzer:
I've pushed the new implementation based on the Espressif info and some more experimentation.
I've also enabled the IRAM flag for the GPIO and SPI_noDMA handlers, as all handlers here already were in IRAM.
I've been testing this over the weekend, had no issues and slightly reduced can1 frame drops.
Mark, maybe you can do some additional high load CAN tests in your lab.
Regards, Michael
Am 01.11.19 um 12:35 schrieb Michael Balzer:
An unusual fast reply from Espressif. They basically declared it to be a feature…
https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085
Regards, Michael
Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
No luck with all variations. I've switched back to the byte counting workaround. Regards, Michael Am 07.11.19 um 10:31 schrieb Michael Balzer:
The new implementation doesn't seem to work very well. It seems this approach loses a lot of CAN frames. I don't have the means to test how many, but I see the effect in my vehicle events & status.
The Espressif info given by Darian in the issue was either inaccurate or misses some important detail. There's another (non Espressif) developer on the issue trying to get the idf driver to correctly handle the issue, he hasn't had any luck as well up to now.
I'll try another interpretation of the scheme outlined by Darian. If that doesn't work I'll revert to my first workaround until we get answers or template code from Espressif.
Regards, Michael
Am 03.11.19 um 19:46 schrieb Michael Balzer:
I've pushed the new implementation based on the Espressif info and some more experimentation.
I've also enabled the IRAM flag for the GPIO and SPI_noDMA handlers, as all handlers here already were in IRAM.
I've been testing this over the weekend, had no issues and slightly reduced can1 frame drops.
Mark, maybe you can do some additional high load CAN tests in your lab.
Regards, Michael
Am 01.11.19 um 12:35 schrieb Michael Balzer:
An unusual fast reply from Espressif. They basically declared it to be a feature…
https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085
Regards, Michael
Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
For those who didn't follow the issue: I did some single step experimentation to determine the actual behaviour of the controller. I found out how to recover after an overrun, and I found a hardware bug that causes the controller FIFO to get into an unrecoverable state when the message counter reaches 64. My results have just been confirmed by Espressif. The new implementation has been in edge for 9 days now, and no frame corruptions have occurred since. Overflow rate still has been pretty high (~3-4 per mille). I now build with the TCP/IP task pinned to core 1, which has lowered the rate to ~0.5 per mille, but also leads to an unbalanced load distribution. Not sure yet if that's the best solution. I've checked if moving all CAN processing back to core 1 would help, but that turned out to be worse. I suspect either SPI or SD I/O to require long/often critical section locks on core 1, but haven't checked their code yet. Regards, Michael Am 07.11.19 um 18:19 schrieb Michael Balzer:
No luck with all variations. I've switched back to the byte counting workaround.
Regards, Michael
Am 07.11.19 um 10:31 schrieb Michael Balzer:
The new implementation doesn't seem to work very well. It seems this approach loses a lot of CAN frames. I don't have the means to test how many, but I see the effect in my vehicle events & status.
The Espressif info given by Darian in the issue was either inaccurate or misses some important detail. There's another (non Espressif) developer on the issue trying to get the idf driver to correctly handle the issue, he hasn't had any luck as well up to now.
I'll try another interpretation of the scheme outlined by Darian. If that doesn't work I'll revert to my first workaround until we get answers or template code from Espressif.
Regards, Michael
Am 03.11.19 um 19:46 schrieb Michael Balzer:
I've pushed the new implementation based on the Espressif info and some more experimentation.
I've also enabled the IRAM flag for the GPIO and SPI_noDMA handlers, as all handlers here already were in IRAM.
I've been testing this over the weekend, had no issues and slightly reduced can1 frame drops.
Mark, maybe you can do some additional high load CAN tests in your lab.
Regards, Michael
Am 01.11.19 um 12:35 schrieb Michael Balzer:
An unusual fast reply from Espressif. They basically declared it to be a feature…
https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085
Regards, Michael
Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
A quick update: On my car module, CAN overflows are now down to 0.004 per mille, SIMCOM UART overflows down to effectively 0. The new file logging delayed fsync scheme had a major impact (as expected), and interestingly the regular DukTape garbage collection caused another drop. So it's OK now, but we need to be careful about both SD file I/O and high / granular memory usage. Regards, Michael Am 21.11.19 um 14:31 schrieb Michael Balzer:
For those who didn't follow the issue: I did some single step experimentation to determine the actual behaviour of the controller.
I found out how to recover after an overrun, and I found a hardware bug that causes the controller FIFO to get into an unrecoverable state when the message counter reaches 64. My results have just been confirmed by Espressif.
The new implementation has been in edge for 9 days now, and no frame corruptions have occurred since.
Overflow rate still has been pretty high (~3-4 per mille). I now build with the TCP/IP task pinned to core 1, which has lowered the rate to ~0.5 per mille, but also leads to an unbalanced load distribution. Not sure yet if that's the best solution. I've checked if moving all CAN processing back to core 1 would help, but that turned out to be worse. I suspect either SPI or SD I/O to require long/often critical section locks on core 1, but haven't checked their code yet.
Regards, Michael
Am 07.11.19 um 18:19 schrieb Michael Balzer:
No luck with all variations. I've switched back to the byte counting workaround.
Regards, Michael
Am 07.11.19 um 10:31 schrieb Michael Balzer:
The new implementation doesn't seem to work very well. It seems this approach loses a lot of CAN frames. I don't have the means to test how many, but I see the effect in my vehicle events & status.
The Espressif info given by Darian in the issue was either inaccurate or misses some important detail. There's another (non Espressif) developer on the issue trying to get the idf driver to correctly handle the issue, he hasn't had any luck as well up to now.
I'll try another interpretation of the scheme outlined by Darian. If that doesn't work I'll revert to my first workaround until we get answers or template code from Espressif.
Regards, Michael
Am 03.11.19 um 19:46 schrieb Michael Balzer:
I've pushed the new implementation based on the Espressif info and some more experimentation.
I've also enabled the IRAM flag for the GPIO and SPI_noDMA handlers, as all handlers here already were in IRAM.
I've been testing this over the weekend, had no issues and slightly reduced can1 frame drops.
Mark, maybe you can do some additional high load CAN tests in your lab.
Regards, Michael
Am 01.11.19 um 12:35 schrieb Michael Balzer:
An unusual fast reply from Espressif. They basically declared it to be a feature…
https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085
Regards, Michael
Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer: > Second finding & commit: > > TL;DR: > > * ESP32 CAN controller is broken, delivers trashed & duplicated RX frames > * we need to optimize interrupt latency > > > Users have been reporting occasional false BMS readings on the Twizy for more than a year. > The error always trashed the reading of voltage #10 and/or temperature #8, so I initially > thought that must be a variant of the SPIRAM/cache bug. > > But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal > RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the > latest changes by Marko, my car module began showing that issue regularly, so I could take a > closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger > didn't show any wrong data on the bus. > > So I scanned our CAN driver for potential issues, found & solved some (even a serious > race condition) but that didn't fix it. I then added debug dumps of all frames read, dug > into the SJA1000 spec sheet and found this: > > The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, > with frames wrapping around as necessary. If no space is left in the FIFO for a new frame > coming in, that frame should be discarded completely, not counted, just the overflow > indicator should be set & the according interrupt be generated. > > Reality differs. The frame causing the overflow is added to the FIFO partially. It's also > counted both in the RMC register and indicated by RBS as a valid frame when retrieving > the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up > to the nominal frame length. > > In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length > 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so > 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", > the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data. > > Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the > SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly > once or twice, and sometimes 7 times or more: > > inv_msg: framecnt=13, invindex=6 > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 > inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& > inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. > inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J > inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* > ^^^^^ trashed bytes > … followed by 7 repetitions of the first frame: > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n > > If you had occasional strange false readings leading to strange effects, this > may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance > to occur in addition to the overflow, maybe the overflow happening on a specific > byte position in the FIFO -- I haven't tried to determine that. > > I've tried clearing the overflow ASAP, looping on the IR indicator and looping > on the RMC, the fault is always the same. The workaround I've now done is adding > up the message lengths read during a FIFO run and discard all frames exceeding > the 64 byte border. Not pretty, but functional. > > That keeps the trashed frames from being forwarded to the OVMS application level. > > As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler > would react fast enough. It seems we've got a general interrupt latency issue. > > And just as with the SIMCOM I saw a reduction of overflows when allocating the > CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's > also part of my commit. > > The esp-idf CAN driver apparently has the same issue even in the latest release, > I'll contact the author about this or post another issue to Espressif. > > So our interrupt allocation is now: > > Task Run# Lvl ISR# Usage Source > ---------------------------------------------------------------------------------------- > - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE > esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE > ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE > ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE > main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE > OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE > OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE > *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** > **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* > OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE > OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE > ---------------------------------------------------------------------------------------- > > I think we need to find the source of the interrupt latency and/or find some > better scheme to distribute the tasks & interrupts over the cores. > > I still don't know why core #1 already got an issue with just our 6 interrupts. > This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO > overflows before applying the fix, and the CAN issue also has been present for > over a year. > > My working hypothesis is some task/driver disabling interrupts for too often > / too long, or non-IRAM ISRs getting delayed for too often / too long. And it > seems this is an issue especially on core #1. > > Please verify & comment. > > Regards, > Michael > > PS: a little side mystery: the netmanager task starts with priority 5, then > gets raised to 22 (probably a mutex priority inheritance), but never gets > lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
It's 17 months after my report, and – lo & behold – Espressif just fixed the issue in their TWAI driver: https://github.com/espressif/esp-idf/commit/2f5806092135e3d991057bc06225bdcf... The commit includes three more workarounds for hardware bugs (see options added to Kconfig) probably worth checking for our driver: 1. "Add SW workaround for REC change during bus-off" – apparently the receive error counter can get stuck on a bus-off event preventing recovery 2. "Add SW workaround for TX interrupt lost errata" – apparently there's a race condition when reading the transmit interrupt 3. "Add SW workaround for invalid RX frame errata" – this already has been spotted by Craig, and he's added a counter for this, but it seems the fix also does a controller reset when this occurs → we probably should do so as well? Issue #1 could be the cause for the occasional CAN1 freezes we see, with the CAN watchdog being our current workaround. #2 could cause a similar effect, especially on a polling-only bus. So it could be worth porting those fixes to our driver. Craig, do you want to take this on? Regards, Michael Am 01.11.19 um 12:35 schrieb Michael Balzer:
An unusual fast reply from Espressif. They basically declared it to be a feature…
https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085 <https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085>
Regards, Michael
Am 01.11.19 um 10:34 schrieb Michael Balzer:
I've submitted an issue to Espressif on this:
https://github.com/espressif/esp-idf/issues/4276 <https://github.com/espressif/esp-idf/issues/4276>
It would be nice if I could reproduce the effect using some standard hardware. I've ordered some of those combined ESP32 + OLED units, but don't have any CAN transceivers I could hook up to them. Maybe the info I've provided is sufficient for Espressif to investigate.
Regards, Michael
Am 31.10.19 um 22:24 schrieb Michael Balzer:
Second finding & commit:
TL;DR:
* ESP32 CAN controller is broken, delivers trashed & duplicated RX frames * we need to optimize interrupt latency
Users have been reporting occasional false BMS readings on the Twizy for more than a year. The error always trashed the reading of voltage #10 and/or temperature #8, so I initially thought that must be a variant of the SPIRAM/cache bug.
But the SPIRAM fix didn't help, and the bug still occured with all BMS data in internal RAM. The issue correlated with high load situations and the SIMCOM UART issue. On testing the latest changes by Marko, my car module began showing that issue regularly, so I could take a closer look. It turned out the wrong data came from the CAN bus. Checking with a CAN logger didn't show any wrong data on the bus.
So I scanned our CAN driver for potential issues, found & solved some (even a serious race condition) but that didn't fix it. I then added debug dumps of all frames read, dug into the SJA1000 spec sheet and found this:
The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to work as a ring buffer, with frames wrapping around as necessary. If no space is left in the FIFO for a new frame coming in, that frame should be discarded completely, not counted, just the overflow indicator should be set & the according interrupt be generated.
Reality differs. The frame causing the overflow is added to the FIFO partially. It's also counted both in the RMC register and indicated by RBS as a valid frame when retrieving the FIFO contents. The read buffer delivers the partial frame + some trashed bytes up to the nominal frame length.
In the BMS frame sequence it would normally trash bytes 7 & 8 on the sixth frame of length 8 in a sequence. A standard frame needs a 3 byte header + the data bytes in the FIFO, so 6 frames would add up to 66 bytes => 64 + 2 trash. The first trash byte normally is "08", the second "84" or "2a" or sometimes "ab", possibly some internal SJA1000 data.
Better yet, if you rely on the indicators of the SJA1000 when reading the FIFO, the SJA1000 will repeat the first frame within the FIFO after the invalid frame. Mostly once or twice, and sometimes 7 times or more:
inv_msg: framecnt=13, invindex=6 inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7 inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.& inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777. inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J inv_msg: 57 05 00 00 31 43 14 31 53 15 *08 2a* | 4..?........W...1C.1S..* ^^^^^ trashed bytes … followed by 7 repetitions of the first frame: inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
If you had occasional strange false readings leading to strange effects, this may be the explanation. Note: the SJA1000 FIFO bug may need specific circumstance to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO -- I haven't tried to determine that.
I've tried clearing the overflow ASAP, looping on the IR indicator and looping on the RMC, the fault is always the same. The workaround I've now done is adding up the message lengths read during a FIFO run and discard all frames exceeding the 64 byte border. Not pretty, but functional.
That keeps the trashed frames from being forwarded to the OVMS application level.
As with the SIMCOM UART, these overflows wouldn't happen if the interrupt handler would react fast enough. It seems we've got a general interrupt latency issue.
And just as with the SIMCOM I saw a reduction of overflows when allocating the CAN IRQ on core #0, another one with raising the IRQ level to 3, so that's also part of my commit.
The esp-idf CAN driver apparently has the same issue even in the latest release, I'll contact the author about this or post another issue to Espressif.
So our interrupt allocation is now:
Task Run# Lvl ISR# Usage Source ---------------------------------------------------------------------------------------- - 0 1 0 RTC Core ETS_RTC_CORE_INTR_SOURCE esp_timer 0 1 0 RTOS esp_timer ETS_TIMER2_INTR_SOURCE ipc0 0 1 0 RTOS Scheduler Core0 ETS_FROM_CPU_INTR0_SOURCE ipc1 1 1 1 RTOS Scheduler Core1 ETS_FROM_CPU_INTR1_SOURCE main 0 1 0 RTOS Watchdog ETS_TG0_WDT_LEVEL_INTR_SOURCE OVMS Events 1 1 1 Peripherals (GPIO ISR) ETS_GPIO_INTR_SOURCE OVMS Events 1 1 1 MAX7317 (spi_nodma) ETS_SPI3_INTR_SOURCE *esp32can init 0 3 0 ESP32 CAN ETS_CAN_INTR_SOURCE** **OVMS SIMCOM 0 2 0 SIMCOM UART ETS_UART1_INTR_SOURCE* OVMS Events 1 1 1 USB console (UART) ETS_UART0_INTR_SOURCE OVMS Events 1 1 1 SD card ETS_SDIO_HOST_INTR_SOURCE ----------------------------------------------------------------------------------------
I think we need to find the source of the interrupt latency and/or find some better scheme to distribute the tasks & interrupts over the cores.
I still don't know why core #1 already got an issue with just our 6 interrupts. This may be related to the SPIRAM fix adding overhead, but I've seen the FIFO overflows before applying the fix, and the CAN issue also has been present for over a year.
My working hypothesis is some task/driver disabling interrupts for too often / too long, or non-IRAM ISRs getting delayed for too often / too long. And it seems this is an issue especially on core #1.
Please verify & comment.
Regards, Michael
PS: a little side mystery: the netmanager task starts with priority 5, then gets raised to 22 (probably a mutex priority inheritance), but never gets lowered back to 5 afterwards. A FreeRTOS bug?
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 4/1/21 11:51 PM, Michael Balzer wrote:
It's 17 months after my report, and – lo & behold – Espressif just fixed the issue in their TWAI driver:
https://github.com/espressif/esp-idf/commit/2f5806092135e3d991057bc06225bdcf...
The commit includes three more workarounds for hardware bugs (see options added to Kconfig) probably worth checking for our driver:
1. "Add SW workaround for REC change during bus-off" – apparently the receive error counter can get stuck on a bus-off event preventing recovery 2. "Add SW workaround for TX interrupt lost errata" – apparently there's a race condition when reading the transmit interrupt 3. "Add SW workaround for invalid RX frame errata" – this already has been spotted by Craig, and he's added a counter for this, but it seems the fix also does a controller reset when this occurs → we probably should do so as well?
Issue #1 could be the cause for the occasional CAN1 freezes we see, with the CAN watchdog being our current workaround. #2 could cause a similar effect, especially on a polling-only bus.
So it could be worth porting those fixes to our driver. Craig, do you want to take this on?
I'll see what I can come up with this weekend. Craig
On 4/1/21 11:51 PM, Michael Balzer wrote:
So it could be worth porting those fixes to our driver. Craig, do you want to take this on?
Thinking about this a bit, what is the target? Is it to backport the changes into master or for-v3.3? I'm guessing master so most users will benifit in the short term and assume that in the long term we'll move to a version of esp-idf that the changes. Craig
Am 03.04.21 um 02:22 schrieb Craig Leres:
On 4/1/21 11:51 PM, Michael Balzer wrote:
So it could be worth porting those fixes to our driver. Craig, do you want to take this on?
Thinking about this a bit, what is the target? Is it to backport the changes into master or for-v3.3? I'm guessing master so most users will benifit in the short term and assume that in the long term we'll move to a version of esp-idf that the changes.
As there has been no change to the driver in for-v3.3, master is correct. That way we can simply merge master into for-v3.3 afterwards. Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
participants (2)
-
Craig Leres -
Michael Balzer