<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    Second finding & commit:<br>

    <br>

    TL;DR:<br>

    <ul>

      <li>ESP32 CAN controller is broken, delivers trashed &

        duplicated RX frames</li>

      <li>we need to optimize interrupt latency</li>

    </ul>

    <br>

    Users have been reporting occasional false BMS readings on the Twizy

    for more than a year.<br>

    The error always trashed the reading of voltage #10 and/or

    temperature #8, so I initially<br>

    thought that must be a variant of the SPIRAM/cache bug.<br>

    <br>

    But the SPIRAM fix didn't help, and the bug still occured with all

    BMS data in internal<br>

    RAM. The issue correlated with high load situations and the SIMCOM

    UART issue. On testing the<br>

    latest changes by Marko, my car module began showing that issue

    regularly, so I could take a<br>

    closer look. It turned out the wrong data came from the CAN bus.

    Checking with a CAN logger<br>

    didn't show any wrong data on the bus.<br>

    <br>

    So I scanned our CAN driver for potential issues, found & solved

    some (even a serious<br>

    race condition) but that didn't fix it. I then added debug dumps of

    all frames read, dug<br>

    into the SJA1000 spec sheet and found this:<br>

    <br>

    The SJA1000 RX FIFO has a capacity of 64 bytes. It's supposed to

    work as a ring buffer,<br>

    with frames wrapping around as necessary. If no space is left in the

    FIFO for a new frame<br>

    coming in, that frame should be discarded completely, not counted,

    just the overflow<br>

    indicator should be set & the according interrupt be generated.<br>

    <br>

    Reality differs. The frame causing the overflow is added to the FIFO

    partially. It's also<br>

    counted both in the RMC register and indicated by RBS as a valid

    frame when retrieving<br>

    the FIFO contents. The read buffer delivers the partial frame + some

    trashed bytes up<br>

    to the nominal frame length.<br>

    <br>

    In the BMS frame sequence it would normally trash bytes 7 & 8 on

    the sixth frame of length<br>

    8 in a sequence. A standard frame needs a 3 byte header + the data

    bytes in the FIFO, so<br>

    6 frames would add up to 66 bytes => 64 + 2 trash. The first

    trash byte normally is "08",<br>

    the second "84" or "2a" or sometimes "ab", possibly some internal

    SJA1000 data.<br>

    <br>

    Better yet, if you rely on the indicators of the SJA1000 when

    reading the FIFO, the<br>

    SJA1000 will repeat the first frame within the FIFO after the

    invalid frame. Mostly<br>

    once or twice, and sometimes 7 times or more:<br>

    <br>

    <tt>inv_msg: framecnt=13, invindex=6</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 |

      4..?........$....@."7U.7</tt><tt><br>

    </tt><tt>inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 |

      4..?........%.....D..N.&</tt><tt><br>

    </tt><tt>inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 |

      4..?........T...7777777.</tt><tt><br>

    </tt><tt>inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a |

      4..?........V...1c.1S.1J</tt><tt><br>

    </tt><tt>inv_msg: 57 05 00 00 31 43 14 31 53 15 <font

        color="#ff0000"><b>08 2a</b></font> | 4..?........W...1C.1S..*</tt><tt><br>

    </tt><tt>                                       ^^^^^ <font

        color="#ff0000">trashed bytes</font></tt><tt><br>

    </tt><tt>… followed by <font color="#ff0000">7 repetitions of the

        first frame</font>:</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><tt><br>

    </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e |

      4..?........U.....PTT .n</tt><br>

    <br>

    If you had occasional strange false readings leading to strange

    effects, this<br>

    may be the explanation. Note: the SJA1000 FIFO bug may need specific

    circumstance<br>

    to occur in addition to the overflow, maybe the overflow happening

    on a specific<br>

    byte position in the FIFO -- I haven't tried to determine that.<br>

    <br>

    I've tried clearing the overflow ASAP, looping on the IR indicator

    and looping<br>

    on the RMC, the fault is always the same. The workaround I've now

    done is adding<br>

    up the message lengths read during a FIFO run and discard all frames

    exceeding<br>

    the 64 byte border. Not pretty, but functional.<br>

    <br>

    That keeps the trashed frames from being forwarded to the OVMS

    application level.<br>

    <br>

    As with the SIMCOM UART, these overflows wouldn't happen if the

    interrupt handler<br>

    would react fast enough. It seems we've got a general interrupt

    latency issue.<br>

    <br>

    And just as with the SIMCOM I saw a reduction of overflows when

    allocating the<br>

    CAN IRQ on core #0, another one with raising the IRQ level to 3, so

    that's<br>

    also part of my commit.<br>

    <br>

    The esp-idf CAN driver apparently has the same issue even in the

    latest release,<br>

    I'll contact the author about this or post another issue to

    Espressif.<br>

    <br>

    So our interrupt allocation is now:<br>

    <br>

    <tt>Task           Run#  Lvl  ISR#  Usage                    Source</tt><tt><br>

    </tt><tt>----------------------------------------------------------------------------------------</tt><tt><br>

    </tt><tt>-              0     1    0     RTC Core                

      ETS_RTC_CORE_INTR_SOURCE</tt><tt><br>

    </tt><tt>esp_timer      0     1    0     RTOS esp_timer          

      ETS_TIMER2_INTR_SOURCE</tt><tt><br>

    </tt><tt>ipc0           0     1    0     RTOS Scheduler Core0    

      ETS_FROM_CPU_INTR0_SOURCE</tt><tt><br>

    </tt><tt>ipc1           1     1    1     RTOS Scheduler Core1    

      ETS_FROM_CPU_INTR1_SOURCE</tt><tt><br>

    </tt><tt>main           0     1    0     RTOS Watchdog           

      ETS_TG0_WDT_LEVEL_INTR_SOURCE</tt><tt><br>

    </tt><tt>OVMS Events    1     1    1     Peripherals (GPIO ISR)  

      ETS_GPIO_INTR_SOURCE</tt><tt><br>

    </tt><tt>OVMS Events    1     1    1     MAX7317 (spi_nodma)     

      ETS_SPI3_INTR_SOURCE</tt><tt><br>

    </tt><b><tt>esp32can init  0     3    0     ESP32 CAN               

        ETS_CAN_INTR_SOURCE</tt></b><b><tt><br>

      </tt></b><b><tt>OVMS SIMCOM    0     2    0     SIMCOM

        UART              ETS_UART1_INTR_SOURCE</tt></b><tt><br>

    </tt><tt>OVMS Events    1     1    1     USB console (UART)      

      ETS_UART0_INTR_SOURCE</tt><tt><br>

    </tt><tt>OVMS Events    1     1    1     SD card                 

      ETS_SDIO_HOST_INTR_SOURCE</tt><tt><br>

    </tt><tt>----------------------------------------------------------------------------------------</tt><br>

    <br>

    I think we need to find the source of the interrupt latency and/or

    find some<br>

    better scheme to distribute the tasks & interrupts over the

    cores.<br>

    <br>

    I still don't know why core #1 already got an issue with just our 6

    interrupts.<br>

    This may be related to the SPIRAM fix adding overhead, but I've seen

    the FIFO<br>

    overflows before applying the fix, and the CAN issue also has been

    present for<br>

    over a year.<br>

    <br>

    My working hypothesis is some task/driver disabling interrupts for

    too often<br>

    / too long, or non-IRAM ISRs getting delayed for too often / too

    long. And it<br>

    seems this is an issue especially on core #1.<br>

    <br>

    Please verify & comment.<br>

    <br>

    Regards,<br>

    Michael<br>

    <br>

    PS: a little side mystery: the netmanager task starts with priority

    5, then<br>

    gets raised to 22 (probably a mutex priority inheritance), but never

    gets<br>

    lowered back to 5 afterwards. A FreeRTOS bug?<br>

    <br>

    <pre class="moz-signature" cols="160">-- 

Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal

Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

</pre>

  </body>

</html>