<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">A quick update:<br>

      <br>

      On my car module, CAN overflows are now down to 0.004 per mille,

      SIMCOM UART overflows down to effectively 0.<br>

      <br>

      The new file logging delayed fsync scheme had a major impact (as

      expected), and interestingly the regular DukTape garbage

      collection caused another drop.<br>

      <br>

      So it's OK now, but we need to be careful about both SD file I/O

      and high / granular memory usage.<br>

      <br>

      Regards,<br>

      Michael<br>

      <br>

      <br>

      Am 21.11.19 um 14:31 schrieb Michael Balzer:<br>

    </div>

    <blockquote type="cite"

      cite="mid:638ce3ab-f705-dcaf-2e93-26fda74e601e@expeedo.de">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      For those who didn't follow the issue: I did some single step

      experimentation to determine the actual behaviour of the

      controller.<br>

      <br>

      I found out how to recover after an overrun, and I found a

      hardware bug that causes the controller FIFO to get into an

      unrecoverable state when the message counter reaches 64. My

      results have just been confirmed by Espressif.<br>

      <br>

      The new implementation has been in edge for 9 days now, and no

      frame corruptions have occurred since.<br>

      <br>

      Overflow rate still has been pretty high (~3-4 per mille). I now

      build with the TCP/IP task pinned to core 1, which has lowered the

      rate to ~0.5 per mille, but also leads to an unbalanced load

      distribution. Not sure yet if that's the best solution. I've

      checked if moving all CAN processing back to core 1 would help,

      but that turned out to be worse. I suspect either SPI or SD I/O to

      require long/often critical section locks on core 1, but haven't

      checked their code yet.<br>

      <br>

      Regards,<br>

      Michael<br>

      <br>

      <br>

      <div class="moz-cite-prefix">Am 07.11.19 um 18:19 schrieb Michael

        Balzer:<br>

      </div>

      <blockquote type="cite"

        cite="mid:b9805669-4d4b-5049-46ea-7ea8efcd02fa@expeedo.de">

        <meta http-equiv="Content-Type" content="text/html;

          charset=UTF-8">

        No luck with all variations. I've switched back to the byte

        counting workaround.<br>

        <br>

        Regards,<br>

        Michael<br>

        <br>

        <br>

        <div class="moz-cite-prefix">Am 07.11.19 um 10:31 schrieb

          Michael Balzer:<br>

        </div>

        <blockquote type="cite"

          cite="mid:95588de0-ab7c-113c-274d-4b634a2199b1@expeedo.de">

          <meta http-equiv="Content-Type" content="text/html;

            charset=UTF-8">

          The new implementation doesn't seem to work very well. It

          seems this approach loses a lot of CAN frames. I don't have

          the means to test how many, but I see the effect in my vehicle

          events & status.<br>

          <br>

          The Espressif info given by Darian in the issue was either

          inaccurate or misses some important detail. There's another

          (non Espressif) developer on the issue trying to get the idf

          driver to correctly handle the issue, he hasn't had any luck

          as well up to now.<br>

          <br>

          I'll try another interpretation of the scheme outlined by

          Darian. If that doesn't work I'll revert to my first

          workaround until we get answers or template code from

          Espressif.<br>

          <br>

          Regards,<br>

          Michael<br>

          <br>

          <br>

          <div class="moz-cite-prefix">Am 03.11.19 um 19:46 schrieb

            Michael Balzer:<br>

          </div>

          <blockquote type="cite"

            cite="mid:76d7e96d-0108-35c4-9c0d-3d7b7dadf14c@expeedo.de">

            <meta http-equiv="Content-Type" content="text/html;

              charset=UTF-8">

            I've pushed the new implementation based on the Espressif

            info and some more experimentation.<br>

            <br>

            I've also enabled the IRAM flag for the GPIO and SPI_noDMA

            handlers, as all handlers here already were in IRAM.<br>

            <br>

            I've been testing this over the weekend, had no issues and

            slightly reduced can1 frame drops.<br>

            <br>

            Mark, maybe you can do some additional high load CAN tests

            in your lab.<br>

            <br>

            Regards,<br>

            Michael<br>

            <br>

            <br>

            <div class="moz-cite-prefix">Am 01.11.19 um 12:35 schrieb

              Michael Balzer:<br>

            </div>

            <blockquote type="cite"

              cite="mid:fd9d5897-5237-7cb8-1e40-a40aee6ba740@expeedo.de">

              <meta http-equiv="Content-Type" content="text/html;

                charset=UTF-8">

              An unusual fast reply from Espressif. They basically

              declared it to be a feature…<br>

              <br>

              <a

href="https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085"

                moz-do-not-send="true">https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085</a><br>

              <br>

              Regards,<br>

              Michael<br>

              <br>

              <br>

              <div class="moz-cite-prefix">Am 01.11.19 um 10:34 schrieb

                Michael Balzer:<br>

              </div>

              <blockquote type="cite"

                cite="mid:3fccacb5-bf45-a329-24eb-8d2cd62b6cda@expeedo.de">

                <meta http-equiv="Content-Type" content="text/html;

                  charset=UTF-8">

                I've submitted an issue to Espressif on this:<br>

                <br>

                <a

                  href="https://github.com/espressif/esp-idf/issues/4276"

                  moz-do-not-send="true">https://github.com/espressif/esp-idf/issues/4276</a><br>

                <br>

                It would be nice if I could reproduce the effect using

                some standard hardware.<br>

                I've ordered some of those combined ESP32 + OLED units,

                but don't have any CAN transceivers I could hook up to

                them.<br>

                Maybe the info I've provided is sufficient for Espressif

                to investigate.<br>

                <br>

                Regards,<br>

                Michael<br>

                <br>

                <br>

                <div class="moz-cite-prefix">Am 31.10.19 um 22:24

                  schrieb Michael Balzer:<br>

                </div>

                <blockquote type="cite"

                  cite="mid:fe537da7-12c1-e0ca-0ac5-7374415628b2@expeedo.de">

                  <meta http-equiv="content-type" content="text/html;

                    charset=UTF-8">

                  Second finding & commit:<br>

                  <br>

                  TL;DR:<br>

                  <ul>

                    <li>ESP32 CAN controller is broken, delivers trashed

                      & duplicated RX frames</li>

                    <li>we need to optimize interrupt latency</li>

                  </ul>

                  <br>

                  Users have been reporting occasional false BMS

                  readings on the Twizy for more than a year.<br>

                  The error always trashed the reading of voltage #10

                  and/or temperature #8, so I initially<br>

                  thought that must be a variant of the SPIRAM/cache

                  bug.<br>

                  <br>

                  But the SPIRAM fix didn't help, and the bug still

                  occured with all BMS data in internal<br>

                  RAM. The issue correlated with high load situations

                  and the SIMCOM UART issue. On testing the<br>

                  latest changes by Marko, my car module began showing

                  that issue regularly, so I could take a<br>

                  closer look. It turned out the wrong data came from

                  the CAN bus. Checking with a CAN logger<br>

                  didn't show any wrong data on the bus.<br>

                  <br>

                  So I scanned our CAN driver for potential issues,

                  found & solved some (even a serious<br>

                  race condition) but that didn't fix it. I then added

                  debug dumps of all frames read, dug<br>

                  into the SJA1000 spec sheet and found this:<br>

                  <br>

                  The SJA1000 RX FIFO has a capacity of 64 bytes. It's

                  supposed to work as a ring buffer,<br>

                  with frames wrapping around as necessary. If no space

                  is left in the FIFO for a new frame<br>

                  coming in, that frame should be discarded completely,

                  not counted, just the overflow<br>

                  indicator should be set & the according interrupt

                  be generated.<br>

                  <br>

                  Reality differs. The frame causing the overflow is

                  added to the FIFO partially. It's also<br>

                  counted both in the RMC register and indicated by RBS

                  as a valid frame when retrieving<br>

                  the FIFO contents. The read buffer delivers the

                  partial frame + some trashed bytes up<br>

                  to the nominal frame length.<br>

                  <br>

                  In the BMS frame sequence it would normally trash

                  bytes 7 & 8 on the sixth frame of length<br>

                  8 in a sequence. A standard frame needs a 3 byte

                  header + the data bytes in the FIFO, so<br>

                  6 frames would add up to 66 bytes => 64 + 2 trash.

                  The first trash byte normally is "08",<br>

                  the second "84" or "2a" or sometimes "ab", possibly

                  some internal SJA1000 data.<br>

                  <br>

                  Better yet, if you rely on the indicators of the

                  SJA1000 when reading the FIFO, the<br>

                  SJA1000 will repeat the first frame within the FIFO

                  after the invalid frame. Mostly<br>

                  once or twice, and sometimes 7 times or more:<br>

                  <br>

                  <tt>inv_msg: framecnt=13, invindex=6</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37

                    | 4..?........$....@."7U.7</tt><tt><br>

                  </tt><tt>inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26

                    | 4..?........%.....D..N.&</tt><tt><br>

                  </tt><tt>inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00

                    | 4..?........T...7777777.</tt><tt><br>

                  </tt><tt>inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a

                    | 4..?........V...1c.1S.1J</tt><tt><br>

                  </tt><tt>inv_msg: 57 05 00 00 31 43 14 31 53 15 <font

                      color="#ff0000"><b>08 2a</b></font> |

                    4..?........W...1C.1S..*</tt><tt><br>

                  </tt><tt>                                       ^^^^^

                    <font color="#ff0000">trashed bytes</font></tt><tt><br>

                  </tt><tt>… followed by <font color="#ff0000">7

                      repetitions of the first frame</font>:</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><tt><br>

                  </tt><tt>inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e

                    | 4..?........U.....PTT .n</tt><br>

                  <br>

                  If you had occasional strange false readings leading

                  to strange effects, this<br>

                  may be the explanation. Note: the SJA1000 FIFO bug may

                  need specific circumstance<br>

                  to occur in addition to the overflow, maybe the

                  overflow happening on a specific<br>

                  byte position in the FIFO -- I haven't tried to

                  determine that.<br>

                  <br>

                  I've tried clearing the overflow ASAP, looping on the

                  IR indicator and looping<br>

                  on the RMC, the fault is always the same. The

                  workaround I've now done is adding<br>

                  up the message lengths read during a FIFO run and

                  discard all frames exceeding<br>

                  the 64 byte border. Not pretty, but functional.<br>

                  <br>

                  That keeps the trashed frames from being forwarded to

                  the OVMS application level.<br>

                  <br>

                  As with the SIMCOM UART, these overflows wouldn't

                  happen if the interrupt handler<br>

                  would react fast enough. It seems we've got a general

                  interrupt latency issue.<br>

                  <br>

                  And just as with the SIMCOM I saw a reduction of

                  overflows when allocating the<br>

                  CAN IRQ on core #0, another one with raising the IRQ

                  level to 3, so that's<br>

                  also part of my commit.<br>

                  <br>

                  The esp-idf CAN driver apparently has the same issue

                  even in the latest release,<br>

                  I'll contact the author about this or post another

                  issue to Espressif.<br>

                  <br>

                  So our interrupt allocation is now:<br>

                  <br>

                  <tt>Task           Run#  Lvl  ISR# 

                    Usage                    Source</tt><tt><br>

                  </tt><tt>----------------------------------------------------------------------------------------</tt><tt><br>

                  </tt><tt>-              0     1    0     RTC

                    Core                 ETS_RTC_CORE_INTR_SOURCE</tt><tt><br>

                  </tt><tt>esp_timer      0     1    0     RTOS

                    esp_timer           ETS_TIMER2_INTR_SOURCE</tt><tt><br>

                  </tt><tt>ipc0           0     1    0     RTOS

                    Scheduler Core0     ETS_FROM_CPU_INTR0_SOURCE</tt><tt><br>

                  </tt><tt>ipc1           1     1    1     RTOS

                    Scheduler Core1     ETS_FROM_CPU_INTR1_SOURCE</tt><tt><br>

                  </tt><tt>main           0     1    0     RTOS

                    Watchdog            ETS_TG0_WDT_LEVEL_INTR_SOURCE</tt><tt><br>

                  </tt><tt>OVMS Events    1     1    1     Peripherals

                    (GPIO ISR)   ETS_GPIO_INTR_SOURCE</tt><tt><br>

                  </tt><tt>OVMS Events    1     1    1     MAX7317

                    (spi_nodma)      ETS_SPI3_INTR_SOURCE</tt><tt><br>

                  </tt><b><tt>esp32can init  0     3    0     ESP32

                      CAN                ETS_CAN_INTR_SOURCE</tt></b><b><tt><br>

                    </tt></b><b><tt>OVMS SIMCOM    0     2    0    

                      SIMCOM UART              ETS_UART1_INTR_SOURCE</tt></b><tt><br>

                  </tt><tt>OVMS Events    1     1    1     USB console

                    (UART)       ETS_UART0_INTR_SOURCE</tt><tt><br>

                  </tt><tt>OVMS Events    1     1    1     SD

                    card                  ETS_SDIO_HOST_INTR_SOURCE</tt><tt><br>

                  </tt><tt>----------------------------------------------------------------------------------------</tt><br>

                  <br>

                  I think we need to find the source of the interrupt

                  latency and/or find some<br>

                  better scheme to distribute the tasks & interrupts

                  over the cores.<br>

                  <br>

                  I still don't know why core #1 already got an issue

                  with just our 6 interrupts.<br>

                  This may be related to the SPIRAM fix adding overhead,

                  but I've seen the FIFO<br>

                  overflows before applying the fix, and the CAN issue

                  also has been present for<br>

                  over a year.<br>

                  <br>

                  My working hypothesis is some task/driver disabling

                  interrupts for too often<br>

                  / too long, or non-IRAM ISRs getting delayed for too

                  often / too long. And it<br>

                  seems this is an issue especially on core #1.<br>

                  <br>

                  Please verify & comment.<br>

                  <br>

                  Regards,<br>

                  Michael<br>

                  <br>

                  PS: a little side mystery: the netmanager task starts

                  with priority 5, then<br>

                  gets raised to 22 (probably a mutex priority

                  inheritance), but never gets<br>

                  lowered back to 5 afterwards. A FreeRTOS bug?<br>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

        </blockquote>

      </blockquote>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="144">-- 

Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal

Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

</pre>

  </body>

</html>