CAN-3 broken again?

older
Fwd: Open Vehicle Monitoring System

Greg D.

6 Jan 2018 6 Jan '18

11:04 p.m.

Hi folks, I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver... I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH. Seems like we just took a big step backward. What happened? Greg

Show replies by date

Stephen Casner

6 Jan 6 Jan

11:09 p.m.

On Sat, 6 Jan 2018, Greg D. wrote:

...

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I don't know about that part.

...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Running out of free heap memory? -- Steve

Michael Balzer

7 Jan 7 Jan

2:29 a.m.

Greg, which commits / changes do you mean? The CAN drivers have not been changed since the TX performance fix, which Geir reported having solved his last issues. The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component. Regards, Michael Am 07.01.2018 um 08:04 schrieb Greg D.:

...

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

11:05 a.m.

Stephen Casner

1:46 p.m.

Greg, Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages. The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM. -- Steve On Sun, 7 Jan 2018, Greg D. wrote:

...

Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

5:14 p.m.

I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure. Let's see how that holds... As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang. On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence. Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong. I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great. Thanks, Greg Stephen Casner wrote:

...

Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

...
Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

10:31 p.m.

Quick update... I can reliably get the CAN 3 bus to hang with an Rx overflow by having the modem running, then telling WiFi to connect, but only with the OBDWiz dongle. Connecting an actual HUD display doesn't seem to trigger the effect. Surprisingly, I can do a full module reset while the OBDWiz is running, without it disconnecting. (OBDII ECU is started in the system.start script, along with the v2 server and vehicle module setting, though all this testing is done on the bench without the car.) Also, once the bus is hung, restarting the OBD2ECU process sometimes only lets the OBDWiz dongle get part-way through its connect process before it hangs again. Consistently 7 frames received, 10 sent (due to some of the responses taking multiple frames). It may be significant that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)... That said, another development is that the Rx overflow may not be fatal after all, if I start things with the HUD, then swap dongles to the OBDWiz. Seems that having an external 12v power source keeps things running even with the overflow status. Since earlier (few months ago) testing didn't have the modem running, and the OBDWiz dongle doesn't need the 12v power (it's a USB device, on a different PC), this test combination is new. Error flags on the can status are 0x2040, by the way, when it hangs. But still, even with the 12v, I can reliably cause the bus to hang by starting with the OBDWiz dongle running, get the modem connected, then connect wifi. The partial connect and hang seems to be solved with the 12v power; the full hang is not. But the full hang (with 12v attached) can be reset by stopping and restarting the obdii ecu. Interestingly, the 0x2040 error status is not cleared when restarting the obdii process, but the frame and frame error counters do get set back to zero. Still looking for more clues... Any ideas on how to narrow this down? Greg Greg D. wrote:

...

I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure.

Let's see how that holds...

As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang.

On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence.

Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong.

I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great.

Thanks,

Greg

Stephen Casner wrote:

...
Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

...
Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

8 Jan 8 Jan

11:34 a.m.

Greg, error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after restarting the obdii process. As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough. My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback. But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working. Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command? Regards, Michael Am 08.01.2018 um 07:31 schrieb Greg D.:

...

Quick update... I can reliably get the CAN 3 bus to hang with an Rx overflow by having the modem running, then telling WiFi to connect, but only with the OBDWiz dongle. Connecting an actual HUD display doesn't seem to trigger the effect. Surprisingly, I can do a full module reset while the OBDWiz is running, without it disconnecting. (OBDII ECU is started in the system.start script, along with the v2 server and vehicle module setting, though all this testing is done on the bench without the car.)

Also, once the bus is hung, restarting the OBD2ECU process sometimes only lets the OBDWiz dongle get part-way through its connect process before it hangs again. Consistently 7 frames received, 10 sent (due to some of the responses taking multiple frames). It may be significant that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...

That said, another development is that the Rx overflow may not be fatal after all, if I start things with the HUD, then swap dongles to the OBDWiz. Seems that having an external 12v power source keeps things running even with the overflow status. Since earlier (few months ago) testing didn't have the modem running, and the OBDWiz dongle doesn't need the 12v power (it's a USB device, on a different PC), this test combination is new. Error flags on the can status are 0x2040, by the way, when it hangs.

But still, even with the 12v, I can reliably cause the bus to hang by starting with the OBDWiz dongle running, get the modem connected, then connect wifi. The partial connect and hang seems to be solved with the 12v power; the full hang is not. But the full hang (with 12v attached) can be reset by stopping and restarting the obdii ecu. Interestingly, the 0x2040 error status is not cleared when restarting the obdii process, but the frame and frame error counters do get set back to zero.

Still looking for more clues... Any ideas on how to narrow this down?

Greg

Greg D. wrote:

...
I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure.

Let's see how that holds...

As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang.

On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence.

Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong.

I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great.

Thanks,

Greg

Stephen Casner wrote:

...
Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

...
Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

12:03 p.m.

Hi Michael, The overflow needs to have something else happen to the system, consuming enough CPU time that we lose a frame or two. The Wifi client connecting seems to do that pretty reliably, but other things can do it too (e.g. playing with the modem). Restarting the obd2ecu process resets the bus, and everything runs fine until that perturbing (e.g. wifi connect) event occurs again. I do not recall this happening prior to the optimizations that were done last month, but so much else has changed that it's hard to point to any one of them as the cause. Thinking about my experiments last night, I recall that there were times where the bus was running even though Rx Overflow was non-zero. I also recall that sometimes restarting obd2ecu didn't clear the 0x2040 error status, and sometimes it did, and that seemed to be separate from whether the bus was hung or not. I don't recall now what exactly the conditions were (it was late), but there may be more to the mcp2515 handling than just the overflow. But I'd start there. Thanks for reminding me about the trace facility. I was relying on the can status command to count the frames, but confirmed with trace that none were received. The obd2ecu process also has a lot of logging built-in (set log level to debug), and that confirmed no frames are received after the hangs. Are there any debugging hooks in the MCP driver that we can use to understand its state? Greg Michael Balzer wrote:

...

Greg,

error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after restarting the obdii process.

As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough.

My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback.

But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working.

Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command?

Regards, Michael

Am 08.01.2018 um 07:31 schrieb Greg D.:

...
Quick update... I can reliably get the CAN 3 bus to hang with an Rx overflow by having the modem running, then telling WiFi to connect, but only with the OBDWiz dongle. Connecting an actual HUD display doesn't seem to trigger the effect. Surprisingly, I can do a full module reset while the OBDWiz is running, without it disconnecting. (OBDII ECU is started in the system.start script, along with the v2 server and vehicle module setting, though all this testing is done on the bench without the car.)

Also, once the bus is hung, restarting the OBD2ECU process sometimes only lets the OBDWiz dongle get part-way through its connect process before it hangs again. Consistently 7 frames received, 10 sent (due to some of the responses taking multiple frames). It may be significant that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...

That said, another development is that the Rx overflow may not be fatal after all, if I start things with the HUD, then swap dongles to the OBDWiz. Seems that having an external 12v power source keeps things running even with the overflow status. Since earlier (few months ago) testing didn't have the modem running, and the OBDWiz dongle doesn't need the 12v power (it's a USB device, on a different PC), this test combination is new. Error flags on the can status are 0x2040, by the way, when it hangs.

But still, even with the 12v, I can reliably cause the bus to hang by starting with the OBDWiz dongle running, get the modem connected, then connect wifi. The partial connect and hang seems to be solved with the 12v power; the full hang is not. But the full hang (with 12v attached) can be reset by stopping and restarting the obdii ecu. Interestingly, the 0x2040 error status is not cleared when restarting the obdii process, but the frame and frame error counters do get set back to zero.

Still looking for more clues... Any ideas on how to narrow this down?

Greg

Greg D. wrote:

...
I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure.

Let's see how that holds...

As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang.

On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence.

Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong.

I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great.

Thanks,

Greg

Stephen Casner wrote:

...
Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

...
Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

9 Jan 9 Jan

11:24 p.m.

Ok, I think I found the issue with receive buffer overflows killing the interface. Errors get handled by the Receive callback, but the function was returning false. Apparently that signaled somebody to stop. Changed it to return true, now things keep running. Since the error was handled, why hang? Seems an overly severe punishment. Mark: Made the change to mcp2525.cpp and pushed to my fork. Unless there's an objection to hiding the error, please pull. Todo: see if I can replicate the issue with the transmit side and Canopen. For some reason, that stopped failing earlier today (not that not failing is a bad thing...) It could be a similar issue with a transmit error, but I need to replicate it first. Greg Michael Balzer wrote:

...

Greg,

error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after restarting the obdii process.

As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough.

My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback.

But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working.

Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command?

Regards, Michael

Am 08.01.2018 um 07:31 schrieb Greg D.:

...
Quick update... I can reliably get the CAN 3 bus to hang with an Rx overflow by having the modem running, then telling WiFi to connect, but only with the OBDWiz dongle. Connecting an actual HUD display doesn't seem to trigger the effect. Surprisingly, I can do a full module reset while the OBDWiz is running, without it disconnecting. (OBDII ECU is started in the system.start script, along with the v2 server and vehicle module setting, though all this testing is done on the bench without the car.)

Also, once the bus is hung, restarting the OBD2ECU process sometimes only lets the OBDWiz dongle get part-way through its connect process before it hangs again. Consistently 7 frames received, 10 sent (due to some of the responses taking multiple frames). It may be significant that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...

That said, another development is that the Rx overflow may not be fatal after all, if I start things with the HUD, then swap dongles to the OBDWiz. Seems that having an external 12v power source keeps things running even with the overflow status. Since earlier (few months ago) testing didn't have the modem running, and the OBDWiz dongle doesn't need the 12v power (it's a USB device, on a different PC), this test combination is new. Error flags on the can status are 0x2040, by the way, when it hangs.

But still, even with the 12v, I can reliably cause the bus to hang by starting with the OBDWiz dongle running, get the modem connected, then connect wifi. The partial connect and hang seems to be solved with the 12v power; the full hang is not. But the full hang (with 12v attached) can be reset by stopping and restarting the obdii ecu. Interestingly, the 0x2040 error status is not cleared when restarting the obdii process, but the frame and frame error counters do get set back to zero.

Still looking for more clues... Any ideas on how to narrow this down?

Greg

Greg D. wrote:

...
I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure.

Let's see how that holds...

As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang.

On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence.

Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong.

I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great.

Thanks,

Greg

Stephen Casner wrote:

...
Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

...
Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

10 Jan 10 Jan

12:59 p.m.

Greg, the RxCallback() is designed to process the errors after the RX buffers, so the return false is basically correct, as there should be no more to do for this rx loop. Also, if the error handler returns true to the framework, a random frame will be sent to the listeners. If this is going to be the solution, we need another return code for the loop. I still think this is a performance issue of the queue/callback scheme. If you return true from the error handler, RxCallback() will be called again after the , so may clear some RX buffer being filled directly on clearing the error flags... maybe that's why this change makes a difference? I don't understand how the flags 0x2040 = RXB0 overflow can happen. Have a look at the receive flow chart on page 26: if RXB0 is full, it will roll over into RXB1, if that's also full it will generate an RXB_1_ overflow. With rollover enabled, RXB0 overflow should never happen. Strange. Regards, Michael Am 10.01.2018 um 08:24 schrieb Greg D.:

...

Ok, I think I found the issue with receive buffer overflows killing the interface. Errors get handled by the Receive callback, but the function was returning false. Apparently that signaled somebody to stop. Changed it to return true, now things keep running. Since the error was handled, why hang? Seems an overly severe punishment.

Mark: Made the change to mcp2525.cpp and pushed to my fork. Unless there's an objection to hiding the error, please pull.

Todo: see if I can replicate the issue with the transmit side and Canopen. For some reason, that stopped failing earlier today (not that not failing is a bad thing...) It could be a similar issue with a transmit error, but I need to replicate it first.

Greg

Michael Balzer wrote:

...
Greg,

error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after restarting the obdii process.

As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough.

My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback.

But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working.

Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command?

Regards, Michael

Am 08.01.2018 um 07:31 schrieb Greg D.:

...
Quick update... I can reliably get the CAN 3 bus to hang with an Rx overflow by having the modem running, then telling WiFi to connect, but only with the OBDWiz dongle. Connecting an actual HUD display doesn't seem to trigger the effect. Surprisingly, I can do a full module reset while the OBDWiz is running, without it disconnecting. (OBDII ECU is started in the system.start script, along with the v2 server and vehicle module setting, though all this testing is done on the bench without the car.)

Also, once the bus is hung, restarting the OBD2ECU process sometimes only lets the OBDWiz dongle get part-way through its connect process before it hangs again. Consistently 7 frames received, 10 sent (due to some of the responses taking multiple frames). It may be significant that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...

That said, another development is that the Rx overflow may not be fatal after all, if I start things with the HUD, then swap dongles to the OBDWiz. Seems that having an external 12v power source keeps things running even with the overflow status. Since earlier (few months ago) testing didn't have the modem running, and the OBDWiz dongle doesn't need the 12v power (it's a USB device, on a different PC), this test combination is new. Error flags on the can status are 0x2040, by the way, when it hangs.

But still, even with the 12v, I can reliably cause the bus to hang by starting with the OBDWiz dongle running, get the modem connected, then connect wifi. The partial connect and hang seems to be solved with the 12v power; the full hang is not. But the full hang (with 12v attached) can be reset by stopping and restarting the obdii ecu. Interestingly, the 0x2040 error status is not cleared when restarting the obdii process, but the frame and frame error counters do get set back to zero.

Still looking for more clues... Any ideas on how to narrow this down?

Greg

Greg D. wrote:

...
I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure.

Let's see how that holds...

As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang.

On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence.

Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong.

I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great.

Thanks,

Greg

Stephen Casner wrote:

...
Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

...
Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue. Crash and reboot log attached (crash.txt). One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia. I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt). As a test, I removed Canopen from the build config, and the crash has disappeared. CAN-3 also appears to have come back to life (!), at least initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).

Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server. Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions... First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit. What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product? Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the T X performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

2:27 p.m.

Hi Michael, Returning true was done on a late-night debugging whim, as an experiment. I haven't looked "upstream" to see what the false return would do, but clearly it's having some negative effect on the ability for the CAN bus to operate. I did a bit more poking around, and I now believe that returning True is totally correct in this circumstance. The functioning of buffer overflow, I believe, is working as it should. I see that most of the time, frames come in on buffer 0. When I cause the overflow by starting wifi, I see a single frame received in buffer 1, along with the status of a buffer overflow from buffer 0, but the interrupt status only shows buffer 1 as being full: status from register 2C is 0x22, not 0x23. The error status was 0x40, indicating the single overflow, as expected. My guess is that the timing is such that buffer 0 was being read at the time the next frame arrived, so it went into buffer 1, and that buffer 0 had emptied by the time buffer 1's interrupt was seen. I have not seen a buffer 1 overflow (which would indicate that a frame was actually lost), so the buffer 0 overflow is totally not an issue. At most, it's a warning that the system is under load. No surprise there; it was. But we still need to assume that Rx overflows can occur in real life, and not have that be a fatal error requiring a reboot or process restart. The system has limited CPU power, and not all CAN bus consumers can have top priority in their processing. To prevent the overflows is probably going to be expensive (redesign, perhaps), and is unnecessary. That the OBDII devices do retries proves that the obd2ecu system can withstand a lost frame here and there and still operate correctly. I can even do a full reboot of the module without OBDWiz complaining, which kind of surprised me when it first occurred. And since the overflows I have seen don't indicate an actual frame loss, we really should be returning True. Now, if the load on the processor increases such that we do get an overflow from buffer 1, what then? I suggest nothing different. Again, the design of OBDII is such that it can withstand occasional frame losses, and even so, what would you do? There is nothing the average user would do, nor is there a protocol to engage for error handling. The same, I believe, is true for other uses of the MCP CAN busses, which fortunately appear to be much lower in data rate, and are less likely to overflow. In an embedded system, we should note the event (counters increment, as now), and move on. If there are any CAN messages that would have an effect on the system if lost (are there any one-time occurrences?), that would probably be worth noting... An Info loglevel message should probably be displayed for the developers if we do get a buffer 1 overflow, indicating a lost frame. I'll add that and push. Thanks for looking over my shoulder! Greg Michael Balzer wrote:

...

Greg,

the RxCallback() is designed to process the errors after the RX buffers, so the return false is basically correct, as there should be no more to do for this rx loop.

Also, if the error handler returns true to the framework, a random frame will be sent to the listeners. If this is going to be the solution, we need another return code for the loop.

I still think this is a performance issue of the queue/callback scheme. If you return true from the error handler, RxCallback() will be called again after the , so may clear some RX buffer being filled directly on clearing the error flags... maybe that's why this change makes a difference?

I don't understand how the flags 0x2040 = RXB0 overflow can happen. Have a look at the receive flow chart on page 26: if RXB0 is full, it will roll over into RXB1, if that's also full it will generate an RXB_1_ overflow. With rollover enabled, RXB0 overflow should never happen. Strange.

Regards, Michael

Am 10.01.2018 um 08:24 schrieb Greg D.:

...
Ok, I think I found the issue with receive buffer overflows killing the interface. Errors get handled by the Receive callback, but the function was returning false. Apparently that signaled somebody to stop. Changed it to return true, now things keep running. Since the error was handled, why hang? Seems an overly severe punishment.

Mark: Made the change to mcp2525.cpp and pushed to my fork. Unless there's an objection to hiding the error, please pull.

Todo: see if I can replicate the issue with the transmit side and Canopen. For some reason, that stopped failing earlier today (not that not failing is a bad thing...) It could be a similar issue with a transmit error, but I need to replicate it first.

Greg

Michael Balzer wrote:

...
Greg,

error flags 0x2040 on mcp2515 = receive buffer 0 overflow. The error flags are reset by a bus start command, so I'd guess the overflow occurs again right after restarting the obdii process.

As the mcp has two receive buffers, this means the CAN driver was not able to clear buffer 0 fast enough.

My next guess would then be we need to clear the RX buffer in the mcp ISR code instead of the RxCallback.

But… a receive buffer overflow condition should get reset by the driver automatically (line 306). So that should not be the reason the bus stops working.

Are you sure it's the bus that stopped working, and not the obdii process? Did you verify this using the "can trace" command?

Regards, Michael

Am 08.01.2018 um 07:31 schrieb Greg D.:

...
Quick update... I can reliably get the CAN 3 bus to hang with an Rx overflow by having the modem running, then telling WiFi to connect, but only with the OBDWiz dongle. Connecting an actual HUD display doesn't seem to trigger the effect. Surprisingly, I can do a full module reset while the OBDWiz is running, without it disconnecting. (OBDII ECU is started in the system.start script, along with the v2 server and vehicle module setting, though all this testing is done on the bench without the car.)

Also, once the bus is hung, restarting the OBD2ECU process sometimes only lets the OBDWiz dongle get part-way through its connect process before it hangs again. Consistently 7 frames received, 10 sent (due to some of the responses taking multiple frames). It may be significant that the HUD does NOT use those multi-frame PIDS (ECU Name and VIN)...

That said, another development is that the Rx overflow may not be fatal after all, if I start things with the HUD, then swap dongles to the OBDWiz. Seems that having an external 12v power source keeps things running even with the overflow status. Since earlier (few months ago) testing didn't have the modem running, and the OBDWiz dongle doesn't need the 12v power (it's a USB device, on a different PC), this test combination is new. Error flags on the can status are 0x2040, by the way, when it hangs.

But still, even with the 12v, I can reliably cause the bus to hang by starting with the OBDWiz dongle running, get the modem connected, then connect wifi. The partial connect and hang seems to be solved with the 12v power; the full hang is not. But the full hang (with 12v attached) can be reset by stopping and restarting the obdii ecu. Interestingly, the 0x2040 error status is not cleared when restarting the obdii process, but the frame and frame error counters do get set back to zero.

Still looking for more clues... Any ideas on how to narrow this down?

Greg

Greg D. wrote:

...
I've turned off Canopen, SSH, and Telnet earlier, and that seemed to stop the crashes. Just now added Bluetooth to that list, for good measure.

Let's see how that holds...

As for the issues with CAN-3, I seem to be able to hang it by simply starting WiFi while the OBDII ECU is running with an OBDII device attached (OBDWiz, in this case). Trying to reconnect the OBDII device fails - no frames are received. Stopping and restarting the OBDII ECU task lets me reconnect. If I look at the can status when hung, I see that Rx Ovrflw is 1, and the Rx counter doesn't increment. I'm guessing that starting WiFi is taking enough CPU time that the OBDII ECU task is falling behind, causing the overflow. Apparently, that overflow is not being handled, leading to the hang.

On an earlier run (before removing Bluetooth), I was able to get the OBDWiz dongle to connect for a few frames, after which it hung. That behavior didn't repeat just now, but I'm not sure what else was running at the time (e.g. the modem / ppp). The connect sequence from OBDWiz does a few frames rapidly (an initial PID 0, followed by requests for ECU Name and VIN), before a more relaxed polling starts. So, if there's another task taking up CPU time, I can see where an Rx overflow could occur during that initial connect sequence.

Driving a HUD is not a critical task, so I would be against a general raising of task priority. Rather, we need to figure out how to handle the Rx Overflow, and keep the frames coming in. OBDII devices generally are somewhat forgiving about lost frames, but apparently the OBDWiz has a short attention span and lets you know that something is wrong.

I'll take a look at the 2515 code, but I'm not much of an expert on the chip's care and feeding under such circumstances. If someone more in the know about it could take a look, that would be great.

Thanks,

Greg

Stephen Casner wrote:

...
Greg,

Yes, definitely running out of free RAM, but I don't know the meaning of the WindowOverflow messages.

The first time I built with release/v3.0 of esp-idf I was not able to open an ssh connection; the error displayed was about a crypto failure. After quite a bit of digging to narrow down to where the error was occurring, I finally found that the problem was running out of free RAM. My solution was to disable bluetooth entirely, which made a big difference in the amount of free RAM.

-- Steve

On Sun, 7 Jan 2018, Greg D. wrote:

> Hi Michael, Steve, Mark, > > Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the > issue. Crash and reboot log attached (crash.txt). One thing I've been wondering > about are the several lines "_WindowOverflow4 at ??:?" during the boot process. Is > that indicative of a problem, later to manifest in the crash? > > My builds include pretty much everything, except for the Leaf, Twizy, and Soul. > > The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other > stuff, including a lot stuff updated in Canopen and Kia. I have a script that does > the git fetch master, merge, and push back to my github fork, the output of which is > attached (update.txt). As a test, I removed Canopen from the build config, and the > crash has disappeared. CAN-3 also appears to have come back to life (!), at least > initially. I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in > some sequence (still trying to pin that down), but that also leads to another crash > (crash2.txt, attached). > > Mark: Note also the issue with DNS failures getting to the v2 server. I enabled the > modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 > server. Disabling Wifi didn't bring it back, and powering off the modem (in > preparation for turning it back on) caused the crash. > > So, two questions... First, why the apparent conflict between Canopen or wifi/modem > and obd2ecu over access to the 3rd CAN bus? Why would the modem or wifi have any > effect on a CAN bus? > > Second, overall memory usage seems to be at the limit. What sort of budget do we have > for what remains to be done, and how are we going to be packaging the build options > for when non-developers want to get their hands on the product? Will we be able to > turn everything on, minus the developer / debug stuff, or will we have a separate SKU > for each model car? > > Thanks, > > Greg > > > Michael Balzer wrote: > > Greg, > > which commits / changes do you mean? The CAN drivers have not been changed since the T > X performance fix, which Geir reported having solved his last issues. > > The current version is stable over here, but without the SSH component -- I can't use > that due to memory getting too low together with the Twizy component. > > Regards, > Michael > > > Am 07.01.2018 um 08:04 schrieb Greg D.: > > Hi folks, > > I just resync'd with the main repository, and am not receiving frames on > CAN-3 anymore. I see there were changes to the chip driver... > > I'm also seeing crashes right after getting connected to WiFi, > immediately after the system tries to start SSH. > > Seems like we just took a big step backward. What happened? > > Greg > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.teslaclub.hk > http://lists.teslaclub.hk/mailman/listinfo/ovmsdev _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

2:52 p.m.

Done. Mark, if you could do the "pull" honors from github / bitsofgreg? Greg Greg D. wrote:

...

An Info loglevel message should probably be displayed for the developers if we do get a buffer 1 overflow, indicating a lost frame. I'll add that and push.

Michael Balzer

3:30 p.m.

Greg, please check the receive flow chart, that's not the way the MCP2515 is supposed to work with RXB0CTRL.BUKT=1 and no filters -- if the documentation is correct. Your change still will produce wrong IncomingFrame() calls caused by the return true from the error handler. You need to change the RxCallback() return type (or use the frame buffer for an auxiliary result tag) and call loop to add the "don't send but keep calling" case. Regards, Michael Am 10.01.2018 um 23:27 schrieb Greg D.:

...

The functioning of buffer overflow, I believe, is working as it should. I see that most of the time, frames come in on buffer 0. When I cause the overflow by starting wifi, I see a single frame received in buffer 1, along with the status of a buffer overflow from buffer 0, but the interrupt status only shows buffer 1 as being full: status from register 2C is 0x22, not 0x23. The error status was 0x40, indicating the single overflow, as expected. My guess is that the timing is such that buffer 0 was being read at the time the next frame arrived, so it went into buffer 1, and that buffer 0 had emptied by the time buffer 1's interrupt was seen. I have not seen a buffer 1 overflow (which would indicate that a frame was actually lost), so the buffer 0 overflow is totally not an issue. At most, it's a warning that the system is under load. No surprise there; it was.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Mark Webb-Johnson

4:31 p.m.

The design of the system is as follows: The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM). The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter. I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself. The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial. Regards, Mark.

...

On 11 Jan 2018, at 7:30 AM, Michael Balzer <dexter@expeedo.de> wrote:

Greg,

please check the receive flow chart, that's not the way the MCP2515 is supposed to work with RXB0CTRL.BUKT=1 and no filters -- if the documentation is correct.

Your change still will produce wrong IncomingFrame() calls caused by the return true from the error handler. You need to change the RxCallback() return type (or use the frame buffer for an auxiliary result tag) and call loop to add the "don't send but keep calling" case.

Regards, Michael

Am 10.01.2018 um 23:27 schrieb Greg D.:

...
The functioning of buffer overflow, I believe, is working as it should. I see that most of the time, frames come in on buffer 0. When I cause the overflow by starting wifi, I see a single frame received in buffer 1, along with the status of a buffer overflow from buffer 0, but the interrupt status only shows buffer 1 as being full: status from register 2C is 0x22, not 0x23. The error status was 0x40, indicating the single overflow, as expected. My guess is that the timing is such that buffer 0 was being read at the time the next frame arrived, so it went into buffer 1, and that buffer 0 had emptied by the time buffer 1's interrupt was seen. I have not seen a buffer 1 overflow (which would indicate that a frame was actually lost), so the buffer 0 overflow is totally not an issue. At most, it's a warning that the system is under load. No surprise there; it was.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Mark Webb-Johnson

4:35 p.m.

Another issue, that seems strange, is: can::can() ... xTaskCreatePinnedToCore(CAN_rxtask, "CanRxTask", 2048, (void*)this, 10, &m_rxtask, 0); That is creating the CAN_rxtask on core #0. $ fgrep -r xTaskCreatePinnedToCore * components/can/can.cpp: xTaskCreatePinnedToCore(CAN_rxtask, "CanRxTask", 2048, (void*)this, 10, &m_rxtask, 0); components/canopen/src/canopen.cpp: xTaskCreatePinnedToCore(CANopenRxTask, "COrx Task", 4096, (void*)this, 5, &m_rxtask, 1); components/canopen/src/canopen_worker.cpp: xTaskCreatePinnedToCore(CANopenWorkerJobTask, m_taskname, 4096, (void*)this, 5, &m_jobtask, 1); components/obd2ecu/src/obd2ecu.cpp: xTaskCreatePinnedToCore(OBD2ECU_task, "OBDII ECU Task", 6144, (void*)this, 5, &m_task, 1); components/ovms_server/ovms_server.cpp: xTaskCreatePinnedToCore(OvmsServer_task, "OVMS Server", 6144, (void*)this, 5, &m_task, 1); components/retools/src/retools.cpp: xTaskCreatePinnedToCore(RE_task, "RE Task", 4096, (void*)this, 5, &m_task, 1); components/simcom/src/simcom.cpp: xTaskCreatePinnedToCore(SIMCOM_task, "SIMCOMTask", 4096, (void*)this, 5, &m_task, 1); components/vehicle/vehicle.cpp: xTaskCreatePinnedToCore(OvmsVehicleRxTask, "Vrx Task", 4096, (void*)this, 10, &m_rxtask, 1); main/ovms_housekeeping.cpp: xTaskCreatePinnedToCore(HousekeepingTask, "Housekeeping", 4096, (void*)this, 5, &m_taskid, 1); main/ovms_netmanager.cpp: xTaskCreatePinnedToCore(MongooseRawTask, "NetManTask", 7*1024, (void*)this, 5, &m_mongoose_task, 1); main/task_base.cpp: BaseType_t task = xTaskCreatePinnedToCore(Task, name, stack, (void*)this, priority, &m_taskid, core); Every other task is created on core #1. Supposedly our App runs on core #1, and wifi/bluetooth run on core #0. Why is can::can CAN_rxtask running on core #0? Regards, Mark.

...

On 11 Jan 2018, at 8:31 AM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:

The design of the system is as follows:

The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM).

The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter.

I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself.

The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial.

Regards, Mark.

...
On 11 Jan 2018, at 7:30 AM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:

Greg,

please check the receive flow chart, that's not the way the MCP2515 is supposed to work with RXB0CTRL.BUKT=1 and no filters -- if the documentation is correct.

Your change still will produce wrong IncomingFrame() calls caused by the return true from the error handler. You need to change the RxCallback() return type (or use the frame buffer for an auxiliary result tag) and call loop to add the "don't send but keep calling" case.

Regards, Michael

Am 10.01.2018 um 23:27 schrieb Greg D.:

...
The functioning of buffer overflow, I believe, is working as it should. I see that most of the time, frames come in on buffer 0. When I cause the overflow by starting wifi, I see a single frame received in buffer 1, along with the status of a buffer overflow from buffer 0, but the interrupt status only shows buffer 1 as being full: status from register 2C is 0x22, not 0x23. The error status was 0x40, indicating the single overflow, as expected. My guess is that the timing is such that buffer 0 was being read at the time the next frame arrived, so it went into buffer 1, and that buffer 0 had emptied by the time buffer 1's interrupt was seen. I have not seen a buffer 1 overflow (which would indicate that a frame was actually lost), so the buffer 0 overflow is totally not an issue. At most, it's a warning that the system is under load. No surprise there; it was.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

10:20 p.m.

WiFi running on core #0 would explain the impact of wifi operations on the receipt of CAN frames. But, with the car interfaces running on core #1, core #0 might actually have more bandwidth under normal circumstances, as I'm assuming the modem is pretty light loading due to the limited data rate. {shrug} Need more real-world experience for optimizing this kind of stuff. Are we going to have a semi-formalized "beta" testing period for the first sets of production units per car? What will the feedback channel be? Greg Mark Webb-Johnson wrote:

...

Every other task is created on core #1. Supposedly our App runs on core #1, and wifi/bluetooth run on core #0. Why is can::can CAN_rxtask running on core #0?

Mark Webb-Johnson

11:35 p.m.

The core #0 vs #1 issue was just following Espressif’s original recommendations. I thought that was how they liked it to be done? We’re close now. Very close. RAM is the biggest issue, but I think with a bunch of stuff moved to SPIRAM, that will give us a lot of headroom. I’m working on Steve to optimise that over the next few days. I’ve got it running on my desk at least. For beta test / initial release, I think we can just do it like we did with OVMS v1. Clearly say this is pre-beta, intended for developers / technical people only, and make sure people know how to update firmware. Once the hardware and SDK are stable, I think things will get easier. We can just concentrate on the last few reliability issues. Regards, Mark.

...

On 11 Jan 2018, at 2:20 PM, Greg D. <gregd2350@gmail.com> wrote:

WiFi running on core #0 would explain the impact of wifi operations on the receipt of CAN frames. But, with the car interfaces running on core #1, core #0 might actually have more bandwidth under normal circumstances, as I'm assuming the modem is pretty light loading due to the limited data rate.

{shrug} Need more real-world experience for optimizing this kind of stuff. Are we going to have a semi-formalized "beta" testing period for the first sets of production units per car? What will the feedback channel be?

Greg

Mark Webb-Johnson wrote:

...
Every other task is created on core #1. Supposedly our App runs on core #1, and wifi/bluetooth run on core #0. Why is can::can CAN_rxtask running on core #0?

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

11 Jan 11 Jan

4:21 a.m.

Mark, Am 11.01.2018 um 01:35 schrieb Mark Webb-Johnson:

...

Every other task is created on core #1. Supposedly our App runs on core #1, and wifi/bluetooth run on core #0. Why is can::can CAN_rxtask running on core #0?

See my post "can1 TX performance solved" (Dec 31):

...

Turned out this was no hardware issue but a scheduler problem: the CAN rx and vehicle rx tasks were not scheduled often & fast enough to consistently keep up with the 10 ms period.

After assigning the CAN RX task to core 0 and raising the vehicle task priority to 10, there are no more TX overflows, and my charge control override works perfectly.

It's just a first attempt to optimize core & priority assignment to better utilize the dual core system. We sure can do some more optimization on this. Core #0 is called the "protocol core" by Espressif, from my understanding it's the preferred core for communication processes & driver tasks. The cores are identical, we can also use another assignment scheme: https://dl.espressif.com/doc/esp-idf/latest/api-guides/freertos-smp.html Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Mark Webb-Johnson

5:43 p.m.

I think the naming PRO (protocols - wifi & bluetooth) and APP (application code) reflect the original intention. As you say, it seems that Espressif have ironed out the bugs and don’t care so much now. That’s good to know. Regards, Mark

...

On 11 Jan 2018, at 8:21 PM, Michael Balzer <dexter@expeedo.de> wrote:

Mark,

Am 11.01.2018 um 01:35 schrieb Mark Webb-Johnson:

...
Every other task is created on core #1. Supposedly our App runs on core #1, and wifi/bluetooth run on core #0. Why is can::can CAN_rxtask running on core #0?

See my post "can1 TX performance solved" (Dec 31):

...
Turned out this was no hardware issue but a scheduler problem: the CAN rx and vehicle rx tasks were not scheduled often & fast enough to consistently keep up with the 10 ms period.

After assigning the CAN RX task to core 0 and raising the vehicle task priority to 10, there are no more TX overflows, and my charge control override works perfectly.

It's just a first attempt to optimize core & priority assignment to better utilize the dual core system. We sure can do some more optimization on this.

Core #0 is called the "protocol core" by Espressif, from my understanding it's the preferred core for communication processes & driver tasks. The cores are identical, we can also use another assignment scheme:

https://dl.espressif.com/doc/esp-idf/latest/api-guides/freertos-smp.html

Regards, Michael

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

10 Jan 10 Jan

5:24 p.m.

Greg D.

8:55 p.m.

Michael Balzer

11 Jan 11 Jan

4:42 a.m.

Greg, Mark, I can check your new code after work. For the TX performance/overflow issue, there are basically two options: * A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient delays (very ugly) * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the queue is full o → the framework checks for TX buffers becoming available (i.e. driver issuing a TxCallback request) and delivers queued frames only as fast as the driver can handle them Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't think applications should need to handle TX overflows. I can try to implement that this weekend if it's urgent now. Regards, Michael Am 11.01.2018 um 05:55 schrieb Greg D.:

...

Hi Mark, Micheal,

Ok, good news and bad news.

Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on Github.

Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the delays in the obd2ecu code and call it close enough. Lemme know.

For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork.

I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's much better than it was before. Really, this time. :)

Greg

Greg D. wrote:

...
Hi Mark,

I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think...

I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news...

Greg

Mark Webb-Johnson wrote:

...
The design of the system is as follows:

* The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM).

The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter.

I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself.

The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial.

Regards, Mark.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Mark Webb-Johnson

4:19 p.m.

Option B sounds like a good approach. Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback? Regards, Mark.

...

On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de> wrote:

Greg, Mark,

I can check your new code after work.

For the TX performance/overflow issue, there are basically two options: A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient delays (very ugly) B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the queue is full → the framework checks for TX buffers becoming available (i.e. driver issuing a TxCallback request) and delivers queued frames only as fast as the driver can handle them Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't think applications should need to handle TX overflows.

I can try to implement that this weekend if it's urgent now.

Regards, Michael

Am 11.01.2018 um 05:55 schrieb Greg D.:

...
Hi Mark, Micheal,

Ok, good news and bad news.

Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on Github.

Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the delays in the obd2ecu code and call it close enough. Lemme know.

For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... <https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/mcp2515/mcp2515.cpp> Hopefully he'll be back on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork.

I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's much better than it was before. Really, this time. :)

Greg

Greg D. wrote:

...
Hi Mark,

I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think...

I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news...

Greg

Mark Webb-Johnson wrote:

...
The design of the system is as follows:

The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM).

The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter.

I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself.

The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial.

Regards, Mark.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

12 Jan 12 Jan

10:01 a.m.

Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a scheme like that already? ;) Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers. Thanks, Michael Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson:

...

Option B sounds like a good approach.

Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback?

Regards, Mark.

...
On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:

Greg, Mark,

I can check your new code after work.

For the TX performance/overflow issue, there are basically two options:

* A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient delays (very ugly) * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the queue is full o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as fast as the driver can handle them

Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't think applications should need to handle TX overflows.

I can try to implement that this weekend if it's urgent now.

Regards, Michael

Am 11.01.2018 um 05:55 schrieb Greg D.:

...
Hi Mark, Micheal,

Ok, good news and bad news.

Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on Github.

Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the delays in the obd2ecu code and call it close enough. Lemme know.

For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork.

I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's much better than it was before. Really, this time. :)

Greg

Greg D. wrote:

...
Hi Mark,

I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think...

I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news...

Greg

Mark Webb-Johnson wrote:

...
The design of the system is as follows:

* The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM).

The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter.

I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself.

The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial.

Regards, Mark.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

12:37 p.m.

Mark's just fetched the stuff in the past, but I'll give it a try (not too familiar with Github's process here...) Greg Michael Balzer wrote:

...

Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

Michael Balzer

1:23 p.m.

OK, never mind, I'll fetch your changes. Regards, Michael Am 12.01.2018 um 21:37 schrieb Greg D.:

...

Mark's just fetched the stuff in the past, but I'll give it a try (not too familiar with Github's process here...)

Greg

Michael Balzer wrote:

...
Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

1:52 p.m.

Thanks, looks like it worked. Also, the changes for CR/LF and the vfs editor also work. Yea! Greg Michael Balzer wrote:

...

OK, never mind, I'll fetch your changes.

Regards, Michael

Am 12.01.2018 um 21:37 schrieb Greg D.:

...
Mark's just fetched the stuff in the past, but I'll give it a try (not too familiar with Github's process here...)

Greg

Michael Balzer wrote:

...
Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

13 Jan 13 Jan

8:24 a.m.

Part one (TX queue) done & pushed. OVMS > can can1 status CAN: can1 Mode: Active Speed: 500000 Rx pkt: 236657 Rx err: 1 Rx ovrflw: 0 Tx pkt: 106378 Tx delays: 4 Tx err: 0 Tx ovrflw: 0 Err flags: 0x800caa TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir, could you…? The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems. No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and will automatically be used by the drivers when no TX buffers are free. If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if queue is full). I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue debugging. E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 …that's also a first part of the logging extension (part two). Regards, Michael Am 12.01.2018 um 19:01 schrieb Michael Balzer:

...

Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a scheme like that already? ;)

Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

Thanks, Michael

Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson:

...
Option B sounds like a good approach.

Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback?

Regards, Mark.

...
On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:

Greg, Mark,

I can check your new code after work.

For the TX performance/overflow issue, there are basically two options:

* A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient delays (very ugly) * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the queue is full o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as fast as the driver can handle them

Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't think applications should need to handle TX overflows.

I can try to implement that this weekend if it's urgent now.

Regards, Michael

Am 11.01.2018 um 05:55 schrieb Greg D.:

...
Hi Mark, Micheal,

Ok, good news and bad news.

Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on Github.

Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the delays in the obd2ecu code and call it close enough. Lemme know.

For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork.

I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's much better than it was before. Really, this time. :)

Greg

Greg D. wrote:

...
Hi Mark,

I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think...

I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news...

Greg

Mark Webb-Johnson wrote:

...
The design of the system is as follows:

* The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM).

The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter.

I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself.

The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial.

Regards, Mark.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Geir Øyvind Vælidalo

9:37 a.m.

I currently don’t send anything on can2. I could try to send something, but the car is away this weekend :-( Geir

...

13. jan. 2018 kl. 17:24 skrev Michael Balzer <dexter@expeedo.de>:

Part one (TX queue) done & pushed.

OVMS > can can1 status CAN: can1 Mode: Active Speed: 500000 Rx pkt: 236657 Rx err: 1 Rx ovrflw: 0 Tx pkt: 106378 Tx delays: 4 Tx err: 0 Tx ovrflw: 0 Err flags: 0x800caa

TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir, could you…?

The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems.

No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and will automatically be used by the drivers when no TX buffers are free.

If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if queue is full).

I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue debugging.

E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0

…that's also a first part of the logging extension (part two).

Regards, Michael

Am 12.01.2018 um 19:01 schrieb Michael Balzer:

...
Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a scheme like that already? ;)

Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

Thanks, Michael

Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson:

...
Option B sounds like a good approach.

Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback?

Regards, Mark.

...
On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:

Greg, Mark,

I can check your new code after work.

For the TX performance/overflow issue, there are basically two options: A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient delays (very ugly) B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the queue is full → the framework checks for TX buffers becoming available (i.e. driver issuing a TxCallback request) and delivers queued frames only as fast as the driver can handle them Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't think applications should need to handle TX overflows.

I can try to implement that this weekend if it's urgent now.

Regards, Michael

Am 11.01.2018 um 05:55 schrieb Greg D.:

...
Hi Mark, Micheal,

Ok, good news and bad news.

Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on Github.

Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the delays in the obd2ecu code and call it close enough. Lemme know.

For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... <https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/mcp2515/mcp2515.cpp> Hopefully he'll be back on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork.

I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's much better than it was before. Really, this time. :)

Greg

Greg D. wrote:

...
Hi Mark,

I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think...

I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news...

Greg

Mark Webb-Johnson wrote: > > The design of the system is as follows: > > The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: > CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. > CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. > CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. > In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. > The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. > The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver uses this option. > Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The mcp2515 driver uses this option. > The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is called again. > This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks for each canbus, hence saving stack RAM). > > The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter. > > I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in the RxCallback itself. > > The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that change - it is pretty trivial. > > Regards, Mark. >

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

9:42 a.m.

Michael Balzer

10:03 a.m.

Please check again. Thanks, Michael Am 13.01.2018 um 18:42 schrieb Greg D.:

...

Gave it a quick try, and got a crash... I'll see if I can isolate it a bit, but here's something to start with. Tombstone, attached.

Greg

Geir Øyvind Vælidalo wrote:

...
I currently don’t send anything on can2. I could try to send something, but the car is away this weekend :-(

Geir

...
13. jan. 2018 kl. 17:24 skrev Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>>:

Part one (TX queue) done & pushed.

OVMS > can can1 status CAN: can1 Mode: Active Speed: 500000 Rx pkt: 236657 Rx err: 1 Rx ovrflw: 0 Tx pkt: 106378 Tx delays: 4 Tx err: 0 Tx ovrflw: 0 Err flags: 0x800caa

TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir, could you…?

The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems.

No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and will automatically be used by the drivers when no TX buffers are free.

If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if queue is full).

I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue debugging.

E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0

…that's also a first part of the logging extension (part two).

Regards, Michael

Am 12.01.2018 um 19:01 schrieb Michael Balzer:

...
Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a scheme like that already? ;)

Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

Thanks, Michael

Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson:

...
Option B sounds like a good approach.

Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback?

Regards, Mark.

...
On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:

Greg, Mark,

I can check your new code after work.

For the TX performance/overflow issue, there are basically two options:

* A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient delays (very ugly) * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the queue is full o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as fast as the driver can handle them

Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't think applications should need to handle TX overflows.

I can try to implement that this weekend if it's urgent now.

Regards, Michael

Am 11.01.2018 um 05:55 schrieb Greg D.: > Hi Mark, Micheal, > > Ok, good news and bad news. > > Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled at > the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on Github. > > Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report > the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, > that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN > request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit > between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue > things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, > but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the > delays in the obd2ecu code and call it close enough. Lemme know. > > For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. > https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on > line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork. > > I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's > much better than it was before. Really, this time. :) > > Greg > > > Greg D. wrote: >> Hi Mark, >> >> I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to >> buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which >> in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think... >> >> I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news... >> >> Greg >> >> >> Mark Webb-Johnson wrote: >>> >>> The design of the system is as follows: >>> >>> * The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: >>> o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. >>> o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. >>> o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. >>> * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and RxCallback >>> should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. >>> * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. >>> o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver >>> uses this option. >>> o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The >>> mcp2515 driver uses this option. >>> * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it is >>> called again. >>> * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks >>> for each canbus, hence saving stack RAM). >>> >>> >>> The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter. >>> >>> I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has the >>> undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing the >>> error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop again and >>> keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply handled in >>> the RxCallback itself. >>> >>> The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback >>> itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that >>> change - it is pretty trivial. >>> >>> Regards, Mark. >>>

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Michael Balzer

10:14 a.m.

Just added an additional fix for this. Regards, Michael Am 13.01.2018 um 19:03 schrieb Michael Balzer:

...

Please check again.

Thanks, Michael

Am 13.01.2018 um 18:42 schrieb Greg D.:

...
Gave it a quick try, and got a crash... I'll see if I can isolate it a bit, but here's something to start with. Tombstone, attached.

Greg

Geir Øyvind Vælidalo wrote:

...
I currently don’t send anything on can2. I could try to send something, but the car is away this weekend :-(

Geir

...
13. jan. 2018 kl. 17:24 skrev Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>>:

Part one (TX queue) done & pushed.

OVMS > can can1 status CAN: can1 Mode: Active Speed: 500000 Rx pkt: 236657 Rx err: 1 Rx ovrflw: 0 Tx pkt: 106378 Tx delays: 4 Tx err: 0 Tx ovrflw: 0 Err flags: 0x800caa

TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir, could you…?

The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems.

No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and will automatically be used by the drivers when no TX buffers are free.

If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if queue is full).

I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue debugging.

E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0

…that's also a first part of the logging extension (part two).

Regards, Michael

Am 12.01.2018 um 19:01 schrieb Michael Balzer:

...
Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a scheme like that already? ;)

Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers.

Thanks, Michael

Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson:

...
Option B sounds like a good approach.

Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback?

Regards, Mark.

> On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: > > Greg, Mark, > > I can check your new code after work. > > For the TX performance/overflow issue, there are basically two options: > > * A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient > delays (very ugly) > * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the > queue is full > o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as fast as > the driver can handle them > > Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't > think applications should need to handle TX overflows. > > I can try to implement that this weekend if it's urgent now. > > Regards, > Michael > > > Am 11.01.2018 um 05:55 schrieb Greg D.: >> Hi Mark, Micheal, >> >> Ok, good news and bad news. >> >> Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled >> at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on >> Github. >> >> Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to report >> the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex CAN bus, >> that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to the VIN >> request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain a bit >> between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place to queue >> things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally it's ugly, >> but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, I'll put the >> delays in the obd2ecu code and call it close enough. Lemme know. >> >> For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. >> https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on >> line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork. >> >> I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's >> much better than it was before. Really, this time. :) >> >> Greg >> >> >> Greg D. wrote: >>> Hi Mark, >>> >>> I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 to >>> buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, which >>> in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think... >>> >>> I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news... >>> >>> Greg >>> >>> >>> Mark Webb-Johnson wrote: >>>> >>>> The design of the system is as follows: >>>> >>>> * The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: >>>> o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. >>>> o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. >>>> o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. >>>> * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and >>>> RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. >>>> * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. >>>> o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver >>>> uses this option. >>>> o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. The >>>> mcp2515 driver uses this option. >>>> * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it >>>> is called again. >>>> * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks >>>> for each canbus, hence saving stack RAM). >>>> >>>> >>>> The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter. >>>> >>>> I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has >>>> the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again (clearing >>>> the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but then loop >>>> again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can be simply >>>> handled in the RxCallback itself. >>>> >>>> The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the RxCallback >>>> itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to make that >>>> change - it is pretty trivial. >>>> >>>> Regards, Mark. >>>> > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> > http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

10:44 a.m.

Michael Balzer

11:30 a.m.

Greg, please do the same test including the OVMS log output at log level verbose with can trace on. Additionally, when it hangs, please issue can can3 rx standard 7df 02 01 00 00 00 00 00 ff can can3 status can can3 tx standard 7e8 06 41 00 18 19 00 01 ff can can3 status …still with wireshark capturing and without any restart of the obd2ecu process. Thanks, Michael Am 13.01.2018 um 19:44 schrieb Greg D.:

...

Hi Michael,

Much better. Crash is solved, but unfortunately when I remove the delays in the obd2ecu application (marked with "temporary" in the comments), the bus still hangs as before. Actually, slightly worse, because I used to be able to get by if I turned off the VIN reporting; now that hangs too. But it was working by the slimmest of margin before, so it could also be just by luck.

Wireshark trace of the interaction, attached. Turning off privacy (so I should reply to the VIN request) results in the same trace, so the bus is hanging right around that point. Notably, the receive side is hung too (i.e. I never got the VIN request), and counting frames (5 Rx, 7 Tx = 12), we can see the hang occurred right after the ECU Name was sent (frame 12), and before the VIN request (frame 13), which I never received.

Frame 13 in the trace is where the OBDWiz dongle is requesting the VIN. The bus is apparently hung at this point, so there is no reply. The next frame is the dongle re-connecting with the OVMS module after a timeout. The OBDWiz dongle retries the connect a few more times, then gives up.

'can can3 status' at this point is:

OVMS > can can3 status CAN: can3 Mode: Active Speed: 500000 Rx pkt: 5 Rx err: 0 Rx ovrflw: 0 Tx pkt: 7 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0 OVMS >

If I stop and restart the obd2ecu task, I can re-create this same sequence, so a close/open properly resets the chip/driver.

Hope this helps. Let me know what else I can do.

Greg

Michael Balzer wrote:

...
Just added an additional fix for this.

Regards, Michael

Am 13.01.2018 um 19:03 schrieb Michael Balzer:

...
Please check again.

Thanks, Michael

Am 13.01.2018 um 18:42 schrieb Greg D.:

...
Gave it a quick try, and got a crash... I'll see if I can isolate it a bit, but here's something to start with. Tombstone, attached.

Greg

Geir Øyvind Vælidalo wrote:

...
I currently don’t send anything on can2. I could try to send something, but the car is away this weekend :-(

Geir

...
13. jan. 2018 kl. 17:24 skrev Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>>:

Part one (TX queue) done & pushed.

OVMS > can can1 status CAN: can1 Mode: Active Speed: 500000 Rx pkt: 236657 Rx err: 1 Rx ovrflw: 0 Tx pkt: 106378 Tx delays: 4 Tx err: 0 Tx ovrflw: 0 Err flags: 0x800caa

TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir, could you…?

The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems.

No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and will automatically be used by the drivers when no TX buffers are free.

If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if queue is full).

I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue debugging.

E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0

…that's also a first part of the logging extension (part two).

Regards, Michael

Am 12.01.2018 um 19:01 schrieb Michael Balzer: > Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX > queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a > scheme like that already? ;) > > Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers. > > Thanks, > Michael > > > Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson: >> Option B sounds like a good approach. >> >> Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback? >> >> Regards, Mark. >> >>> On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>> >>> Greg, Mark, >>> >>> I can check your new code after work. >>> >>> For the TX performance/overflow issue, there are basically two options: >>> >>> * A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient >>> delays (very ugly) >>> * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if the >>> queue is full >>> o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as fast >>> as the driver can handle them >>> >>> Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I don't >>> think applications should need to handle TX overflows. >>> >>> I can try to implement that this weekend if it's urgent now. >>> >>> Regards, >>> Michael >>> >>> >>> Am 11.01.2018 um 05:55 schrieb Greg D.: >>>> Hi Mark, Micheal, >>>> >>>> Ok, good news and bad news. >>>> >>>> Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are handled >>>> at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to my clone on >>>> Github. >>>> >>>> Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to >>>> report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a half-duplex >>>> CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I don't respond to >>>> the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code to let things drain >>>> a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as there is no easy place >>>> to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the driver, no?). Architecturally >>>> it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a burst of transmitting? If not, >>>> I'll put the delays in the obd2ecu code and call it close enough. Lemme know. >>>> >>>> For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. >>>> https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on >>>> line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork. >>>> >>>> I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, it's >>>> much better than it was before. Really, this time. :) >>>> >>>> Greg >>>> >>>> >>>> Greg D. wrote: >>>>> Hi Mark, >>>>> >>>>> I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer 0 >>>>> to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same time, >>>>> which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think... >>>>> >>>>> I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news... >>>>> >>>>> Greg >>>>> >>>>> >>>>> Mark Webb-Johnson wrote: >>>>>> >>>>>> The design of the system is as follows: >>>>>> >>>>>> * The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: >>>>>> o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. >>>>>> o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. >>>>>> o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. >>>>>> * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and >>>>>> RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. >>>>>> * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. >>>>>> o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can driver >>>>>> uses this option. >>>>>> o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. >>>>>> The mcp2515 driver uses this option. >>>>>> * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then it >>>>>> is called again. >>>>>> * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual tasks >>>>>> for each canbus, hence saving stack RAM). >>>>>> >>>>>> >>>>>> The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer parameter. >>>>>> >>>>>> I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) has >>>>>> the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again >>>>>> (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but >>>>>> then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop can >>>>>> be simply handled in the RxCallback itself. >>>>>> >>>>>> The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the >>>>>> RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy to >>>>>> make that change - it is pretty trivial. >>>>>> >>>>>> Regards, Mark. >>>>>> >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> >>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev >> >> >> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.teslaclub.hk >> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.teslaclub.hk > http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

12:07 p.m.

Michael Balzer

14 Jan 14 Jan

8:52 a.m.

Greg, Am 13.01.2018 um 21:07 schrieb Greg D.:

...

I tried it 3 times, and on the second attempt the dongle connected and ran without further interaction. Timing is on the hairy edge, so debug logging can affect the results.

If I understand this test, it seems that the receive side is hung, but not the transmit side. Interesting! Need to noodle on this a bit...

Yes, it's apparently just the MCP RX that is hung, the MCP TX, CAN framework and OBD2ECU system still work. Also, frame #15 should have triggered an RX overflow, but it didn't. So either the MCP has completely shut off the receiver (how, why?), or the error interrupts also get cut off (more likely). Maybe the interleaved RX frame (#9, flow control) triggers the problem. But Wireshark logs the RX occurring 277 us after the TX, so the IRQs can hardly overlap and there's plenty of time for the TX cleanup… …or maybe the mcp2515 has a bug: http://linux-can.vger.kernel.narkive.com/mPiIYqDN/patch-mcp251x-mcp2515-stop...

...

The mcp2515 sometimes seems to trigger an interrupt with the corresponding register not being set yet. This makes the driver exit the interrupt because there is obviously nothing to do, but the interrupt line is kept low. Therefore the driver does not see any more interrupts until the chip is reset (via interface down/up). […] I've had this problem before but with the MCP2515 connected to a Microchip Microcontroller. I've been working on other stuff and just got back to looking into this problem. Whilst it's not Linux based I was getting this exact problem and was forced to poll the MCP2515 instead of managing it via interrupts. Earlier today I got the best of both worlds by doing nothing in the ISR other then setting a FLAG and clearing the interrupt. The Flag is then picked up in the main processing loop of the code. It's a race condition and a flaw in the MCP2515, in my opinion.

This seems to match our issue. There's another comment:

...

Hmmmm, I think level triggered interrupt would help here.

We currently let the interrupt trigger on the negative edge (line 77). Maybe GPIO_INTR_LOW_LEVEL can help? Polling the IRQ flags for 10 ms is certainly a bad solution… Regards, Michael

...

Greg

Michael Balzer wrote:

...
Greg,

please do the same test including the OVMS log output at log level verbose with can trace on.

Additionally, when it hangs, please issue

can can3 rx standard 7df 02 01 00 00 00 00 00 ff can can3 status can can3 tx standard 7e8 06 41 00 18 19 00 01 ff can can3 status

…still with wireshark capturing and without any restart of the obd2ecu process.

Thanks, Michael

Am 13.01.2018 um 19:44 schrieb Greg D.:

...
Hi Michael,

Much better. Crash is solved, but unfortunately when I remove the delays in the obd2ecu application (marked with "temporary" in the comments), the bus still hangs as before. Actually, slightly worse, because I used to be able to get by if I turned off the VIN reporting; now that hangs too. But it was working by the slimmest of margin before, so it could also be just by luck.

Wireshark trace of the interaction, attached. Turning off privacy (so I should reply to the VIN request) results in the same trace, so the bus is hanging right around that point. Notably, the receive side is hung too (i.e. I never got the VIN request), and counting frames (5 Rx, 7 Tx = 12), we can see the hang occurred right after the ECU Name was sent (frame 12), and before the VIN request (frame 13), which I never received.

Frame 13 in the trace is where the OBDWiz dongle is requesting the VIN. The bus is apparently hung at this point, so there is no reply. The next frame is the dongle re-connecting with the OVMS module after a timeout. The OBDWiz dongle retries the connect a few more times, then gives up.

'can can3 status' at this point is:

OVMS > can can3 status CAN: can3 Mode: Active Speed: 500000 Rx pkt: 5 Rx err: 0 Rx ovrflw: 0 Tx pkt: 7 Tx delays: 0 Tx err: 0 Tx ovrflw: 0 Err flags: 0 OVMS >

If I stop and restart the obd2ecu task, I can re-create this same sequence, so a close/open properly resets the chip/driver.

Hope this helps. Let me know what else I can do.

Greg

Michael Balzer wrote:

...
Just added an additional fix for this.

Regards, Michael

Am 13.01.2018 um 19:03 schrieb Michael Balzer:

...
Please check again.

Thanks, Michael

Am 13.01.2018 um 18:42 schrieb Greg D.:

...
Gave it a quick try, and got a crash... I'll see if I can isolate it a bit, but here's something to start with. Tombstone, attached.

Greg

Geir Øyvind Vælidalo wrote: > I currently don’t send anything on can2. I could try to send something, but the car is away this weekend :-( > > Geir > > >> 13. jan. 2018 kl. 17:24 skrev Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>>: >> >> Part one (TX queue) done & pushed. >> >> OVMS > can can1 status >> CAN: can1 >> Mode: Active >> Speed: 500000 >> Rx pkt: 236657 >> Rx err: 1 >> Rx ovrflw: 0 >> Tx pkt: 106378 >> Tx delays: 4 >> Tx err: 0 >> Tx ovrflw: 0 >> Err flags: 0x800caa >> >> TX performance is rock steady on can1 -- the delays occurred when sending the stop charge request (as expected). I can't test can2/3, Greg & Geir, >> could you…? >> >> The TxCallback can't be used on the mcp2515. The ISR can't query the IRQ register, so the TX IRQs are now also handled by the RxCallback(). As the TX >> IRQs need to be cleared before loading the next frame, this needs another SPI call. I hope that doesn't introduce new problems. >> >> >> No changes are necessary to the application code (well, except you can remove any hard coded delays now). The TX queue has a length of 20 frames and >> will automatically be used by the drivers when no TX buffers are free. >> >> If an application wants to know whether a frame was sent immediately or gets delayed it can check the return code of the Write() method. Write() now >> also can take a second parameter for the maximum wait time for space in the TX queue to become available if it's full (default 0 = fail immediately if >> queue is full). >> >> >> I also added logging of CAN errors. It's currently activated by "can … trace on", I don't think this needs to be active by default, just for CAN issue >> debugging. >> >> E (45718) can: Error can1 rxpkt=3 txpkt=0 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 >> E (83528) can: Error can1 rxpkt=7483 txpkt=226 errflags=0x800caa rxerr=1 txerr=0 rxovr=0 txovr=0 txdelay=0 >> >> …that's also a first part of the logging extension (part two). >> >> Regards, >> Michael >> >> >> Am 12.01.2018 um 19:01 schrieb Michael Balzer: >>> Yes, I had something like that in mind. On TX IRQ, the drivers send CAN_txcallbacks to the CAN_rxtask. The CAN_rxtask then fetches frames from the TX >>> queue and calls the TxCallback until all TX buffers of the driver are full. From the already existing TxCallback() stubs I suppose you had planned a >>> scheme like that already? ;) >>> >>> Greg, can you create a pull request for your MCP2515 change? I'd like to merge that before beginning on the drivers. >>> >>> Thanks, >>> Michael >>> >>> >>> Am 12.01.2018 um 01:19 schrieb Mark Webb-Johnson: >>>> Option B sounds like a good approach. >>>> >>>> Presumably we are just polling the tx queue in the existing CAN_rxtask based on TxCallback? >>>> >>>> Regards, Mark. >>>> >>>>> On 11 Jan 2018, at 8:42 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>> >>>>> Greg, Mark, >>>>> >>>>> I can check your new code after work. >>>>> >>>>> For the TX performance/overflow issue, there are basically two options: >>>>> >>>>> * A: make all application TX be aware of overflows, i.e. check the return value of the CAN Write() call as necessary and/or introduce sufficient >>>>> delays (very ugly) >>>>> * B: add a TX queue to the CAN framework, so the application can just push some frames as fast as it likes, with an option to wait/block/fail if >>>>> the queue is full >>>>> o → the framework checks for TX buffers becoming available *(i.e. driver issuing a TxCallback request)* and delivers queued frames only as >>>>> fast as the driver can handle them >>>>> >>>>> Option B has been on my todo list since removing the delay from the MCP driver and introducing the TX buffer check in the esp32can driver, as I >>>>> don't think applications should need to handle TX overflows. >>>>> >>>>> I can try to implement that this weekend if it's urgent now. >>>>> >>>>> Regards, >>>>> Michael >>>>> >>>>> >>>>> Am 11.01.2018 um 05:55 schrieb Greg D.: >>>>>> Hi Mark, Micheal, >>>>>> >>>>>> Ok, good news and bad news. >>>>>> >>>>>> Good news: Rx problem I believe is fixed. Return is true only if we received something, else false. And the other interrupt conditions are >>>>>> handled at the same time, so no hangs are seen when restarting wifi. Rx overflow counter does increment properly. Yea! Code has been pushed to >>>>>> my clone on Github. >>>>>> >>>>>> Bad news: I am still able to hang the bus, but I think it's on the transmit side. The obd2ecu process can send up to 3 frames back to back to >>>>>> report the ECU Name, followed soon after by several more with to grab the VIN. Without any flow control on the transmit side, and with a >>>>>> half-duplex CAN bus, that's just too much. Turning off the VIN reporting (config set obd2ecu private yes) seems to let everything run because I >>>>>> don't respond to the VIN request (which lets everything drain as OBDWiz times out). Also verified by putting temporary delays in the obd2ecu code >>>>>> to let things drain a bit between frames. So, the transmit side is still a bit fragile, depending on timing. Not sure quite what to do here, as >>>>>> there is no easy place to queue things... Do we need to go back to the old way with a delay in the obd2ecu code (perhaps better than in the >>>>>> driver, no?). Architecturally it's ugly, but this only occurs at startup, and I don't mind the kludge. Do any other uses of the MCP busses do a >>>>>> burst of transmitting? If not, I'll put the delays in the obd2ecu code and call it close enough. Lemme know. >>>>>> >>>>>> For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. >>>>>> https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back >>>>>> on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork. >>>>>> >>>>>> I'll keep poking at things tonight, and take it out for a spin in the car tomorrow, just to see everything working together. But as it is now, >>>>>> it's much better than it was before. Really, this time. :) >>>>>> >>>>>> Greg >>>>>> >>>>>> >>>>>> Greg D. wrote: >>>>>>> Hi Mark, >>>>>>> >>>>>>> I believe you are right about the multiple flags, and the code only processing Rx and "error" separately. Fundamentally, a roll-over from buffer >>>>>>> 0 to buffer 1 isn't really an error, just a statement of fact on what happened. So, we should have buffer 1 and the rollover flag at the same >>>>>>> time, which in fact is what I saw. I need to handle the Rx overflow at the same time as the buffer 1 receive, I think... >>>>>>> >>>>>>> I need to grab some dinner, but have a fix in the works. Will report back in a few hours, hopefully with good news... >>>>>>> >>>>>>> Greg >>>>>>> >>>>>>> >>>>>>> Mark Webb-Johnson wrote: >>>>>>>> >>>>>>>> The design of the system is as follows: >>>>>>>> >>>>>>>> * The can object CAN_rxtask listens on the rx queue to receive instructional messages from canbus drivers. These can be: >>>>>>>> o CAN_frame: simply passes an entire incoming can frame to the IncomingFrame handler. >>>>>>>> o CAN_rxcallback: an instruction for the CAN_rxtask to call the RxCallback task repeatedly. >>>>>>>> o CAN_txcallback: an instruction for the CAN_rxtask to call the TxCallback once. >>>>>>>> * In the case of CAN_rxcallback, the canbus object RxCallback function is expected to return FALSE to indicate nothing should be done and >>>>>>>> RxCallback should not be called again, or TRUE to indicate an incoming frame has been received and should be passed to IncomingFrame. >>>>>>>> * The system is arranged so that individual bus driver interrupt implementations can be fast and efficient. >>>>>>>> o The driver can choose to receive the frame in the interrupt handler itself, and pass it with CAN_frame to CAN_rxtask. The esp32 can >>>>>>>> driver uses this option. >>>>>>>> o Or the driver can choose to delay the reception of the frame to the RxCallback stage, and merely pass an indication with CAN_rxcallback. >>>>>>>> The mcp2515 driver uses this option. >>>>>>>> * The true/false response from RxCallback is designed to allow the callback to signal it received a frame or not. If it received a frame, then >>>>>>>> it is called again. >>>>>>>> * This approach is used in order to be able to centralise the reception of CAN frames to one single task (avoiding having to run individual >>>>>>>> tasks for each canbus, hence saving stack RAM). >>>>>>>> >>>>>>>> >>>>>>>> The RxCallback should definitely ONLY return true if an actual can message has been received, and is being passed back in the frame pointer >>>>>>>> parameter. >>>>>>>> >>>>>>>> I suspect the issue is that the mcp2515 RxCallback is being faced with multiple error flags. Changing that to a return true (as Greg has done) >>>>>>>> has the undesired side-effect of issuing a spurious IncomingFrame (with garbage/blank frame), but also causes the RxCallback to be called again >>>>>>>> (clearing the error flag). Perhaps the solution is to put a loop in RxCallback so that if an error condition is found, it should be cleared, but >>>>>>>> then loop again and keep clearing errors until no more are found, then return false? I think that in the mcp2515 case, this error clearing loop >>>>>>>> can be simply handled in the RxCallback itself. >>>>>>>> >>>>>>>> The alternative is to change the RxCallback logic so that the return bool value means simply ‘loop’ (call me again, please), and have the >>>>>>>> RxCallback itself call IncomingFrame(), rather than passing a frame as a parameter. If Michael/Greg think this is a better approach, I am happy >>>>>>>> to make that change - it is pretty trivial. >>>>>>>> >>>>>>>> Regards, Mark. >>>>>>>> >>>>> >>>>> -- >>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>> _______________________________________________ >>>>> OvmsDev mailing list >>>>> OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> >>>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev >>>> >>>> >>>> >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.teslaclub.hk >>>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> >>> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.teslaclub.hk >>> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> >> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev > > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.teslaclub.hk > http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Michael Balzer

9:50 a.m.

Greg, here's another report of level triggering solving the issue: https://community.nxp.com/thread/456907 Can you please give that a try? I.e. change line 77 to: gpio_set_intr_type((gpio_num_t)m_intpin, GPIO_INTR_LOW_LEVEL); I'll also get a DB9 plug to implement Marks test solution. Regards, Michael Am 14.01.2018 um 17:52 schrieb Michael Balzer:

...

...
Hmmmm, I think level triggered interrupt would help here.

We currently let the interrupt trigger on the negative edge (line 77). Maybe GPIO_INTR_LOW_LEVEL can help?

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

11:36 a.m.

Michael Balzer

12:11 p.m.

Greg, ah yes, the esp-idf currently doesn't implement oneshot interrupts on levels, we need to do that ourselves. Something along the base line of this: https://github.com/espressif/esp-idf/issues/1234#issuecomment-342320583 …just with reversed logic, as we get triggered on low. Regards, Michael Am 14.01.2018 um 20:36 schrieb Greg D.:

...

hi Michael,

Good try, but the whole system hangs, so I'm guessing that we're stuck in an infinite interrupt loop. But that suggests that the issue is related to unbalanced interrupt processing somehow.

I'm going to try for some diagnostic logging once it gets stuck, and see if I can identify its state. The absolute reproducibility of the base issue leads me to think the reported empty interrupt issue (your prior email) is probably not it, but rather the driver is not handling some confluence of events properly. I have no trouble running the HUD display, for example, but it doesn't produce the rapid transmit frames of the OBDWiz dongle. Something about the transmit side is messing up the receive... I wonder if we're getting an interrupt for both the last Tx frame being sent and the VIN request PID being received at the same time? Transmit interrupts don't check for receive, and vice-versa.

Will report back later today...

Greg

Michael Balzer wrote:

...
Greg,

here's another report of level triggering solving the issue: https://community.nxp.com/thread/456907

Can you please give that a try? I.e. change line 77 to:

gpio_set_intr_type((gpio_num_t)m_intpin, GPIO_INTR_LOW_LEVEL);

I'll also get a DB9 plug to implement Marks test solution.

Regards, Michael

Am 14.01.2018 um 17:52 schrieb Michael Balzer:

...
...
Hmmmm, I think level triggered interrupt would help here.

We currently let the interrupt trigger on the negative edge (line 77). Maybe GPIO_INTR_LOW_LEVEL can help?

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

15 Jan 15 Jan

7:48 p.m.

Mark Webb-Johnson

8:07 p.m.

In vino veritas.

...

On 16 Jan 2018, at 11:48 AM, Greg D. <gregd2350@gmail.com> wrote:

Hi all, (but mostly Michael...),

I'm sitting here at my workbench, with a glass of Merlot in one hand, and a red velvet and white chocolate cookie in the other, staring at a yet-again hung CAN3 bus on the v3 module. A sequence of multiple Transmits, closely spaced, can cause the chip to stop processing (issuing) interrupts, hanging the receive side, and making the transmit side operate strangely. Nothing I do seems to help, and Google searches are not encouraging.

Looking at the chip's programming manual, I start to see a lot of "feature" (complication) in the transmit buffering. Multiple priorities, etc., none of which we are (should be) using. So, since Michael has implemented a very efficient excess-frame queuing mechanism, how about we just use a single transmit buffer at a time, and queue the rest? Not quite as good as double buffering the transmit, but still a lot faster than the original fixed delay. Will it fix the hang?

Bingo!

Fix is simple and implemented, and seems to work. I will do some more testing before committing later tonight. Perhaps another glass of Merlot is in order. Or, maybe an old-vine Zin...

Greg

Michael Balzer wrote:

...
Greg,

ah yes, the esp-idf currently doesn't implement oneshot interrupts on levels, we need to do that ourselves.

Something along the base line of this:

https://github.com/espressif/esp-idf/issues/1234#issuecomment-342320583 <https://github.com/espressif/esp-idf/issues/1234#issuecomment-342320583>

…just with reversed logic, as we get triggered on low.

Regards, Michael

Am 14.01.2018 um 20:36 schrieb Greg D.:

...
hi Michael,

Good try, but the whole system hangs, so I'm guessing that we're stuck in an infinite interrupt loop. But that suggests that the issue is related to unbalanced interrupt processing somehow.

I'm going to try for some diagnostic logging once it gets stuck, and see if I can identify its state. The absolute reproducibility of the base issue leads me to think the reported empty interrupt issue (your prior email) is probably not it, but rather the driver is not handling some confluence of events properly. I have no trouble running the HUD display, for example, but it doesn't produce the rapid transmit frames of the OBDWiz dongle. Something about the transmit side is messing up the receive... I wonder if we're getting an interrupt for both the last Tx frame being sent and the VIN request PID being received at the same time? Transmit interrupts don't check for receive, and vice-versa.

Will report back later today...

Greg

Michael Balzer wrote:

...
Greg,

here's another report of level triggering solving the issue: https://community.nxp.com/thread/456907 <https://community.nxp.com/thread/456907>

Can you please give that a try? I.e. change line 77 to:

gpio_set_intr_type((gpio_num_t)m_intpin, GPIO_INTR_LOW_LEVEL);

I'll also get a DB9 plug to implement Marks test solution.

Regards, Michael

Am 14.01.2018 um 17:52 schrieb Michael Balzer:

...
...
Hmmmm, I think level triggered interrupt would help here.

We currently let the interrupt trigger on the negative edge (line 77). Maybe GPIO_INTR_LOW_LEVEL can help?

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

9:59 p.m.

Mark Webb-Johnson wrote:

...

In vino veritas.

Indeed! Fixes look good (can't seem to kill it). Pushed to my fork on Git Hub, along with removal of the delays in the obd2ecu.cpp code. This was done on the older tool chain; wanted to be sure the fix was real before upgrading and potentially hiding the issue with a slight change in timing. Will upgrade the toolchain tomorrow and verify. Greg

Mark Webb-Johnson

16 Jan 16 Jan

12:26 a.m.

I think best for @michael to review. A single tx buffer is certainly easier to manage. Regards, Mark.

...

On 16 Jan 2018, at 1:59 PM, Greg D. <gregd2350@gmail.com> wrote:

Mark Webb-Johnson wrote:

...
In vino veritas.

Indeed!

Fixes look good (can't seem to kill it). Pushed to my fork on Git Hub, along with removal of the delays in the obd2ecu.cpp code.

This was done on the older tool chain; wanted to be sure the fix was real before upgrading and potentially hiding the issue with a slight change in timing. Will upgrade the toolchain tomorrow and verify.

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

12:47 a.m.

Hi Mark, Yes, absolutely. In fact, I just thought of a corner case while nodding off to sleep (hate it when that happens!). Fixed, verified, and pushed to my branch. Anxiously awaiting review from Michael... Greg Mark Webb-Johnson wrote:

...

I think best for @michael to review. A single tx buffer is certainly easier to manage.

Regards, Mark.

...
On 16 Jan 2018, at 1:59 PM, Greg D. <gregd2350@gmail.com> wrote:

Mark Webb-Johnson wrote:

...
In vino veritas.

Indeed!

Fixes look good (can't seem to kill it). Pushed to my fork on Git Hub, along with removal of the delays in the obd2ecu.cpp code.

This was done on the older tool chain; wanted to be sure the fix was real before upgrading and potentially hiding the issue with a slight change in timing. Will upgrade the toolchain tomorrow and verify.

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

9:59 a.m.

Greg, Mark, I'm having two basic issues with this: a) This will limit the TX speed achievable on the mcp interfaces. I generally hate not fully utilizing available hardware capabilities. And this is SPI, filling a TX buffer is expensive. b) It again feels like fixing without understanding. I really think we should first understand how the reported race condition of the mcp interrupt signal affects our driver design, and we should check the effect of level interrupts before resorting to limiting the speed. The mcp SPI interface is running at 1 MHz, so a TX will … wait a moment, why is it running at 1 MHz? According to the data sheet, it can do 10 MHz? Mark, is there some scaling done in the spinodma module, or is there a reason (other than typo / copy-paste) for this? Am 16.01.2018 um 09:47 schrieb Greg D.:

...

Hi Mark,

Yes, absolutely. In fact, I just thought of a corner case while nodding off to sleep (hate it when that happens!). Fixed, verified, and pushed to my branch. Anxiously awaiting review from Michael...

Greg

Mark Webb-Johnson wrote:

...
I think best for @michael to review. A single tx buffer is certainly easier to manage.

Regards, Mark.

...
On 16 Jan 2018, at 1:59 PM, Greg D. <gregd2350@gmail.com> wrote:

Mark Webb-Johnson wrote:

...
In vino veritas.

Indeed!

Fixes look good (can't seem to kill it). Pushed to my fork on Git Hub, along with removal of the delays in the obd2ecu.cpp code.

This was done on the older tool chain; wanted to be sure the fix was real before upgrading and potentially hiding the issue with a slight change in timing. Will upgrade the toolchain tomorrow and verify.

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Michael Balzer

10:29 a.m.

Greg, regarding your changes to reduce to TX buffer 1, I think they will do so. I don't think it's necessary to check buffers 2 & 3 busy flags at all, but it shouldn't hurt either. On the TX limits this introduces: each TX needs… * 4 bytes for the interrupt check * 4 bytes for the TX clear * 2 bytes for the status read * 14 bytes for the buffer * 1 byte for the RTS …total 25 bytes = 200 bits. So at 1 MHz the throughput will be limited to 5,000 frames per second, and the latency will be at least 200 us. That should still be sufficient for most situations. If you need more speed → can1. Plus maybe the 1 MHz is a typo. So, while my basic issues still apply, we can use that solution for now. Regards, Michael Am 16.01.2018 um 18:59 schrieb Michael Balzer:

...

Greg, Mark,

I'm having two basic issues with this:

a) This will limit the TX speed achievable on the mcp interfaces. I generally hate not fully utilizing available hardware capabilities. And this is SPI, filling a TX buffer is expensive.

b) It again feels like fixing without understanding. I really think we should first understand how the reported race condition of the mcp interrupt signal affects our driver design, and we should check the effect of level interrupts before resorting to limiting the speed.

The mcp SPI interface is running at 1 MHz, so a TX will … wait a moment, why is it running at 1 MHz? According to the data sheet, it can do 10 MHz?

Mark, is there some scaling done in the spinodma module, or is there a reason (other than typo / copy-paste) for this?

Am 16.01.2018 um 09:47 schrieb Greg D.:

...
Hi Mark,

Yes, absolutely. In fact, I just thought of a corner case while nodding off to sleep (hate it when that happens!). Fixed, verified, and pushed to my branch. Anxiously awaiting review from Michael...

Greg

Mark Webb-Johnson wrote:

...
I think best for @michael to review. A single tx buffer is certainly easier to manage.

Regards, Mark.

...
On 16 Jan 2018, at 1:59 PM, Greg D. <gregd2350@gmail.com> wrote:

Mark Webb-Johnson wrote:

...
In vino veritas.

Indeed!

Fixes look good (can't seem to kill it). Pushed to my fork on Git Hub, along with removal of the delays in the obd2ecu.cpp code.

This was done on the older tool chain; wanted to be sure the fix was real before upgrading and potentially hiding the issue with a slight change in timing. Will upgrade the toolchain tomorrow and verify.

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

11:21 a.m.

Mark Webb-Johnson

4:45 p.m.

No specific reason for the clock speed to be 1MHz. I know that I have changed it multiple times during development of the initial module, and often run it very low (as it is easier to get complete logic analyser captures, and store more capture data, at lower clock speeds). I suspect the 1MHz choice was a compromise between performance and being able to store more than a couple of seconds of logic analyser capture. I think our circuit should be able to cope with 10MHz. MAX7317 is good for up to 26MHz. MCP2515 should be ok up to 10MHz. Espressif documentation says: While in general, speeds up to 80MHz on the dedicated SPI pins and 40MHz on GPIO-matrix-routed pins are supported, full-duplex transfers routed over the GPIO matrix only support speeds up to 26MHz. Clock should be fine at 10MHz. Can you try it (simple fix in ovms_peripherals.cpp lines 118 and 120)? MAX7317 is at line 93 and probably worth setting to the same speed, although I don’t think it matters much if they are different as control is on the CS lines anyway. That said, we have to drive the MISO line on MAX7317 via a tri-state on the CS pin, so that will be speed limited via the tri-state. If it works ok for you at that speed, I’ll put it on an oscilloscope later this week and just verify the signals still look clean. I did that on the original board design, and they seemed ok. Regards, Mark.

...

On 17 Jan 2018, at 1:59 AM, Michael Balzer <dexter@expeedo.de> wrote:

Greg, Mark,

I'm having two basic issues with this:

a) This will limit the TX speed achievable on the mcp interfaces. I generally hate not fully utilizing available hardware capabilities. And this is SPI, filling a TX buffer is expensive.

b) It again feels like fixing without understanding. I really think we should first understand how the reported race condition of the mcp interrupt signal affects our driver design, and we should check the effect of level interrupts before resorting to limiting the speed.

The mcp SPI interface is running at 1 MHz, so a TX will … wait a moment, why is it running at 1 MHz? According to the data sheet, it can do 10 MHz?

Mark, is there some scaling done in the spinodma module, or is there a reason (other than typo / copy-paste) for this?

Am 16.01.2018 um 09:47 schrieb Greg D.:

...
Hi Mark,

Yes, absolutely. In fact, I just thought of a corner case while nodding off to sleep (hate it when that happens!). Fixed, verified, and pushed to my branch. Anxiously awaiting review from Michael...

Greg

Mark Webb-Johnson wrote:

...
I think best for @michael to review. A single tx buffer is certainly easier to manage.

Regards, Mark.

...
On 16 Jan 2018, at 1:59 PM, Greg D. <gregd2350@gmail.com> wrote:

Mark Webb-Johnson wrote:

...
In vino veritas.

Indeed!

Fixes look good (can't seem to kill it). Pushed to my fork on Git Hub, along with removal of the delays in the obd2ecu.cpp code.

This was done on the older tool chain; wanted to be sure the fix was real before upgrading and potentially hiding the issue with a slight change in timing. Will upgrade the toolchain tomorrow and verify.

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

6:38 p.m.

Greg D.

17 Jan 17 Jan

3:34 p.m.

Still running! No issues that I can detect, though it's a fairly focused test: CAN-3 mostly, verified CAN-1 still connects to car. Greg Greg D. wrote:

...

Hi Mark,

Changed all three to 10mhz. I only have stuff hung off CAN3 right now, but the change to 10mhz seems to work. Wireshark shows that the inter-frame time for the back-to-back messages reduced from somewhere around 800 us to about 550us. Certainly helps!

Surprisingly, the variability between runs is a lot less than before. No idea why.

I'll let it run this way over night, with obd2ecu talking to OBDWiz, and report if any issues become apparent.

Greg

Mark Webb-Johnson

3:54 p.m.

I’ve committed a change to make both MAX7317 and MCP2515 10Mhz. Regards, Mark.

...

On 18 Jan 2018, at 7:34 AM, Greg D. <gregd2350@gmail.com> wrote:

Still running! No issues that I can detect, though it's a fairly focused test: CAN-3 mostly, verified CAN-1 still connects to car.

Greg

Greg D. wrote:

...
Hi Mark,

Changed all three to 10mhz. I only have stuff hung off CAN3 right now, but the change to 10mhz seems to work. Wireshark shows that the inter-frame time for the back-to-back messages reduced from somewhere around 800 us to about 550us. Certainly helps!

Surprisingly, the variability between runs is a lot less than before. No idea why.

I'll let it run this way over night, with obd2ecu talking to OBDWiz, and report if any issues become apparent.

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

11 Jan 11 Jan

9:28 a.m.

Greg, looks correct, just some general notes: * Line 252: "so why are we even here?" -- because the RxCallback() will be called again after fetching an RX buffer to check for the other RX buffer. If no other RX and no error occurred, the callback loop can be terminated. It may also be that the RxCallback() loop fetches an RX that occurred after the trigger of the current loop, so the framework will call again after polling the intermediate RxCallback request, but the buffer will already be empty. * Line 294: this should be a warning level log, also the "\n" at the end isn't necessary for a log. * Code style: for readability and maintainability, please try to adopt the general coding style (at least indentation & line break / space conventions) of the module you're changing, or run a formatter on your changes before committing. Regards, Michael Am 11.01.2018 um 05:55 schrieb Greg D.:

...

For receive, I'd go with what I have for now, if Michael would be so kind as to review what I have done first. https://github.com/bitsofgreg/Open-Vehicle-Monitoring-System-3/blob/master/v... Hopefully he'll be back on line before I get up in the morning. Wonderful how the Earth's spin helps with the teamwork.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

9:36 a.m.

Greg D.

9:59 a.m.

Greg D.

10:05 a.m.

Stephen Casner

10:07 a.m.

On Thu, 11 Jan 2018, Michael Balzer wrote:

...

* Line 294: this should be a warning level log, also the "\n" at the end isn't necessary for a log.

This should be stated more strongly: Not only is it not necessary, it should just not be there. It is also not optional, it won't be ignored. It used to cause a blank line in the output, but now it causes a '|' character to be appended after after a change that Michael made a while back. Michael, when I first saw that change you made to translate all \n or \r except the last to be '|', thereby unfolding a multi-line log into one line, I did not really like that idea but I didn't comment at the time. It seems to me that if the format of a multi-line log message causes a problem then it would be best to go back to the original log message and change it. Perhaps you were bothered by some log message in code that is not under our control? Can you show some examples? I bring this up now because that bit of code needs to move as I implement the facility to log on all consoles, and I would like to consider removing it. -- Steve

Michael Balzer

10:28 a.m.

Steve, LFs will be injected into the log if a notification or command result is sent: I (544365) ovms-server-v2: Send MP-0 PITopping off|CHG: 0 (~0) Wh|Full: 124 min.|Range: 50 - 48Km|SOC: 70.0% (70.0..70.0%)|ODO: 47026.7Km|CAP: 89.7% 96.9Ah|SOH: 100%| D (550265) ovms-server-v2: TransmitNotifyData: msg=RT-BAT-P,1,86400,1,1,6998,6998,6998,574,574,574,29,29,29,5,1,190,20| These were the reason for my change. Will probably also occur if a user command received contains CR/LF. At the moment, the simcom module also injects CR/LF on tx logs: D (28345) simcom: tx scmd ch=0 len=4 : AT|| D (44345) simcom: tx scmd ch=0 len=103 : AT+CPIN?;+CREG=1;+CTZU=1;+CTZR=1;+CLIP=1;+CMGF=1;+CNMI=1,2,0,0,0;+CSDH=1;+CMEE=2;+CSQ;+AUTOCSQ=1,1;E0|| Don't know if that covers all cases. Basically this also applies everywhere a string of external / user supplied source gets logged (little Bobby Tables). I can for example imagine a broken Wifi network sending an SSID containing CR/LF. Regards, Michael Am 11.01.2018 um 19:07 schrieb Stephen Casner:

...

On Thu, 11 Jan 2018, Michael Balzer wrote:

...
* Line 294: this should be a warning level log, also the "\n" at the end isn't necessary for a log. This should be stated more strongly: Not only is it not necessary, it should just not be there. It is also not optional, it won't be ignored. It used to cause a blank line in the output, but now it causes a '|' character to be appended after after a change that Michael made a while back.

Michael, when I first saw that change you made to translate all \n or \r except the last to be '|', thereby unfolding a multi-line log into one line, I did not really like that idea but I didn't comment at the time. It seems to me that if the format of a multi-line log message causes a problem then it would be best to go back to the original log message and change it. Perhaps you were bothered by some log message in code that is not under our control? Can you show some examples?

I bring this up now because that bit of code needs to move as I implement the facility to log on all consoles, and I would like to consider removing it.

-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Stephen Casner

4:01 p.m.

Michael, I've implemented the multi-console logging (on the "for-master" branch). I kept your CR/LF replacement, but added another loop to remove any vertical bars at the end of the aggregated line. I also changed the writing to the log file to be after the CR/LF replacement rather than before. -- Steve On Thu, 11 Jan 2018, Michael Balzer wrote:

...

Steve,

LFs will be injected into the log if a notification or command result is sent:

I (544365) ovms-server-v2: Send MP-0 PITopping off|CHG: 0 (~0) Wh|Full: 124 min.|Range: 50 - 48Km|SOC: 70.0% (70.0..70.0%)|ODO: 47026.7Km|CAP: 89.7% 96.9Ah|SOH: 100%| D (550265) ovms-server-v2: TransmitNotifyData: msg=RT-BAT-P,1,86400,1,1,6998,6998,6998,574,574,574,29,29,29,5,1,190,20|

These were the reason for my change. Will probably also occur if a user command received contains CR/LF.

At the moment, the simcom module also injects CR/LF on tx logs:

D (28345) simcom: tx scmd ch=0 len=4 : AT|| D (44345) simcom: tx scmd ch=0 len=103 : AT+CPIN?;+CREG=1;+CTZU=1;+CTZR=1;+CLIP=1;+CMGF=1;+CNMI=1,2,0,0,0;+CSDH=1;+CMEE=2;+CSQ;+AUTOCSQ=1,1;E0||

Don't know if that covers all cases.

Basically this also applies everywhere a string of external / user supplied source gets logged (little Bobby Tables). I can for example imagine a broken Wifi network sending an SSID containing CR/LF.

Regards, Michael

Am 11.01.2018 um 19:07 schrieb Stephen Casner:

...
On Thu, 11 Jan 2018, Michael Balzer wrote:

...
* Line 294: this should be a warning level log, also the "\n" at the end isn't necessary for a log. This should be stated more strongly: Not only is it not necessary, it should just not be there. It is also not optional, it won't be ignored. It used to cause a blank line in the output, but now it causes a '|' character to be appended after after a change that Michael made a while back.

Michael, when I first saw that change you made to translate all \n or \r except the last to be '|', thereby unfolding a multi-line log into one line, I did not really like that idea but I didn't comment at the time. It seems to me that if the format of a multi-line log message causes a problem then it would be best to go back to the original log message and change it. Perhaps you were bothered by some log message in code that is not under our control? Can you show some examples?

I bring this up now because that bit of code needs to move as I implement the facility to log on all consoles, and I would like to consider removing it.

-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Mark Webb-Johnson

7 Jan 7 Jan

4:07 p.m.

...

Mark:� Note also the issue with DNS failures getting to the v2 server.� I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server.� Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

Yep, I’m aware of that. Discussed on the list a few days ago - seems to be the DNS server configuration when both WiFi and PPP running at the same time.

...

Second, overall memory usage seems to be at the limit.� What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product?� Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

The OTA system supports both. I’d prefer everything in one build, but it really depends on the flash footprint we end up with. Regards, Mark.

...

On 8 Jan 2018, at 3:05 AM, Greg D. <gregd2350@gmail.com> wrote:

Hi Michael, Steve, Mark,

Steve, the crash was an abort in new_op.cc, so perhaps being out of space is the issue.� Crash and reboot log attached (crash.txt).� One thing I've been wondering about are the several lines "_WindowOverflow4 at ??:?" during the boot process.� Is that indicative of a problem, later to manifest in the crash?

My builds include pretty much everything, except for the Leaf, Twizy, and Soul.�

The update included some 20 lines changed to mcp2525.cpp, as well as a bunch of other stuff, including a lot stuff updated in Canopen and Kia.� I have a script that does the git fetch master, merge, and push back to my github fork, the output of which is attached (update.txt).� As a test, I removed Canopen from the build config, and the crash has disappeared.� CAN-3 also appears to have come back to life (!), at least initially.� I can still get CAN-3 to fail if I turn on/off the modem and/or wifi in some sequence (still trying to pin that down), but that also leads to another crash (crash2.txt, attached).�

Mark:� Note also the issue with DNS failures getting to the v2 server.� I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server.� Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

So, two questions...� First, why the apparent conflict between Canopen or wifi/modem and obd2ecu over access to the 3rd CAN bus?� Why would the modem or wifi have any effect on a CAN bus?

Second, overall memory usage seems to be at the limit.� What sort of budget do we have for what remains to be done, and how are we going to be packaging the build options for when non-developers want to get their hands on the product?� Will we be able to turn everything on, minus the developer / debug stuff, or will we have a separate SKU for each model car?

Thanks,

Greg

Michael Balzer wrote:

...
Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the TX performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

...
Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk <mailto:OvmsDev@lists.teslaclub.hk> http://lists.teslaclub.hk/mailman/listinfo/ovmsdev <http://lists.teslaclub.hk/mailman/listinfo/ovmsdev>

<update.txt><crash.txt><crash2.txt>_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Greg D.

4:42 p.m.

Hi Mark, The DNS thing was just a data point for you, perhaps showing a way to reproduce (or another way to do it...). It wasn't clear to me from the earlier email thread if this had been tracked down yet, or if more information was needed. Greg. Mark Webb-Johnson wrote:

...

...
Mark:� Note also the issue with DNS failures getting to the v2 server.� I enabled the modem, got connected, then enabled WiFi (simulating arriving at home), and lost the V2 server.� Disabling Wifi didn't bring it back, and powering off the modem (in preparation for turning it back on) caused the crash.

Yep, I’m aware of that. Discussed on the list a few days ago - seems to be the DNS server configuration when both WiFi and PPP running at the same time.

Mark Webb-Johnson

4:04 p.m.

With renault twizy and canopen enabled: OVMS > module memory ============================ Free 8-bit 78744/243064, 32-bit 29116/55900, blocks dumped = 0 With renault twizy and canopen disabled: OVMS > module memory ============================ Free 8-bit 84928/243096, 32-bit 29720/56504, blocks dumped = 0 I can’t compile with canopen disabled and renault twizy enabled. Not sure where the difference is, but it would be preferrable if these optional components didn’t consume any ram unless explicitly loaded. Using the class object model, and member variables, should make that relatively simple. Regards, Mark.

...

On 7 Jan 2018, at 6:29 PM, Michael Balzer <dexter@expeedo.de> wrote:

Greg,

which commits / changes do you mean? The CAN drivers have not been changed since the TX performance fix, which Geir reported having solved his last issues.

The current version is stable over here, but without the SSH component -- I can't use that due to memory getting too low together with the Twizy component.

Regards, Michael

Am 07.01.2018 um 08:04 schrieb Greg D.:

...
Hi folks,

I just resync'd with the main repository, and am not receiving frames on CAN-3 anymore. I see there were changes to the chip driver...

I'm also seeing crashes right after getting connected to WiFi, immediately after the system tries to start SSH.

Seems like we just took a big step backward. What happened?

Greg

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

Michael Balzer

8 Jan 8 Jan

6:27 a.m.

Mark, the Twizy implementation currently depends largely on the CANopen framework. It may be possible to build a -very- restricted version without, but the result would not be used by any Twizy driver. The CANopen framework is also a general toolkit to discover and talk to CANopen devices, see my intro at: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master... The RAM usage of the manager module, while not having started a bus instance, is 24 bytes for the module state plus command registry. Command registry follows the approach of the CAN framework to do the interface selection as a command level. So the CANopen command registry entries consist of 2 + 3 * 13 = 41 commands. The same command registry overhead applies to all optional components, i.e. OBD2ECU and REtools. Maybe a better solution is to make all these components loadable like the vehicles. I saw you subclassed RE from pcp, was this meant to support the dynamic loading/init by means of the power control command, or is there another plan on this? Regards, Michael Am 08.01.2018 um 01:04 schrieb Mark Webb-Johnson:

...

With renault twizy and canopen enabled:

OVMS > module memory ============================ Free 8-bit 78744/243064, 32-bit 29116/55900, blocks dumped = 0

With renault twizy and canopen disabled:

OVMS > module memory ============================ Free 8-bit 84928/243096, 32-bit 29720/56504, blocks dumped = 0

I can’t compile with canopen disabled and renault twizy enabled.

Not sure where the difference is, but it would be preferrable if these optional components didn’t consume any ram unless explicitly loaded. Using the class object model, and member variables, should make that relatively simple.

Regards, Mark.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

11:49 a.m.

Michael Balzer

12:06 p.m.

Greg, CANopen is a general framework for CANopen devices, not specifically for the Twizy. The module provides tools to analyze and talk to CANopen nodes of any kind. It's generally useful for reverse engineering and getting access to vehicle components. Including the CANopen module makes no difference to the running system (except a bit of memory usage for the shell commands, see my earlier response to Mark), unless you explicitly start a CANopen process. Without a "co … start" command, there is no CANopen task or listener present. There is also no filtering code in CANopen that applies to anything else but the CANopen listeners. So your issue is certainly not related to the CANopen module, it's just coincidence it seemed to make a difference in your test. Regarding the RX performance, see my previous message. Please verify it's the bus, not the obd2ecu. Regarding the Wifi connection triggering the problem: this seems to match the assumed performance issue of the MCP driver. Regards, Michael Am 08.01.2018 um 20:49 schrieb Greg D.:

...

Hi Michael,

Linking the several parts to this thread... With the clarity of a new day, I'm being a bit more methodical, changing one thing at a time. TL:DR question: What is Canopen actively doing in a system without the Twizzy even present?

Continuing on my issue isolation work from last night, it appears the whole 12v / modem thing might be irrelevant. I've reproduced the "partial hang", where the OBDWiz dongle gets part way through the connect sequence before CAN3 hangs. Metrics show 5 frames received, 7 transmitted. One frame Rx overflow. Error status = 0x 2040. If I stop and restart obdii ecu, I can repeat the hang by restarting the connect in OBDWiz.

The previous build was able to start OBDWiz and it ran all night with no issues. The only difference between this build and the previous one is that I included Canopen in it. *So, first issue: Why would including Canopen make a difference?* I don't have any reference to Canopen in my config, and the Twizzy vehicle module is not included in the build (I've got a Tesla Roadster). The OBDWiz is on CAN3; CAN2 is unconnected. The module is on my desk, so nothing on CAN1, either.

The system.start script is as follows:

enable ****** obdii ecu start can3 power ext12v on wifi mode client # power simcom on vehicle module TR server v2 start

Note the modem enable is commented out, so the need for 12v power is shown to be irrelevant. Wifi connectivity is effectively disabled because the only configured network is hidden, so it doesn't auto-connect.

*Second issue: If I remove Canopen and change the wifi hotspot to be not hidden, CAN3 hangs at the point where Wifi auto-connects. *As before, there appears to be an Rx overflow, but this time if I stop and restart obdii, everything connects and runs properly. Wifi disconnecting doesn't affect the CAN bus, but when it reconnects, the hang repeats.

I've verified with can trace that the frames truly aren't being received, so it's not the obd2ecu process that's hung. Obd2ecu continues to process other commands normally (e.g. obdii ecu list).

So, summarizing, I believe there are two issues here. First, the CAN2/3 driver needs some work in handling receive overflows. In theory, the CAN devices should be resilient enough to withstand the loss of a frame or two, and indeed, I see OBDwiz retrying the connect (as monitored by Wireshark via a CAN splitter cable). Second, Canopen is getting in the way, even when it is not being used. If I understand the ESP architecture, all CAN recipients get all CAN frames from all (or both?) CAN busses, and ignore the ones they don't need. I'm guessing this filtering code in Canopen is taking too long to be transparent.

Greg

Michael Balzer wrote:

...
Mark,

the Twizy implementation currently depends largely on the CANopen framework. It may be possible to build a -very- restricted version without, but the result would not be used by any Twizy driver.

The CANopen framework is also a general toolkit to discover and talk to CANopen devices, see my intro at:

https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master...

The RAM usage of the manager module, while not having started a bus instance, is 24 bytes for the module state plus command registry. Command registry follows the approach of the CAN framework to do the interface selection as a command level. So the CANopen command registry entries consist of 2 + 3 * 13 = 41 commands.

The same command registry overhead applies to all optional components, i.e. OBD2ECU and REtools. Maybe a better solution is to make all these components loadable like the vehicles. I saw you subclassed RE from pcp, was this meant to support the dynamic loading/init by means of the power control command, or is there another plan on this?

Regards, Michael

Am 08.01.2018 um 01:04 schrieb Mark Webb-Johnson:

...
With renault twizy and canopen enabled:

OVMS > module memory ============================ Free 8-bit 78744/243064, 32-bit 29116/55900, blocks dumped = 0

With renault twizy and canopen disabled:

OVMS > module memory ============================ Free 8-bit 84928/243096, 32-bit 29720/56504, blocks dumped = 0

I can’t compile with canopen disabled and renault twizy enabled.

Not sure where the difference is, but it would be preferrable if these optional components didn’t consume any ram unless explicitly loaded. Using the class object model, and member variables, should make that relatively simple.

Regards, Mark.

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

_______________________________________________ OvmsDev mailing list OvmsDev@lists.teslaclub.hk http://lists.teslaclub.hk/mailman/listinfo/ovmsdev

-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

Greg D.

12:55 p.m.

3006

Age (days ago)

3016

Last active (days ago)

List overview

Download

65 comments

5 participants

participants (5)

Geir Øyvind Vælidalo
Greg D.
Mark Webb-Johnson
Michael Balzer
Stephen Casner