I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error: WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA. Is that expected for the trust arrangements we are using? A possibly related question: do we expect the server to validate clients, or only the clients to validate the server? -- Steve
Steve, A thorny issue. Servers are _supposed_ to provide intermediate certificates, up to a trusted root. When you are issued a certificate, it includes a bundle of these intermediary certificates to be installed at the same time. In practice, servers are often mis-configured so they do not. This is made worse by browsers silently detecting this, then downloading the missing intermediate certificates (the child certificate contains a URL to its parent’s cert). For Open Vehicles, I don’t think we need to deal with this, and we certainly don’t need the complexity of automatically downloading intermediate certificates. I think if the user wants to access a server misconfigured in that way, he can simply import and trust the intermediate certificate directly. I don’t think we should set WOLFSSL_ALT_CERT_CHAIN. Regarding your question, in normal operation OVMS as a client must validate the server certificates it connect to. I don’t think OVMS currently supports client certificates, although if it did we would have to correctly provide those to the server on connection. Regards, Mark.
On 4 Mar 2021, at 9:00 AM, Stephen Casner <casner@acm.org> wrote:
I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error:
WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA.
Is that expected for the trust arrangements we are using?
A possibly related question: do we expect the server to validate clients, or only the clients to validate the server?
-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Mark, Thanks for that reply. As I mentioned, if I don't configure WOLFSSL_ALT_CERT_CHAIN then I get an "ASN no signer to confirm" error. Do you have any idea why that might be? That is, am I likely to be missing access to some key? Of some key not being present in a cert when it should be there? -- Steve On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
Steve,
A thorny issue. Servers are _supposed_ to provide intermediate certificates, up to a trusted root. When you are issued a certificate, it includes a bundle of these intermediary certificates to be installed at the same time. In practice, servers are often mis-configured so they do not. This is made worse by browsers silently detecting this, then downloading the missing intermediate certificates (the child certificate contains a URL to its parent's cert).
For Open Vehicles, I don't think we need to deal with this, and we certainly don't need the complexity of automatically downloading intermediate certificates. I think if the user wants to access a server misconfigured in that way, he can simply import and trust the intermediate certificate directly.
I don't think we should set WOLFSSL_ALT_CERT_CHAIN.
Regarding your question, in normal operation OVMS as a client must validate the server certificates it connect to. I don't think OVMS currently supports client certificates, although if it did we would have to correctly provide those to the server on connection.
Regards, Mark.
On 4 Mar 2021, at 9:00 AM, Stephen Casner <casner@acm.org> wrote:
I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error:
WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA.
Is that expected for the trust arrangements we are using?
A possibly related question: do we expect the server to validate clients, or only the clients to validate the server?
-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
What site (and port) are you trying to access?
On 4 Mar 2021, at 12:09 PM, Stephen Casner <casner@acm.org> wrote:
Mark,
Thanks for that reply. As I mentioned, if I don't configure WOLFSSL_ALT_CERT_CHAIN then I get an "ASN no signer to confirm" error. Do you have any idea why that might be? That is, am I likely to be missing access to some key? Of some key not being present in a cert when it should be there?
-- Steve
On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
Steve,
A thorny issue. Servers are _supposed_ to provide intermediate certificates, up to a trusted root. When you are issued a certificate, it includes a bundle of these intermediary certificates to be installed at the same time. In practice, servers are often mis-configured so they do not. This is made worse by browsers silently detecting this, then downloading the missing intermediate certificates (the child certificate contains a URL to its parent's cert).
For Open Vehicles, I don't think we need to deal with this, and we certainly don't need the complexity of automatically downloading intermediate certificates. I think if the user wants to access a server misconfigured in that way, he can simply import and trust the intermediate certificate directly.
I don't think we should set WOLFSSL_ALT_CERT_CHAIN.
Regarding your question, in normal operation OVMS as a client must validate the server certificates it connect to. I don't think OVMS currently supports client certificates, although if it did we would have to correctly provide those to the server on connection.
Regards, Mark.
On 4 Mar 2021, at 9:00 AM, Stephen Casner <casner@acm.org> wrote:
I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error:
WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA.
Is that expected for the trust arrangements we are using?
A possibly related question: do we expect the server to validate clients, or only the clients to validate the server?
-- Steve _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Your server api.openvehicles.com on port 6870. On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
What site (and port) are you trying to access?
On 4 Mar 2021, at 12:09 PM, Stephen Casner <casner@acm.org> wrote:
Mark,
Thanks for that reply. As I mentioned, if I don't configure WOLFSSL_ALT_CERT_CHAIN then I get an "ASN no signer to confirm" error. Do you have any idea why that might be? That is, am I likely to be missing access to some key? Of some key not being present in a cert when it should be there?
-- Steve
On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
Steve,
A thorny issue. Servers are _supposed_ to provide intermediate certificates, up to a trusted root. When you are issued a certificate, it includes a bundle of these intermediary certificates to be installed at the same time. In practice, servers are often mis-configured so they do not. This is made worse by browsers silently detecting this, then downloading the missing intermediate certificates (the child certificate contains a URL to its parent's cert).
For Open Vehicles, I don't think we need to deal with this, and we certainly don't need the complexity of automatically downloading intermediate certificates. I think if the user wants to access a server misconfigured in that way, he can simply import and trust the intermediate certificate directly.
I don't think we should set WOLFSSL_ALT_CERT_CHAIN.
Regarding your question, in normal operation OVMS as a client must validate the server certificates it connect to. I don't think OVMS currently supports client certificates, although if it did we would have to correctly provide those to the server on connection.
Regards, Mark.
On 4 Mar 2021, at 9:00 AM, Stephen Casner <casner@acm.org> wrote:
I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error:
WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA.
Is that expected for the trust arrangements we are using?
A possibly related question: do we expect the server to validate clients, or only the clients to validate the server?
-- Steve
Steve, Mea culpa. Here is the main cert: X509v3 Subject Alternative Name: DNS:*.openvehicles.com, DNS:openvehicles.com <http://openvehicles.com/> Issuer: C=LV, L=Riga, O=GoGetSSL, CN=GoGetSSL RSA DV CA Validity Not Before: Feb 11 00:00:00 2020 GMT Not After : Feb 10 23:59:59 2022 GMT Then the intermediate certs: Subject: C=LV, L=Riga, O=GoGetSSL, CN=GoGetSSL RSA DV CA Issuer: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Validity Not Before: Sep 6 00:00:00 2018 GMT Not After : Sep 5 23:59:59 2028 GMT Subject: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Issuer: C=SE, O=AddTrust AB, OU=AddTrust External TTP Network, CN=AddTrust External CA Root Validity Not Before: May 30 10:48:38 2000 GMT Not After : May 30 10:48:38 2020 GMT It seems the second intermediate certificate has expired, and that is perhaps the problem you are seeing? Very annoying. I really hate it when certificate authorities issue certificates signed by intermediate CA certificates, or trusted roots, that expire before the certificate itself. AddTrust has already issued a new intermediate cert which expires in 2038, and have managed to get that in as a trusted CA in most modern browsers. It is in my server’s bundle for older browsers, and is also in the OVMS firmware as a trusted CA. I guess wolfssl and mbedtls behave differently in this situation (where the same certificate is provided as a trusted CA as well as in the bundle, but with different expiration dates). Either that, or mbedtls is not verifying expired intermediate certs 😱 I replaced that certificate on my server, and now see: Subject: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Issuer: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Validity Not Before: Feb 1 00:00:00 2010 GMT Not After : Jan 18 23:59:59 2038 GMT A test now looks ok: $ openssl s_client -connect api.openvehicles.com:6870 <http://api.openvehicles.com:6870/> CONNECTED(00000005) depth=2 C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority verify return:1 depth=1 C = LV, L = Riga, O = GoGetSSL, CN = GoGetSSL RSA DV CA verify return:1 depth=0 CN = *.openvehicles.com verify return:1 --- Certificate chain 0 s:/CN=*.openvehicles.com i:/C=LV/L=Riga/O=GoGetSSL/CN=GoGetSSL RSA DV CA 1 s:/C=LV/L=Riga/O=GoGetSSL/CN=GoGetSSL RSA DV CA i:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority 2 s:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority i:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority Can you try again? See if you still get an error? Regards, Mark.
On 4 Mar 2021, at 12:52 PM, Stephen Casner <casner@acm.org> wrote:
Your server api.openvehicles.com on port 6870.
On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
What site (and port) are you trying to access?
On 4 Mar 2021, at 12:09 PM, Stephen Casner <casner@acm.org> wrote:
Mark,
Thanks for that reply. As I mentioned, if I don't configure WOLFSSL_ALT_CERT_CHAIN then I get an "ASN no signer to confirm" error. Do you have any idea why that might be? That is, am I likely to be missing access to some key? Of some key not being present in a cert when it should be there?
-- Steve
On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
Steve,
A thorny issue. Servers are _supposed_ to provide intermediate certificates, up to a trusted root. When you are issued a certificate, it includes a bundle of these intermediary certificates to be installed at the same time. In practice, servers are often mis-configured so they do not. This is made worse by browsers silently detecting this, then downloading the missing intermediate certificates (the child certificate contains a URL to its parent's cert).
For Open Vehicles, I don't think we need to deal with this, and we certainly don't need the complexity of automatically downloading intermediate certificates. I think if the user wants to access a server misconfigured in that way, he can simply import and trust the intermediate certificate directly.
I don't think we should set WOLFSSL_ALT_CERT_CHAIN.
Regarding your question, in normal operation OVMS as a client must validate the server certificates it connect to. I don't think OVMS currently supports client certificates, although if it did we would have to correctly provide those to the server on connection.
Regards, Mark.
On 4 Mar 2021, at 9:00 AM, Stephen Casner <casner@acm.org> wrote:
I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error:
WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA.
Is that expected for the trust arrangements we are using?
A possibly related question: do we expect the server to validate clients, or only the clients to validate the server?
-- Steve
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Mark, Thanks. That fix avoids the signature error. I still have the problem that the TLS handshake gets only part way through and then the network task gets locked up for 90 seconds. It's not always in the same place in the log, but since ~100 log messages get lost when this occurs, I can't be sure where it happens. It could be a timing dependency between a couple of events such that in some circumstances a blocking operation is executed. -- Steve On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
Steve,
Mea culpa.
Here is the main cert:
X509v3 Subject Alternative Name: DNS:*.openvehicles.com, DNS:openvehicles.com <http://openvehicles.com/> Issuer: C=LV, L=Riga, O=GoGetSSL, CN=GoGetSSL RSA DV CA Validity Not Before: Feb 11 00:00:00 2020 GMT Not After : Feb 10 23:59:59 2022 GMT
Then the intermediate certs:
Subject: C=LV, L=Riga, O=GoGetSSL, CN=GoGetSSL RSA DV CA Issuer: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Validity Not Before: Sep 6 00:00:00 2018 GMT Not After : Sep 5 23:59:59 2028 GMT
Subject: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Issuer: C=SE, O=AddTrust AB, OU=AddTrust External TTP Network, CN=AddTrust External CA Root Validity Not Before: May 30 10:48:38 2000 GMT Not After : May 30 10:48:38 2020 GMT
It seems the second intermediate certificate has expired, and that is perhaps the problem you are seeing?
Very annoying. I really hate it when certificate authorities issue certificates signed by intermediate CA certificates, or trusted roots, that expire before the certificate itself.
AddTrust has already issued a new intermediate cert which expires in 2038, and have managed to get that in as a trusted CA in most modern browsers. It is in my server’s bundle for older browsers, and is also in the OVMS firmware as a trusted CA. I guess wolfssl and mbedtls behave differently in this situation (where the same certificate is provided as a trusted CA as well as in the bundle, but with different expiration dates). Either that, or mbedtls is not verifying expired intermediate certs 😱
I replaced that certificate on my server, and now see:
Subject: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Issuer: C=US, ST=New Jersey, L=Jersey City, O=The USERTRUST Network, CN=USERTrust RSA Certification Authority Validity Not Before: Feb 1 00:00:00 2010 GMT Not After : Jan 18 23:59:59 2038 GMT
A test now looks ok:
$ openssl s_client -connect api.openvehicles.com:6870 <http://api.openvehicles.com:6870/>
CONNECTED(00000005) depth=2 C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority verify return:1 depth=1 C = LV, L = Riga, O = GoGetSSL, CN = GoGetSSL RSA DV CA verify return:1 depth=0 CN = *.openvehicles.com verify return:1 --- Certificate chain 0 s:/CN=*.openvehicles.com i:/C=LV/L=Riga/O=GoGetSSL/CN=GoGetSSL RSA DV CA 1 s:/C=LV/L=Riga/O=GoGetSSL/CN=GoGetSSL RSA DV CA i:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority 2 s:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority i:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority
Can you try again? See if you still get an error?
Regards, Mark.
On 4 Mar 2021, at 12:52 PM, Stephen Casner <casner@acm.org> wrote:
Your server api.openvehicles.com on port 6870.
On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
What site (and port) are you trying to access?
On 4 Mar 2021, at 12:09 PM, Stephen Casner <casner@acm.org> wrote:
Mark,
Thanks for that reply. As I mentioned, if I don't configure WOLFSSL_ALT_CERT_CHAIN then I get an "ASN no signer to confirm" error. Do you have any idea why that might be? That is, am I likely to be missing access to some key? Of some key not being present in a cert when it should be there?
-- Steve
On Thu, 4 Mar 2021, Mark Webb-Johnson wrote:
Steve,
A thorny issue. Servers are _supposed_ to provide intermediate certificates, up to a trusted root. When you are issued a certificate, it includes a bundle of these intermediary certificates to be installed at the same time. In practice, servers are often mis-configured so they do not. This is made worse by browsers silently detecting this, then downloading the missing intermediate certificates (the child certificate contains a URL to its parent's cert).
For Open Vehicles, I don't think we need to deal with this, and we certainly don't need the complexity of automatically downloading intermediate certificates. I think if the user wants to access a server misconfigured in that way, he can simply import and trust the intermediate certificate directly.
I don't think we should set WOLFSSL_ALT_CERT_CHAIN.
Regarding your question, in normal operation OVMS as a client must validate the server certificates it connect to. I don't think OVMS currently supports client certificates, although if it did we would have to correctly provide those to the server on connection.
Regards, Mark.
On 4 Mar 2021, at 9:00 AM, Stephen Casner <casner@acm.org> wrote:
I find that I need to enable the following option in my testing of the possible replacement of MEDTLS with WolfSSL, otherwise I get an "ASN no signer to confirm" error:
WOLFSSL_ALT_CERT_CHAIN allows CA's to be presented by peer, but not part of a valid chain. Default wolfSSL behavior is to require validation of all presented peer certificates. This also allows loading intermediate CA's as trusted and ignoring no signer failures for CA's up the chain to root. The alternate certificate chain mode only requires that the peer certificate validate to a trusted CA.
Is that expected for the trust arrangements we are using?
A possibly related question: do we expect the server to validate clients, or only the clients to validate the server?
-- Steve
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Michael and anyone else who's game: I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check. So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes. Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0. -- Steve On Wed, 3 Mar 2021, Stephen Casner wrote:
Mark,
Thanks. That fix avoids the signature error. I still have the problem that the TLS handshake gets only part way through and then the network task gets locked up for 90 seconds. It's not always in the same place in the log, but since ~100 log messages get lost when this occurs, I can't be sure where it happens. It could be a timing dependency between a couple of events such that in some circumstances a blocking operation is executed.
-- Steve
On 3/10/21 11:23 PM, Stephen Casner wrote:
Michael and anyone else who's game:
I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check.
So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes.
Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master). I just did a fresh reboot and captured the serial console output and noticed this: W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code I think it happened around the time I lost wifi connectivity. My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles. Craig
Craig, I get the same (with for-v3.3): W (2940) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code I guess it is just a warning. Probably some debugging config setting. But wifi, web and others work ok for me. Only problems I have with for-v3.3 branch are (a) the web dashboard modem status, and (b) the TLS certificate verification against api.openvehicles.com <http://api.openvehicles.com/>. I am working on both. Regards, Mark.
On 12 Mar 2021, at 9:46 AM, Craig Leres <leres@xse.com> wrote:
On 3/10/21 11:23 PM, Stephen Casner wrote:
Michael and anyone else who's game: I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check. So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes. Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master).
I just did a fresh reboot and captured the serial console output and noticed this:
W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I think it happened around the time I lost wifi connectivity.
My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles.
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
P.S. Error code -174 seems to be 'NOT_COMPILED_IN’. Regards, Mark.
On 12 Mar 2021, at 9:56 AM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Craig,
I get the same (with for-v3.3):
W (2940) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I guess it is just a warning. Probably some debugging config setting.
But wifi, web and others work ok for me. Only problems I have with for-v3.3 branch are (a) the web dashboard modem status, and (b) the TLS certificate verification against api.openvehicles.com <http://api.openvehicles.com/>. I am working on both.
Regards, Mark.
On 12 Mar 2021, at 9:46 AM, Craig Leres <leres@xse.com <mailto:leres@xse.com>> wrote:
On 3/10/21 11:23 PM, Stephen Casner wrote:
Michael and anyone else who's game: I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check. So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes. Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master).
I just did a fresh reboot and captured the serial console output and noticed this:
W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I think it happened around the time I lost wifi connectivity.
My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles.
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
Craig and Mark, I do have the OvmsSSH::NetManInit() function calling wolfSSL_Debugging_ON() which would be expected to return -174 meaning NOT_COMPILED_IN as Mark correctly found. At one point I had code to print that error messge because I was having trouble getting the wolfSSL debugging to work, but I took out that error message in commit 9607979e91da7a53da1cd0bd8325ab390abe18bb so now the return value is ignored. I'm baffled. I'll have to look deeper after dinner. My mongoose-wolfssl branch is off of master on 2/17. I should probably have rebased to the current master or perhaps merged it to for-v3.3 as Mark recently requested. Have you guys done that merge for what you are testing now? -- Steve On Fri, 12 Mar 2021, Mark Webb-Johnson wrote:
P.S. Error code -174 seems to be 'NOT_COMPILED_IN’.
Regards, Mark.
On 12 Mar 2021, at 9:56 AM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Craig,
I get the same (with for-v3.3):
W (2940) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I guess it is just a warning. Probably some debugging config setting.
But wifi, web and others work ok for me. Only problems I have with for-v3.3 branch are (a) the web dashboard modem status, and (b) the TLS certificate verification against api.openvehicles.com <http://api.openvehicles.com/>. I am working on both.
Regards, Mark.
On 12 Mar 2021, at 9:46 AM, Craig Leres <leres@xse.com <mailto:leres@xse.com>> wrote:
On 3/10/21 11:23 PM, Stephen Casner wrote:
Michael and anyone else who's game: I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check. So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes. Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master).
I just did a fresh reboot and captured the serial console output and noticed this:
W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I think it happened around the time I lost wifi connectivity.
My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles.
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
The for-v3.3 branch should be up-to-date and merged from master. It should have everything that master has. I see it has this: commit 9607979e91da7a53da1cd0bd8325ab390abe18bb Author: Stephen Casner <casner@acm.org> Date: Wed Feb 24 23:53:28 2021 -0800 SSH: Don't emit error message if wolfssl debugging is unconfigured diff --git a/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp b/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp index 21549bad..6fa0e5e5 100644 --- a/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp +++ b/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp @@ -177,9 +177,8 @@ void OvmsSSH::NetManInit(std::string event, void* data) ESP_LOGI(tag, "Launching SSH Server"); wolfSSH_SetLoggingCb(&wolfssh_logger); wolfSSH_Debugging_ON(); - if ((ret=wolfSSL_SetLoggingCb(&wolfssl_logger)) || (ret=wolfSSL_Debugging_ON())) - ESP_LOGW(tag, "Couldn't initialize wolfSSL debugging, error %d: %s", ret, - GetErrorString(ret)); + wolfSSL_SetLoggingCb(&wolfssl_logger); + wolfSSL_Debugging_ON(); ret = wolfSSH_Init(); if (ret != WS_SUCCESS) { But current code in both master and for-v3.3 branches is: ESP_LOGI(tag, "Launching SSH Server"); wolfSSH_SetLoggingCb(&wolfssh_logger); wolfSSH_Debugging_ON(); if ((ret=wolfSSL_SetLoggingCb(&wolfssl_logger)) || (ret=wolfSSL_Debugging_ON())) ESP_LOGW(tag, "Couldn't initialize wolfSSL debugging, error %d: %s", ret, GetErrorString(ret)); Seems commit c6911c91432cada337bef46f6a541af46304b5cf seems to have brought back the old code? Mark
On 12 Mar 2021, at 11:34 AM, Stephen Casner <casner@acm.org> wrote:
Craig and Mark,
I do have the OvmsSSH::NetManInit() function calling wolfSSL_Debugging_ON() which would be expected to return -174 meaning NOT_COMPILED_IN as Mark correctly found. At one point I had code to print that error messge because I was having trouble getting the wolfSSL debugging to work, but I took out that error message in commit 9607979e91da7a53da1cd0bd8325ab390abe18bb so now the return value is ignored. I'm baffled. I'll have to look deeper after dinner.
My mongoose-wolfssl branch is off of master on 2/17. I should probably have rebased to the current master or perhaps merged it to for-v3.3 as Mark recently requested. Have you guys done that merge for what you are testing now?
-- Steve
On Fri, 12 Mar 2021, Mark Webb-Johnson wrote:
P.S. Error code -174 seems to be 'NOT_COMPILED_IN’.
Regards, Mark.
On 12 Mar 2021, at 9:56 AM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Craig,
I get the same (with for-v3.3):
W (2940) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I guess it is just a warning. Probably some debugging config setting.
But wifi, web and others work ok for me. Only problems I have with for-v3.3 branch are (a) the web dashboard modem status, and (b) the TLS certificate verification against api.openvehicles.com <http://api.openvehicles.com/>. I am working on both.
Regards, Mark.
On 12 Mar 2021, at 9:46 AM, Craig Leres <leres@xse.com <mailto:leres@xse.com>> wrote:
On 3/10/21 11:23 PM, Stephen Casner wrote:
Michael and anyone else who's game: I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check. So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes. Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master).
I just did a fresh reboot and captured the serial console output and noticed this:
W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I think it happened around the time I lost wifi connectivity.
My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles.
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui. But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?) Craig
IDF should be the same. I use the same one for compiling both master for for-v3.3. But if you are switching branches, perhaps safest to do a ‘make clean’ between each build. Regards, Mark
On 12 Mar 2021, at 12:47 PM, Craig Leres <leres@xse.com> wrote:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf. The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes. The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed: 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting I need my car now, so will switch back to master for now. Mark, if you've got specific debug logs I should fetch on the next try, tell me. Regards, Michael Am 12.03.21 um 05:47 schrieb Craig Leres:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Tried to repeat this, but not having much success. Here is my car module, with network still up: OVMS# boot status Last boot was 262355 second(s) ago I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was: I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). 707 if ((pri->name[0]==search[0])&& 0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). 1357 void OvmsMetricString::SetValue(std::string value) 0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. 0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. 0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). 522 PrioritiseAndIndicate(); 0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); } 0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...); 0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data); 0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). 237 HandleQueueSignalEvent(&msg); 0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). 80 me->EventTask(); My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that. Regards, Mark.
On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf.
The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes.
The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed:
2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us
2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting
I need my car now, so will switch back to master for now.
Mark, if you've got specific debug logs I should fetch on the next try, tell me.
Regards, Michael
Am 12.03.21 um 05:47 schrieb Craig Leres:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module. I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 D (130541) http: OvmsSyncHttpClient: waiting for completion After that log message, the network is dead, and the netmanager also doesn't respond: OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout The interfaces seem to be registered and online, but nothing gets in or out: OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 DNS: 192.168.2.1 Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1) A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event. Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client. Regards, Michael Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson:
Tried to repeat this, but not having much success. Here is my car module, with network still up:
OVMS# boot status Last boot was 262355 second(s) ago
I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was:
I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707).
707 if ((pri->name[0]==search[0])&&
0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358).
1357 void OvmsMetricString::SetValue(std::string value)
0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586.
0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604.
0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522).
522 PrioritiseAndIndicate();
0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600).
600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); }
0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271).
2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283).
283 m_current_callback->m_callback(m_current_event, msg->body.signal.data);
0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237).
237 HandleQueueSignalEvent(&msg);
0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80).
80 me->EventTask();
My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that.
Regards, Mark.
On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf.
The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes.
The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed:
2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us
2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting
I need my car now, so will switch back to master for now.
Mark, if you've got specific debug logs I should fetch on the next try, tell me.
Regards, Michael
Am 12.03.21 um 05:47 schrieb Craig Leres:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Not sure how to resolve this. OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota. I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here: Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds). But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback? Mark
On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module.
I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 D (130541) http: OvmsSyncHttpClient: waiting for completion
After that log message, the network is dead, and the netmanager also doesn't respond:
OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout
The interfaces seem to be registered and online, but nothing gets in or out:
OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64
Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1
Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1
DNS: 192.168.2.1
Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1)
A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event.
Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client.
Regards, Michael
Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson:
Tried to repeat this, but not having much success. Here is my car module, with network still up:
OVMS# boot status Last boot was 262355 second(s) ago
I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was:
I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). 707 if ((pri->name[0]==search[0])&&
0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). 1357 void OvmsMetricString::SetValue(std::string value)
0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586.
0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604.
0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). 522 PrioritiseAndIndicate();
0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); }
0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data);
0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). 237 HandleQueueSignalEvent(&msg);
0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). 80 me->EventTask();
My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that.
Regards, Mark.
On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf.
The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes.
The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed:
2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us
2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting
I need my car now, so will switch back to master for now.
Mark, if you've got specific debug logs I should fetch on the next try, tell me.
Regards, Michael
Am 12.03.21 um 05:47 schrieb Craig Leres:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher. Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse. Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations? Regards, Michael Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson:
Not sure how to resolve this.
OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota.
I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here:
Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds).
But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback?
Mark
On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module.
I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de:80> D (130541) http: OvmsSyncHttpClient: waiting for completion
After that log message, the network is dead, and the netmanager also doesn't respond:
OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout
The interfaces seem to be registered and online, but nothing gets in or out:
OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64
Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1
Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1
DNS: 192.168.2.1
Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1)
A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event.
Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client.
Regards, Michael
Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson:
Tried to repeat this, but not having much success. Here is my car module, with network still up:
OVMS# boot status Last boot was 262355 second(s) ago
I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was:
I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707).
707 if ((pri->name[0]==search[0])&&
0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358).
1357 void OvmsMetricString::SetValue(std::string value)
0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586.
0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604.
0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522).
522 PrioritiseAndIndicate();
0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600).
600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); }
0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271).
2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283).
283 m_current_callback->m_callback(m_current_event, msg->body.signal.data);
0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237).
237 HandleQueueSignalEvent(&msg);
0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80).
80 me->EventTask();
My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that.
Regards, Mark.
On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf.
The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes.
The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed:
2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us
2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting
I need my car now, so will switch back to master for now.
Mark, if you've got specific debug logs I should fetch on the next try, tell me.
Regards, Michael
Am 12.03.21 um 05:47 schrieb Craig Leres:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion? Doesn’t necessarily need to be a networking command. Something long running like the string speed tests. In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3). Regards, Mark
On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> wrote:
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher.
Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse.
Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations?
Regards, Michael
Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson:
Not sure how to resolve this.
OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota.
I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here:
Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds).
But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback?
Mark
On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module.
I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 D (130541) http: OvmsSyncHttpClient: waiting for completion
After that log message, the network is dead, and the netmanager also doesn't respond:
OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout
The interfaces seem to be registered and online, but nothing gets in or out:
OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64
Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1
Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1
DNS: 192.168.2.1
Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1)
A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event.
Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client.
Regards, Michael
Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson:
Tried to repeat this, but not having much success. Here is my car module, with network still up:
OVMS# boot status Last boot was 262355 second(s) ago
I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was:
I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). 707 if ((pri->name[0]==search[0])&&
0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). 1357 void OvmsMetricString::SetValue(std::string value)
0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586.
0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604.
0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). 522 PrioritiseAndIndicate();
0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); }
0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data);
0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). 237 HandleQueueSignalEvent(&msg);
0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). 80 me->EventTask();
My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that.
Regards, Mark.
On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf.
The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes.
The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed:
2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us
2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting
I need my car now, so will switch back to master for now.
Mark, if you've got specific debug logs I should fetch on the next try, tell me.
Regards, Michael
Am 12.03.21 um 05:47 schrieb Craig Leres:
I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui.
But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?)
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context. Running commands via web doesn't block, as the webcommand class starts a separate task for each execution. The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file. Regards, Michael Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson:
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion?
Doesn’t necessarily need to be a networking command. Something long running like the string speed tests.
In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3).
Regards, Mark
On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> wrote:
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher.
Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse.
Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations?
Regards, Michael
Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson:
Not sure how to resolve this.
OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota.
I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here:
Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds).
But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback?
Mark
On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module.
I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de:80> D (130541) http: OvmsSyncHttpClient: waiting for completion
After that log message, the network is dead, and the netmanager also doesn't respond:
OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout
The interfaces seem to be registered and online, but nothing gets in or out:
OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64
Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1
Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1
DNS: 192.168.2.1
Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1)
A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event.
Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client.
Regards, Michael
Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson:
Tried to repeat this, but not having much success. Here is my car module, with network still up:
OVMS# boot status Last boot was 262355 second(s) ago
I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was:
I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707).
707 if ((pri->name[0]==search[0])&&
0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358).
1357 void OvmsMetricString::SetValue(std::string value)
0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586.
0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604.
0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522).
522 PrioritiseAndIndicate();
0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600).
600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); }
0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271).
2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283).
283 m_current_callback->m_callback(m_current_event, msg->body.signal.data);
0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237).
237 HandleQueueSignalEvent(&msg);
0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80).
80 me->EventTask();
My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that.
Regards, Mark.
On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf.
The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes.
The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed:
2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us
2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped …and so on until 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting
I need my car now, so will switch back to master for now.
Mark, if you've got specific debug logs I should fetch on the next try, tell me.
Regards, Michael
Am 12.03.21 um 05:47 schrieb Craig Leres: > I just updated to 3.2.016-68-g8e10c6b7 and still get the network > hang immediately after booting and logging into the web gui. > > But I see now my problem is likely that I'm not using the right > esp-idf (duh). Is there a way I can have master build using > ~/esp/esp-idf and have for-v3.3 use a different path?) > > Craig > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com > <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev > <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
OK, some progress… I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem. I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking. I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3). There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either. Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess. These two changes are in ‘pre’ now, and I am trying them in my car. Regards, Mark.
On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context.
Running commands via web doesn't block, as the webcommand class starts a separate task for each execution.
The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file.
Regards, Michael
Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson:
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion?
Doesn’t necessarily need to be a networking command. Something long running like the string speed tests.
In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3).
Regards, Mark
On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> <mailto:dexter@expeedo.de> wrote:
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher.
Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse.
Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations?
Regards, Michael
Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson:
Not sure how to resolve this.
OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota.
I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here:
Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds).
But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback?
Mark
On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module.
I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> D (130541) http: OvmsSyncHttpClient: waiting for completion
After that log message, the network is dead, and the netmanager also doesn't respond:
OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout
The interfaces seem to be registered and online, but nothing gets in or out:
OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64
Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1
Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1
DNS: 192.168.2.1
Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1)
A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event.
Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client.
Regards, Michael
Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson:
Tried to repeat this, but not having much success. Here is my car module, with network still up:
OVMS# boot status Last boot was 262355 second(s) ago
I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was:
I (3717989) cellular: PPP Connection disconnected Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). 707 if ((pri->name[0]==search[0])&&
0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). 1357 void OvmsMetricString::SetValue(std::string value)
0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586.
0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604.
0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). 522 PrioritiseAndIndicate();
0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); }
0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data);
0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). 237 HandleQueueSignalEvent(&msg);
0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). 80 me->EventTask();
My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that.
Regards, Mark.
> On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: > > Signed PGP part > I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf. > > The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes. > > The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed: > > 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network > 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart > 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station > 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) > 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us > > 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 > 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq > 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq > 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq > 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver > 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 > 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver > 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log > 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 > 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE > 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE > 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 > 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled > 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled > 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 > 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 > 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 > 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 > 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 > 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 > 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 > 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) > 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 > 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart > 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state > 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... > 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt > 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed > 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown > 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... > 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt > 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed > 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown > 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) > 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle > 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) > 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop > 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found > 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority > 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) > 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) > 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found > 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE > 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) > 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) > 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found > 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) > 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) > 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found > 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down > 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped > 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) > 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) > 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) > 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) > 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost > 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) > 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) > 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) > 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up > 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped > 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped > 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped > …and so on until > 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting > > > I need my car now, so will switch back to master for now. > > Mark, if you've got specific debug logs I should fetch on the next try, tell me. > > Regards, > Michael > > > Am 12.03.21 um 05:47 schrieb Craig Leres: >> I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui. >> >> But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?) >> >> Craig >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > > > >
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Mark, regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context. I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.: /** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block) So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context. Regards, Michael Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson:
OK, some progress…
1. I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem.
I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking.
2. I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3).
There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either.
Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess.
These two changes are in ‘pre’ now, and I am trying them in my car.
Regards, Mark.
On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context.
Running commands via web doesn't block, as the webcommand class starts a separate task for each execution.
The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file.
Regards, Michael
Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson:
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion?
Doesn’t necessarily need to be a networking command. Something long running like the string speed tests.
In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3).
Regards, Mark
On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> wrote:
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher.
Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse.
Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations?
Regards, Michael
Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson:
Not sure how to resolve this.
OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota.
I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here:
Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds).
But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback?
Mark
On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module.
I (130531) webserver: HTTP GET /cfg/firmware D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> D (130541) http: OvmsSyncHttpClient: waiting for completion
After that log message, the network is dead, and the netmanager also doesn't respond:
OVMS# network list ERROR: job failed D (183241) netmanager: send cmd 1 from 0x3ffe7054 W (193241) netmanager: ExecuteJob: cmd 1: timeout
The interfaces seem to be registered and online, but nothing gets in or out:
OVMS# network status Interface#3: pp3 (ifup=1 linkup=1) IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64
Interface#2: ap2 (ifup=1 linkup=1) IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1
Interface#1: st1 (ifup=1 linkup=1) IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1
DNS: 192.168.2.1
Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1)
A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event.
Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client.
Regards, Michael
Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: > Tried to repeat this, but not having much success. Here is my > car module, with network still up: > > OVMS# boot status > Last boot was 262355 second(s) ago > > > I did manage to catch one network related crash after repeatedly > disconnecting and reconnecting the cellular antenna. This was: > > I (3717989) cellular: PPP Connection disconnected > Guru Meditation Error: Core 1 panic'ed (LoadProhibited). > Exception was unhandled. > > 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() > (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). > > 707 if ((pri->name[0]==search[0])&& > > > 0x400ed360 is in > OvmsMetricString::SetValue(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> >) > (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). > > 1357 void OvmsMetricString::SetValue(std::string value) > > > 0x4008bdad is at > ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. > > > 0x4008bdd1 is at > ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. > > > 0x400fe886 is in > OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> >, void*) > (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). > > 522 PrioritiseAndIndicate(); > > > 0x400fd752 is in std::_Function_handler<void > (std::__cxx11::basic_string<char, std::char_traits<char>, > std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void > (OvmsNetManager::*)(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> >, void*)> > (OvmsNetManager*, std::_Placeholder<1>, > std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, > std::__cxx11::basic_string<char, std::char_traits<char>, > std::allocator<char> >&&, void*&&) > (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). > > 600 { return > (__object->*_M_pmf)(std::forward<_Args>(__args)...); } > > > 0x400f512e is in std::function<void > (std::__cxx11::basic_string<char, std::char_traits<char>, > std::allocator<char> >, > void*)>::operator()(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> >, void*) const > (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). > > 2271 return _M_invoker(_M_functor, > std::forward<_ArgTypes>(__args)...); > > > 0x400f52f1 is in > OvmsEvents::HandleQueueSignalEvent(event_queue_t*) > (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). > > 283 m_current_callback->m_callback(m_current_event, > msg->body.signal.data); > > > 0x400f53d8 is in OvmsEvents::EventTask() > (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). > > 237 HandleQueueSignalEvent(&msg); > > > 0x400f545d is in EventLaunchTask(void*) > (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). > > 80 me->EventTask(); > > > My for_v3.3 branch does include the preliminary changes to > support the wifi at 20MHz bandwidth, and perhaps those could be > affecting things. I do notice that if I ‘power wifi off’, then > ‘wifi mode client’, it can connect to the station, but not get > an IP address. I’ve just tried to merge in the latest fixes to > that, and rebuilt a release. I will continue to test with that. > > Regards, Mark. > >> On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de >> <mailto:dexter@expeedo.de>> wrote: >> >> Signed PGP part >> I just tried switching to for-v3.3 in my car module after tests >> on my desk module were OK, and I've run into the very same >> problem with for-v3.3. So the issue isn't related to esp-idf. >> >> The network only occasionally starts normally, but even then >> all connectivity is lost after a couple of minutes. >> >> The stale connection watchdog in server-v2 triggers a network >> restart, but that also doesn't seem to succeed: >> >> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected >> stale connection (issue #241), restarting network >> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI >> station >> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) >> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total >> sleep time: 831205045 us / 975329961 us >> >> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, >> old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down >> WIFI driver >> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx >> mblock:16 >> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up >> WIFI driver >> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase >> log key successfully, reinit nvs log >> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: >> 3ffd4d84, prio:23, stack:3584, core=0 >> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC >> address is not set, read default base MAC address from BLK0 of >> EFUSE >> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC >> address is not set, read default base MAC address from BLK0 of >> EFUSE >> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware >> version: 30f9e79 >> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: >> enabled >> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano >> formating: disabled >> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame >> dynamic rx buffer num: 16 >> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management >> frame dynamic rx buffer num: 16 >> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management >> short buffer num: 32 >> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx >> buffer num: 16 >> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx >> buffer size: 2212 >> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx >> buffer num: 16 >> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx >> buffer num: 16 >> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta >> (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save >> buffer number: 8 >> 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart >> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter >> PowerOffOn state >> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down >> (hard)... >> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: >> User Interrupt >> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection >> has been closed >> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown >> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down >> (hard)... >> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: >> User Interrupt >> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection >> has been closed >> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown >> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) >> 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: >> Power Cycle >> 2021-03-12 14:53:04.682 CET D (984532) events: >> Signal(system.wifi.down) >> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop >> 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent >> state: no interface of type 'pp' found >> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client >> down (with MODEM up): reconfigured for MODEM priority >> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) >> 2021-03-12 14:53:04.692 CET D (984542) events: >> Signal(system.wifi.sta.disconnected) >> 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent >> state: no interface of type 'pp' found >> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA >> disconnected with reason 8 = ASSOC_LEAVE >> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) >> 2021-03-12 14:53:04.702 CET D (984552) events: >> Signal(system.wifi.sta.stop) >> 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent >> state: no interface of type 'pp' found >> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) >> 2021-03-12 14:53:04.712 CET D (984562) events: >> Signal(system.wifi.ap.stop) >> 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent >> state: no interface of type 'pp' found >> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access >> point is down >> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >> 2021-03-12 14:53:04.722 CET D (984572) events: >> Signal(network.wifi.sta.bad) >> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) >> 2021-03-12 14:53:04.722 CET D (984572) events: >> Signal(system.wifi.sta.start) >> 2021-03-12 14:53:04.732 CET D (984582) events: >> Signal(system.modem.down) >> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down >> (with WIFI client down): network connectivity has been lost >> 2021-03-12 14:53:04.742 CET D (984592) events: >> Signal(system.modem.down) >> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) >> 2021-03-12 14:53:04.752 CET D (984602) events: >> Signal(system.wifi.ap.start) >> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access >> point is up >> 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: >> queue overflow (running system.wifi.ap.start->netmanager for 23 >> sec), event 'ticker.1' dropped >> 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: >> queue overflow (running system.wifi.ap.start->netmanager for 24 >> sec), event 'ticker.1' dropped >> 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: >> queue overflow (running system.wifi.ap.start->netmanager for 25 >> sec), event 'ticker.1' dropped >> …and so on until >> 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: >> lost important event => aborting >> >> >> I need my car now, so will switch back to master for now. >> >> Mark, if you've got specific debug logs I should fetch on the >> next try, tell me. >> >> Regards, >> Michael >> >> >> Am 12.03.21 um 05:47 schrieb Craig Leres: >>> I just updated to 3.2.016-68-g8e10c6b7 and still get the >>> network hang immediately after booting and logging into the >>> web gui. >>> >>> But I see now my problem is likely that I'm not using the >>> right esp-idf (duh). Is there a way I can have master build >>> using ~/esp/esp-idf and have for-v3.3 use a different path?) >>> >>> Craig >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com >>> <mailto:OvmsDev@lists.openvehicles.com> >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>> <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> >> >> > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com > http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
My attempt didn’t work (still crashes), so I’m now trying your suggestion of wrapping PrioritiseAndIndicate() in a tcpip_callback_with_block. 🤞🏻 Regards, Mark.
On 23 Mar 2021, at 3:02 PM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Mark,
regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context.
I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.:
/** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block)
So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context.
Regards, Michael
Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson:
OK, some progress…
I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem.
I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking.
I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3).
There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either.
Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess.
These two changes are in ‘pre’ now, and I am trying them in my car.
Regards, Mark.
On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context.
Running commands via web doesn't block, as the webcommand class starts a separate task for each execution.
The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file.
Regards, Michael
Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson:
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion?
Doesn’t necessarily need to be a networking command. Something long running like the string speed tests.
In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3).
Regards, Mark
On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> <mailto:dexter@expeedo.de> wrote:
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher.
Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse.
Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations?
Regards, Michael
Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson:
Not sure how to resolve this.
OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota.
I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here:
Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds).
But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback?
Mark
> On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: > > Signed PGP part > I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module. > > I (130531) webserver: HTTP GET /cfg/firmware > D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> > D (130541) http: OvmsSyncHttpClient: waiting for completion > > After that log message, the network is dead, and the netmanager also doesn't respond: > > OVMS# network list > ERROR: job failed > D (183241) netmanager: send cmd 1 from 0x3ffe7054 > W (193241) netmanager: ExecuteJob: cmd 1: timeout > > The interfaces seem to be registered and online, but nothing gets in or out: > > OVMS# network status > Interface#3: pp3 (ifup=1 linkup=1) > IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 > > Interface#2: ap2 (ifup=1 linkup=1) > IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 > > Interface#1: st1 (ifup=1 linkup=1) > IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 > > DNS: 192.168.2.1 > > Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1) > > > A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event. > > Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client. > > Regards, > Michael > > > Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: >> Tried to repeat this, but not having much success. Here is my car module, with network still up: >> >> OVMS# boot status >> Last boot was 262355 second(s) ago >> >> I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was: >> >> I (3717989) cellular: PPP Connection disconnected >> Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. >> >> 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). >> 707 if ((pri->name[0]==search[0])&& >> >> 0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). >> 1357 void OvmsMetricString::SetValue(std::string value) >> >> 0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. >> >> 0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. >> >> 0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). >> 522 PrioritiseAndIndicate(); >> >> 0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). >> 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); } >> >> 0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). >> 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...); >> >> 0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). >> 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data); >> >> 0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). >> 237 HandleQueueSignalEvent(&msg); >> >> 0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). >> 80 me->EventTask(); >> >> My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that. >> >> Regards, Mark. >> >>> On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>> >>> Signed PGP part >>> I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf. >>> >>> The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes. >>> >>> The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed: >>> >>> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network >>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station >>> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) >>> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us >>> >>> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >>> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >>> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >>> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >>> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver >>> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 >>> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver >>> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log >>> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 >>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 >>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled >>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled >>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 >>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 >>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 >>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 >>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 >>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 >>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 >>> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >>> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 >>> 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart >>> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state >>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt >>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed >>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown >>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt >>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed >>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown >>> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) >>> 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle >>> 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) >>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop >>> 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found >>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority >>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) >>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) >>> 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found >>> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE >>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) >>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) >>> 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found >>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) >>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) >>> 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found >>> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down >>> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) >>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) >>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) >>> 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) >>> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost >>> 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) >>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) >>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) >>> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up >>> 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped >>> 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped >>> 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped >>> …and so on until >>> 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting >>> >>> >>> I need my car now, so will switch back to master for now. >>> >>> Mark, if you've got specific debug logs I should fetch on the next try, tell me. >>> >>> Regards, >>> Michael >>> >>> >>> Am 12.03.21 um 05:47 schrieb Craig Leres: >>>> I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui. >>>> >>>> But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?) >>>> >>>> Craig >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > >
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Good grief, this is not so easy. Now we have: Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. Core 1 register dump: PC : 0x40008044 PS : 0x00060f30 A0 : 0x800fe2cc A1 : 0x3ffcaa90 A2 : 0x3f413acc A3 : 0x00000046 A4 : 0x00e6807e A5 : 0x00000000 A6 : 0x00000000 A7 : 0x00000000 A8 : 0x00000010 A9 : 0x00e6807e A10 : 0x00000078 A11 : 0x00000009 A12 : 0x3ffcaa3f A13 : 0x00000032 A14 : 0x00000000 A15 : 0x3ffcaa48 SAR : 0x00000004 EXCCAUSE: 0x0000001c EXCVADDR: 0x00e6807e LBEG : 0x4008bdad LEND : 0x4008bdd1 LCOUNT : 0x800f93f4 ELF file SHA256: 74bb0a75eeb4578b Backtrace: 0x40008044:0x3ffcaa90 0x400fe2c9:0x3ffcab20 0x400fe412:0x3ffcabb0 0x402937b5:0x3ffcabd0 0x400fe2c9 is in OvmsNetManager::DoSafePrioritiseAndIndicate() (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:723). 718 } 719 720 for (struct netif *pri = netif_list; pri != NULL; pri=pri->next) 721 { 722 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: interface %p",pri); 723 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: name %s",pri->name); 724 if ((pri->name[0]==search[0])&& 725 (pri->name[1]==search[1])) 726 { 727 if (search[0] != m_previous_name[0] || search[1] != m_previous_name[1]) 0x400fe412 is in SafePrioritiseAndIndicate(void*) (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:676). 671 } 672 } 673 674 void SafePrioritiseAndIndicate(void* ctx) 675 { 676 MyNetManager.DoSafePrioritiseAndIndicate(); 677 } 678 679 void OvmsNetManager::PrioritiseAndIndicate() 680 { 0x402937b5 is in tcpip_thread (/Users/hq.mark.johnson/esp/esp-idf/components/lwip/lwip/src/api/tcpip.c:158). 153 break; 154 #endif /* LWIP_TCPIP_TIMEOUT && LWIP_TIMERS */ 155 156 case TCPIP_MSG_CALLBACK: 157 LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: CALLBACK %p\n", (void *)msg)); 158 msg->msg.cb.function(msg->msg.cb.ctx); 159 memp_free(MEMP_TCPIP_MSG_API, msg); 160 break; 161 162 case TCPIP_MSG_CALLBACK_STATIC: So the issue is most likely corruption of the network interface structure, not thread safe traversal. I had added some ESP_EARLY_LOGI statements, so can see a little more of what is going on: I (103202) gsm-ppp: Initialising... I (103212) events: Signal(system.modem.netmode) I (105902) netmanager: DoSafePrioritiseAndIndicate: start I (105902) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (105912) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (105912) netmanager: DoSafePrioritiseAndIndicate: name pp I (105922) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde854 I (105932) netmanager: DoSafePrioritiseAndIndicate: name ap I (105932) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde640 I (105942) netmanager: DoSafePrioritiseAndIndicate: name st I (105952) netmanager: DoSafePrioritiseAndIndicate: end I (105902) gsm-ppp: StatusCallBack: None I (105902) gsm-ppp: status_cb: Connected I (105902) gsm-ppp: our_ipaddr = 10.52.40.80 … I (3708442) cellular: PPP Connection disconnected I (3708442) cellular: PPP Connection disconnected I (3709212) netmanager: DoSafePrioritiseAndIndicate: start I (3709212) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (3709212) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (3709222) netmanager: DoSafePrioritiseAndIndicate: name pp I (3709222) netmanager: DoSafePrioritiseAndIndicate: interface 0x30323930 I (3709232) netmanager: DoSafePrioritiseAndIndicate: name f I (3709242) netmanager: DoSafePrioritiseAndIndicate: interface 0x667fc000 I (3709252) netmanager: DoSafePrioritiseAndIndicate: name Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1) Doesn’t help much, apart from confirm the corruption. Took about an hour to recreate the problem. I’ll keep looking. Regards, Mark.
On 23 Mar 2021, at 4:05 PM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Signed PGP part My attempt didn’t work (still crashes), so I’m now trying your suggestion of wrapping PrioritiseAndIndicate() in a tcpip_callback_with_block.
🤞🏻
Regards, Mark.
On 23 Mar 2021, at 3:02 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part Mark,
regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context.
I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.:
/** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block)
So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context.
Regards, Michael
Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson:
OK, some progress…
I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem.
I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking.
I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3).
There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either.
Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess.
These two changes are in ‘pre’ now, and I am trying them in my car.
Regards, Mark.
On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context.
Running commands via web doesn't block, as the webcommand class starts a separate task for each execution.
The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file.
Regards, Michael
Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson:
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion?
Doesn’t necessarily need to be a networking command. Something long running like the string speed tests.
In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3).
Regards, Mark
On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> <mailto:dexter@expeedo.de> wrote:
I think we must avoid blocking the Mongoose task, as that's the central network dispatcher.
Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse.
Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations?
Regards, Michael
Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson: > Not sure how to resolve this. > > OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota. > > I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here: > > Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds). > > But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback? > > Mark > >> On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >> >> Signed PGP part >> I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module. >> >> I (130531) webserver: HTTP GET /cfg/firmware >> D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> >> D (130541) http: OvmsSyncHttpClient: waiting for completion >> >> After that log message, the network is dead, and the netmanager also doesn't respond: >> >> OVMS# network list >> ERROR: job failed >> D (183241) netmanager: send cmd 1 from 0x3ffe7054 >> W (193241) netmanager: ExecuteJob: cmd 1: timeout >> >> The interfaces seem to be registered and online, but nothing gets in or out: >> >> OVMS# network status >> Interface#3: pp3 (ifup=1 linkup=1) >> IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 >> >> Interface#2: ap2 (ifup=1 linkup=1) >> IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 >> >> Interface#1: st1 (ifup=1 linkup=1) >> IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 >> >> DNS: 192.168.2.1 >> >> Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1) >> >> >> A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event. >> >> Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client. >> >> Regards, >> Michael >> >> >> Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: >>> Tried to repeat this, but not having much success. Here is my car module, with network still up: >>> >>> OVMS# boot status >>> Last boot was 262355 second(s) ago >>> >>> I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was: >>> >>> I (3717989) cellular: PPP Connection disconnected >>> Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. >>> >>> 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). >>> 707 if ((pri->name[0]==search[0])&& >>> >>> 0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). >>> 1357 void OvmsMetricString::SetValue(std::string value) >>> >>> 0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. >>> >>> 0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. >>> >>> 0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). >>> 522 PrioritiseAndIndicate(); >>> >>> 0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). >>> 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); } >>> >>> 0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). >>> 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...); >>> >>> 0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). >>> 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data); >>> >>> 0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). >>> 237 HandleQueueSignalEvent(&msg); >>> >>> 0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). >>> 80 me->EventTask(); >>> >>> My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that. >>> >>> Regards, Mark. >>> >>>> On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>> >>>> Signed PGP part >>>> I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf. >>>> >>>> The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes. >>>> >>>> The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed: >>>> >>>> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network >>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station >>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) >>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us >>>> >>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >>>> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >>>> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver >>>> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 >>>> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver >>>> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log >>>> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 >>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>>> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 >>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled >>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled >>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 >>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 >>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 >>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 >>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 >>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 >>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 >>>> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >>>> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 >>>> 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart >>>> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state >>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt >>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed >>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown >>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt >>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed >>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown >>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) >>>> 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle >>>> 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) >>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop >>>> 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found >>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority >>>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) >>>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) >>>> 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found >>>> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE >>>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) >>>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) >>>> 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found >>>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) >>>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) >>>> 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found >>>> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down >>>> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) >>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) >>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) >>>> 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) >>>> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost >>>> 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) >>>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) >>>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) >>>> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up >>>> 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped >>>> 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped >>>> 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped >>>> …and so on until >>>> 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting >>>> >>>> >>>> I need my car now, so will switch back to master for now. >>>> >>>> Mark, if you've got specific debug logs I should fetch on the next try, tell me. >>>> >>>> Regards, >>>> Michael >>>> >>>> >>>> Am 12.03.21 um 05:47 schrieb Craig Leres: >>>>> I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui. >>>>> >>>>> But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?) >>>>> >>>>> Craig >>>>> _______________________________________________ >>>>> OvmsDev mailing list >>>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>> >>>> -- >>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>> >>>> >>>> >>>> >>> >>> >>> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> >> > > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Reviving this old topic still impacting us… Comparing the master vs for-v3.3 branches, the only thing that changed related to ppp is that in 3.3 the ppp object is dynamically created and destroyed, while in master it is statically allocated at boot. The ppp code is so simple, and I can’t see how that could be a problem. Perhaps related to position in memory, and some other memory corruption? Anyway, I changed it to not destroy the ppp object when the gsm connection is lost, but merely shutdown the ppp (which is what master branch does). While it is still dynamically allocated, it is no longer as dynamic (being created just once at startup of the cellular system). I’ve never managed to reliably repeat this problem in my environment, but I think this should help.It has been running on my desktop test unit for the past four days without issue. That code is committed now. I would appreciate it if others who saw this problem could try again with this latest build of the for-v3.3 branch. Regards, Mark.
On 24 Mar 2021, at 3:53 PM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Signed PGP part Good grief, this is not so easy. Now we have:
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump: PC : 0x40008044 PS : 0x00060f30 A0 : 0x800fe2cc A1 : 0x3ffcaa90 A2 : 0x3f413acc A3 : 0x00000046 A4 : 0x00e6807e A5 : 0x00000000 A6 : 0x00000000 A7 : 0x00000000 A8 : 0x00000010 A9 : 0x00e6807e A10 : 0x00000078 A11 : 0x00000009 A12 : 0x3ffcaa3f A13 : 0x00000032 A14 : 0x00000000 A15 : 0x3ffcaa48 SAR : 0x00000004 EXCCAUSE: 0x0000001c EXCVADDR: 0x00e6807e LBEG : 0x4008bdad LEND : 0x4008bdd1 LCOUNT : 0x800f93f4
ELF file SHA256: 74bb0a75eeb4578b
Backtrace: 0x40008044:0x3ffcaa90 0x400fe2c9:0x3ffcab20 0x400fe412:0x3ffcabb0 0x402937b5:0x3ffcabd0
0x400fe2c9 is in OvmsNetManager::DoSafePrioritiseAndIndicate() (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:723). 718 } 719 720 for (struct netif *pri = netif_list; pri != NULL; pri=pri->next) 721 { 722 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: interface %p",pri); 723 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: name %s",pri->name); 724 if ((pri->name[0]==search[0])&& 725 (pri->name[1]==search[1])) 726 { 727 if (search[0] != m_previous_name[0] || search[1] != m_previous_name[1]) 0x400fe412 is in SafePrioritiseAndIndicate(void*) (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:676). 671 } 672 } 673 674 void SafePrioritiseAndIndicate(void* ctx) 675 { 676 MyNetManager.DoSafePrioritiseAndIndicate(); 677 } 678 679 void OvmsNetManager::PrioritiseAndIndicate() 680 { 0x402937b5 is in tcpip_thread (/Users/hq.mark.johnson/esp/esp-idf/components/lwip/lwip/src/api/tcpip.c:158). 153 break; 154 #endif /* LWIP_TCPIP_TIMEOUT && LWIP_TIMERS */ 155 156 case TCPIP_MSG_CALLBACK: 157 LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: CALLBACK %p\n", (void *)msg)); 158 msg->msg.cb.function(msg->msg.cb.ctx); 159 memp_free(MEMP_TCPIP_MSG_API, msg); 160 break; 161 162 case TCPIP_MSG_CALLBACK_STATIC:
So the issue is most likely corruption of the network interface structure, not thread safe traversal.
I had added some ESP_EARLY_LOGI statements, so can see a little more of what is going on:
I (103202) gsm-ppp: Initialising... I (103212) events: Signal(system.modem.netmode) I (105902) netmanager: DoSafePrioritiseAndIndicate: start I (105902) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (105912) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (105912) netmanager: DoSafePrioritiseAndIndicate: name pp I (105922) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde854 I (105932) netmanager: DoSafePrioritiseAndIndicate: name ap I (105932) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde640 I (105942) netmanager: DoSafePrioritiseAndIndicate: name st I (105952) netmanager: DoSafePrioritiseAndIndicate: end I (105902) gsm-ppp: StatusCallBack: None I (105902) gsm-ppp: status_cb: Connected I (105902) gsm-ppp: our_ipaddr = 10.52.40.80 … I (3708442) cellular: PPP Connection disconnected I (3708442) cellular: PPP Connection disconnected I (3709212) netmanager: DoSafePrioritiseAndIndicate: start I (3709212) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (3709212) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (3709222) netmanager: DoSafePrioritiseAndIndicate: name pp I (3709222) netmanager: DoSafePrioritiseAndIndicate: interface 0x30323930 I (3709232) netmanager: DoSafePrioritiseAndIndicate: name f I (3709242) netmanager: DoSafePrioritiseAndIndicate: interface 0x667fc000 I (3709252) netmanager: DoSafePrioritiseAndIndicate: name Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1)
Doesn’t help much, apart from confirm the corruption. Took about an hour to recreate the problem.
I’ll keep looking.
Regards, Mark.
On 23 Mar 2021, at 4:05 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part My attempt didn’t work (still crashes), so I’m now trying your suggestion of wrapping PrioritiseAndIndicate() in a tcpip_callback_with_block.
🤞🏻
Regards, Mark.
On 23 Mar 2021, at 3:02 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part Mark,
regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context.
I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.:
/** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block)
So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context.
Regards, Michael
Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson:
OK, some progress…
I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem.
I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking.
I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3).
There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either.
Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess.
These two changes are in ‘pre’ now, and I am trying them in my car.
Regards, Mark.
On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context.
Running commands via web doesn't block, as the webcommand class starts a separate task for each execution.
The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file.
Regards, Michael
Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson:
In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion?
Doesn’t necessarily need to be a networking command. Something long running like the string speed tests.
In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3).
Regards, Mark
> On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> <mailto:dexter@expeedo.de> wrote: > > I think we must avoid blocking the Mongoose task, as that's the central network dispatcher. > > Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse. > > Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations? > > Regards, > Michael > > > Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson: >> Not sure how to resolve this. >> >> OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota. >> >> I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here: >> >> Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds). >> >> But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback? >> >> Mark >> >>> On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>> >>> Signed PGP part >>> I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module. >>> >>> I (130531) webserver: HTTP GET /cfg/firmware >>> D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> >>> D (130541) http: OvmsSyncHttpClient: waiting for completion >>> >>> After that log message, the network is dead, and the netmanager also doesn't respond: >>> >>> OVMS# network list >>> ERROR: job failed >>> D (183241) netmanager: send cmd 1 from 0x3ffe7054 >>> W (193241) netmanager: ExecuteJob: cmd 1: timeout >>> >>> The interfaces seem to be registered and online, but nothing gets in or out: >>> >>> OVMS# network status >>> Interface#3: pp3 (ifup=1 linkup=1) >>> IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 >>> >>> Interface#2: ap2 (ifup=1 linkup=1) >>> IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 >>> >>> Interface#1: st1 (ifup=1 linkup=1) >>> IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 >>> >>> DNS: 192.168.2.1 >>> >>> Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1) >>> >>> >>> A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event. >>> >>> Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client. >>> >>> Regards, >>> Michael >>> >>> >>> Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: >>>> Tried to repeat this, but not having much success. Here is my car module, with network still up: >>>> >>>> OVMS# boot status >>>> Last boot was 262355 second(s) ago >>>> >>>> I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was: >>>> >>>> I (3717989) cellular: PPP Connection disconnected >>>> Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. >>>> >>>> 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). >>>> 707 if ((pri->name[0]==search[0])&& >>>> >>>> 0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). >>>> 1357 void OvmsMetricString::SetValue(std::string value) >>>> >>>> 0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. >>>> >>>> 0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. >>>> >>>> 0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). >>>> 522 PrioritiseAndIndicate(); >>>> >>>> 0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). >>>> 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); } >>>> >>>> 0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). >>>> 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...); >>>> >>>> 0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). >>>> 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data); >>>> >>>> 0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). >>>> 237 HandleQueueSignalEvent(&msg); >>>> >>>> 0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). >>>> 80 me->EventTask(); >>>> >>>> My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that. >>>> >>>> Regards, Mark. >>>> >>>>> On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>> >>>>> Signed PGP part >>>>> I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf. >>>>> >>>>> The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes. >>>>> >>>>> The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed: >>>>> >>>>> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network >>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station >>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) >>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us >>>>> >>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >>>>> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >>>>> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver >>>>> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 >>>>> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver >>>>> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log >>>>> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 >>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>>>> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 >>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled >>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled >>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 >>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 >>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 >>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 >>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 >>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 >>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 >>>>> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >>>>> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 >>>>> 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart >>>>> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state >>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt >>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed >>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown >>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt >>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed >>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown >>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) >>>>> 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle >>>>> 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) >>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop >>>>> 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found >>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority >>>>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) >>>>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) >>>>> 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found >>>>> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE >>>>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) >>>>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) >>>>> 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found >>>>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) >>>>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) >>>>> 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found >>>>> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down >>>>> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >>>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) >>>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) >>>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) >>>>> 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) >>>>> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost >>>>> 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) >>>>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) >>>>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) >>>>> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up >>>>> 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped >>>>> 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped >>>>> 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped >>>>> …and so on until >>>>> 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting >>>>> >>>>> >>>>> I need my car now, so will switch back to master for now. >>>>> >>>>> Mark, if you've got specific debug logs I should fetch on the next try, tell me. >>>>> >>>>> Regards, >>>>> Michael >>>>> >>>>> >>>>> Am 12.03.21 um 05:47 schrieb Craig Leres: >>>>>> I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui. >>>>>> >>>>>> But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?) >>>>>> >>>>>> Craig >>>>>> _______________________________________________ >>>>>> OvmsDev mailing list >>>>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>>> >>>>> -- >>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> >>> >> >> >> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> > http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Mark, I've been running the new for-v3.3 version this week on both of my modules without ppp issues. Duktape still occasionally runs into the null/undefined issue with for…in: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#... for…in normally doesn't throw an error even if you run over null or undefined. I think both could still be the SPIRAM bug, now probably only occurring with very specific conditions. We build with LWIP using SPIRAM as well, so the PPP instance is allocated from SPIRAM also. Reallocating the instance on each new connect implies a higher chance of triggering the problem if it's address specific. The Duktape stack object addresses vary continuously with the running event handlers and user interactions, so that also has a high chance of occasionally triggering an address specific bug. We need to test the revision 3 ESP32 on this. Regards, Michael Am 09.09.21 um 02:31 schrieb Mark Webb-Johnson:
Reviving this old topic still impacting us…
Comparing the master vs for-v3.3 branches, the only thing that changed related to ppp is that in 3.3 the ppp object is dynamically created and destroyed, while in master it is statically allocated at boot. The ppp code is so simple, and I can’t see how that could be a problem. Perhaps related to position in memory, and some other memory corruption?
Anyway, I changed it to not destroy the ppp object when the gsm connection is lost, but merely shutdown the ppp (which is what master branch does). While it is still dynamically allocated, it is no longer as dynamic (being created just once at startup of the cellular system). I’ve never managed to reliably repeat this problem in my environment, but I think this should help.It has been running on my desktop test unit for the past four days without issue.
That code is committed now. I would appreciate it if others who saw this problem could try again with this latest build of the for-v3.3 branch.
Regards, Mark.
On 24 Mar 2021, at 3:53 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part Good grief, this is not so easy. Now we have:
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x40008044 PS : 0x00060f30 A0 : 0x800fe2cc A1 : 0x3ffcaa90
A2 : 0x3f413acc A3 : 0x00000046 A4 : 0x00e6807e A5 : 0x00000000
A6 : 0x00000000 A7 : 0x00000000 A8 : 0x00000010 A9 : 0x00e6807e
A10 : 0x00000078 A11 : 0x00000009 A12 : 0x3ffcaa3f A13 : 0x00000032
A14 : 0x00000000 A15 : 0x3ffcaa48 SAR : 0x00000004 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00e6807e LBEG : 0x4008bdad LEND : 0x4008bdd1 LCOUNT : 0x800f93f4
ELF file SHA256: 74bb0a75eeb4578b
Backtrace: 0x40008044:0x3ffcaa90 0x400fe2c9:0x3ffcab20 0x400fe412:0x3ffcabb0 0x402937b5:0x3ffcabd0
0x400fe2c9 is in OvmsNetManager::DoSafePrioritiseAndIndicate() (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:723). 718 } 719 720 for (struct netif *pri = netif_list; pri != NULL; pri=pri->next) 721 { 722 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: interface %p",pri); 723 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: name %s",pri->name); 724 if ((pri->name[0]==search[0])&& 725 (pri->name[1]==search[1])) 726 { 727 if (search[0] != m_previous_name[0] || search[1] != m_previous_name[1]) 0x400fe412 is in SafePrioritiseAndIndicate(void*) (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:676). 671 } 672 } 673 674void SafePrioritiseAndIndicate(void* ctx) 675 { 676 MyNetManager.DoSafePrioritiseAndIndicate(); 677 } 678 679void OvmsNetManager::PrioritiseAndIndicate() 680 { 0x402937b5 is in tcpip_thread (/Users/hq.mark.johnson/esp/esp-idf/components/lwip/lwip/src/api/tcpip.c:158). 153 break; 154#endif /* LWIP_TCPIP_TIMEOUT && LWIP_TIMERS */ 155 156 case TCPIP_MSG_CALLBACK: 157 LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: CALLBACK %p\n", (void *)msg)); 158 msg->msg.cb.function(msg->msg.cb.ctx); 159 memp_free(MEMP_TCPIP_MSG_API, msg); 160 break; 161 162 case TCPIP_MSG_CALLBACK_STATIC:
So the issue is most likely corruption of the network interface structure, not thread safe traversal.
I had added some ESP_EARLY_LOGI statements, so can see a little more of what is going on:
I (103202) gsm-ppp: Initialising... I (103212) events: Signal(system.modem.netmode) I (105902) netmanager: DoSafePrioritiseAndIndicate: start I (105902) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (105912) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (105912) netmanager: DoSafePrioritiseAndIndicate: name pp I (105922) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde854 I (105932) netmanager: DoSafePrioritiseAndIndicate: name ap I (105932) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde640 I (105942) netmanager: DoSafePrioritiseAndIndicate: name st I (105952) netmanager: DoSafePrioritiseAndIndicate: end I (105902) gsm-ppp: StatusCallBack: None I (105902) gsm-ppp: status_cb: Connected I (105902) gsm-ppp: our_ipaddr = 10.52.40.80 … I (3708442) cellular: PPP Connection disconnected I (3708442) cellular: PPP Connection disconnected I (3709212) netmanager: DoSafePrioritiseAndIndicate: start I (3709212) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (3709212) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (3709222) netmanager: DoSafePrioritiseAndIndicate: name pp I (3709222) netmanager: DoSafePrioritiseAndIndicate: interface 0x30323930 I (3709232) netmanager: DoSafePrioritiseAndIndicate: name f I (3709242) netmanager: DoSafePrioritiseAndIndicate: interface 0x667fc000 I (3709252) netmanager: DoSafePrioritiseAndIndicate: name Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1)
Doesn’t help much, apart from confirm the corruption. Took about an hour to recreate the problem.
I’ll keep looking.
Regards, Mark.
On 23 Mar 2021, at 4:05 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part My attempt didn’t work (still crashes), so I’m now trying your suggestion of wrapping PrioritiseAndIndicate() in a tcpip_callback_with_block.
🤞🏻
Regards, Mark.
On 23 Mar 2021, at 3:02 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part Mark,
regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context.
I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.:
/** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block)
So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context.
Regards, Michael
Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson:
OK, some progress…
1. I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem.
I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking.
2. I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3).
There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either.
Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess.
These two changes are in ‘pre’ now, and I am trying them in my car.
Regards, Mark.
On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context.
Running commands via web doesn't block, as the webcommand class starts a separate task for each execution.
The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file.
Regards, Michael
Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson: > In master branch, at the moment, if a command is run from the > web shell (or server v2), surely the mongoose task will block as > the web server / server v2 blocks waiting for the command to run > to completion? > > Doesn’t necessarily need to be a networking command. Something > long running like the string speed tests. > > In v3.3 I can easily detect the task wait being requested in the > http library (by seeing if current task id == mongoose task), > and fail (which I should do anyway). But I am more concerned > with the general case now (which I think may be wrong in both > master and for-v3.3). > > Regards, Mark > >> On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> >> wrote: >> >> I think we must avoid blocking the Mongoose task, as that's >> the central network dispatcher. >> >> Chris had implemented a workaround in one of his PRs that could >> allow that to be done temporarily by running a local Mongoose >> main loop during a synchronous operation, but I still see >> potential issues from that, as it wasn't the standard handling >> as done by the task, and as it may need to recurse. >> >> Maybe the old OvmsHttpClient using socket I/O is the right way >> for synchronous network operations? >> >> Regards, >> Michael >> >> >> Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson: >>> Not sure how to resolve this. >>> >>> OvmsSyncHttpClient is currently used in commands from >>> ovms_plugins and ovms_ota. >>> >>> I could bring back the OvmsHttpClient blocking (non-mongoose) >>> implementation, but I don’t think that would address the core >>> problem here: >>> >>> Inside a mongoose callback (inside the mongoose networking >>> task), we are making blocking calls (and in particular >>> calls that could block for several tens of seconds). >>> >>> >>> But fundamentally is it ok to block the mongoose networking >>> task for extended periods during a mongoose event callback? >>> >>> Mark >>> >>>> On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de >>>> <mailto:dexter@expeedo.de>> wrote: >>>> >>>> Signed PGP part >>>> I've found opening the web UI firmware page or calling "ota >>>> status" via ssh to consistently deadlock the network on my >>>> module. >>>> >>>> I (130531) webserver: HTTP GET /cfg/firmware >>>> D (130531) http: OvmsSyncHttpClient: Connect to >>>> ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> >>>> D (130541) http: OvmsSyncHttpClient: waiting for completion >>>> >>>> After that log message, the network is dead, and the >>>> netmanager also doesn't respond: >>>> >>>> OVMS# network list >>>> ERROR: job failed >>>> D (183241) netmanager: send cmd 1 from 0x3ffe7054 >>>> W (193241) netmanager: ExecuteJob: cmd 1: timeout >>>> >>>> The interfaces seem to be registered and online, but nothing >>>> gets in or out: >>>> >>>> OVMS# network status >>>> Interface#3: pp3 (ifup=1 linkup=1) >>>> IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 >>>> >>>> Interface#2: ap2 (ifup=1 linkup=1) >>>> IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 >>>> >>>> Interface#1: st1 (ifup=1 linkup=1) >>>> IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 >>>> >>>> DNS: 192.168.2.1 >>>> >>>> Default Interface: st1 (192.168.2.106/255.255.255.0 gateway >>>> 192.168.2.1) >>>> >>>> >>>> A couple of minutes later, server-v2 recognizes the stale >>>> connection and issues a network restart, which fails >>>> resulting in the same behaviour as shown below with finally >>>> forced reboot by loss of an important event. >>>> >>>> Doing "ota status" from USB works normally, so this looks >>>> like OvmsSyncHttpClient not being able to run from within a >>>> mongoose client. >>>> >>>> Regards, >>>> Michael >>>> >>>> >>>> Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: >>>>> Tried to repeat this, but not having much success. Here is >>>>> my car module, with network still up: >>>>> >>>>> OVMS# boot status >>>>> Last boot was 262355 second(s) ago >>>>> >>>>> >>>>> I did manage to catch one network related crash after >>>>> repeatedly disconnecting and reconnecting the cellular >>>>> antenna. This was: >>>>> >>>>> I (3717989) cellular: PPP Connection disconnected >>>>> Guru Meditation Error: Core 1 panic'ed >>>>> (LoadProhibited). Exception was unhandled. >>>>> >>>>> 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() >>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). >>>>> >>>>> 707 if ((pri->name[0]==search[0])&& >>>>> >>>>> >>>>> 0x400ed360 is in >>>>> OvmsMetricString::SetValue(std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >) >>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). >>>>> >>>>> 1357 void OvmsMetricString::SetValue(std::string >>>>> value) >>>>> >>>>> >>>>> 0x4008bdad is at >>>>> ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. >>>>> >>>>> >>>>> 0x4008bdd1 is at >>>>> ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. >>>>> >>>>> >>>>> 0x400fe886 is in >>>>> OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >, void*) >>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). >>>>> >>>>> 522 PrioritiseAndIndicate(); >>>>> >>>>> >>>>> 0x400fd752 is in std::_Function_handler<void >>>>> (std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >, void*), >>>>> std::_Bind<std::_Mem_fn<void >>>>> (OvmsNetManager::*)(std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >, void*)> >>>>> (OvmsNetManager*, std::_Placeholder<1>, >>>>> std::_Placeholder<2>)> >::_M_invoke(std::_Any_data >>>>> const&, std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >&&, >>>>> void*&&) >>>>> (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). >>>>> >>>>> 600 { return >>>>> (__object->*_M_pmf)(std::forward<_Args>(__args)...); } >>>>> >>>>> >>>>> 0x400f512e is in std::function<void >>>>> (std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >, >>>>> void*)>::operator()(std::__cxx11::basic_string<char, >>>>> std::char_traits<char>, std::allocator<char> >, void*) >>>>> const >>>>> (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). >>>>> >>>>> 2271 return _M_invoker(_M_functor, >>>>> std::forward<_ArgTypes>(__args)...); >>>>> >>>>> >>>>> 0x400f52f1 is in >>>>> OvmsEvents::HandleQueueSignalEvent(event_queue_t*) >>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). >>>>> >>>>> 283 m_current_callback->m_callback(m_current_event, >>>>> msg->body.signal.data); >>>>> >>>>> >>>>> 0x400f53d8 is in OvmsEvents::EventTask() >>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). >>>>> >>>>> 237 HandleQueueSignalEvent(&msg); >>>>> >>>>> >>>>> 0x400f545d is in EventLaunchTask(void*) >>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). >>>>> >>>>> 80 me->EventTask(); >>>>> >>>>> >>>>> My for_v3.3 branch does include the preliminary changes to >>>>> support the wifi at 20MHz bandwidth, and perhaps those could >>>>> be affecting things. I do notice that if I ‘power wifi off’, >>>>> then ‘wifi mode client’, it can connect to the station, but >>>>> not get an IP address. I’ve just tried to merge in the >>>>> latest fixes to that, and rebuilt a release. I will continue >>>>> to test with that. >>>>> >>>>> Regards, Mark. >>>>> >>>>>> On 12 Mar 2021, at 10:32 PM, Michael Balzer >>>>>> <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>>> >>>>>> Signed PGP part >>>>>> I just tried switching to for-v3.3 in my car module after >>>>>> tests on my desk module were OK, and I've run into the very >>>>>> same problem with for-v3.3. So the issue isn't related to >>>>>> esp-idf. >>>>>> >>>>>> The network only occasionally starts normally, but even >>>>>> then all connectivity is lost after a couple of minutes. >>>>>> >>>>>> The stale connection watchdog in server-v2 triggers a >>>>>> network restart, but that also doesn't seem to succeed: >>>>>> >>>>>> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: >>>>>> Detected stale connection (issue #241), restarting network >>>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >>>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping >>>>>> WIFI station >>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> >>>>>> init (0) >>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total >>>>>> sleep time: 831205045 us / 975329961 us >>>>>> >>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, >>>>>> old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >>>>>> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >>>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >>>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >>>>>> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering >>>>>> down WIFI driver >>>>>> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc >>>>>> rx mblock:16 >>>>>> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering >>>>>> up WIFI driver >>>>>> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, >>>>>> erase log key successfully, reinit nvs log >>>>>> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver >>>>>> task: 3ffd4d84, prio:23, stack:3584, core=0 >>>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC >>>>>> address is not set, read default base MAC address from BLK0 >>>>>> of EFUSE >>>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC >>>>>> address is not set, read default base MAC address from BLK0 >>>>>> of EFUSE >>>>>> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware >>>>>> version: 30f9e79 >>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS >>>>>> flash: enabled >>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano >>>>>> formating: disabled >>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame >>>>>> dynamic rx buffer num: 16 >>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management >>>>>> frame dynamic rx buffer num: 16 >>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management >>>>>> short buffer num: 32 >>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx >>>>>> buffer num: 16 >>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx >>>>>> buffer size: 2212 >>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx >>>>>> buffer num: 16 >>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx >>>>>> buffer num: 16 >>>>>> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta >>>>>> (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >>>>>> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power >>>>>> save buffer number: 8 >>>>>> 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: >>>>>> Restart >>>>>> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: >>>>>> Enter PowerOffOn state >>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting >>>>>> down (hard)... >>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: >>>>>> StatusCallBack: User Interrupt >>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP >>>>>> connection has been closed >>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown >>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting >>>>>> down (hard)... >>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: >>>>>> StatusCallBack: User Interrupt >>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP >>>>>> connection has been closed >>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown >>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown >>>>>> (direct) >>>>>> 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: >>>>>> Power Cycle >>>>>> 2021-03-12 14:53:04.682 CET D (984532) events: >>>>>> Signal(system.wifi.down) >>>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI >>>>>> client stop >>>>>> 2021-03-12 14:53:04.682 CET E (984532) netmanager: >>>>>> Inconsistent state: no interface of type 'pp' found >>>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI >>>>>> client down (with MODEM up): reconfigured for MODEM priority >>>>>> 2021-03-12 14:53:04.692 CET D (984542) events: >>>>>> Signal(system.event) >>>>>> 2021-03-12 14:53:04.692 CET D (984542) events: >>>>>> Signal(system.wifi.sta.disconnected) >>>>>> 2021-03-12 14:53:04.692 CET E (984542) netmanager: >>>>>> Inconsistent state: no interface of type 'pp' found >>>>>> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA >>>>>> disconnected with reason 8 = ASSOC_LEAVE >>>>>> 2021-03-12 14:53:04.702 CET D (984552) events: >>>>>> Signal(system.event) >>>>>> 2021-03-12 14:53:04.702 CET D (984552) events: >>>>>> Signal(system.wifi.sta.stop) >>>>>> 2021-03-12 14:53:04.702 CET E (984552) netmanager: >>>>>> Inconsistent state: no interface of type 'pp' found >>>>>> 2021-03-12 14:53:04.712 CET D (984562) events: >>>>>> Signal(system.event) >>>>>> 2021-03-12 14:53:04.712 CET D (984562) events: >>>>>> Signal(system.wifi.ap.stop) >>>>>> 2021-03-12 14:53:04.712 CET E (984562) netmanager: >>>>>> Inconsistent state: no interface of type 'pp' found >>>>>> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI >>>>>> access point is down >>>>>> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: >>>>>> Signal(network.wifi.sta.bad) >>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: >>>>>> Signal(system.event) >>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: >>>>>> Signal(system.wifi.sta.start) >>>>>> 2021-03-12 14:53:04.732 CET D (984582) events: >>>>>> Signal(system.modem.down) >>>>>> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM >>>>>> down (with WIFI client down): network connectivity has been >>>>>> lost >>>>>> 2021-03-12 14:53:04.742 CET D (984592) events: >>>>>> Signal(system.modem.down) >>>>>> 2021-03-12 14:53:04.752 CET D (984602) events: >>>>>> Signal(system.event) >>>>>> 2021-03-12 14:53:04.752 CET D (984602) events: >>>>>> Signal(system.wifi.ap.start) >>>>>> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI >>>>>> access point is up >>>>>> 2021-03-12 14:53:26.802 CET E (1006652) events: >>>>>> SignalEvent: queue overflow (running >>>>>> system.wifi.ap.start->netmanager for 23 sec), event >>>>>> 'ticker.1' dropped >>>>>> 2021-03-12 14:53:27.802 CET E (1007652) events: >>>>>> SignalEvent: queue overflow (running >>>>>> system.wifi.ap.start->netmanager for 24 sec), event >>>>>> 'ticker.1' dropped >>>>>> 2021-03-12 14:53:28.802 CET E (1008652) events: >>>>>> SignalEvent: queue overflow (running >>>>>> system.wifi.ap.start->netmanager for 25 sec), event >>>>>> 'ticker.1' dropped >>>>>> …and so on until >>>>>> 2021-03-12 14:54:01.802 CET E (1041652) events: >>>>>> SignalEvent: lost important event => aborting >>>>>> >>>>>> >>>>>> I need my car now, so will switch back to master for now. >>>>>> >>>>>> Mark, if you've got specific debug logs I should fetch on >>>>>> the next try, tell me. >>>>>> >>>>>> Regards, >>>>>> Michael >>>>>> >>>>>> >>>>>> Am 12.03.21 um 05:47 schrieb Craig Leres: >>>>>>> I just updated to 3.2.016-68-g8e10c6b7 and still get the >>>>>>> network hang immediately after booting and logging into >>>>>>> the web gui. >>>>>>> >>>>>>> But I see now my problem is likely that I'm not using the >>>>>>> right esp-idf (duh). Is there a way I can have master >>>>>>> build using ~/esp/esp-idf and have for-v3.3 use a >>>>>>> different path?) >>>>>>> >>>>>>> Craig >>>>>>> _______________________________________________ >>>>>>> OvmsDev mailing list >>>>>>> OvmsDev@lists.openvehicles.com >>>>>>> <mailto:OvmsDev@lists.openvehicles.com> >>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>>>>>> <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>>>> >>>>>> -- >>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> OvmsDev mailing list >>>>> OvmsDev@lists.openvehicles.com >>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>>> >>>> -- >>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>> >>> >>> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com > http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Thanks for the feedback. I should have a final sample of v3.3 hardware with rev3 esp32 in my hands towards the end of this month (this is the same sample set that goes to be destroyed by the certification labs). Is there any way of triggering the bug earlier, for replication? Like a stress test or something? Or just have to wait. Regards, Mark.
On 17 Sep 2021, at 3:24 PM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Mark,
I've been running the new for-v3.3 version this week on both of my modules without ppp issues.
Duktape still occasionally runs into the null/undefined issue with for…in:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#... <https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-744005044>
for…in normally doesn't throw an error even if you run over null or undefined.
I think both could still be the SPIRAM bug, now probably only occurring with very specific conditions. We build with LWIP using SPIRAM as well, so the PPP instance is allocated from SPIRAM also. Reallocating the instance on each new connect implies a higher chance of triggering the problem if it's address specific. The Duktape stack object addresses vary continuously with the running event handlers and user interactions, so that also has a high chance of occasionally triggering an address specific bug.
We need to test the revision 3 ESP32 on this.
Regards, Michael
Am 09.09.21 um 02:31 schrieb Mark Webb-Johnson:
Reviving this old topic still impacting us…
Comparing the master vs for-v3.3 branches, the only thing that changed related to ppp is that in 3.3 the ppp object is dynamically created and destroyed, while in master it is statically allocated at boot. The ppp code is so simple, and I can’t see how that could be a problem. Perhaps related to position in memory, and some other memory corruption?
Anyway, I changed it to not destroy the ppp object when the gsm connection is lost, but merely shutdown the ppp (which is what master branch does). While it is still dynamically allocated, it is no longer as dynamic (being created just once at startup of the cellular system). I’ve never managed to reliably repeat this problem in my environment, but I think this should help.It has been running on my desktop test unit for the past four days without issue.
That code is committed now. I would appreciate it if others who saw this problem could try again with this latest build of the for-v3.3 branch.
Regards, Mark.
On 24 Mar 2021, at 3:53 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part Good grief, this is not so easy. Now we have:
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump: PC : 0x40008044 PS : 0x00060f30 A0 : 0x800fe2cc A1 : 0x3ffcaa90 A2 : 0x3f413acc A3 : 0x00000046 A4 : 0x00e6807e A5 : 0x00000000 A6 : 0x00000000 A7 : 0x00000000 A8 : 0x00000010 A9 : 0x00e6807e A10 : 0x00000078 A11 : 0x00000009 A12 : 0x3ffcaa3f A13 : 0x00000032 A14 : 0x00000000 A15 : 0x3ffcaa48 SAR : 0x00000004 EXCCAUSE: 0x0000001c EXCVADDR: 0x00e6807e LBEG : 0x4008bdad LEND : 0x4008bdd1 LCOUNT : 0x800f93f4
ELF file SHA256: 74bb0a75eeb4578b
Backtrace: 0x40008044:0x3ffcaa90 0x400fe2c9:0x3ffcab20 0x400fe412:0x3ffcabb0 0x402937b5:0x3ffcabd0
0x400fe2c9 is in OvmsNetManager::DoSafePrioritiseAndIndicate() (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:723). 718 } 719 720 for (struct netif *pri = netif_list; pri != NULL; pri=pri->next) 721 { 722 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: interface %p",pri); 723 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: name %s",pri->name); 724 if ((pri->name[0]==search[0])&& 725 (pri->name[1]==search[1])) 726 { 727 if (search[0] != m_previous_name[0] || search[1] != m_previous_name[1]) 0x400fe412 is in SafePrioritiseAndIndicate(void*) (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:676). 671 } 672 } 673 674 void SafePrioritiseAndIndicate(void* ctx) 675 { 676 MyNetManager.DoSafePrioritiseAndIndicate(); 677 } 678 679 void OvmsNetManager::PrioritiseAndIndicate() 680 { 0x402937b5 is in tcpip_thread (/Users/hq.mark.johnson/esp/esp-idf/components/lwip/lwip/src/api/tcpip.c:158). 153 break; 154 #endif /* LWIP_TCPIP_TIMEOUT && LWIP_TIMERS */ 155 156 case TCPIP_MSG_CALLBACK: 157 LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: CALLBACK %p\n", (void *)msg)); 158 msg->msg.cb.function(msg->msg.cb.ctx); 159 memp_free(MEMP_TCPIP_MSG_API, msg); 160 break; 161 162 case TCPIP_MSG_CALLBACK_STATIC:
So the issue is most likely corruption of the network interface structure, not thread safe traversal.
I had added some ESP_EARLY_LOGI statements, so can see a little more of what is going on:
I (103202) gsm-ppp: Initialising... I (103212) events: Signal(system.modem.netmode) I (105902) netmanager: DoSafePrioritiseAndIndicate: start I (105902) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (105912) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (105912) netmanager: DoSafePrioritiseAndIndicate: name pp I (105922) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde854 I (105932) netmanager: DoSafePrioritiseAndIndicate: name ap I (105932) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde640 I (105942) netmanager: DoSafePrioritiseAndIndicate: name st I (105952) netmanager: DoSafePrioritiseAndIndicate: end I (105902) gsm-ppp: StatusCallBack: None I (105902) gsm-ppp: status_cb: Connected I (105902) gsm-ppp: our_ipaddr = 10.52.40.80 … I (3708442) cellular: PPP Connection disconnected I (3708442) cellular: PPP Connection disconnected I (3709212) netmanager: DoSafePrioritiseAndIndicate: start I (3709212) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (3709212) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (3709222) netmanager: DoSafePrioritiseAndIndicate: name pp I (3709222) netmanager: DoSafePrioritiseAndIndicate: interface 0x30323930 I (3709232) netmanager: DoSafePrioritiseAndIndicate: name f I (3709242) netmanager: DoSafePrioritiseAndIndicate: interface 0x667fc000 I (3709252) netmanager: DoSafePrioritiseAndIndicate: name Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1)
Doesn’t help much, apart from confirm the corruption. Took about an hour to recreate the problem.
I’ll keep looking.
Regards, Mark.
On 23 Mar 2021, at 4:05 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part My attempt didn’t work (still crashes), so I’m now trying your suggestion of wrapping PrioritiseAndIndicate() in a tcpip_callback_with_block.
🤞🏻
Regards, Mark.
On 23 Mar 2021, at 3:02 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part Mark,
regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context.
I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.:
/** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block)
So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context.
Regards, Michael
Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson:
OK, some progress…
I’ve added a check in the OvmsSyncHttpClient code to refuse to block while running as the netman (mongoose) task. This will now simply fail the http connection, and log an error. Not perfect, and not a solution to the core problem, but at least it avoids a known problem.
I’m not sure of the best permanent solution to this. It seems that we need a callback interface to run commands asynchronously, and use that in mongoose event handlers. Adding another mongoose event loop, or using a separate networking socket with select(), just minimise the problem - they don’t solve it. The core issue here is blocking during a mongoose event delivery. That is going to pause all high level networking.
I found a race condition in ovms_netmanager that seems nasty. The new cellular code could raise duplicate modem.down signals, picked up and handled in ovms_netmanager. As part of that it calls a PrioritiseAndIndicate() function that iterates over the network interface list (maintained by LWIP). If that network interface list is modified (eg; removing an interface) while it is being traversed, nasty crashes can happen. The ‘fix’ I’ve done is again just a workaround to try to reduce the duplicate signals and hence reduce the likelyhood of the problem happening, but it won’t fix the core problem (that is in both master and for-v3.3).
There is a netif_find function in LWIP, but (a) that requires an interface number that we don’t have, and (b) doesn’t seem to lock the list either.
Can’t think of an elegant solution to this, other than modifications to lwip. We could add our own mutex and use that whenever we talk to lwip, but even that would miss out on some modifications to the network interface list, I guess.
These two changes are in ‘pre’ now, and I am trying them in my car.
Regards, Mark.
> On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: > > Signed PGP part > In master, running commands via ssh or server-v2 block, because these are running synchronously in the mongoose context. > > Running commands via web doesn't block, as the webcommand class starts a separate task for each execution. > > The firmware config page does a synchronous call to MyOTA.GetStatus(), so that call is executed in the mongoose context. It still works in master, just needs a second or two to fetch the version file. > > Regards, > Michael > > > Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson: >> In master branch, at the moment, if a command is run from the web shell (or server v2), surely the mongoose task will block as the web server / server v2 blocks waiting for the command to run to completion? >> >> Doesn’t necessarily need to be a networking command. Something long running like the string speed tests. >> >> In v3.3 I can easily detect the task wait being requested in the http library (by seeing if current task id == mongoose task), and fail (which I should do anyway). But I am more concerned with the general case now (which I think may be wrong in both master and for-v3.3). >> >> Regards, Mark >> >>> On 22 Mar 2021, at 5:22 PM, Michael Balzer <dexter@expeedo.de> <mailto:dexter@expeedo.de> wrote: >>> >>> I think we must avoid blocking the Mongoose task, as that's the central network dispatcher. >>> >>> Chris had implemented a workaround in one of his PRs that could allow that to be done temporarily by running a local Mongoose main loop during a synchronous operation, but I still see potential issues from that, as it wasn't the standard handling as done by the task, and as it may need to recurse. >>> >>> Maybe the old OvmsHttpClient using socket I/O is the right way for synchronous network operations? >>> >>> Regards, >>> Michael >>> >>> >>> Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson: >>>> Not sure how to resolve this. >>>> >>>> OvmsSyncHttpClient is currently used in commands from ovms_plugins and ovms_ota. >>>> >>>> I could bring back the OvmsHttpClient blocking (non-mongoose) implementation, but I don’t think that would address the core problem here: >>>> >>>> Inside a mongoose callback (inside the mongoose networking task), we are making blocking calls (and in particular calls that could block for several tens of seconds). >>>> >>>> But fundamentally is it ok to block the mongoose networking task for extended periods during a mongoose event callback? >>>> >>>> Mark >>>> >>>>> On 21 Mar 2021, at 9:57 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>> >>>>> Signed PGP part >>>>> I've found opening the web UI firmware page or calling "ota status" via ssh to consistently deadlock the network on my module. >>>>> >>>>> I (130531) webserver: HTTP GET /cfg/firmware >>>>> D (130531) http: OvmsSyncHttpClient: Connect to ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> >>>>> D (130541) http: OvmsSyncHttpClient: waiting for completion >>>>> >>>>> After that log message, the network is dead, and the netmanager also doesn't respond: >>>>> >>>>> OVMS# network list >>>>> ERROR: job failed >>>>> D (183241) netmanager: send cmd 1 from 0x3ffe7054 >>>>> W (193241) netmanager: ExecuteJob: cmd 1: timeout >>>>> >>>>> The interfaces seem to be registered and online, but nothing gets in or out: >>>>> >>>>> OVMS# network status >>>>> Interface#3: pp3 (ifup=1 linkup=1) >>>>> IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 >>>>> >>>>> Interface#2: ap2 (ifup=1 linkup=1) >>>>> IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 >>>>> >>>>> Interface#1: st1 (ifup=1 linkup=1) >>>>> IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 >>>>> >>>>> DNS: 192.168.2.1 >>>>> >>>>> Default Interface: st1 (192.168.2.106/255.255.255.0 gateway 192.168.2.1) >>>>> >>>>> >>>>> A couple of minutes later, server-v2 recognizes the stale connection and issues a network restart, which fails resulting in the same behaviour as shown below with finally forced reboot by loss of an important event. >>>>> >>>>> Doing "ota status" from USB works normally, so this looks like OvmsSyncHttpClient not being able to run from within a mongoose client. >>>>> >>>>> Regards, >>>>> Michael >>>>> >>>>> >>>>> Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: >>>>>> Tried to repeat this, but not having much success. Here is my car module, with network still up: >>>>>> >>>>>> OVMS# boot status >>>>>> Last boot was 262355 second(s) ago >>>>>> >>>>>> I did manage to catch one network related crash after repeatedly disconnecting and reconnecting the cellular antenna. This was: >>>>>> >>>>>> I (3717989) cellular: PPP Connection disconnected >>>>>> Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. >>>>>> >>>>>> 0x400fe082 is in OvmsNetManager::PrioritiseAndIndicate() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). >>>>>> 707 if ((pri->name[0]==search[0])&& >>>>>> >>>>>> 0x400ed360 is in OvmsMetricString::SetValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). >>>>>> 1357 void OvmsMetricString::SetValue(std::string value) >>>>>> >>>>>> 0x4008bdad is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. >>>>>> >>>>>> 0x4008bdd1 is at ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. >>>>>> >>>>>> 0x400fe886 is in OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). >>>>>> 522 PrioritiseAndIndicate(); >>>>>> >>>>>> 0x400fd752 is in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*), std::_Bind<std::_Mem_fn<void (OvmsNetManager::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)> (OvmsNetManager*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, void*&&) (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). >>>>>> 600 { return (__object->*_M_pmf)(std::forward<_Args>(__args)...); } >>>>>> >>>>>> 0x400f512e is in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*) const (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). >>>>>> 2271 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...); >>>>>> >>>>>> 0x400f52f1 is in OvmsEvents::HandleQueueSignalEvent(event_queue_t*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). >>>>>> 283 m_current_callback->m_callback(m_current_event, msg->body.signal.data); >>>>>> >>>>>> 0x400f53d8 is in OvmsEvents::EventTask() (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). >>>>>> 237 HandleQueueSignalEvent(&msg); >>>>>> >>>>>> 0x400f545d is in EventLaunchTask(void*) (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). >>>>>> 80 me->EventTask(); >>>>>> >>>>>> My for_v3.3 branch does include the preliminary changes to support the wifi at 20MHz bandwidth, and perhaps those could be affecting things. I do notice that if I ‘power wifi off’, then ‘wifi mode client’, it can connect to the station, but not get an IP address. I’ve just tried to merge in the latest fixes to that, and rebuilt a release. I will continue to test with that. >>>>>> >>>>>> Regards, Mark. >>>>>> >>>>>>> On 12 Mar 2021, at 10:32 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>>>> >>>>>>> Signed PGP part >>>>>>> I just tried switching to for-v3.3 in my car module after tests on my desk module were OK, and I've run into the very same problem with for-v3.3. So the issue isn't related to esp-idf. >>>>>>> >>>>>>> The network only occasionally starts normally, but even then all connectivity is lost after a couple of minutes. >>>>>>> >>>>>>> The stale connection watchdog in server-v2 triggers a network restart, but that also doesn't seem to succeed: >>>>>>> >>>>>>> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: Detected stale connection (issue #241), restarting network >>>>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >>>>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Stopping WIFI station >>>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> init (0) >>>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, total sleep time: 831205045 us / 975329961 us >>>>>>> >>>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >>>>>>> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >>>>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >>>>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >>>>>>> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: Powering down WIFI driver >>>>>>> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc rx mblock:16 >>>>>>> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: Powering up WIFI driver >>>>>>> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, erase log key successfully, reinit nvs log >>>>>>> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver task: 3ffd4d84, prio:23, stack:3584, core=0 >>>>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>>>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE >>>>>>> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware version: 30f9e79 >>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS flash: enabled >>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano formating: disabled >>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data frame dynamic rx buffer num: 16 >>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init management frame dynamic rx buffer num: 16 >>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init management short buffer num: 32 >>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic tx buffer num: 16 >>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer size: 2212 >>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static rx buffer num: 16 >>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic rx buffer num: 16 >>>>>>> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >>>>>>> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power save buffer number: 8 >>>>>>> 2021-03-12 14:53:02.652 CET I (982502) cellular-modem-auto: Restart >>>>>>> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: Enter PowerOffOn state >>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: StatusCallBack: User Interrupt >>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP connection has been closed >>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is shutdown >>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting down (hard)... >>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: StatusCallBack: User Interrupt >>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP connection has been closed >>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is shutdown >>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown (direct) >>>>>>> 2021-03-12 14:53:02.672 CET I (982522) cellular-modem-auto: Power Cycle >>>>>>> 2021-03-12 14:53:04.682 CET D (984532) events: Signal(system.wifi.down) >>>>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client stop >>>>>>> 2021-03-12 14:53:04.682 CET E (984532) netmanager: Inconsistent state: no interface of type 'pp' found >>>>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI client down (with MODEM up): reconfigured for MODEM priority >>>>>>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.event) >>>>>>> 2021-03-12 14:53:04.692 CET D (984542) events: Signal(system.wifi.sta.disconnected) >>>>>>> 2021-03-12 14:53:04.692 CET E (984542) netmanager: Inconsistent state: no interface of type 'pp' found >>>>>>> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA disconnected with reason 8 = ASSOC_LEAVE >>>>>>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.event) >>>>>>> 2021-03-12 14:53:04.702 CET D (984552) events: Signal(system.wifi.sta.stop) >>>>>>> 2021-03-12 14:53:04.702 CET E (984552) netmanager: Inconsistent state: no interface of type 'pp' found >>>>>>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.event) >>>>>>> 2021-03-12 14:53:04.712 CET D (984562) events: Signal(system.wifi.ap.stop) >>>>>>> 2021-03-12 14:53:04.712 CET E (984562) netmanager: Inconsistent state: no interface of type 'pp' found >>>>>>> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI access point is down >>>>>>> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >>>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(network.wifi.sta.bad) >>>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.event) >>>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: Signal(system.wifi.sta.start) >>>>>>> 2021-03-12 14:53:04.732 CET D (984582) events: Signal(system.modem.down) >>>>>>> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM down (with WIFI client down): network connectivity has been lost >>>>>>> 2021-03-12 14:53:04.742 CET D (984592) events: Signal(system.modem.down) >>>>>>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.event) >>>>>>> 2021-03-12 14:53:04.752 CET D (984602) events: Signal(system.wifi.ap.start) >>>>>>> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI access point is up >>>>>>> 2021-03-12 14:53:26.802 CET E (1006652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 23 sec), event 'ticker.1' dropped >>>>>>> 2021-03-12 14:53:27.802 CET E (1007652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 24 sec), event 'ticker.1' dropped >>>>>>> 2021-03-12 14:53:28.802 CET E (1008652) events: SignalEvent: queue overflow (running system.wifi.ap.start->netmanager for 25 sec), event 'ticker.1' dropped >>>>>>> …and so on until >>>>>>> 2021-03-12 14:54:01.802 CET E (1041652) events: SignalEvent: lost important event => aborting >>>>>>> >>>>>>> >>>>>>> I need my car now, so will switch back to master for now. >>>>>>> >>>>>>> Mark, if you've got specific debug logs I should fetch on the next try, tell me. >>>>>>> >>>>>>> Regards, >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> Am 12.03.21 um 05:47 schrieb Craig Leres: >>>>>>>> I just updated to 3.2.016-68-g8e10c6b7 and still get the network hang immediately after booting and logging into the web gui. >>>>>>>> >>>>>>>> But I see now my problem is likely that I'm not using the right esp-idf (duh). Is there a way I can have master build using ~/esp/esp-idf and have for-v3.3 use a different path?) >>>>>>>> >>>>>>>> Craig >>>>>>>> _______________________________________________ >>>>>>>> OvmsDev mailing list >>>>>>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>>>>> >>>>>>> -- >>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> OvmsDev mailing list >>>>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>>> >>>>> -- >>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>> >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>> >>> -- >>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >> >> >> _______________________________________________ >> OvmsDev mailing list >> OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> >> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> > > -- > Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal > Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 > >
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I just remembered… https://github.com/espressif/esp-idf/issues/2892#issuecomment-459120605 …and as Duktape runs on core #1, I now test shifting the Duktape heap into the upper 2 MB of SPIRAM. The TypeError in PubSub occurred ~ 3 times per day in my car module, so I should see a difference over the weekend. Regards, Michael Am 17.09.21 um 10:02 schrieb Mark Webb-Johnson:
Thanks for the feedback.
I should have a final sample of v3.3 hardware with rev3 esp32 in my hands towards the end of this month (this is the same sample set that goes to be destroyed by the certification labs). Is there any way of triggering the bug earlier, for replication? Like a stress test or something? Or just have to wait.
Regards, Mark.
On 17 Sep 2021, at 3:24 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part Mark,
I've been running the new for-v3.3 version this week on both of my modules without ppp issues.
Duktape still occasionally runs into the null/undefined issue with for…in:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
for…in normally doesn't throw an error even if you run over null or undefined.
I think both could still be the SPIRAM bug, now probably only occurring with very specific conditions. We build with LWIP using SPIRAM as well, so the PPP instance is allocated from SPIRAM also. Reallocating the instance on each new connect implies a higher chance of triggering the problem if it's address specific. The Duktape stack object addresses vary continuously with the running event handlers and user interactions, so that also has a high chance of occasionally triggering an address specific bug.
We need to test the revision 3 ESP32 on this.
Regards, Michael
Am 09.09.21 um 02:31 schrieb Mark Webb-Johnson:
Reviving this old topic still impacting us…
Comparing the master vs for-v3.3 branches, the only thing that changed related to ppp is that in 3.3 the ppp object is dynamically created and destroyed, while in master it is statically allocated at boot. The ppp code is so simple, and I can’t see how that could be a problem. Perhaps related to position in memory, and some other memory corruption?
Anyway, I changed it to not destroy the ppp object when the gsm connection is lost, but merely shutdown the ppp (which is what master branch does). While it is still dynamically allocated, it is no longer as dynamic (being created just once at startup of the cellular system). I’ve never managed to reliably repeat this problem in my environment, but I think this should help.It has been running on my desktop test unit for the past four days without issue.
That code is committed now. I would appreciate it if others who saw this problem could try again with this latest build of the for-v3.3 branch.
Regards, Mark.
On 24 Mar 2021, at 3:53 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part Good grief, this is not so easy. Now we have:
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x40008044 PS : 0x00060f30 A0 : 0x800fe2cc A1 : 0x3ffcaa90
A2 : 0x3f413acc A3 : 0x00000046 A4 : 0x00e6807e A5 : 0x00000000
A6 : 0x00000000 A7 : 0x00000000 A8 : 0x00000010 A9 : 0x00e6807e
A10 : 0x00000078 A11 : 0x00000009 A12 : 0x3ffcaa3f A13 : 0x00000032
A14 : 0x00000000 A15 : 0x3ffcaa48 SAR : 0x00000004 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00e6807e LBEG : 0x4008bdad LEND : 0x4008bdd1 LCOUNT : 0x800f93f4
ELF file SHA256: 74bb0a75eeb4578b
Backtrace: 0x40008044:0x3ffcaa90 0x400fe2c9:0x3ffcab20 0x400fe412:0x3ffcabb0 0x402937b5:0x3ffcabd0
0x400fe2c9 is in OvmsNetManager::DoSafePrioritiseAndIndicate() (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:723). 718 } 719 720 for (struct netif *pri = netif_list; pri != NULL; pri=pri->next) 721 { 722 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: interface %p",pri); 723 ESP_EARLY_LOGI(TAG,"DoSafePrioritiseAndIndicate: name %s",pri->name); 724 if ((pri->name[0]==search[0])&& 725 (pri->name[1]==search[1])) 726 { 727 if (search[0] != m_previous_name[0] || search[1] != m_previous_name[1]) 0x400fe412 is in SafePrioritiseAndIndicate(void*) (/Users/hq.mark.johnson/Documents/ovms/Open-Vehicle-Monitoring-System-3/vehicle/OVMS.V3/main/ovms_netmanager.cpp:676). 671 } 672 } 673 674void SafePrioritiseAndIndicate(void* ctx) 675 { 676 MyNetManager.DoSafePrioritiseAndIndicate(); 677 } 678 679void OvmsNetManager::PrioritiseAndIndicate() 680 { 0x402937b5 is in tcpip_thread (/Users/hq.mark.johnson/esp/esp-idf/components/lwip/lwip/src/api/tcpip.c:158). 153 break; 154#endif /* LWIP_TCPIP_TIMEOUT && LWIP_TIMERS */ 155 156 case TCPIP_MSG_CALLBACK: 157 LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: CALLBACK %p\n", (void *)msg)); 158 msg->msg.cb.function(msg->msg.cb.ctx); 159 memp_free(MEMP_TCPIP_MSG_API, msg); 160 break; 161 162 case TCPIP_MSG_CALLBACK_STATIC:
So the issue is most likely corruption of the network interface structure, not thread safe traversal.
I had added some ESP_EARLY_LOGI statements, so can see a little more of what is going on:
I (103202) gsm-ppp: Initialising... I (103212) events: Signal(system.modem.netmode) I (105902) netmanager: DoSafePrioritiseAndIndicate: start I (105902) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (105912) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (105912) netmanager: DoSafePrioritiseAndIndicate: name pp I (105922) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde854 I (105932) netmanager: DoSafePrioritiseAndIndicate: name ap I (105932) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffde640 I (105942) netmanager: DoSafePrioritiseAndIndicate: name st I (105952) netmanager: DoSafePrioritiseAndIndicate: end I (105902) gsm-ppp: StatusCallBack: None I (105902) gsm-ppp: status_cb: Connected I (105902) gsm-ppp: our_ipaddr = 10.52.40.80 … I (3708442) cellular: PPP Connection disconnected I (3708442) cellular: PPP Connection disconnected I (3709212) netmanager: DoSafePrioritiseAndIndicate: start I (3709212) netmanager: DoSafePrioritiseAndIndicate: connected wifi I (3709212) netmanager: DoSafePrioritiseAndIndicate: interface 0x3ffed6a0 I (3709222) netmanager: DoSafePrioritiseAndIndicate: name pp I (3709222) netmanager: DoSafePrioritiseAndIndicate: interface 0x30323930 I (3709232) netmanager: DoSafePrioritiseAndIndicate: name f I (3709242) netmanager: DoSafePrioritiseAndIndicate: interface 0x667fc000 I (3709252) netmanager: DoSafePrioritiseAndIndicate: name Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1)
Doesn’t help much, apart from confirm the corruption. Took about an hour to recreate the problem.
I’ll keep looking.
Regards, Mark.
On 23 Mar 2021, at 4:05 PM, Mark Webb-Johnson <mark@webb-johnson.net <mailto:mark@webb-johnson.net>> wrote:
Signed PGP part My attempt didn’t work (still crashes), so I’m now trying your suggestion of wrapping PrioritiseAndIndicate() in a tcpip_callback_with_block.
🤞🏻
Regards, Mark.
On 23 Mar 2021, at 3:02 PM, Michael Balzer <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote:
Signed PGP part Mark,
regarding point 2: I've had the same issue with jobs that need to iterate over the mongoose connection list, I introduced the netmanager job queue for this to delegate these to the mongoose context.
I remember seeing LwIP has a similar API while browsing the source… yes, found it: the "tcpip_callback…" functions, e.g.:
/** * Call a specific function in the thread context of * tcpip_thread for easy access synchronization. * A function called in that way may access lwIP core code * without fearing concurrent access. * * @param function the function to call * @param ctx parameter passed to f * @param block 1 to block until the request is posted, 0 to non-blocking mode * @return ERR_OK if the function was called, another err_t if not */ err_t tcpip_callback_with_block(tcpip_callback_fn function, void *ctx, u8_t block)
So we probably can use this to execute PrioritiseAndIndicate() in the LwIP context.
Regards, Michael
Am 23.03.21 um 06:47 schrieb Mark Webb-Johnson: > OK, some progress… > > 1. I’ve added a check in the OvmsSyncHttpClient code to refuse > to block while running as the netman (mongoose) task. This > will now simply fail the http connection, and log an error. > Not perfect, and not a solution to the core problem, but at > least it avoids a known problem. > > I’m not sure of the best permanent solution to this. It > seems that we need a callback interface to run commands > asynchronously, and use that in mongoose event handlers. > Adding another mongoose event loop, or using a separate > networking socket with select(), just minimise the problem - > they don’t solve it. The core issue here is blocking during > a mongoose event delivery. That is going to pause all high > level networking. > > 2. I found a race condition in ovms_netmanager that seems > nasty. The new cellular code could raise duplicate > modem.down signals, picked up and handled in > ovms_netmanager. As part of that it calls > a PrioritiseAndIndicate() function that iterates over the > network interface list (maintained by LWIP). If that network > interface list is modified (eg; removing an interface) while > it is being traversed, nasty crashes can happen. The ‘fix’ > I’ve done is again just a workaround to try to reduce the > duplicate signals and hence reduce the likelyhood of the > problem happening, but it won’t fix the core problem (that > is in both master and for-v3.3). > > There is a netif_find function in LWIP, but (a) that > requires an interface number that we don’t have, and (b) > doesn’t seem to lock the list either. > > Can’t think of an elegant solution to this, other than > modifications to lwip. We could add our own mutex and use > that whenever we talk to lwip, but even that would miss out > on some modifications to the network interface list, I guess. > > > These two changes are in ‘pre’ now, and I am trying them in my car. > > Regards, Mark. > >> On 22 Mar 2021, at 6:06 PM, Michael Balzer <dexter@expeedo.de >> <mailto:dexter@expeedo.de>> wrote: >> >> Signed PGP part >> In master, running commands via ssh or server-v2 block, because >> these are running synchronously in the mongoose context. >> >> Running commands via web doesn't block, as the webcommand class >> starts a separate task for each execution. >> >> The firmware config page does a synchronous call to >> MyOTA.GetStatus(), so that call is executed in the mongoose >> context. It still works in master, just needs a second or two >> to fetch the version file. >> >> Regards, >> Michael >> >> >> Am 22.03.21 um 10:38 schrieb Mark Webb-Johnson: >>> In master branch, at the moment, if a command is run from the >>> web shell (or server v2), surely the mongoose task will block >>> as the web server / server v2 blocks waiting for the command >>> to run to completion? >>> >>> Doesn’t necessarily need to be a networking command. Something >>> long running like the string speed tests. >>> >>> In v3.3 I can easily detect the task wait being requested in >>> the http library (by seeing if current task id == mongoose >>> task), and fail (which I should do anyway). But I am more >>> concerned with the general case now (which I think may be >>> wrong in both master and for-v3.3). >>> >>> Regards, Mark >>> >>>> On 22 Mar 2021, at 5:22 PM, Michael Balzer >>>> <dexter@expeedo.de> wrote: >>>> >>>> I think we must avoid blocking the Mongoose task, as that's >>>> the central network dispatcher. >>>> >>>> Chris had implemented a workaround in one of his PRs that >>>> could allow that to be done temporarily by running a local >>>> Mongoose main loop during a synchronous operation, but I >>>> still see potential issues from that, as it wasn't the >>>> standard handling as done by the task, and as it may need to >>>> recurse. >>>> >>>> Maybe the old OvmsHttpClient using socket I/O is the right >>>> way for synchronous network operations? >>>> >>>> Regards, >>>> Michael >>>> >>>> >>>> Am 22.03.21 um 07:15 schrieb Mark Webb-Johnson: >>>>> Not sure how to resolve this. >>>>> >>>>> OvmsSyncHttpClient is currently used in commands from >>>>> ovms_plugins and ovms_ota. >>>>> >>>>> I could bring back the OvmsHttpClient blocking >>>>> (non-mongoose) implementation, but I don’t think that would >>>>> address the core problem here: >>>>> >>>>> Inside a mongoose callback (inside the mongoose >>>>> networking task), we are making blocking calls (and in >>>>> particular calls that could block for several tens of >>>>> seconds). >>>>> >>>>> >>>>> But fundamentally is it ok to block the mongoose networking >>>>> task for extended periods during a mongoose event callback? >>>>> >>>>> Mark >>>>> >>>>>> On 21 Mar 2021, at 9:57 PM, Michael Balzer >>>>>> <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>>> >>>>>> Signed PGP part >>>>>> I've found opening the web UI firmware page or calling "ota >>>>>> status" via ssh to consistently deadlock the network on my >>>>>> module. >>>>>> >>>>>> I (130531) webserver: HTTP GET /cfg/firmware >>>>>> D (130531) http: OvmsSyncHttpClient: Connect to >>>>>> ovms.dexters-web.de:80 <http://ovms.dexters-web.de/> >>>>>> D (130541) http: OvmsSyncHttpClient: waiting for completion >>>>>> >>>>>> After that log message, the network is dead, and the >>>>>> netmanager also doesn't respond: >>>>>> >>>>>> OVMS# network list >>>>>> ERROR: job failed >>>>>> D (183241) netmanager: send cmd 1 from 0x3ffe7054 >>>>>> W (193241) netmanager: ExecuteJob: cmd 1: timeout >>>>>> >>>>>> The interfaces seem to be registered and online, but >>>>>> nothing gets in or out: >>>>>> >>>>>> OVMS# network status >>>>>> Interface#3: pp3 (ifup=1 linkup=1) >>>>>> IPv4: 10.170.195.13/255.255.255.255 gateway 10.64.64.64 >>>>>> >>>>>> Interface#2: ap2 (ifup=1 linkup=1) >>>>>> IPv4: 192.168.4.1/255.255.255.0 gateway 192.168.4.1 >>>>>> >>>>>> Interface#1: st1 (ifup=1 linkup=1) >>>>>> IPv4: 192.168.2.106/255.255.255.0 gateway 192.168.2.1 >>>>>> >>>>>> DNS: 192.168.2.1 >>>>>> >>>>>> Default Interface: st1 (192.168.2.106/255.255.255.0 gateway >>>>>> 192.168.2.1) >>>>>> >>>>>> >>>>>> A couple of minutes later, server-v2 recognizes the stale >>>>>> connection and issues a network restart, which fails >>>>>> resulting in the same behaviour as shown below with finally >>>>>> forced reboot by loss of an important event. >>>>>> >>>>>> Doing "ota status" from USB works normally, so this looks >>>>>> like OvmsSyncHttpClient not being able to run from within a >>>>>> mongoose client. >>>>>> >>>>>> Regards, >>>>>> Michael >>>>>> >>>>>> >>>>>> Am 18.03.21 um 08:14 schrieb Mark Webb-Johnson: >>>>>>> Tried to repeat this, but not having much success. Here is >>>>>>> my car module, with network still up: >>>>>>> >>>>>>> OVMS# boot status >>>>>>> Last boot was 262355 second(s) ago >>>>>>> >>>>>>> >>>>>>> I did manage to catch one network related crash after >>>>>>> repeatedly disconnecting and reconnecting the cellular >>>>>>> antenna. This was: >>>>>>> >>>>>>> I (3717989) cellular: PPP Connection disconnected >>>>>>> Guru Meditation Error: Core 1 panic'ed >>>>>>> (LoadProhibited). Exception was unhandled. >>>>>>> >>>>>>> 0x400fe082 is in >>>>>>> OvmsNetManager::PrioritiseAndIndicate() >>>>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:707). >>>>>>> >>>>>>> 707 if ((pri->name[0]==search[0])&& >>>>>>> >>>>>>> >>>>>>> 0x400ed360 is in >>>>>>> OvmsMetricString::SetValue(std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >) >>>>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_metrics.cpp:1358). >>>>>>> >>>>>>> 1357 void >>>>>>> OvmsMetricString::SetValue(std::string value) >>>>>>> >>>>>>> >>>>>>> 0x4008bdad is at >>>>>>> ../../../../.././newlib/libc/machine/xtensa/strcmp.S:586. >>>>>>> >>>>>>> >>>>>>> 0x4008bdd1 is at >>>>>>> ../../../../.././newlib/libc/machine/xtensa/strcmp.S:604. >>>>>>> >>>>>>> >>>>>>> 0x400fe886 is in >>>>>>> OvmsNetManager::ModemDown(std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >, void*) >>>>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_netmanager.cpp:522). >>>>>>> >>>>>>> 522 PrioritiseAndIndicate(); >>>>>>> >>>>>>> >>>>>>> 0x400fd752 is in std::_Function_handler<void >>>>>>> (std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >, >>>>>>> void*), std::_Bind<std::_Mem_fn<void >>>>>>> (OvmsNetManager::*)(std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >, >>>>>>> void*)> (OvmsNetManager*, std::_Placeholder<1>, >>>>>>> std::_Placeholder<2>)> >::_M_invoke(std::_Any_data >>>>>>> const&, std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >&&, >>>>>>> void*&&) >>>>>>> (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:600). >>>>>>> >>>>>>> 600 { return >>>>>>> (__object->*_M_pmf)(std::forward<_Args>(__args)...); } >>>>>>> >>>>>>> >>>>>>> 0x400f512e is in std::function<void >>>>>>> (std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >, >>>>>>> void*)>::operator()(std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> >, void*) >>>>>>> const >>>>>>> (/home/openvehicles/build/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271). >>>>>>> >>>>>>> 2271 return _M_invoker(_M_functor, >>>>>>> std::forward<_ArgTypes>(__args)...); >>>>>>> >>>>>>> >>>>>>> 0x400f52f1 is in >>>>>>> OvmsEvents::HandleQueueSignalEvent(event_queue_t*) >>>>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:283). >>>>>>> >>>>>>> 283 >>>>>>> m_current_callback->m_callback(m_current_event, >>>>>>> msg->body.signal.data); >>>>>>> >>>>>>> >>>>>>> 0x400f53d8 is in OvmsEvents::EventTask() >>>>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:237). >>>>>>> >>>>>>> 237 HandleQueueSignalEvent(&msg); >>>>>>> >>>>>>> >>>>>>> 0x400f545d is in EventLaunchTask(void*) >>>>>>> (/home/openvehicles/build/Open-Vehicle-Monitoring-System-pre/vehicle/OVMS.V3/main/ovms_events.cpp:80). >>>>>>> >>>>>>> 80 me->EventTask(); >>>>>>> >>>>>>> >>>>>>> My for_v3.3 branch does include the preliminary changes to >>>>>>> support the wifi at 20MHz bandwidth, and perhaps those >>>>>>> could be affecting things. I do notice that if I ‘power >>>>>>> wifi off’, then ‘wifi mode client’, it can connect to the >>>>>>> station, but not get an IP address. I’ve just tried to >>>>>>> merge in the latest fixes to that, and rebuilt a release. >>>>>>> I will continue to test with that. >>>>>>> >>>>>>> Regards, Mark. >>>>>>> >>>>>>>> On 12 Mar 2021, at 10:32 PM, Michael Balzer >>>>>>>> <dexter@expeedo.de <mailto:dexter@expeedo.de>> wrote: >>>>>>>> >>>>>>>> Signed PGP part >>>>>>>> I just tried switching to for-v3.3 in my car module after >>>>>>>> tests on my desk module were OK, and I've run into the >>>>>>>> very same problem with for-v3.3. So the issue isn't >>>>>>>> related to esp-idf. >>>>>>>> >>>>>>>> The network only occasionally starts normally, but even >>>>>>>> then all connectivity is lost after a couple of minutes. >>>>>>>> >>>>>>>> The stale connection watchdog in server-v2 triggers a >>>>>>>> network restart, but that also doesn't seem to succeed: >>>>>>>> >>>>>>>> 2021-03-12 14:53:01.802 CET W (981652) ovms-server-v2: >>>>>>>> Detected stale connection (issue #241), restarting network >>>>>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: Restart >>>>>>>> 2021-03-12 14:53:01.802 CET I (981652) esp32wifi: >>>>>>>> Stopping WIFI station >>>>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:state: run -> >>>>>>>> init (0) >>>>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:pm stop, >>>>>>>> total sleep time: 831205045 us / 975329961 us >>>>>>>> >>>>>>>> 2021-03-12 14:53:01.812 CET I (981662) wifi:new:<1,0>, >>>>>>>> old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 >>>>>>>> 2021-03-12 14:53:01.832 CET I (981682) wifi:flush txq >>>>>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:stop sw txq >>>>>>>> 2021-03-12 14:53:01.842 CET I (981692) wifi:lmac stop hw txq >>>>>>>> 2021-03-12 14:53:01.852 CET I (981702) esp32wifi: >>>>>>>> Powering down WIFI driver >>>>>>>> 2021-03-12 14:53:01.852 CET I (981702) wifi:Deinit lldesc >>>>>>>> rx mblock:16 >>>>>>>> 2021-03-12 14:53:01.862 CET I (981712) esp32wifi: >>>>>>>> Powering up WIFI driver >>>>>>>> 2021-03-12 14:53:01.862 CET I (981712) wifi:nvs_log_init, >>>>>>>> erase log key successfully, reinit nvs log >>>>>>>> 2021-03-12 14:53:01.882 CET I (981732) wifi:wifi driver >>>>>>>> task: 3ffd4d84, prio:23, stack:3584, core=0 >>>>>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base >>>>>>>> MAC address is not set, read default base MAC address >>>>>>>> from BLK0 of EFUSE >>>>>>>> 2021-03-12 14:53:01.882 CET I (981732) system_api: Base >>>>>>>> MAC address is not set, read default base MAC address >>>>>>>> from BLK0 of EFUSE >>>>>>>> 2021-03-12 14:53:01.902 CET I (981752) wifi:wifi firmware >>>>>>>> version: 30f9e79 >>>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config NVS >>>>>>>> flash: enabled >>>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:config nano >>>>>>>> formating: disabled >>>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init data >>>>>>>> frame dynamic rx buffer num: 16 >>>>>>>> 2021-03-12 14:53:01.912 CET I (981762) wifi:Init >>>>>>>> management frame dynamic rx buffer num: 16 >>>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init >>>>>>>> management short buffer num: 32 >>>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic >>>>>>>> tx buffer num: 16 >>>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static >>>>>>>> rx buffer size: 2212 >>>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init static >>>>>>>> rx buffer num: 16 >>>>>>>> 2021-03-12 14:53:01.922 CET I (981772) wifi:Init dynamic >>>>>>>> rx buffer num: 16 >>>>>>>> 2021-03-12 14:53:02.642 CET I (982492) wifi:mode : sta >>>>>>>> (30:ae:a4:5f:e7:ec) + softAP (30:ae:a4:5f:e7:ed) >>>>>>>> 2021-03-12 14:53:02.652 CET I (982502) wifi:Total power >>>>>>>> save buffer number: 8 >>>>>>>> 2021-03-12 14:53:02.652 CET I (982502) >>>>>>>> cellular-modem-auto: Restart >>>>>>>> 2021-03-12 14:53:02.662 CET I (982512) cellular: State: >>>>>>>> Enter PowerOffOn state >>>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting >>>>>>>> down (hard)... >>>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: >>>>>>>> StatusCallBack: User Interrupt >>>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP >>>>>>>> connection has been closed >>>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: PPP is >>>>>>>> shutdown >>>>>>>> 2021-03-12 14:53:02.662 CET I (982512) gsm-ppp: Shutting >>>>>>>> down (hard)... >>>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: >>>>>>>> StatusCallBack: User Interrupt >>>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP >>>>>>>> connection has been closed >>>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-ppp: PPP is >>>>>>>> shutdown >>>>>>>> 2021-03-12 14:53:02.672 CET I (982522) gsm-nmea: Shutdown >>>>>>>> (direct) >>>>>>>> 2021-03-12 14:53:02.672 CET I (982522) >>>>>>>> cellular-modem-auto: Power Cycle >>>>>>>> 2021-03-12 14:53:04.682 CET D (984532) events: >>>>>>>> Signal(system.wifi.down) >>>>>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI >>>>>>>> client stop >>>>>>>> 2021-03-12 14:53:04.682 CET E (984532) netmanager: >>>>>>>> Inconsistent state: no interface of type 'pp' found >>>>>>>> 2021-03-12 14:53:04.682 CET I (984532) netmanager: WIFI >>>>>>>> client down (with MODEM up): reconfigured for MODEM priority >>>>>>>> 2021-03-12 14:53:04.692 CET D (984542) events: >>>>>>>> Signal(system.event) >>>>>>>> 2021-03-12 14:53:04.692 CET D (984542) events: >>>>>>>> Signal(system.wifi.sta.disconnected) >>>>>>>> 2021-03-12 14:53:04.692 CET E (984542) netmanager: >>>>>>>> Inconsistent state: no interface of type 'pp' found >>>>>>>> 2021-03-12 14:53:04.692 CET I (984542) esp32wifi: STA >>>>>>>> disconnected with reason 8 = ASSOC_LEAVE >>>>>>>> 2021-03-12 14:53:04.702 CET D (984552) events: >>>>>>>> Signal(system.event) >>>>>>>> 2021-03-12 14:53:04.702 CET D (984552) events: >>>>>>>> Signal(system.wifi.sta.stop) >>>>>>>> 2021-03-12 14:53:04.702 CET E (984552) netmanager: >>>>>>>> Inconsistent state: no interface of type 'pp' found >>>>>>>> 2021-03-12 14:53:04.712 CET D (984562) events: >>>>>>>> Signal(system.event) >>>>>>>> 2021-03-12 14:53:04.712 CET D (984562) events: >>>>>>>> Signal(system.wifi.ap.stop) >>>>>>>> 2021-03-12 14:53:04.712 CET E (984562) netmanager: >>>>>>>> Inconsistent state: no interface of type 'pp' found >>>>>>>> 2021-03-12 14:53:04.712 CET I (984562) netmanager: WIFI >>>>>>>> access point is down >>>>>>>> 2021-03-12 14:53:04.712 CET I (984562) esp32wifi: AP stopped >>>>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: >>>>>>>> Signal(network.wifi.sta.bad) >>>>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: >>>>>>>> Signal(system.event) >>>>>>>> 2021-03-12 14:53:04.722 CET D (984572) events: >>>>>>>> Signal(system.wifi.sta.start) >>>>>>>> 2021-03-12 14:53:04.732 CET D (984582) events: >>>>>>>> Signal(system.modem.down) >>>>>>>> 2021-03-12 14:53:04.742 CET I (984592) netmanager: MODEM >>>>>>>> down (with WIFI client down): network connectivity has >>>>>>>> been lost >>>>>>>> 2021-03-12 14:53:04.742 CET D (984592) events: >>>>>>>> Signal(system.modem.down) >>>>>>>> 2021-03-12 14:53:04.752 CET D (984602) events: >>>>>>>> Signal(system.event) >>>>>>>> 2021-03-12 14:53:04.752 CET D (984602) events: >>>>>>>> Signal(system.wifi.ap.start) >>>>>>>> 2021-03-12 14:53:04.752 CET I (984602) netmanager: WIFI >>>>>>>> access point is up >>>>>>>> 2021-03-12 14:53:26.802 CET E (1006652) events: >>>>>>>> SignalEvent: queue overflow (running >>>>>>>> system.wifi.ap.start->netmanager for 23 sec), event >>>>>>>> 'ticker.1' dropped >>>>>>>> 2021-03-12 14:53:27.802 CET E (1007652) events: >>>>>>>> SignalEvent: queue overflow (running >>>>>>>> system.wifi.ap.start->netmanager for 24 sec), event >>>>>>>> 'ticker.1' dropped >>>>>>>> 2021-03-12 14:53:28.802 CET E (1008652) events: >>>>>>>> SignalEvent: queue overflow (running >>>>>>>> system.wifi.ap.start->netmanager for 25 sec), event >>>>>>>> 'ticker.1' dropped >>>>>>>> …and so on until >>>>>>>> 2021-03-12 14:54:01.802 CET E (1041652) events: >>>>>>>> SignalEvent: lost important event => aborting >>>>>>>> >>>>>>>> >>>>>>>> I need my car now, so will switch back to master for now. >>>>>>>> >>>>>>>> Mark, if you've got specific debug logs I should fetch on >>>>>>>> the next try, tell me. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> Am 12.03.21 um 05:47 schrieb Craig Leres: >>>>>>>>> I just updated to 3.2.016-68-g8e10c6b7 and still get the >>>>>>>>> network hang immediately after booting and logging into >>>>>>>>> the web gui. >>>>>>>>> >>>>>>>>> But I see now my problem is likely that I'm not using >>>>>>>>> the right esp-idf (duh). Is there a way I can have >>>>>>>>> master build using ~/esp/esp-idf and have for-v3.3 use a >>>>>>>>> different path?) >>>>>>>>> >>>>>>>>> Craig >>>>>>>>> _______________________________________________ >>>>>>>>> OvmsDev mailing list >>>>>>>>> OvmsDev@lists.openvehicles.com >>>>>>>>> <mailto:OvmsDev@lists.openvehicles.com> >>>>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>>>>>>>> <http://lists.openvehicles.com/mailman/listinfo/ovmsdev> >>>>>>>> >>>>>>>> -- >>>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> OvmsDev mailing list >>>>>>> OvmsDev@lists.openvehicles.com >>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>>>>> >>>>>> -- >>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> OvmsDev mailing list >>>>> OvmsDev@lists.openvehicles.com >>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>>> >>>> -- >>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >>>> _______________________________________________ >>>> OvmsDev mailing list >>>> OvmsDev@lists.openvehicles.com >>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >>> >>> _______________________________________________ >>> OvmsDev mailing list >>> OvmsDev@lists.openvehicles.com >>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev >> >> -- >> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal >> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 >> > > > _______________________________________________ > OvmsDev mailing list > OvmsDev@lists.openvehicles.com > http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I've now also results from testing different heap placements in the SPIRAM address space. I've done these tests on branch "master" with duktape running on core 1 (= no heap corruption crashes). To shift the duktape heap location into the upper SPIRAM regions, I added this modification to DukTapeInit(): * void *shiftmem = ExternalRamMalloc(2*1024*1024);* umm_memory = ExternalRamMalloc(memsize); * ESP_LOGI(TAG, "Duktape: shiftmem=%p umm_memory=%p", shiftmem, umm_memory);** ** if (shiftmem) free(shiftmem);* I've tried shifting into the upper 2 MB as shown here, and then shifting by just 1 MB up to place the head at the upper end of the lower 2 MB (so memory corruption by other processes is unlikely). I've then run the event test script 15 times until aborted by the error and collected the total loop counts. Result: there is a strong correlation between the heap memory half used and the error frequency: with the heap in the upper half (shifted by 2 MB), the loop will run *>10x longer* on average than with the heap in the lower half. In absolute numbers: the average loop count with the heap in the lower 2 MB SPIRAM was ~ 3,000, with the heap shifted into the upper 2 MB it was ~ 40,000. Within the lower half, there is no difference between shifting by 1 MB and no shifting, so it's most probably not caused by some other process corrupting the duktape heap region. My conclusion would be now, this is still the SPIRAM bug, or a previously undetected variant thereof. Can some others please try to reproduce my results? Regards, Michael Am 17.09.21 um 10:56 schrieb Michael Balzer:
I just remembered…
https://github.com/espressif/esp-idf/issues/2892#issuecomment-459120605
…and as Duktape runs on core #1, I now test shifting the Duktape heap into the upper 2 MB of SPIRAM.
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/17/21 12:24 AM, Michael Balzer wrote:
I've been running the new for-v3.3 version this week on both of my modules without ppp issues.
Duktape still occasionally runs into the null/undefined issue with for…in:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
for…in normally doesn't throw an error even if you run over null or undefined.
I think both could still be the SPIRAM bug, now probably only occurring with very specific conditions. We build with LWIP using SPIRAM as well, so the PPP instance is allocated from SPIRAM also. Reallocating the instance on each new connect implies a higher chance of triggering the problem if it's address specific. The Duktape stack object addresses vary continuously with the running event handlers and user interactions, so that also has a high chance of occasionally triggering an address specific bug.
We need to test the revision 3 ESP32 on this.
What can I do to test this on my frankenstein v3 module? Craig
Craig, Am 18.09.21 um 00:14 schrieb Craig Leres:
On 9/17/21 12:24 AM, Michael Balzer wrote:
Duktape still occasionally runs into the null/undefined issue with for…in: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
We need to test the revision 3 ESP32 on this.
What can I do to test this on my frankenstein v3 module?
Here's a simple test that reliably reproduces the effect on my modules: 1. Setup handler loop: script eval 'PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; OvmsEvents.Raise("usr.testev."+ms, ms); })' 2. Start event loop with 10 ms interval: event raise usr.testev.10 The handler will keep raising the same event with the given interval. You can monitor the events by settingt log level debug on events or using the event tracing (event trace on/off), be aware 10 ms will flood your terminal/shell. Check the Duktape CPU usage to see if the loop is still running without enabling the event log. The loop _should_ normally run indefinitely. To stop it manually, you would do: script eval 'PubSub.unsubscribe("usr.testev")' In reality, it will abort after some random run time with a log entry like this: E (1803740) ovms-duktape: [int/PubSub.js:1] TypeError: not object coercible| at [anon] (duk_api_stack.c:3661) internal| at hasKeys (int/PubSub.js:1) strict| at messageHasSubscribers (int/PubSub.js:1) strict| at publish (int/PubSub.js:1) strict| at [anon] (int/PubSub.js:1) strict preventsyield …or (newly discovered today) with a heap corruption crash, which may have another cause though (investigating). If it stops with the "TypeError", the loop should be restartable simply by repeating step 2. That proves the PubSub handler setup is still intact. The time it needs to run into the TypeError varies widely and has some dependance on how many other handlers are registered. With no other javascript handlers registered, it normally takes 15-30 minutes on my modules to run into the error. With AuxBatMon, PwrMon & Edimax plugins enabled, it will normally happen within 5 minutes, sometimes within a few seconds. The interval is basically irrelevant, the effect just comes faster with a short interval. Regards, Michael PS: btw, the effect occurs with Duktape running in the upper 2 MB as well. That doesn't prove it's not the SPIRAM bug though, it just may have other trigger conditions. -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/18/21 8:28 AM, Michael Balzer wrote:
Craig,
Am 18.09.21 um 00:14 schrieb Craig Leres:
On 9/17/21 12:24 AM, Michael Balzer wrote:
Duktape still occasionally runs into the null/undefined issue with for…in: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
We need to test the revision 3 ESP32 on this.
What can I do to test this on my frankenstein v3 module?
Here's a simple test that reliably reproduces the effect on my modules:
1. Setup handler loop: script eval 'PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; OvmsEvents.Raise("usr.testev."+ms, ms); })' 2. Start event loop with 10 ms interval: event raise usr.testev.10
The handler will keep raising the same event with the given interval. You can monitor the events by settingt log level debug on events or using the event tracing (event trace on/off), be aware 10 ms will flood your terminal/shell. Check the Duktape CPU usage to see if the loop is still running without enabling the event log.
The loop _should_ normally run indefinitely. To stop it manually, you would do: script eval 'PubSub.unsubscribe("usr.testev")'
In reality, it will abort after some random run time with a log entry like this:
E (1803740) ovms-duktape: [int/PubSub.js:1] TypeError: not object coercible| at [anon] (duk_api_stack.c:3661) internal| at hasKeys (int/PubSub.js:1) strict| at messageHasSubscribers (int/PubSub.js:1) strict| at publish (int/PubSub.js:1) strict| at [anon] (int/PubSub.js:1) strict preventsyield
…or (newly discovered today) with a heap corruption crash, which may have another cause though (investigating).
If it stops with the "TypeError", the loop should be restartable simply by repeating step 2. That proves the PubSub handler setup is still intact.
The time it needs to run into the TypeError varies widely and has some dependance on how many other handlers are registered. With no other javascript handlers registered, it normally takes 15-30 minutes on my modules to run into the error. With AuxBatMon, PwrMon & Edimax plugins enabled, it will normally happen within 5 minutes, sometimes within a few seconds.
The interval is basically irrelevant, the effect just comes faster with a short interval.
Regards, Michael
PS: btw, the effect occurs with Duktape running in the upper 2 MB as well. That doesn't prove it's not the SPIRAM bug though, it just may have other trigger conditions.
Running 3.2.016-394-ge7d9e1c1/ota_0/main, I ran your suggested script/event commands. Within ~30 seconds I tried "event trace on". But since I was on the serial console I next decided to type "event trace off" and try that on a ssh session but the module crashed with your heap crash before I could stop the output. Is it a problem if I cause the module to generate way more serial output than 115200 baud can handle? (Before I rebooted I saw that the module was again unhappy with the cell modem but I guess one problem at a time...) I ran your test commands a second time. After 3-4 minutes I tried ssh'ing in which seemed to cause another KCORRUPT HEAP crash. (I didn't collect any info from that crash. The third time I let it run for 3 hours. Then I turned trace on and back off to verify the test was still running. (Cellular registration stated up for 3 hours as well.) I will check again in the morning. Craig ======================================== [first crash] I (60262) events: Signal(usr.testev.10) I (60272) events: Signal(usr.testev.10) I (60282) events: SignCORRUPT HEAP: Bad head at 0x3f82d0c4. Expected 0xabba1234 got 0x3f82d148 abort() was called at PC 0x400844c3 on core 0 ELF file SHA256: 7c10178c2874419f Backtrace: 0x40089a2f:0x3ffcbc20 0x40089cc9:0x3ffcbc40 0x400844c3:0x3ffcbc60 0x400845dd:0x3ffcbca0 0x4011a4a3:0x3ffcbcc0 0x4010f439:0x3ffcbf80 0x4010ef89:0x3ffcbfd0 0x4008e903:0x3ffcc000 0x400840b1:0x3ffcc020 0x40084671:0x3ffcc040 0x4000bec7:0x3ffcc060 0x401b15ad:0x3ffcc080 0x400f7081:0x3ffcc0a0 0x4008d3af:0x3ffcc0c0 ======================================== ice 114 % ./backtrace.sh 0x40089a2f:0x3ffcbc20 0x40089cc9:0x3ffcbc40 0x400844c3:0x3ffcbc60 0x400845dd:0x3ffcbca0 0x4011a4a3:0x3ffcbcc0 0x4010f439:0x3ffcbf80 0x4010ef89:0x3ffcbfd0 0x4008e903:0x3ffcc000 0x400840b1:0x3ffcc020 0x40084671:0x3ffcc040 0x4000bec7:0x3ffcc060 0x401b15ad:0x3ffcc080 0x400f7081:0x3ffcc0a0 0x4008d3af:0x3ffcc0c0 + xtensa-esp32-elf-addr2line -e build/ovms3.elf 0x40089a2f:0x3ffcbc20 0x40089cc9:0x3ffcbc40 0x400844c3:0x3ffcbc60 0x400845dd:0x3ffcbca0 0x4011a4a3:0x3ffcbcc0 0x4010f439:0x3ffcbf80 0x4010ef89:0x3ffcbfd0 0x4008e903:0x3ffcc000 0x400840b1:0x3ffcc020 0x40084671:0x3ffcc040 0x4000bec7:0x3ffcc060 0x401b15ad:0x3ffcc080 0x400f7081:0x3ffcc0a0 0x4008d3af:0x3ffcc0c0 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:143 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:171 /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c:1699 (discriminator 8) /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/strtod.c:428 /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/string/../../../.././newlib/libc/string/strerror.c:591 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/multi_heap_poisoning.c:350 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/heap_caps.c:403 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/syscalls.c:42 ??:0 /wrkdirs/usr/ports/devel/xtensa-esp32-elf/work/crosstool-NG-1.22.0-97-gc752ad5/.build/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/libstdc++-v3/include/bits/locale_facets_nonio.tcc:1284 main/ovms_http.cpp:84 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/freertos/timers.c:485 ======================================== ice 117 % diff sdkconfig support/sdkconfig.default.hw31 | fgrep -v CONFIG_OVMS_VEHICLE 209c209 < CONFIG_SPIRAM_CACHE_WORKAROUND= ---
CONFIG_SPIRAM_CACHE_WORKAROUND=y 665,680c665,681
683,686c684,687 ---
Craig, Am 19.09.21 um 06:45 schrieb Craig Leres:
Running 3.2.016-394-ge7d9e1c1/ota_0/main, I ran your suggested script/event commands. Within ~30 seconds I tried "event trace on". But since I was on the serial console I next decided to type "event trace off" and try that on a ssh session but the module crashed with your heap crash before I could stop the output. Is it a problem if I cause the module to generate way more serial output than 115200 baud can handle?
(Before I rebooted I saw that the module was again unhappy with the cell modem but I guess one problem at a time...)
I ran your test commands a second time. After 3-4 minutes I tried ssh'ing in which seemed to cause another KCORRUPT HEAP crash. (I didn't collect any info from that crash.
The corrupt heap crashes might be something else, up to now I only saw them with the for-v3.3 branch. I'm currently trying to reproduce one of these on "master". Unfortunately gdb cannot handle them, the call stack also seems to be corrupted. Here's a new version of the script that logs progress so you don't need to enable event logging: // Setup: script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' // Start with 10 ms interval: script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' // Check status: script eval 'print("testcnt: " + testcnt + "\n")' // Stop: script eval 'PubSub.unsubscribe("usr.testev")' This version will log a loop counter every 3 seconds, and you can query the loop counter by the check command. Example: … I (2369414) script: [eval:1:] usr.testev.10: 37500 I (2373244) script: [eval:1:] usr.testev.10: 37800 I (2376884) script: [eval:1:] usr.testev.10: 38100 E (2377694) script: [int/PubSub.js:1] TypeError: not object coercible| at [anon] (duk_api_stack.c:3661) internal| at hasKeys (int/PubSub.js:1) strict| at messageHasSubscribers (int/PubSub.js:1) strict| at publish (int/PubSub.js:1) strict| at [anon] (int/PubSub.js:1) strict preventsyield OVMS# script eval 'print("testcnt: " + testcnt + "\n")' testcnt: 38155 Please try this with "master" as well to see if you also only get the heap corruption with for-v3.3. Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Everyone, I can now reproduce the corrupted heap crashes on branch "master" as well, and would like to know if anyone else can reproduce these as well. To replicate, change the script component to launch the duktape task on core 0. Then start the below test script (from "Branch for-v3.3 network issues" thread). The heap corruption will occur within a few seconds. It will happen at different places, where a free() call is made. With duktape (and the script) running on core 1, no heap corruption occurs with branch "master". It's present with branch "for-v3.3" then only, and it needs much longer script runtime to occur. I have no idea yet how this can depend on the core we're running on. I'm open to suggestions what to try to narrow this down. I've disabled all auto start components and file logging, i.e. the module runs without a vehicle and without networking. Next would be to exclude components from the build. Regards, Michael Am 19.09.21 um 09:53 schrieb Michael Balzer:
The corrupt heap crashes might be something else, up to now I only saw them with the for-v3.3 branch. I'm currently trying to reproduce one of these on "master". Unfortunately gdb cannot handle them, the call stack also seems to be corrupted.
Here's a new version of the script that logs progress so you don't need to enable event logging:
// Setup: script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })'
// Start with 10 ms interval: script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")'
// Check status: script eval 'print("testcnt: " + testcnt + "\n")'
// Stop: script eval 'PubSub.unsubscribe("usr.testev")'
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/19/21 12:53 AM, Michael Balzer wrote:
The corrupt heap crashes might be something else, up to now I only saw them with the for-v3.3 branch. I'm currently trying to reproduce one of these on "master". Unfortunately gdb cannot handle them, the call stack also seems to be corrupted.
My overnight run with the for-v3.3 branch ran for about 6 hours and then hit another CORRUPT HEAP.
Here's a new version of the script that logs progress so you don't need to enable event logging:
// Setup: script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })'
// Start with 10 ms interval: script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")'
// Check status: script eval 'print("testcnt: " + testcnt + "\n")'
// Stop: script eval 'PubSub.unsubscribe("usr.testev")'
This version will log a loop counter every 3 seconds, and you can query the loop counter by the check command.
Example:
… I (2369414) script: [eval:1:] usr.testev.10: 37500 I (2373244) script: [eval:1:] usr.testev.10: 37800 I (2376884) script: [eval:1:] usr.testev.10: 38100 E (2377694) script: [int/PubSub.js:1] TypeError: not object coercible| at [anon] (duk_api_stack.c:3661) internal| at hasKeys (int/PubSub.js:1) strict| at messageHasSubscribers (int/PubSub.js:1) strict| at publish (int/PubSub.js:1) strict| at [anon] (int/PubSub.js:1) strict preventsyield OVMS# script eval 'print("testcnt: " + testcnt + "\n")' testcnt: 38155
Please try this with "master" as well to see if you also only get the heap corruption with for-v3.3.
I've been running your new test script on for-v3.3 and main for 1 hour (both v3 boxes), no issues so far. Craig
3.2.016-292-g61cde63a/ota_0/main ran for almost 22 hours and then locked up. When I plugged a cable in it was completely dead and I had to cycle power to get it back. I suspect the only way I'll get anything interesting is to leave a laptop plugged into the serial console. But I'll bet it was the heap issue... (I don't see any point in my running tests on the for-v3.3 branch since I don't understand the heap issue enough to be helpful.) Craig
Craig, Am 20.09.21 um 22:08 schrieb Craig Leres:
3.2.016-292-g61cde63a/ota_0/main ran for almost 22 hours and then locked up. When I plugged a cable in it was completely dead and I had to cycle power to get it back. I suspect the only way I'll get anything interesting is to leave a laptop plugged into the serial console. But I'll bet it was the heap issue...
To clarify: the script produced the checkpoint log outputs until just before the module stopped working, and you don't see any "TypeError" in your log files? Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/20/21 1:47 PM, Michael Balzer wrote:
Craig,
Am 20.09.21 um 22:08 schrieb Craig Leres:
3.2.016-292-g61cde63a/ota_0/main ran for almost 22 hours and then locked up. When I plugged a cable in it was completely dead and I had to cycle power to get it back. I suspect the only way I'll get anything interesting is to leave a laptop plugged into the serial console. But I'll bet it was the heap issue...
To clarify: the script produced the checkpoint log outputs until just before the module stopped working, and you don't see any "TypeError" in your log files?
I think I'm missing config to capture script logs? I used a script to download https://api.openvehicles.com:6869/api/historical/????/*-OVM-ServerLogs and don't see it there. I also do some logging to the sd card but that log seems to mostly have can bus watchdogs and other errors. I'm not sure what logging to turn on (or how to do it when I'm at work and can't easily get to the web gui on the box). Craig OVMS# log status Log listeners : 4 File logging status: active Log file path : /sd/logs/ovms-z.log Current size : 182.2 kB Cycle size : 1024 kB Cycle count : 0 Dropped messages : 0 Messages logged : 45 Total fsync time : 0.1 s
Craig, Am 20.09.21 um 23:00 schrieb Craig Leres:
On 9/20/21 1:47 PM, Michael Balzer wrote:
Craig,
Am 20.09.21 um 22:08 schrieb Craig Leres:
3.2.016-292-g61cde63a/ota_0/main ran for almost 22 hours and then locked up. When I plugged a cable in it was completely dead and I had to cycle power to get it back. I suspect the only way I'll get anything interesting is to leave a laptop plugged into the serial console. But I'll bet it was the heap issue...
To clarify: the script produced the checkpoint log outputs until just before the module stopped working, and you don't see any "TypeError" in your log files?
I think I'm missing config to capture script logs? I used a script to download https://api.openvehicles.com:6869/api/historical/????/*-OVM-ServerLogs and don't see it there. I also do some logging to the sd card but that log seems to mostly have can bus watchdogs and other errors.
I'm not sure what logging to turn on (or how to do it when I'm at work and can't easily get to the web gui on the box).
Craig
OVMS# log status Log listeners : 4 File logging status: active Log file path : /sd/logs/ovms-z.log Current size : 182.2 kB Cycle size : 1024 kB Cycle count : 0 Dropped messages : 0 Messages logged : 45 Total fsync time : 0.1 s
That's file logging enabled, and logs stored on your SD card in the "logs" directory. If your config is unchanged, you should find some archived log files there. Configuration is btw explained here: https://docs.openvehicles.com/en/latest/userguide/logging.html#logging-to-sd... You cannot download these files via the server. The "*-OVM-ServerLog" is just the protocol of the server communication as seen by the server. To download the module log files from /sd/logs, you can either use the web UI, or you can use ssh/scp. Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/20/21 2:12 PM, Michael Balzer wrote:
OVMS# log status Log listeners : 4 File logging status: active Log file path : /sd/logs/ovms-z.log Current size : 182.2 kB Cycle size : 1024 kB Cycle count : 0 Dropped messages : 0 Messages logged : 45 Total fsync time : 0.1 s
That's file logging enabled, and logs stored on your SD card in the "logs" directory. If your config is unchanged, you should find some archived log files there. Configuration is btw explained here: https://docs.openvehicles.com/en/latest/userguide/logging.html#logging-to-sd...
You cannot download these files via the server. The "*-OVM-ServerLog" is just the protocol of the server communication as seen by the server.
Got it.
To download the module log files from /sd/logs, you can either use the web UI, or you can use ssh/scp.
I used scp to get the currently active log (/sd/logs/ovms-z.log) but it doesn't contain any testev logs. I believe it covers the test I did overnight (the timestamps are UTC but it starts on 2021-09-18 20:16 and goes until 2021-09-20 13:14). I guess I'm expecting to have to add some config to enable logging the test events to the sd log. Craig
Am 20.09.21 um 23:18 schrieb Craig Leres:
I used scp to get the currently active log (/sd/logs/ovms-z.log) but it doesn't contain any testev logs. I believe it covers the test I did overnight (the timestamps are UTC but it starts on 2021-09-18 20:16 and goes until 2021-09-20 13:14).
I guess I'm expecting to have to add some config to enable logging the test events to the sd log.
Script outputs are logged at "info" level, which is the default. Unless you've set the level to "warn" or "error" for the "script" component, you should have them. You should see a line like "script: [eval:1:] usr.testev.10: <loopcnt>" every three seconds. When starting the event loop from ssh, you can see these by doing "log monitor yes". Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Craig, Am 20.09.21 um 22:47 schrieb Michael Balzer:
To clarify: the script produced the checkpoint log outputs until just before the module stopped working, and you don't see any "TypeError" in your log files?
any update on this yet? Thanks, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/24/21 6:01 AM, Michael Balzer wrote:
Am 20.09.21 um 22:47 schrieb Michael Balzer:
To clarify: the script produced the checkpoint log outputs until just before the module stopped working, and you don't see any "TypeError" in your log files?
any update on this yet?
Sorry, I was distracted during the week with work. I just started your test on my sim7600 unit (3.2.016-394-ge7d9e1c1) capturing the serial console and one of my cars (3.2.016-292-g61cde63a) capturing the serial console with a laptop sitting in the back seat.
Script outputs are logged at "info" level, which is the default. Unless you've set the level to "warn" or "error" for the "script" component, you should have them.
You should see a line like "script: [eval:1:] usr.testev.10: <loopcnt>" every three seconds.
I confirmed that the dev module has default logging set to info. The production/car module is set to warning so I added a component level to set script to info. And I now see your 3 second test logs on both. I'll report back when I notice something has happened. Craig
First crash on the production module, looks like it only ran for two minutes? Does appear to be event related. Restarted. Meanwhile, the sim7600 guy has run for me than 30 minutes. Craig =========================================== I (705010916) script: [eval:1:] usr.testev.10: 13200 I (705015276) script: [eval:1:] usr.testev.10: 13500 I (705020026) script: [eval:1:] usr.testev.10: 13800 W (705020226) websocket: WebSocketHandler[0x3f8b5c18]: job queue overflow detected [...] W (705020896) websocket: WebSocketHandler[0x3f8b5c18]: job queue overflow resolved, 2 dropsGuru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. Core 1 register dump: PC : 0x4010aee8 PS : 0x00060630 A0 : 0x8010c550 A1 : 0x3ffc1220 A2 : 0x00000000 A3 : 0x00000080 A4 : 0x00000082 A5 : 0x00000047 A6 : 0x00ff0000 A7 : 0xff000000 A8 : 0x5678002b A9 : 0x3ffc11c0 A10 : 0x00000000 A11 : 0x00000000 A12 : 0x3ffafe24 A13 : 0x00000000 A14 : 0x00000001 A15 : 0x00000000 SAR : 0x00000008 EXCCAUSE: 0x0000001c EXCVADDR: 0x00000060 LBEG : 0x400014fd LEND : 0x4000150d LCOUNT : 0xfffffff7 ELF file SHA256: 9fe3cd9b4b911bd8 Backtrace: 0x4010aee8:0x3ffc1220 0x4010c54d:0x3ffc1250 0x401389f3:0x3ffc1290 0x40136cfa:0x3ffc12c0 0x400f5f31:0x3ffc1300 0x400f60e5:0x3ffc1330 0x400f61bc:0x3ffc1390 0x400f6239:0x3ffc13d0 Rebooting... ets Jul 29 2019 12:21:46 =========================================== ice 167 % ./backtrace.sh 0x4010aee8:0x3ffc1220 0x4010c54d:0x3ffc1250 0x401389f3:0x3ffc1290 0x40136cfa:0x3ffc12c0 0x400f5f31:0x3ffc1300 0x400f60e5:0x3ffc1330 0x400f61bc:0x3ffc1390 0x400f6239:0x3ffc13d0 + xtensa-esp32-elf-addr2line -e build/ovms3.elf 0x4010aee8:0x3ffc1220 0x4010c54d:0x3ffc1250 0x401389f3:0x3ffc1290 0x40136cfa:0x3ffc12c0 0x400f5f31:0x3ffc1300 0x400f60e5:0x3ffc1330 0x400f61bc:0x3ffc1390 0x400f6239:0x3ffc13d0 components/mongoose/mongoose/mongoose.c:1701 components/mongoose/mongoose/mongoose.c:1701 components/ovms_server_v3/src/ovms_server_v3.cpp:900 /usr/local/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/bits/basic_string.tcc:405 /usr/local/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271 main/ovms_events.cpp:268 (discriminator 2) main/ovms_events.cpp:222 main/ovms_events.cpp:80
Craig, Am 25.09.21 um 23:23 schrieb Craig Leres:
First crash on the production module, looks like it only ran for two minutes? Does appear to be event related.
Restarted. Meanwhile, the sim7600 guy has run for me than 30 minutes.
components/mongoose/mongoose/mongoose.c:1701 components/mongoose/mongoose/mongoose.c:1701 components/ovms_server_v3/src/ovms_server_v3.cpp:900 /usr/local/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/bits/basic_string.tcc:405
/usr/local/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/5.2.0/functional:2271
main/ovms_events.cpp:268 (discriminator 2) main/ovms_events.cpp:222 main/ovms_events.cpp:80
That's the V3 MQTT code subscribing to the topics after connection has been established. The crash seems to have then been in the mongoose strdup() implementation, which again may point to some memory issue, but not of the type I observed (i.e. no heap corruption involved directly). If that repeats, I suggest trying with v3 server disabled. Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Second crash on the production module, CORRUPT HEAP after ~350 minutes. sim7600 module still running after 19 hours. Looks like the cell connect stayed up for 11 hours before going into "User Interrupt" mode. Craig =========================================== I (25449286) script: [eval:1:] usr.testev.10: 2094300 I (25452446) script: [eval:1:] usr.testev.10: 2094600 I (25455606) script: [eval:1:] usr.testev.10: 2094900 OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0 ELF file SHA256: 9fe3cd9b4b911bd8 Backtrace: 0x40089a2f:0x3ffcbb10 0x40089cc9:0x3ffcbb30 0x400844c3:0x3ffcbb50 0x400845dd:0x3ffcbb90 0x401185ab:0x3ffcbbb0 0x4010d979:0x3ffcbe70 0x4010d4c9:0x3ffcbec0 0x4008e903:0x3ffcbef0 0x400840b1:0x3ffcbf10 0x40084671:0x3ffcbf30 0x4000bec7:0x3ffcbf50 0x401aa715:0x3ffcbf70 0x400f5a51:0x3ffcbf90 0x4008d3af:0x3ffcbfb0 Rebooting... ets Jul 29 2019 12:21:46 =========================================== ice 182 % ./backtrace.sh 0x40089a2f:0x3ffcbb10 0x40089cc9:0x3ffcbb30 0x400844c3:0x3ffcbb50 0x400845dd:0x3ffcbb90 0x401185ab:0x3ffcbbb0 0x4010d979:0x3ffcbe70 0x4010d4c9:0x3ffcbec0 0x4008e903:0x3ffcbef0 0x400840b1:0x3ffcbf10 0x40084671:0x3ffcbf30 0x4000bec7:0x3ffcbf50 0x401aa715:0x3ffcbf70 0x400f5a51:0x3ffcbf90 0x4008d3af:0x3ffcbfb0 + xtensa-esp32-elf-addr2line -e build/ovms3.elf 0x40089a2f:0x3ffcbb10 0x40089cc9:0x3ffcbb30 0x400844c3:0x3ffcbb50 0x400845dd:0x3ffcbb90 0x401185ab:0x3ffcbbb0 0x4010d979:0x3ffcbe70 0x4010d4c9:0x3ffcbec0 0x4008e903:0x3ffcbef0 0x400840b1:0x3ffcbf10 0x40084671:0x3ffcbf30 0x4000bec7:0x3ffcbf50 0x401aa715:0x3ffcbf70 0x400f5a51:0x3ffcbf90 0x4008d3af:0x3ffcbfb0 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:143 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:171 /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c:860 (discriminator 2) /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/fiprintf.c:50 /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:59 (discriminator 8) /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/multi_heap_poisoning.c:350 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/heap_caps.c:403 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/syscalls.c:42 ??:0 /wrkdirs/usr/ports/devel/xtensa-esp32-elf/work/crosstool-NG-1.22.0-97-gc752ad5/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/del_op.cc:46 main/ovms_events.cpp:393 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/freertos/timers.c:485
Craig, just to be sure: you refitted both your modules with an ESP32 rev 3, only difference is, the production module is running "master", sim7600 module "for-v3.3"? If so, my conclusions so far would be: a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code. b) more important: only ESP32 rev 3 really solves the SPIRAM bug. It's not completely solved by the workaround, the workaround just reduces the frequency. Thanks, Michael Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash on the production module, CORRUPT HEAP after ~350 minutes.
sim7600 module still running after 19 hours. Looks like the cell connect stayed up for 11 hours before going into "User Interrupt" mode.
Craig
===========================================
I (25449286) script: [eval:1:] usr.testev.10: 2094300 I (25452446) script: [eval:1:] usr.testev.10: 2094600 I (25455606) script: [eval:1:] usr.testev.10: 2094900 OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0
ELF file SHA256: 9fe3cd9b4b911bd8
Backtrace: 0x40089a2f:0x3ffcbb10 0x40089cc9:0x3ffcbb30 0x400844c3:0x3ffcbb50 0x400845dd:0x3ffcbb90 0x401185ab:0x3ffcbbb0 0x4010d979:0x3ffcbe70 0x4010d4c9:0x3ffcbec0 0x4008e903:0x3ffcbef0 0x400840b1:0x3ffcbf10 0x40084671:0x3ffcbf30 0x4000bec7:0x3ffcbf50 0x401aa715:0x3ffcbf70 0x400f5a51:0x3ffcbf90 0x4008d3af:0x3ffcbfb0
Rebooting... ets Jul 29 2019 12:21:46
===========================================
ice 182 % ./backtrace.sh 0x40089a2f:0x3ffcbb10 0x40089cc9:0x3ffcbb30 0x400844c3:0x3ffcbb50 0x400845dd:0x3ffcbb90 0x401185ab:0x3ffcbbb0 0x4010d979:0x3ffcbe70 0x4010d4c9:0x3ffcbec0 0x4008e903:0x3ffcbef0 0x400840b1:0x3ffcbf10 0x40084671:0x3ffcbf30 0x4000bec7:0x3ffcbf50 0x401aa715:0x3ffcbf70 0x400f5a51:0x3ffcbf90 0x4008d3af:0x3ffcbfb0 + xtensa-esp32-elf-addr2line -e build/ovms3.elf 0x40089a2f:0x3ffcbb10 0x40089cc9:0x3ffcbb30 0x400844c3:0x3ffcbb50 0x400845dd:0x3ffcbb90 0x401185ab:0x3ffcbbb0 0x4010d979:0x3ffcbe70 0x4010d4c9:0x3ffcbec0 0x4008e903:0x3ffcbef0 0x400840b1:0x3ffcbf10 0x40084671:0x3ffcbf30 0x4000bec7:0x3ffcbf50 0x401aa715:0x3ffcbf70 0x400f5a51:0x3ffcbf90 0x4008d3af:0x3ffcbfb0 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736
/home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736
/home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:143
/home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:171
/Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c:860 (discriminator 2) /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/fiprintf.c:50
/Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:59 (discriminator 8) /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/multi_heap_poisoning.c:350
/home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/heap_caps.c:403
/home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/syscalls.c:42
??:0 /wrkdirs/usr/ports/devel/xtensa-esp32-elf/work/crosstool-NG-1.22.0-97-gc752ad5/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/del_op.cc:46
main/ovms_events.cpp:393 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/freertos/timers.c:485
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
On 9/26/21 10:28 AM, Michael Balzer wrote:
just to be sure: you refitted both your modules with an ESP32 rev 3, only difference is, the production module is running "master", sim7600 module "for-v3.3"?
That's right: OVMS# mod sum OVMS MODULE SUMMARY Module Version: 3.2.016-292-g61cde63a/ota_1/main (build idf v3.3.4-848-g1ff5e24b1 Sep 17 2021 09:44:13) Hardware: OVMS WIFI BLE BT cores=2 rev=ESP32/3
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
That's what it looks like to me. I think my sim7600 module hits the bug less often because it's not doing anything (especially when the modem goes offline). The production module is posting v2/v3 data giving it more chances to hit the bug.
b) more important: only ESP32 rev 3 really solves the SPIRAM bug. It's not completely solved by the workaround, the workaround just reduces the frequency.
Is it possible the spiram workarounds are more complete in the newest version of the idf? Does the newer compiler toolchain it uses factor in at all? Craig
Am 26.09.21 um 19:28 schrieb Michael Balzer:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
I think I've found a multicore race condition in FreeRTOS. For the scheduled events, I keep a list of timers. Walking through the list, I call xTimerIsTimerActive() to check if a timer is available. When a timer is due, prvProcessExpiredTimer() takes care of calling the callback. Here's the issue: before it does that, it removes the timer from the active list. xTimerIsTimerActive() checks if the timer is in the "active" list of the timer service. So xTimerIsTimerActive() will return false on a timer before the callback actually has begun. The current FreeRTOS version solves this by introducing a timer status flag instead of checking the list association, but esp-idf 3.3 includes the outdated version 8.2.0. I currently test a workaround for this regarding the events system (introducing my own status map), which seems to fix the issue -- 180.000 events & counting. I've used xTimerIsTimerActive() in the websocket handler and in the OvmsTimer class as well, so need to check if these need similar workarounds. Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Followup to… Am 26.09.21 um 19:28 schrieb Michael Balzer:
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash on the production module, CORRUPT HEAP after ~350 minutes.
OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class: The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires. In theory. _In reality, the timer callback occasionally gets executed twice_. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer. I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well. You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#... My workaround prevents crashes and outputs a log entry when the NULL payload is detected. Example log excerpt: script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer *W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected* I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 *W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected** **W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected* I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 *W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected* I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 *W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected* I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 *W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected* I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 *W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected* I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 … The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks. If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet. Any additional input on this is welcome. Regards, Michael -- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Everyone, I've managed to reproduce the effect on a standard ESP32 development board with a simple test project only involving timers, some CPU/memory load & wifi networking. I've tested both the standard esp-idf release 3.3 and the latest esp-idf release 4.4 (using gcc 8.4) for this, and the effect is still present. → Bug report: https://github.com/espressif/esp-idf/issues/8234 Attached are my test projects if you'd like to reproduce this or improve the test. I haven't tested periodic timer callbacks yet for the effect. These are normally designed to run periodically, but if the timing is important (e.g. on some CAN transmission?), this could cause erroneous behaviour as well. Regards, Michael Am 30.12.21 um 19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um 19:28 schrieb Michael Balzer:
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash on the production module, CORRUPT HEAP after ~350 minutes.
OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class:
The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires.
In theory. _In reality, the timer callback occasionally gets executed twice_. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer.
I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
My workaround prevents crashes and outputs a log entry when the NULL payload is detected.
Example log excerpt:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer *W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected* I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 *W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected** **W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected* I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 *W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected* I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 *W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected* I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 *W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected* I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 *W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected* I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 …
The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards, Michael
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
Good to hear this has been confirmed in a test case. Do we need to review all our use of timers? Or have you already checked? I wonder if it is reproducible single core and without SPI RAM? Whether this is another ESP32 hardware bug affecting something else... Regards, Mark.
On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Everyone,
I've managed to reproduce the effect on a standard ESP32 development board with a simple test project only involving timers, some CPU/memory load & wifi networking.
I've tested both the standard esp-idf release 3.3 and the latest esp-idf release 4.4 (using gcc 8.4) for this, and the effect is still present.
→ Bug report: https://github.com/espressif/esp-idf/issues/8234 <https://github.com/espressif/esp-idf/issues/8234>
Attached are my test projects if you'd like to reproduce this or improve the test.
I haven't tested periodic timer callbacks yet for the effect. These are normally designed to run periodically, but if the timing is important (e.g. on some CAN transmission?), this could cause erroneous behaviour as well.
Regards, Michael
Am 30.12.21 um 19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um 19:28 schrieb Michael Balzer:
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash on the production module, CORRUPT HEAP after ~350 minutes.
OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class:
The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires.
In theory. In reality, the timer callback occasionally gets executed twice. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer.
I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue:https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#... <https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#issuecomment-1002687055>
My workaround prevents crashes and outputs a log entry when the NULL payload is detected.
Example log excerpt:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 …
The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards, Michael
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev <http://lists.openvehicles.com/mailman/listinfo/ovmsdev>
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 <test-timer-dupecall.zip>
The effect is also present with SPIRAM disabled (and 64K for cpuload_task to fit in internal RAM), and it's even more frequent when running in single core mode. I haven't reviewed all our timers yet, I'll try to test periodic timers first for the effect so we know if these need to be reviewed also. Regards, Michael Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
Good to hear this has been confirmed in a test case.
Do we need to review all our use of timers? Or have you already checked?
I wonder if it is reproducible single core and without SPI RAM? Whether this is another ESP32 hardware bug affecting something else...
Regards, Mark.
On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Everyone,
I've managed to reproduce the effect on a standard ESP32 development board with a simple test project only involving timers, some CPU/memory load & wifi networking.
I've tested both the standard esp-idf release 3.3 and the latest esp-idf release 4.4 (using gcc 8.4) for this, and the effect is still present.
→ Bug report: https://github.com/espressif/esp-idf/issues/8234
Attached are my test projects if you'd like to reproduce this or improve the test.
I haven't tested periodic timer callbacks yet for the effect. These are normally designed to run periodically, but if the timing is important (e.g. on some CAN transmission?), this could cause erroneous behaviour as well.
Regards, Michael
Am 30.12.21 um 19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um 19:28 schrieb Michael Balzer:
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash on the production module, CORRUPT HEAP after ~350 minutes.
OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class:
The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires.
In theory. _In reality, the timer callback occasionally gets executed twice_. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer.
I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
My workaround prevents crashes and outputs a log entry when the NULL payload is detected.
Example log excerpt:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer *W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected* I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 *W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected** **W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected* I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 *W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected* I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 *W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected* I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 *W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected* I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 *W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected* I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 …
The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards, Michael
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 <test-timer-dupecall.zip>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
The effect also affects periodic timers, but with a very low frequency. Extended test & results: https://github.com/espressif/esp-idf/issues/8234#issuecomment-1013650548 The test run had 3 periodic callback dupes while counting 5912 single shot callback dupes. Timer usages in the OVMS aside events are a) the housekeeping per second timer, from which the monotictime and all standard tickers are derived, and b) the update ticker for websocket connections. Vehicles using timers are: Nissan Leaf, Tesla Roadster, Renault Twizy & VW e-Up. I suggest the respective maintainers check their timer uses for issues that might result from a duplicate callback. The event timer already detects duplicate invocations and produces the log warning (see below) on this. Regarding the main ticker timer: a duplicate call would cause the monotonictime counter to get ouf of sync with the actual system time, but I don't think we would rely on that anywhere. More serious could be that the per second ticker can run twice on the same second in this case. But I also don't think we rely on a mininum execution interval as well -- after all, these are software timers and generally not guaranteed to run in exactly the interval they are defined to run. We use the monotonictime for some expiry checks, but none of these should be so low a single skipped second would matter. So I don't think the housekeeping ticker running occasionally twice would cause an issue. I do remember though I had an issue with the monotonictime behaving strangely once, but I don't recall the exact issue. The websocket timer is used to schedule data updates, metrics etc., for the websocket clients. It does not rely on a minimum interval as well, and the occasional queue overflow events we see there are more probably caused by a slow network / client. The vehicle uses of timers may be more critical, as they seem to involve protocol timing, I recommend to check these ASAP. A simple strategy to detect duplicate calls on periodic timers can be to keep & compare the FreeRTOS tick count (xTaskGetTickCount()) of the last execution in some static variable, if it's the same, skip the run. For single shot timers, the strategy I've used in the events framework seems to work: use the timer ID to check the call validity, i.e. define some timer ID property to mark an invalid call and set that property in the callback. Regards, Michael Am 14.01.22 um 09:43 schrieb Michael Balzer:
The effect is also present with SPIRAM disabled (and 64K for cpuload_task to fit in internal RAM), and it's even more frequent when running in single core mode.
I haven't reviewed all our timers yet, I'll try to test periodic timers first for the effect so we know if these need to be reviewed also.
Regards, Michael
Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
Good to hear this has been confirmed in a test case.
Do we need to review all our use of timers? Or have you already checked?
I wonder if it is reproducible single core and without SPI RAM? Whether this is another ESP32 hardware bug affecting something else...
Regards, Mark.
On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Everyone,
I've managed to reproduce the effect on a standard ESP32 development board with a simple test project only involving timers, some CPU/memory load & wifi networking.
I've tested both the standard esp-idf release 3.3 and the latest esp-idf release 4.4 (using gcc 8.4) for this, and the effect is still present.
→ Bug report: https://github.com/espressif/esp-idf/issues/8234
Attached are my test projects if you'd like to reproduce this or improve the test.
I haven't tested periodic timer callbacks yet for the effect. These are normally designed to run periodically, but if the timing is important (e.g. on some CAN transmission?), this could cause erroneous behaviour as well.
Regards, Michael
Am 30.12.21 um 19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um 19:28 schrieb Michael Balzer:
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres:
Second crash on the production module, CORRUPT HEAP after ~350 minutes.
OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 got 0x3f8adcc0 abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class:
The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires.
In theory. _In reality, the timer callback occasionally gets executed twice_. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer.
I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
My workaround prevents crashes and outputs a log entry when the NULL payload is detected.
Example log excerpt:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer *W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected* I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 *W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected** **W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected* I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 *W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected* I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 *W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected* I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 *W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected* I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 *W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected* I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 …
The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards, Michael
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 <test-timer-dupecall.zip>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I've added a tick count check to the core tickers and to the VW e-Up and Renault Twizy timer callbacks, which are all periodic. Code template: insert this at the top of your timer callback: // Workaround for FreeRTOS duplicate timer callback bug // (see https://github.com/espressif/esp-idf/issues/8234) static TickType_t last_tick = 0; TickType_t tick = xTaskGetTickCount(); if (tick < last_tick + xTimerGetPeriod(timer) - 3) return; last_tick = tick; This should also work for a single shot timer, unless it can be reused with a shorter period (that's the case with the event timer pool). The "-3" accomodates for delayed timer callback execution. Note that this dupe detection therefore depends on the period being higher than 3 ticks (30 ms), which is the case for all current periodic timers (fastest is the 100 ms = 10 ticks timer for the Twizy kickdown). Regards, Michael Am 15.01.22 um 11:08 schrieb Michael Balzer:
The effect also affects periodic timers, but with a very low frequency.
Extended test & results: https://github.com/espressif/esp-idf/issues/8234#issuecomment-1013650548
The test run had 3 periodic callback dupes while counting 5912 single shot callback dupes.
Timer usages in the OVMS aside events are a) the housekeeping per second timer, from which the monotictime and all standard tickers are derived, and b) the update ticker for websocket connections.
Vehicles using timers are: Nissan Leaf, Tesla Roadster, Renault Twizy & VW e-Up. I suggest the respective maintainers check their timer uses for issues that might result from a duplicate callback.
The event timer already detects duplicate invocations and produces the log warning (see below) on this.
Regarding the main ticker timer: a duplicate call would cause the monotonictime counter to get ouf of sync with the actual system time, but I don't think we would rely on that anywhere. More serious could be that the per second ticker can run twice on the same second in this case. But I also don't think we rely on a mininum execution interval as well -- after all, these are software timers and generally not guaranteed to run in exactly the interval they are defined to run. We use the monotonictime for some expiry checks, but none of these should be so low a single skipped second would matter.
So I don't think the housekeeping ticker running occasionally twice would cause an issue. I do remember though I had an issue with the monotonictime behaving strangely once, but I don't recall the exact issue.
The websocket timer is used to schedule data updates, metrics etc., for the websocket clients. It does not rely on a minimum interval as well, and the occasional queue overflow events we see there are more probably caused by a slow network / client.
The vehicle uses of timers may be more critical, as they seem to involve protocol timing, I recommend to check these ASAP.
A simple strategy to detect duplicate calls on periodic timers can be to keep & compare the FreeRTOS tick count (xTaskGetTickCount()) of the last execution in some static variable, if it's the same, skip the run.
For single shot timers, the strategy I've used in the events framework seems to work: use the timer ID to check the call validity, i.e. define some timer ID property to mark an invalid call and set that property in the callback.
Regards, Michael
Am 14.01.22 um 09:43 schrieb Michael Balzer:
The effect is also present with SPIRAM disabled (and 64K for cpuload_task to fit in internal RAM), and it's even more frequent when running in single core mode.
I haven't reviewed all our timers yet, I'll try to test periodic timers first for the effect so we know if these need to be reviewed also.
Regards, Michael
Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
Good to hear this has been confirmed in a test case.
Do we need to review all our use of timers? Or have you already checked?
I wonder if it is reproducible single core and without SPI RAM? Whether this is another ESP32 hardware bug affecting something else...
Regards, Mark.
On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Everyone,
I've managed to reproduce the effect on a standard ESP32 development board with a simple test project only involving timers, some CPU/memory load & wifi networking.
I've tested both the standard esp-idf release 3.3 and the latest esp-idf release 4.4 (using gcc 8.4) for this, and the effect is still present.
→ Bug report: https://github.com/espressif/esp-idf/issues/8234
Attached are my test projects if you'd like to reproduce this or improve the test.
I haven't tested periodic timer callbacks yet for the effect. These are normally designed to run periodically, but if the timing is important (e.g. on some CAN transmission?), this could cause erroneous behaviour as well.
Regards, Michael
Am 30.12.21 um 19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um 19:28 schrieb Michael Balzer:
If so, my conclusions so far would be:
a) we've got a real heap corruption issue that sometimes gets triggered by the test. I've had it twice around the same place, i.e. within the scheduled event processing. I'll check my code.
Am 26.09.21 um 18:46 schrieb Craig Leres: > Second crash on the production module, CORRUPT HEAP after ~350 > minutes. > > OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 > got 0x3f8adcc0 > abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class:
The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires.
In theory. _In reality, the timer callback occasionally gets executed twice_. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer.
I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
My workaround prevents crashes and outputs a log entry when the NULL payload is detected.
Example log excerpt:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer *W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected* I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 *W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected** **W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected* I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 *W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected* I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 *W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected* I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 *W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected* I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 *W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected* I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 …
The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards, Michael
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 <test-timer-dupecall.zip>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
The duplicate callbacks on the single shot timers actually were my fault. I missed reading the documentation thoroughly: xTimerChangePeriod() actually not only changes the period, it also starts the timer. So assigning the timer payload after changing it's period introduced a race condition. I've fixed that and no longer see duplicate callbacks in the test log. The case of additional callback executions for cyclic timers is still open. Regards, Michael Am 15.01.22 um 15:13 schrieb Michael Balzer:
I've added a tick count check to the core tickers and to the VW e-Up and Renault Twizy timer callbacks, which are all periodic.
Code template: insert this at the top of your timer callback:
// Workaround for FreeRTOS duplicate timer callback bug // (see https://github.com/espressif/esp-idf/issues/8234) static TickType_t last_tick = 0; TickType_t tick = xTaskGetTickCount(); if (tick < last_tick + xTimerGetPeriod(timer) - 3) return; last_tick = tick;
This should also work for a single shot timer, unless it can be reused with a shorter period (that's the case with the event timer pool).
The "-3" accomodates for delayed timer callback execution. Note that this dupe detection therefore depends on the period being higher than 3 ticks (30 ms), which is the case for all current periodic timers (fastest is the 100 ms = 10 ticks timer for the Twizy kickdown).
Regards, Michael
Am 15.01.22 um 11:08 schrieb Michael Balzer:
The effect also affects periodic timers, but with a very low frequency.
Extended test & results: https://github.com/espressif/esp-idf/issues/8234#issuecomment-1013650548
The test run had 3 periodic callback dupes while counting 5912 single shot callback dupes.
Timer usages in the OVMS aside events are a) the housekeeping per second timer, from which the monotictime and all standard tickers are derived, and b) the update ticker for websocket connections.
Vehicles using timers are: Nissan Leaf, Tesla Roadster, Renault Twizy & VW e-Up. I suggest the respective maintainers check their timer uses for issues that might result from a duplicate callback.
The event timer already detects duplicate invocations and produces the log warning (see below) on this.
Regarding the main ticker timer: a duplicate call would cause the monotonictime counter to get ouf of sync with the actual system time, but I don't think we would rely on that anywhere. More serious could be that the per second ticker can run twice on the same second in this case. But I also don't think we rely on a mininum execution interval as well -- after all, these are software timers and generally not guaranteed to run in exactly the interval they are defined to run. We use the monotonictime for some expiry checks, but none of these should be so low a single skipped second would matter.
So I don't think the housekeeping ticker running occasionally twice would cause an issue. I do remember though I had an issue with the monotonictime behaving strangely once, but I don't recall the exact issue.
The websocket timer is used to schedule data updates, metrics etc., for the websocket clients. It does not rely on a minimum interval as well, and the occasional queue overflow events we see there are more probably caused by a slow network / client.
The vehicle uses of timers may be more critical, as they seem to involve protocol timing, I recommend to check these ASAP.
A simple strategy to detect duplicate calls on periodic timers can be to keep & compare the FreeRTOS tick count (xTaskGetTickCount()) of the last execution in some static variable, if it's the same, skip the run.
For single shot timers, the strategy I've used in the events framework seems to work: use the timer ID to check the call validity, i.e. define some timer ID property to mark an invalid call and set that property in the callback.
Regards, Michael
Am 14.01.22 um 09:43 schrieb Michael Balzer:
The effect is also present with SPIRAM disabled (and 64K for cpuload_task to fit in internal RAM), and it's even more frequent when running in single core mode.
I haven't reviewed all our timers yet, I'll try to test periodic timers first for the effect so we know if these need to be reviewed also.
Regards, Michael
Am 14.01.22 um 06:50 schrieb Mark Webb-Johnson:
Good to hear this has been confirmed in a test case.
Do we need to review all our use of timers? Or have you already checked?
I wonder if it is reproducible single core and without SPI RAM? Whether this is another ESP32 hardware bug affecting something else...
Regards, Mark.
On 14 Jan 2022, at 12:03 AM, Michael Balzer <dexter@expeedo.de> wrote:
Signed PGP part Everyone,
I've managed to reproduce the effect on a standard ESP32 development board with a simple test project only involving timers, some CPU/memory load & wifi networking.
I've tested both the standard esp-idf release 3.3 and the latest esp-idf release 4.4 (using gcc 8.4) for this, and the effect is still present.
→ Bug report: https://github.com/espressif/esp-idf/issues/8234
Attached are my test projects if you'd like to reproduce this or improve the test.
I haven't tested periodic timer callbacks yet for the effect. These are normally designed to run periodically, but if the timing is important (e.g. on some CAN transmission?), this could cause erroneous behaviour as well.
Regards, Michael
Am 30.12.21 um 19:25 schrieb Michael Balzer:
Followup to…
Am 26.09.21 um 19:28 schrieb Michael Balzer: > If so, my conclusions so far would be: > > a) we've got a real heap corruption issue that sometimes gets > triggered by the test. I've had it twice around the same place, > i.e. within the scheduled event processing. I'll check my code. > > Am 26.09.21 um 18:46 schrieb Craig Leres: >> Second crash on the production module, CORRUPT HEAP after ~350 >> minutes. >> >> OVMS# CORRUPT HEAP: Bad head at 0x3f8adc78. Expected 0xabba1234 >> got 0x3f8adcc0 >> abort() was called at PC 0x400844c3 on core 0
I've found the heap corruption source and a new bug class:
The corruption was caused by a duplicate free() (here via delete), which was basically impossible: it was the free() call for the event message for a delayed event delivery. There is exactly one producer (OvmsEvents::ScheduleEvent) and exactly one consumer (OvmsEvents::SignalScheduledEvent), which is called exactly once -- when the single shot timer expires.
In theory. _In reality, the timer callback occasionally gets executed twice_. To exclude every possible race condition, I enclosed both producer & consumer into a semaphore lock. I then changed the code in order to clear the timer payload as soon as it's read, and added a test for a NULL payload -- and voila, the timer callback now gets occasionally called with a NULL payload, which is also impossible as the allocation result is checked in the producer.
I've had no luck reproducing this in a reduced test project, even with multiple auto reload timers and distribution over both cores, but I still see no other explanation than a bug in the FreeRTOS timer service (or the Espressif FreeRTOS multi core adaption). Ah, and yes, this occurs on the ESP32/r3 as well.
You should be able to reproduce the effect using the same event test loop as for the Duktape TypeError issue: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/issues/474#...
My workaround prevents crashes and outputs a log entry when the NULL payload is detected.
Example log excerpt:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })' script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")' I (13493019) events: ScheduleEvent: creating new timer *W (13495919) events: SignalScheduledEvent: duplicate callback invocation detected* I (13497029) ovms-duk-util: [eval:1:] usr.testev.10: 300 I (13501109) housekeeping: 2021-12-30 18:12:35 CET (RAM: 8b=64448-67004 32b=6472) -- I (13521779) ovms-duk-util: [eval:1:] usr.testev.10: 2100 I (13525809) ovms-duk-util: [eval:1:] usr.testev.10: 2400 *W (13527579) events: SignalScheduledEvent: duplicate callback invocation detected** **W (13527629) events: SignalScheduledEvent: duplicate callback invocation detected* I (13529839) ovms-duk-util: [eval:1:] usr.testev.10: 2700 I (13533329) ovms-server-v2: Incoming Msg: MP-0 AFA -- I (13579149) ovms-duk-util: [eval:1:] usr.testev.10: 6300 I (13583319) ovms-duk-util: [eval:1:] usr.testev.10: 6600 *W (13584679) events: SignalScheduledEvent: duplicate callback invocation detected* I (13587439) ovms-duk-util: [eval:1:] usr.testev.10: 6900 I (13591589) ovms-duk-util: [eval:1:] usr.testev.10: 7200 -- I (13714299) ovms-duk-util: [eval:1:] usr.testev.10: 16200 I (13718339) ovms-duk-util: [eval:1:] usr.testev.10: 16500 *W (13718719) events: SignalScheduledEvent: duplicate callback invocation detected* I (13722459) ovms-duk-util: [eval:1:] usr.testev.10: 16800 I (13726509) ovms-duk-util: [eval:1:] usr.testev.10: 17100 -- I (13743149) ovms-duk-util: [eval:1:] usr.testev.10: 18300 I (13747129) ovms-duk-util: [eval:1:] usr.testev.10: 18600 *W (13748979) events: SignalScheduledEvent: duplicate callback invocation detected* I (13751299) ovms-duk-util: [eval:1:] usr.testev.10: 18900 I (13755349) ovms-duk-util: [eval:1:] usr.testev.10: 19200 -- I (13784029) ovms-duk-util: [eval:1:] usr.testev.10: 21300 I (13788059) ovms-duk-util: [eval:1:] usr.testev.10: 21600 *W (13791409) events: SignalScheduledEvent: duplicate callback invocation detected* I (13792179) ovms-duk-util: [eval:1:] usr.testev.10: 21900 I (13796239) ovms-duk-util: [eval:1:] usr.testev.10: 22200 …
The bug frequency differs from boot to boot, but as you can see can be very high. I've had runs with ~ 1 occurrence every 300.000 callbacks, and runs like the above with ~ 1 every 3.000 callbacks.
If this is a common effect with timer callbacks, that may cause some of the remaining issues. It's possible this only happens with single shot timers, haven't checked our periodic timers yet.
Any additional input on this is welcome.
Regards, Michael
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26 <test-timer-dupecall.zip>
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
_______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
-- Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
I'm aborting the "production" test after 37 hours with no new crashes. The sim7600 test hit a CORRUPT HEAP after 5 hours. Let me know if you want me to do any more tests. Craig =========================================== I (194380023) netmanager: Set DNS#1 0.0.0.0 I (194380023) netmanager: Set DNS#2 0.0.0.0 I (194380673) ovms-duk-util: [eval:1:] usr.testev.10: 18837900 I (194383703) ovms-duk-util: [eval:1:] usr.testev.10: 18838200 I (194386843) ovms-duk-util: [eval:1:] usr.testev.10: 18838500 I (194389853) ovms-duk-util: [eval:1:] usr.testev.10: 18838800 I (194390023) cellular: State: Enter NetWait state I (194392023) gsm-ppp: StatusCallBack: User Interrupt I (194392883) ovms-duk-util: [eval:1:] usr.testev.10: 18839100 I (194394023) cellular: State: Enter NetStart state I (194395063) cellular: PPP Connection is ready to start I (194396023) cellular: State: Enter NetMode state I (194396023) gsm-ppp: Initialising... I (194396033) ovms-duk-util: [eval:1:] usr.testev.10: 18839400CORRUPT HEAP: Bad head at 0x3f82dde0. Expected 0xabba1234 got 0x3f82df64 abort() was called at PC 0x400844c3 on core 1 ELF file SHA256: 7c10178c2874419f Backtrace: 0x40089a2f:0x3ffc4520 0x40089cc9:0x3ffc4540 0x400844c3:0x3ffc4560 0x400845dd:0x3ffc45a0 0x4011a4a3:0x3ffc45c0 0x4010f439:0x3ffc4880 0x4010ef89:0x3ffc48d0 0x4008e903:0x3ffc4900 0x400840b1:0x3ffc4920 0x40084671:0x3ffc4940 0x4000bec7:0x3ffc4960 0x400f7047:0x3ffc4980 0x400f7801:0x3ffc49a0 0x400f783c:0x3ffc4a00 0x400f78b9:0x3ffc4a40 Rebooting... ets Jul 29 2019 12:21:46 =========================================== ice 205 % ./backtrace.sh 0x40089a2f:0x3ffc4520 0x40089cc9:0x3ffc4540 0x400844c3:0x3ffc4560 0x400845dd:0x3ffc45a0 0x4011a4a3:0x3ffc45c0 0x4010f439:0x3ffc4880 0x4010ef89:0x3ffc48d0 0x4008e903:0x3ffc4900 0x400840b1:0x3ffc4920 0x40084671:0x3ffc4940 0x4000bec7:0x3ffc4960 0x400f7047:0x3ffc4980 0x400f7801:0x3ffc49a0 0x400f783c:0x3ffc4a00 0x400f78b9:0x3ffc4a40 + xtensa-esp32-elf-addr2line -e build/ovms3.elf 0x40089a2f:0x3ffc4520 0x40089cc9:0x3ffc4540 0x400844c3:0x3ffc4560 0x400845dd:0x3ffc45a0 0x4011a4a3:0x3ffc45c0 0x4010f439:0x3ffc4880 0x4010ef89:0x3ffc48d0 0x4008e903:0x3ffc4900 0x400840b1:0x3ffc4920 0x40084671:0x3ffc4940 0x4000bec7:0x3ffc4960 0x400f7047:0x3ffc4980 0x400f7801:0x3ffc49a0 0x400f783c:0x3ffc4a00 0x400f78b9:0x3ffc4a40 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/esp32/panic.c:736 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:143 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/locks.c:171 /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c:1699 (discriminator 8) /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/strtod.c:428 /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/string/../../../.././newlib/libc/string/strerror.c:591 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/multi_heap_poisoning.c:350 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/heap/heap_caps.c:403 /home/ice/u0/leres/esp/openvehicles-xtensa-esp32-elf/components/newlib/syscalls.c:42 ??:0 main/ovms_http.cpp:81 main/ovms_metrics.cpp:597 main/ovms_metrics.cpp:597 main/ovms_metrics.cpp:597
Oh, you guys are not testing my mongoose-wolfssl branch, you are testing master and/or for-v3.3 branch. I see that commit c6911c91432cada337bef46f6a541af46304b5cf went into master on 2/11 but commit 9607979e91da7a53da1cd0bd8325ab390abe18bb is only on my mongoose-wolfssl branch. On master or for-v3.3 the SSH/SSL code is updated to 1.4.5/4.6.0 but mongoose is still using MBEDTLS for TLS. So, Craig, you're not actually "building/booting this" where this was referring to my wolfssl-based TLS. If you would like to do so, what rebasing and or merging is needed? Should my new code be moved over to a new branch from for-v3.3? Actually, I guess all of the changes to wolfssh and wolfssl components could be merged. The control of whether TLS goes through MBEDTLS or wolfssl is controlled by changes in mongoose. -- Steve On Fri, 12 Mar 2021, Mark Webb-Johnson wrote:
The for-v3.3 branch should be up-to-date and merged from master. It should have everything that master has.
I see it has this:
commit 9607979e91da7a53da1cd0bd8325ab390abe18bb Author: Stephen Casner <casner@acm.org> Date: Wed Feb 24 23:53:28 2021 -0800
SSH: Don't emit error message if wolfssl debugging is unconfigured
diff --git a/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp b/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp index 21549bad..6fa0e5e5 100644 --- a/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp +++ b/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp @@ -177,9 +177,8 @@ void OvmsSSH::NetManInit(std::string event, void* data) ESP_LOGI(tag, "Launching SSH Server"); wolfSSH_SetLoggingCb(&wolfssh_logger); wolfSSH_Debugging_ON(); - if ((ret=wolfSSL_SetLoggingCb(&wolfssl_logger)) || (ret=wolfSSL_Debugging_ON())) - ESP_LOGW(tag, "Couldn't initialize wolfSSL debugging, error %d: %s", ret, - GetErrorString(ret)); + wolfSSL_SetLoggingCb(&wolfssl_logger); + wolfSSL_Debugging_ON(); ret = wolfSSH_Init(); if (ret != WS_SUCCESS) {
But current code in both master and for-v3.3 branches is:
ESP_LOGI(tag, "Launching SSH Server"); wolfSSH_SetLoggingCb(&wolfssh_logger); wolfSSH_Debugging_ON(); if ((ret=wolfSSL_SetLoggingCb(&wolfssl_logger)) || (ret=wolfSSL_Debugging_ON())) ESP_LOGW(tag, "Couldn't initialize wolfSSL debugging, error %d: %s", ret, GetErrorString(ret));
Seems commit c6911c91432cada337bef46f6a541af46304b5cf seems to have brought back the old code?
Mark
On 12 Mar 2021, at 11:34 AM, Stephen Casner <casner@acm.org> wrote:
Craig and Mark,
I do have the OvmsSSH::NetManInit() function calling wolfSSL_Debugging_ON() which would be expected to return -174 meaning NOT_COMPILED_IN as Mark correctly found. At one point I had code to print that error messge because I was having trouble getting the wolfSSL debugging to work, but I took out that error message in commit 9607979e91da7a53da1cd0bd8325ab390abe18bb so now the return value is ignored. I'm baffled. I'll have to look deeper after dinner.
My mongoose-wolfssl branch is off of master on 2/17. I should probably have rebased to the current master or perhaps merged it to for-v3.3 as Mark recently requested. Have you guys done that merge for what you are testing now?
-- Steve
On Fri, 12 Mar 2021, Mark Webb-Johnson wrote:
P.S. Error code -174 seems to be 'NOT_COMPILED_IN’.
Regards, Mark.
On 12 Mar 2021, at 9:56 AM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Craig,
I get the same (with for-v3.3):
W (2940) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I guess it is just a warning. Probably some debugging config setting.
But wifi, web and others work ok for me. Only problems I have with for-v3.3 branch are (a) the web dashboard modem status, and (b) the TLS certificate verification against api.openvehicles.com <http://api.openvehicles.com/>. I am working on both.
Regards, Mark.
On 12 Mar 2021, at 9:46 AM, Craig Leres <leres@xse.com <mailto:leres@xse.com>> wrote:
On 3/10/21 11:23 PM, Stephen Casner wrote:
Michael and anyone else who's game: I now have an updated mongoose-wolfssl branch ready to be tested. The reason for the 90-second lockup mentioned in the previous post is a whole lot of math for a prime-number validation that's part of the Diffie-Hellman step. It was actually 87 seconds for Mark's server and 28 seconds for Michael's due to differences in certificates. That prime-number validation is required for FIPS compliance, which WolfSSL supports, but we don't need it. I spent quite a while digging into this to find where the process was getting stuck. Finally I got help from WolfSSL support suggesting a configuration option that avoids this extra check. So now I have an implementation using mongoose with wolfssl that connects successfully to both servers with a 3-4 second delay. (I don't recall what the delay was for the MBEDTLS-based implementation.) I think the memory usage looks OK. I still have not taken any steps to reduce any resources used by the MBEDTLS code as accessed for other purposes. Included in the debugging was another version update on the Wolf code to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master).
I just did a fresh reboot and captured the serial console output and noticed this:
W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I think it happened around the time I lost wifi connectivity.
My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles.
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
I suggest to merge into for-v3.3 and just include it in the testing for that. I have a tag ‘pre’ setup on api.openvehicles.com <http://api.openvehicles.com/> server to allow that ‘preview’ from for-v3.3 branch to be OTA updated automatically (I ran out of good names after edge, eap, and main). I build it when worthwhile. This allows me to easily run it in my cars for live testing. Regards, Mark.
On 12 Mar 2021, at 2:08 PM, Stephen Casner <casner@acm.org> wrote:
Oh, you guys are not testing my mongoose-wolfssl branch, you are testing master and/or for-v3.3 branch. I see that commit c6911c91432cada337bef46f6a541af46304b5cf went into master on 2/11 but commit 9607979e91da7a53da1cd0bd8325ab390abe18bb is only on my mongoose-wolfssl branch.
On master or for-v3.3 the SSH/SSL code is updated to 1.4.5/4.6.0 but mongoose is still using MBEDTLS for TLS.
So, Craig, you're not actually "building/booting this" where this was referring to my wolfssl-based TLS. If you would like to do so, what rebasing and or merging is needed? Should my new code be moved over to a new branch from for-v3.3?
Actually, I guess all of the changes to wolfssh and wolfssl components could be merged. The control of whether TLS goes through MBEDTLS or wolfssl is controlled by changes in mongoose.
-- Steve
On Fri, 12 Mar 2021, Mark Webb-Johnson wrote:
The for-v3.3 branch should be up-to-date and merged from master. It should have everything that master has.
I see it has this:
commit 9607979e91da7a53da1cd0bd8325ab390abe18bb Author: Stephen Casner <casner@acm.org> Date: Wed Feb 24 23:53:28 2021 -0800
SSH: Don't emit error message if wolfssl debugging is unconfigured
diff --git a/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp b/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp index 21549bad..6fa0e5e5 100644 --- a/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp +++ b/vehicle/OVMS.V3/components/console_ssh/src/console_ssh.cpp @@ -177,9 +177,8 @@ void OvmsSSH::NetManInit(std::string event, void* data) ESP_LOGI(tag, "Launching SSH Server"); wolfSSH_SetLoggingCb(&wolfssh_logger); wolfSSH_Debugging_ON(); - if ((ret=wolfSSL_SetLoggingCb(&wolfssl_logger)) || (ret=wolfSSL_Debugging_ON())) - ESP_LOGW(tag, "Couldn't initialize wolfSSL debugging, error %d: %s", ret, - GetErrorString(ret)); + wolfSSL_SetLoggingCb(&wolfssl_logger); + wolfSSL_Debugging_ON(); ret = wolfSSH_Init(); if (ret != WS_SUCCESS) {
But current code in both master and for-v3.3 branches is:
ESP_LOGI(tag, "Launching SSH Server"); wolfSSH_SetLoggingCb(&wolfssh_logger); wolfSSH_Debugging_ON(); if ((ret=wolfSSL_SetLoggingCb(&wolfssl_logger)) || (ret=wolfSSL_Debugging_ON())) ESP_LOGW(tag, "Couldn't initialize wolfSSL debugging, error %d: %s", ret, GetErrorString(ret));
Seems commit c6911c91432cada337bef46f6a541af46304b5cf seems to have brought back the old code?
Mark
On 12 Mar 2021, at 11:34 AM, Stephen Casner <casner@acm.org> wrote:
Craig and Mark,
I do have the OvmsSSH::NetManInit() function calling wolfSSL_Debugging_ON() which would be expected to return -174 meaning NOT_COMPILED_IN as Mark correctly found. At one point I had code to print that error messge because I was having trouble getting the wolfSSL debugging to work, but I took out that error message in commit 9607979e91da7a53da1cd0bd8325ab390abe18bb so now the return value is ignored. I'm baffled. I'll have to look deeper after dinner.
My mongoose-wolfssl branch is off of master on 2/17. I should probably have rebased to the current master or perhaps merged it to for-v3.3 as Mark recently requested. Have you guys done that merge for what you are testing now?
-- Steve
On Fri, 12 Mar 2021, Mark Webb-Johnson wrote:
P.S. Error code -174 seems to be 'NOT_COMPILED_IN’.
Regards, Mark.
On 12 Mar 2021, at 9:56 AM, Mark Webb-Johnson <mark@webb-johnson.net> wrote:
Craig,
I get the same (with for-v3.3):
W (2940) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I guess it is just a warning. Probably some debugging config setting.
But wifi, web and others work ok for me. Only problems I have with for-v3.3 branch are (a) the web dashboard modem status, and (b) the TLS certificate verification against api.openvehicles.com <http://api.openvehicles.com/>. I am working on both.
Regards, Mark.
On 12 Mar 2021, at 9:46 AM, Craig Leres <leres@xse.com <mailto:leres@xse.com>> wrote:
On 3/10/21 11:23 PM, Stephen Casner wrote: > Michael and anyone else who's game: > I now have an updated mongoose-wolfssl branch ready to be tested. The > reason for the 90-second lockup mentioned in the previous post is a > whole lot of math for a prime-number validation that's part of the > Diffie-Hellman step. It was actually 87 seconds for Mark's server and > 28 seconds for Michael's due to differences in certificates. That > prime-number validation is required for FIPS compliance, which WolfSSL > supports, but we don't need it. I spent quite a while digging into > this to find where the process was getting stuck. Finally I got help > from WolfSSL support suggesting a configuration option that avoids > this extra check. > So now I have an implementation using mongoose with wolfssl that > connects successfully to both servers with a 3-4 second delay. (I > don't recall what the delay was for the MBEDTLS-based implementation.) > I think the memory usage looks OK. I still have not taken any steps > to reduce any resources used by the MBEDTLS code as accessed for other > purposes. > Included in the debugging was another version update on the Wolf code > to wolfssh 1.4.6 and wolfssl 4.7.0.
I tried building/booting this on my dev module( 3.2.016-66-g93e0cf3e); but for some time now the for-v3.3 branch has been broken for me. When the module first boots the web gui works long enough for me to login and then it times out. From that point on I can't get the web gui or ssh to respond. It will return pings. The serial console is fine (and that's how I switch back to build based on master).
I just did a fresh reboot and captured the serial console output and noticed this:
W (4484) ssh: Couldn't initialize wolfSSL debugging, error -174: Unknown error code
I think it happened around the time I lost wifi connectivity.
My sdkconfig is close to support/sdkconfig.default.hw31, I have CONFIG_SPIRAM_CACHE_WORKAROUND turned off along with a lot of vehicles.
Craig _______________________________________________ OvmsDev mailing list OvmsDev@lists.openvehicles.com <mailto:OvmsDev@lists.openvehicles.com> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
OvmsDev mailing list OvmsDev@lists.openvehicles.com http://lists.openvehicles.com/mailman/listinfo/ovmsdev
participants (4)
-
Craig Leres -
Mark Webb-Johnson -
Michael Balzer -
Stephen Casner