[Ovmsdev] Steven is ...
Stephen Casner
casner at acm.org
Thu Nov 16 16:34:56 HKT 2017
Mark,
Thanks for the compliments. This integration of WolfSSH was a
significant struggle for various reasons that I will describe here for
their possible entertainment value :-) as well as noting some
additional debugging approaches.
Integration of the WolfSSL and WolfSSH libraries themselves into the
OVMS build did not take too long. The build process for those
libraries is implemented with autoconf and automake and the resulting
configure script, which doesn't fit with the ESP-IDF build structure.
My first step was to just build those two libraries and the example
SSH echo server program on my Mac OSX laptop using the configure
script as designed. Then I was able to run configure again with the
option --enable-cryptonly to limit the build to just that portion
needed for SSH plus the --disable-silent-rules option (or equivalently
run "make V=1") to show the full command lines during the make. That
let me see which object files were needed out of the big WolfSSL tree
and to see what -D defines were used during the make. I also looked
at the defines in the config.h file that was produced to see which
were relevant. At first I added those defines to CFLAGS in the
component.mk files, but later moved them to the user_settings.h file
that the Wolf sources are set up to include if WOLFSSL_USER_SETTINGS
is defined (in the component.mk) so they could be included when the
library headers were included in the console_ssh.cpp compilation.
The way the WolfSSH code is written there are many instances where the
compiler detects potential references to uninitialized variables even
though the code structure ensures that won't happen. At first I tried
to add initializations, but then gave up and added CFLAGS option
-Wno-maybe-uninitialized to the component.mk file. I also needed to
define the COMPONENT_EXTRA_INCLUDES variable in coponent.mk to
reference the directory where the FreeRTOS includes are located in
ESP-IDF because, although WolfSSL does support FreeRTOS, their
#include statements expect the files one level up. I also needed to
list the object files explicitly to avoid a warning about direct
compilation misc.c because it is intended to be compiled only as part
of other .c file that #include it. With those additions to the
component.mk files I was able to compile wolfssl and wolfssh without
any changes to the files in the distributions.
I created console_ssh from console_telnet plus code snippets from the
WolfSSH example echoserver program, then I realized that I needed to
add a define NO_DEV_RANDOM since there is no /dev/random facility.
I'm temporarily using the non-random test generator, but need to
implement access to the ESP32 random facility. I worked through
several variations in the transition from the design for telnet where
the library does not do any of the I/O itself to ssh where it does.
At that point the project became more difficult. When connecting from
the client on the Mac it would sometimes progress to the point of
getting a command prompt, but often there were complaints like
"incorrect signature" or "Bad packet length 2263058189". There were
also some crashes suggesting corrupted memory. I turned on the debug
logging in wolfssh and in the ssh client, but that did not give enough
details about the raw packets. I installed Putty (which also did not
go smoothly) since it does provide logging of the raw packets.
I came to the conclusion that the complaints about packets were the
result of memory corruption, so to find out where in the wolfssh code
this was occurring I added calls in several places to a heap debug
routine that checks the integrity of all free and allocated blocks an
records a tag number from the call into a ring buffer to track which
instance of the call it was and abort when a memory error ws found.
That way I could see which was the last call before the problem and
the first one after.
Over a bunch of runs the results varied. Then I began to see some
cases where the sequence of tags should not be possible according to
the code structure. I expanded the wolfssh debug logging to add a
task name the the log line. That exposed the root problem: I had put
some of the initialization code in the constructor and initialization
parts of the SSHConsole task that are executed by the parent process
before the SSHConsole task is created. Then the SSHConsole task could
be created and start sending the command line introduction before the
initialization was finished, causing two tasks to be accessing the
same thread context in the library. That's not safe. This sounds
straighforward to deduce now, but it took a while to run test cases
and dig through the details before I realized the problem.
Progress was also impeded by a bug in gdb that causes it to get stuck
in an infinite loop. I kept hitting this when trying to look at the
memory problems, so I had to digress into debugging gdb using the
Mac's lldb. Naturally the gdb is built without debugging symbols, so
I had to debug using disassembly. It turns out that the bug is one
that was fixed in 2009 in the libiconv library. I'm not sure if this
is true on all platforms, but the xtensa-esp32-elf-gdb binary supplied
for the Mac references /usr/lib/libiconv.2.dylib which is version
7.0.0. MacPorts supplied a newer one, 9.0.0, but the /usr/lib
directory is locked down on the Mac so you can't change it even with
sudo and I could not find a tool to edit the gdb binary. So I tried
following the ESP32 documentation for building the toolchain from
sources. Ater several minutes that got most of the way through but
failed on the gdb step with a complaint from the configure script that
python was not available or not usable.
I knew that should not be true, so I digressed in debugging the
configure script (which, if you've tried this, you know they are
almost impossible to read). Again by adding V=1 to the make command
that invokes configure I figured out that the problem was that the
configure script was including a -u option to gcc when compiling the
test program for that configure step, but on the Mac gcc is symlinked
to clang which does not understand the -u option. I figured out how
to hack the configure script to omit that (unnecessary) option and
then manually ran configure with an option to get libiconv from where
Macports puts it in /opt/local/lib, followed by make to build gdb.
Now I have a gdb that does not get stuck in the infinite loop. If any
of you want to use gdb and hit this problem, I can assist with fixing
it.
-- Steve
On Wed, 15 Nov 2017, Mark Webb-Johnson wrote:
> Wow. Just wow.
>
> $ ssh jack at 1.2.3.217
> The authenticity of host '1.2.3.217 (1.2.3.217)' can't be established.
> RSA key fingerprint is SHA256:Fv1cgvPKmGoojR2nGa+/rnCvu7N0Wv/4pr8NOwHpgn0.
> Are you sure you want to continue connecting (yes/no)? yes
> Warning: Permanently added '1.2.3.217' (RSA) to the list of known hosts.
> WolfSSH Server
> jack at 1.2.3.217's password:
>
> Welcome to the Open Vehicle Monitoring System (OVMS) - SSH Console
> OVMS >
>
> Stephen Casner, you are without a doubt ‘da man.
>
> Very very cool.
More information about the OvmsDev
mailing list