[Ovmsdev] Attention: esp-idf upgrade still breaks configuration

Michael Balzer dexter at expeedo.de
Wed Nov 28 05:35:24 HKT 2018


@all: we need some help on this.

Dmitry has asked us to verify an assumption about the source of the problem. Unfortunately, my module has decided to not show the bug anymore.

Can anyone a) reproduce the bug and b) check Dmitry's code change? The code change is very simple:

    https://github.com/espressif/esp-idf/issues/2730#issuecomment-441928779

…that's in file esp-idf/components/wear_levelling/WL_Flash.cpp line 183.

Chances are the bug is more frequently present on V3.1 first batch modules.

_To test a build including the bug, do_:

    OVMS# config backup /sd/cfgbackup.zip

    ~/…/esp-idf> git checkout 5afc1b0cbbcd5903744cd2948ee235b7016eeede
    ~/…/esp-idf> git merge --no-commit 9d609af54c63e7f949a4fbc43d4f1c13b57f49d8
    ~/…/esp-idf> git submodule update --recursive

    ~/…/OVMS.V3> git checkout c3c908c00a09c4a841d1fec115be785c8f486270
    ~/…/OVMS.V3> make app-flash

Then do some reboots of the module.

If the bug shows up, usually the second reboot already will result in a lost configuration. You'll see some warnings in the boot process (mount failure), and
the prompt will directly be "#" as no module password is in place.

If that happens, rebuild after applying the change as described by Dmitry, restore your config (see below), check if the bug is gone. If it isn't, the change
_definitely doesn't work_.

If it's gone, as the bug can stop showing up, you need to revert Dmitry's change and check once again if the bug reappears. If it does, the change _definitely
works_.

If the bug doesn't reappear, you need to switch off the module for some hours, then redo the above steps.


_Note on restoring the config backup_: as my later restore fix isn't included in that version, you may need to create the events directory manually:

    OVMS# vfs mkdir /store/events
    OVMS# config restore /sd/cfgbackup.zip <password>

To return to normal build setup:

    ~/…/esp-idf> git checkout -f master
    ~/…/esp-idf> git submodule update --recursive

    ~/…/OVMS.V3> git checkout master


Please report definitive results directly to Dmitry in the issue #2730.

Thanks,
Michael


Am 27.11.18 um 02:02 schrieb Mark Webb-Johnson:
> I think unmount should be on the system.shutdown event. That would be after system.shuttingdown and the MyBoot.RestartPending/RestartReady control steps.
>
> That means if a component really wants to write to config (or elsewhere on /store), it should do it during system.shuttingdown event (and optionally do
> the MyBoot.RestartPending(TAG) and MyBoot.RestartReady(TAG) thing, like simcom).
>
> As you say, that doesn’t address the crash-during-shuttingdown issue, but can be done in a very component manner. No changes to MyBoot - all done in
> ovms_config. Or just hard-code the /store unmount before esp_restart().
>
> Regards, Mark.
>
>> On 27 Nov 2018, at 12:47 AM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de>> wrote:
>>
>> Dimitris question about unclean shutdowns let me check our reboot code, and we actually do not take care of unmounting the config store.
>>
>> While my crash observations do not look like crash/clean makes a difference to this bug, I still think we should do that.
>>
>> The simple solution would be to do this last, in Boot::boot_shutdown_done() just before the esp_restart(), but that does not take care of crashes during
>> shutdown, which still are too frequent.
>>
>> How about unmounting /store ASAP after the shuttingdown event (i.e. normal shutdown handling)?
>> Writing to the config would then have no effect during shutdown, but do we actually need (want) that to be possible?
>>
>> Regards,
>> Michael
>>
>>
>> Am 25.11.18 um 03:12 schrieb Mark Webb-Johnson:
>>> I second this. A fantastic effort, Michael.
>>>
>>> The log you provided on the GitHub issue looks really helpful, and I see Dimitri has replied:
>>>
>>>
>>>           dmitry1945 <https://github.com/dmitry1945> commented 6 hours ago <https://github.com/espressif/esp-idf#issuecomment-441391665>
>>>
>>>     Thank you @dexterbg <https://github.com/dexterbg>, the place is clear.
>>>
>>>
>>> Hopefully Dimitri can find the issue from here. So many little bugs in wifi and bluetooth stacks causing us random issues - it would be helpful to be able
>>> to update to the latest IDF.
>>>
>>> Thanks, and Regards,
>>> Mark.
>>>
>>>> On 25 Nov 2018, at 3:22 AM, Stephen Casner <casner at acm.org <mailto:casner at acm.org>> wrote:
>>>>
>>>> Michael -- Kudos for this Herculean effort -- Steve
>>>>
>>>> On Sat, 24 Nov 2018, Michael Balzer wrote:
>>>>
>>>>> Narrowing this down is a real PITA. The effect sometimes stops to occur, I then need to leave the module powered down for a while to get the effect back.
>>>>> Sometimes 1 hour is sufficient, currently it's been off for 2 hours and still works. I had a test window yesterday evening, one this morning and one in the
>>>>> afternoon. Temperature is most probably irrelevant, as the last power downs were outside in ~4-5 °C to rule this out.
>>>>>
>>>>> So as a "passed" can always be a false positive I need to validate every passed step by switching back to a failing version and see if that still fails.
>>>>>
>>>>> It seems I now have to wait until tomorrow for the next test window, but I have bisected down to this range of just 8 commits:
>>>>>
>>>>> balzer at leela:~/esp/esp-idf> git rev-list 9d609af54c63e7f949a4fbc43d4f1c13b57f49d8 ^9d2f7c60d9aef9860c61c2756318ada68c80fddf
>>>>> 9d609af54c63e7f949a4fbc43d4f1c13b57f49d8
>>>>> f392727abf7d56490c2f33127a59bfac42c937e0
>>>>> e834d6fffc23a6fcfc0d2e871c9235417a7fb48f
>>>>> 35842d02abb5f574aaab466d46081a232fdd20a6
>>>>> f05f3fbde87a9ce45c6818f71b49cd13888fd457
>>>>> a6d6c58ecadb9759a0bacf35cd7332ac641e598d
>>>>> 321b1e02052de95db60ddce87eecce5f9e04e9b8
>>>>> 40486c872345584d34949b3ce83f9e956a7eea13
>>>>>
>>>>> ...with 9d609af54c63e7f949a4fbc43d4f1c13b57f49d8 being the last identified bad commit, and 9d2f7c60d9aef9860c61c2756318ada68c80fddf being the last good.
>>>>>
>>>>> If I should guess now, it's probably one of Dmitry's commits on the wear leveling code.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 23.11.18 um 17:20 schrieb Michael Balzer:
>>>>>> It's not a timing issue, I've let it reboot about 30 times without any successful mount after the first failure.
>>>>>>
>>>>>> Going into bisecting now...
>>>>>>
>>>>>>
>>>>>> Am 23.11.18 um 15:54 schrieb Mark Webb-Johnson:
>>>>>>>> It may actually not be a corruption of the filesystem but some timing issue on the mount procedure. To test that we could disable the auto formatting on
>>>>>>>> mount failures.
>>>>>>>
>>>>>>> True.
>>>>>>>
>>>>>>> A couple of Espressif guys have jumped on the issue, and I have provided some more information for them. I think key will be reproducing it.
>>>>>>>
>>>>>>>> The issue may also be dependant on the hardware version, i.e. it could be caused by the bug that caused the SD speed issue on the first 3.1 batch.
>>>>>>>
>>>>>>> That was definitely a hardware issue with the CP2102 chip. I don't think related to ESP in any way.
>>>>>>>
>>>>>>> Regards, Mark.
>>>>>>>
>>>>>>>> On 23 Nov 2018, at 10:34 PM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de> <mailto:dexter at expeedo.de>> wrote:
>>>>>>>>
>>>>>>>> It may actually not be a corruption of the filesystem but some timing issue on the mount procedure. To test that we could disable the auto formatting on
>>>>>>>> mount failures.
>>>>>>>>
>>>>>>>> The issue may also be dependant on the hardware version, i.e. it could be caused by the bug that caused the SD speed issue on the first 3.1 batch.
>>>>>>>>
>>>>>>>> I only have tried the idf update on my batch 1 module (my bench / development module). I think most of our edge testers also have that version.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 23.11.18 um 02:32 schrieb Mark Webb-Johnson:
>>>>>>>>> I have raised the following github issue to Espressif:
>>>>>>>>>
>>>>>>>>>    https://github.com/espressif/esp-idf/issues/2730
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        Environment
>>>>>>>>>
>>>>>>>>>      * Development Kit: none
>>>>>>>>>      * Kit version (for WroverKit/PicoKit/DevKitC): none
>>>>>>>>>      * Module or chip used: ESP32-WROVER 16MB
>>>>>>>>>      * IDF version (run |git describe --tags| to find it): v3.2-beta1-208-g0d7f2d77c
>>>>>>>>>      * Build System: make
>>>>>>>>>      * Compiler version (run |xtensa-esp32-elf-gcc --version| to find it): (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 5.2.0
>>>>>>>>>      * Operating System: macOS
>>>>>>>>>      * Power Supply: USB
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        Problem Description
>>>>>>>>>
>>>>>>>>>    TLDR: Between May and July 2018 a change was made to esp idf master that is causing corruption on FAT filesystems mounted on SPI flash.
>>>>>>>>>
>>>>>>>>>    Our project uses a partitions.csv as follows:
>>>>>>>>>
>>>>>>>>>    |# Name, Type, SubType, Offset, Size nvs, data, nvs, 0x9000, 0x4000 otadata, data, ota, 0xd000, 0x2000 phy_init, data, phy, 0xf000, 0x1000 factory,
>>>>>>>>>    app, factory, 0x10000, 4M ota_0, app, ota_0, , 4M ota_1, app, ota_1, , 4M store, data, fat, , 1M |
>>>>>>>>>
>>>>>>>>>    The 'store' partition is formatted as FAT, as follows:
>>>>>>>>>
>>>>>>>>>    esp_vfs_fat_mount_config_t m_store_fat;
>>>>>>>>>    wl_handle_t m_store_wlh;
>>>>>>>>>    memset(&m_store_fat,0,sizeof(esp_vfs_fat_sdmmc_mount_config_t));
>>>>>>>>>    m_store_fat.format_if_mount_failed = true;
>>>>>>>>>    m_store_fat.max_files = 5;
>>>>>>>>>    esp_vfs_fat_spiflash_mount("/store", "store", &m_store_fat, &m_store_wlh);
>>>>>>>>>
>>>>>>>>>    We have previously used a clone of esp idf master, dated around May 22 2018, without issues. The partition is very reliable.
>>>>>>>>>
>>>>>>>>>    However, on Jul 6 2018, we updated our clone to use the latest esp idf master at that time. Shortly afterwards, users started to report that their
>>>>>>>>>    'store' filesystem contents were corrupted. We rolled back.
>>>>>>>>>
>>>>>>>>>    We have now tried again (updating on Oct 20 2018 to v3.2-beta1-208-g0d7f2d77c) and immediately had the same issue. Random corruption of FAT filesystem
>>>>>>>>>    in SPI flash.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Expected Behavior
>>>>>>>>>
>>>>>>>>>    No corruption of FAT filesystem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Actual Behavior
>>>>>>>>>
>>>>>>>>>    Corruption of FAT filesystem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Steps to reproduce
>>>>>>>>>
>>>>>>>>>     1. Create a partition in SPI flash, and mount FAT filesystem
>>>>>>>>>     2. Read and write to files on FAT filesystem
>>>>>>>>>     3. Reboot
>>>>>>>>>     4. Observe random corruption and unmountable filesystem
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Code to reproduce this issue
>>>>>>>>>
>>>>>>>>>    esp_vfs_fat_mount_config_t m_store_fat;
>>>>>>>>>    wl_handle_t m_store_wlh;
>>>>>>>>>    memset(&m_store_fat,0,sizeof(esp_vfs_fat_sdmmc_mount_config_t));
>>>>>>>>>    m_store_fat.format_if_mount_failed = true;
>>>>>>>>>    m_store_fat.max_files = 5;
>>>>>>>>>    esp_vfs_fat_spiflash_mount("/store", "store", &m_store_fat, &m_store_wlh);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        Debug Logs
>>>>>>>>>
>>>>>>>>>    n/a
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        Other items if possible
>>>>>>>>>
>>>>>>>>>    Please advise if you need anything further.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think the timeline is correct (the issue is in esp idf master some time between May and July 2018), but please let me know if you know differently (or
>>>>>>>>> update the github issue with your comments).
>>>>>>>>>
>>>>>>>>> Regards, Mark
>>>>>>>>>
>>>>>>>>>> On 23 Nov 2018, at 6:19 AM, Michael Balzer <dexter at expeedo.de <mailto:dexter at expeedo.de> <mailto:dexter at expeedo.de>> wrote:
>>>>>>>>>>
>>>>>>>>>> esp-idf and OVMS branches are back to the working version.
>>>>>>>>>>
>>>>>>>>>> In case you also lost your config:  I also just fixed a bug on restoring into an empty /store partition.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Michael
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 22.11.18 um 22:34 schrieb Michael Balzer:
>>>>>>>>>>> See https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/pull/165
>>>>>>>>>>>
>>>>>>>>>>> I'll reset both master branches now.
>>>>>>>>>>>
>>>>>>>>>>> If you're about to pull, please wait until I've reverted the branches.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Michael
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> OvmsDev mailing list
>>>>>>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com> <mailto:OvmsDev at lists.openvehicles.com>
>>>>>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> OvmsDev mailing list
>>>>>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>>>>
>>>>>>>> --
>>>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>>> _______________________________________________
>>>>>>>> OvmsDev mailing list
>>>>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com> <mailto:OvmsDev at lists.openvehicles.com>
>>>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> OvmsDev mailing list
>>>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>>
>>>>>> --
>>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>>
>>>>>> _______________________________________________
>>>>>> OvmsDev mailing list
>>>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>>>
>>>>> --
>>>>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>>>>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>>>>>
>>>>>
>>>>
>>>>                                                        -- Steve_______________________________________________
>>>> OvmsDev mailing list
>>>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>>
>>>
>>> _______________________________________________
>>> OvmsDev mailing list
>>> OvmsDev at lists.openvehicles.com
>>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>>
>> -- 
>> Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
>> Fon 02333 / 833 5735 * Handy 0176 / 206 989 26
>> _______________________________________________
>> OvmsDev mailing list
>> OvmsDev at lists.openvehicles.com <mailto:OvmsDev at lists.openvehicles.com>
>> http://lists.openvehicles.com/mailman/listinfo/ovmsdev
>
>
> _______________________________________________
> OvmsDev mailing list
> OvmsDev at lists.openvehicles.com
> http://lists.openvehicles.com/mailman/listinfo/ovmsdev

-- 
Michael Balzer * Helkenberger Weg 9 * D-58256 Ennepetal
Fon 02333 / 833 5735 * Handy 0176 / 206 989 26

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvehicles.com/pipermail/ovmsdev/attachments/20181127/8d836d7a/attachment-0001.html>


More information about the OvmsDev mailing list