Bug #17278

Additional Software is often broken after initial persistence setup and when recovering old APT lists

Added by intrigeri 2019-12-01 07:17:23 . Updated 2020-05-15 10:02:40 .

Status:
Needs Validation
Priority:
High
Assignee:
Category:
Persistence
Target version:
Start date:
Due date:
% Done:

0%

Feature Branch:
bugfix/17278-fix-additional-software-intrigeri
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Additional Software Packages
Deliverable for:

Description

The test failure is:

12:03:29   Scenario: I set up Additional Software when installing a package without persistent partition and the package is installed next time I start Tails # features/additional_software_packages.feature:25
12:07:42     Given I start Tails from a freshly installed USB drive with an administration password and the network is plugged and I login                    # features/step_definitions/common_steps.rb:163
12:07:42     And I update APT using apt                                                                                                                       # features/step_definitions/apt.rb:46
12:07:42     And I install "sslh" using apt                                                                                                                   # features/step_definitions/apt.rb:65
12:07:42     Then I am proposed to add the "sslh" package to my Additional Software                                                                           # features/step_definitions/additional_software_packages.rb:33
12:07:42     And I create a persistent storage and activate the Additional Software feature                                                                   # features/step_definitions/additional_software_packages.rb:38
12:07:42     And I shutdown Tails and wait for the computer to power off                                                                                      # features/step_definitions/common_steps.rb:542
12:20:21     And I start Tails from USB drive "__internal" with network unplugged and I login with persistence enabled                                        # features/step_definitions/common_steps.rb:139
12:20:21       try_for() timeout expired (Timeout::Error)
12:20:21       ./features/support/helpers/misc_helpers.rb:89:in `rescue in try_for'
12:20:21       ./features/support/helpers/misc_helpers.rb:36:in `try_for'
12:20:21       ./features/step_definitions/additional_software_packages.rb:14:in `/^the Additional Software (upgrade|installation) service has started$/'
12:20:21       ./features/step_definitions/common_steps.rb:152:in `/^I start Tails from (.+?) drive "(.+?)"( with network unplugged)?( and I login( with persistence enabled)?( (?:and|with) an administration password)?)?$/'
12:20:21       features/additional_software_packages.feature:32:in `And I start Tails from USB drive "__internal" with network unplugged and I login with persistence enabled'
12:21:50     And Additional Software is correctly configured for package "sslh"                                                                               # features/step_definitions/additional_software_packages.rb:50
12:21:50     And the package "sslh" is installed after Additional Software has been started                                                                   # features/step_definitions/common_steps.rb:571
12:21:50       Scenario failed at time 02:09:22

That is, step "the Additional Software installation service has started" fails when called from Given /^I start Tails from (.+?) drive "(.+?)"( with network unplugged)?( and I login( with persistence enabled)?( (?:and|with) an administration password)?)?$/.

The Journal looks like this:

Nov 30 12:10:03 amnesia systemd[5150]: Starting Trigger installation of Additional Software Packages...
Nov 30 12:10:03 amnesia systemd[5150]: Starting Wait for Tor to Have Bootstrapped...
Nov 30 12:10:03 amnesia systemd[5150]: Started Tracker metadata database store and lookup manager.
Nov 30 12:10:03 amnesia sudo[8393]:  amnesia : TTY=unknown ; PWD=/home/amnesia ; USER=root ; COMMAND=/bin/systemctl start tails-additional-software-install.service
Nov 30 12:10:03 amnesia sudo[8393]: pam_unix(sudo:session): session opened for user root by (uid=0)
[…]
Nov 30 12:10:04 amnesia systemd[1]: Starting Install Additional Software Packages...
[…]
Nov 30 12:10:09 amnesia tails-additional-software[8409]: [INFO] Starting to install additional software...
Nov 30 12:10:09 amnesia tails-additional-software[8409]: [INFO] Found additional packages list.
[…]
Nov 30 12:10:09 amnesia sudo[8461]:     root : TTY=unknown ; PWD=/ ; USER=amnesia ; COMMAND=/usr/local/lib/tails-additional-software-notify Installing your additional software from persistent storage... This can take several minutes.
[…]
Nov 30 12:10:11 amnesia tails-additional-software[8409]: [INFO] Will install the following packages: sslh
[…]
Nov 30 12:10:24 amnesia tails-additional-software[8409]: [INFO] Reading package lists...
Nov 30 12:10:24 amnesia tails-additional-software[8409]: [INFO] Building dependency tree...
Nov 30 12:10:24 amnesia tails-additional-software[8409]: [INFO] Reading state information...
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [INFO] Package sslh is not available, but is referred to by another package.
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [INFO] This may mean that the package is missing, has been obsoleted, or
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [INFO] is only available from another source
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [INFO]
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [INFO] E: Package 'sslh' has no installation candidate
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [WARNING] apt-get exited with returncode 100
Nov 30 12:10:25 amnesia tails-additional-software[8409]: [WARNING] Warning: installation of sslh failed

It looks like this started around Nov 13. This matches the time when the fix for Bug #17203 was merged, but I don’t understand why changing the dependencies of the ASP upgrade service, which we don’t even try to start in the failing scenarios, can possibly affect whether installation of ASP works or not.

For now I’m not classifying this as a test suite -only issue: there’s a chance this affects real users.


Files


Subtasks


Related issues

Related to Tails - Bug #17203: Automatic install of additional software fails because Tails has not yet connected to Tor Resolved
Related to Tails - Bug #17392: apt sometimes fails to fetch additional software packages after upgrading Tails Confirmed
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

History

#1 Updated by intrigeri 2019-12-01 07:18:17

#2 Updated by intrigeri 2019-12-01 07:18:26

  • related to Bug #17203: Automatic install of additional software fails because Tails has not yet connected to Tor added

#3 Updated by segfault 2020-04-12 19:37:57

I see this scenario fail often on stable and devel:

Scenario My Additional Software list is configurable through a GUI or through notifications when I install or remove packages with APT or Synaptic


Given a computer
And I start Tails from USB drive "__internal" and I login with persistence enabled and an administration password
When I uninstall "sslh" using apt
And I accept removing "sslh" from Additional Software
Then "sslh" is not in the list of Additional Software
When I start Synaptic
And I update APT using Synaptic

Tor operation failed (despite retrying 10 times) with
Test::Unit::AssertionFailedError: <Dogtail::Failure> exception expected but none was thrown. (MaxRetriesFailure)
./features/support/helpers/misc_helpers.rb:173:in `rescue in block in retry_action'
./features/support/helpers/misc_helpers.rb:156:in `block in retry_action'
./features/support/helpers/misc_helpers.rb:155:in `loop'
./features/support/helpers/misc_helpers.rb:155:in `retry_action'
./features/support/helpers/misc_helpers.rb:144:in `retry_tor'
./features/step_definitions/apt.rb:148:in `/^I update APT using Synaptic$/'
features/additional_software_packages.feature:53:in `And I update APT using Synaptic'

… and these two which depend on the first one:

Scenario Recovering in offline mode after Additional Software previously failed to upgrade and then succeed to upgrade when online 
Scenario I am notified when Additional Software fails to install a package

The first scenario fails because Synaptic fails to update the APT package lists, see the attached screenshot.

In the journal I see that the apt-get update processes started by tails-additional-software during boot print the same error:

Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO] Err:1 tor+http://deb.tails.boum.org devel InRelease
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO]     to 127.0.0.1:9050 (127.0.0.1). - connect (111: Connection refused)
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO] Err:2 tor+http://deb.torproject.org/torproject.org buster InRelease
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO]   Could not connect to 127.0.0.1:9050 (127.0.0.1). - connect (111: Connection refused)
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO] Err:3 tor+http://ftp.us.debian.org/debian buster InRelease
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO]   Could not connect to 127.0.0.1:9050 (127.0.0.1). - connect (111: Connection refused)
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO] Err:4 tor+http://security.debian.org buster/updates InRelease
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO]   Could not connect to 127.0.0.1:9050 (127.0.0.1). - connect (111: Connection refused)
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO] Err:5 tor+http://ftp.us.debian.org/debian bullseye InRelease
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO]   Unable to connect to 127.0.0.1:9050:
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO] Err:6 tor+http://ftp.us.debian.org/debian sid InRelease
Apr 08 12:36:26 amnesia tails-additional-software[31110]: [INFO]   Unable to connect to 127.0.0.1:9050:

… but the apt-get update calls still seem to exit with status code 0 (because else tails-additional-software would print a warning), which I didn’t expect.

The journal also includes logs of dropped packets to the Tor SOCKS port every time APT or Synaptic fails to connect:

Apr 08 12:36:25 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=33681 DF PROTO=TCP SPT=43576 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:25 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=22410 DF PROTO=TCP SPT=43578 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:25 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=57950 DF PROTO=TCP SPT=43580 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:25 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=34578 DF PROTO=TCP SPT=43582 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:26 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=57951 DF PROTO=TCP SPT=43580 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:26 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=34579 DF PROTO=TCP SPT=43582 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:26 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=22411 DF PROTO=TCP SPT=43578 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 
Apr 08 12:36:26 amnesia kernel: Dropped outbound packet: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=33682 DF PROTO=TCP SPT=43576 DPT=9050 WINDOW=65495 RES=0x00 SYN URGP=0 UID=0 GID=0 

I don’t understand why the APT packets would be dropped. config/chroot_local-includes/etc/ferm/ferm.conf white-lists access to 9050 for the _apt user, which is the user that downloads the APT lists both when running apt-get update and when starting Synaptic.

#4 Updated by segfault 2020-04-12 19:45:13

Ah I think this line explains why the packets are being dropped:

W: Download is performed unsandboxed as root as file '/var/lib/apt/lists/partial/ftp.us.debian.org_debian_dists_buster_InRelease' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied)

The download is performed as root and not as _apt, and root is not on the white-list of users allowed to connect to 9050. Now we have to figure out why the file /var/lib/apt/lists/partial/ftp.us.debian.org_debian_dists_buster_InRelease exists and is owned by root.

#5 Updated by segfault 2020-04-12 19:56:57

> Now we have to figure out why the file /var/lib/apt/lists/partial/ftp.us.debian.org_debian_dists_buster_InRelease exists and is owned by root.

I suspect that during one of the previous scenarios which use persistence an apt update was not completed and left these partial lists there. I don’t know why they are owned by root and not by _apt when apt apparently expects them to be owned by _apt.

One way to solve this issue would be to allow root to connect to the Tor SOCKS port.

#6 Updated by segfault 2020-04-15 14:48:02

  • Status changed from Confirmed to In Progress

Applied in changeset commit:tails|0146402d5268063ea8f125433ef37fc6c00fc383.

#7 Updated by intrigeri 2020-05-12 09:50:08

  • related to Bug #17392: apt sometimes fails to fetch additional software packages after upgrading Tails added

#8 Updated by intrigeri 2020-05-12 09:52:25

  • Assignee set to intrigeri
  • Priority changed from Elevated to High

Bumping priority as this has been a major PITA for everyone using our CI, and in particular for RMs who receive failure notification email, since months.

On segfault’s branch, “Scenario: My Additional Software list is configurable through a GUI or through notifications when I install or remove packages with APT or Synaptic” still often fails:

  1. The “I update APT using Synaptic” step succeeds. Still, the video shows a warning dialog that says W: Download is performed unsandboxed as root as file '/var/lib/apt/lists/partial/ftp.us.debian.org_debian_dists_buster_InRelease' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied).
  2. The next step (installing cowsay) fails because every time we start Synaptic, that same warning dialog is displayed, and we don’t click through it.

But interestingly, that’s the only Additional Software scenario that I saw fail in the last 10 runs on Jenkins, so it looks like this branch improves things already.

I suspect the problem comes from a previous scenario or step, that occasionally sets up for failure the one we’ve been focusing on so far, depending on how a race condition gets resolved. More specifically, I think a previous scenario or step sometimes leaves the persistent /var/lib/apt/lists/ data in an inconsistent/corrupted/temporary state, due to a bug in tails-additional-software-upgrade.service, which messes with /var/cache/apt/lists.

This hypothesis looks consistent the fact the problem started when Bug #17203 (commit:3982cb1396ff453df9b6aa98fe32558e462853b5) was merged: fixing the dependencies/ordering of tails-additional-software-upgrade.{path,service} there likely impacts how race conditions in this area get resolved, and may have unmasked a bug we did not notice earlier.

I now believe it’s not merely a test suite problem, but one that can affect our users (see also Bug #17392), because what happens before the failing step can probably happen to a user too: I’ve found several WhisperBack bug reports with the same ownership warning segfault and I have seen in the test suite.

My hunch is that the problem comes from save_old_apt_lists, which uses https://docs.python.org/3.7/library/shutil.html#shutil.copytree, which does not preserve ownership, so the copy of /var/cache/apt/lists/partial will be owned by root:root, instead of _apt:root like the original version. If there’s no way to do this simply with Python, IMO we should delegate the work to cp or rsync.

Similarly, depending on how https://docs.python.org/3.7/library/shutil.html#shutil.move detects if “the destination is on the current filesystem” (can be tough given our bind-mounts), restore_old_apt_lists might be affected by a similar problem.

Next step: in save_old_apt_lists, I’ll try replacing shutil.copytree with cp --archive.

If my hunch is correct and once we fix it by preserving _apt:root ownership, I guess we won’t need segfault’s commit:21d708eb027a1bdeb5be4b5365d029625a67f9bf, but what do I know :)

#9 Updated by intrigeri 2020-05-13 06:40:39

> Next step: in save_old_apt_lists, I’ll try replacing shutil.copytree with cp --archive.

I’ve done this on the bugfix/17278-fix-additional-software-intrigeri branch.

Here’s an analysis of test suite runs 3-16 (13 runs):

  • 10 runs out of 13 passed, which in itself is already better than what I see on other branches.
  • run 16
    • “Scenario I set up Additional Software when installing a package without persistent partition and the package is installed next time I start Tails” fails on first reboot after setting up Additional Software. apt-get install fails with “E: Unable to parse package file /var/lib/apt/lists/ftp.us.debian.org_debian_dists_bullseye_main_i18n_Translation-en (1)”. The Journal does not include “Found a copy of old APT lists, restoring it”, which means we did not go through the code path I’ve modified during this boot. My hypotheses are:
      • Something else than our Additional Software code messed up with the lists during the previous or during this boot.
      • tails-synchronize-data-to-new-persistent-volume-on-shutdown.service did not do its job correctly before we rebooted last time
    • ACTION: set APT::Periodic::Enable "0"; to ensure apt-daily-upgrade.service is not messing with us (we’ve disabled apt-daily.service but apparently we missed apt-daily-upgrade.service when it was introduced)
    • ACTION: To debug this, on test failure I’d like to see:
      • the output of ls -lAR /var/lib/apt (I’m interested both in the lists and the periodic sub-directories)
      • the output of mount, to make sure the 2 persistent APT directories are mounted
      • the content of /var/log/live-persist, to give us more info in case one persistent APT directory was not mounted
      • Have tails-synchronize-data-to-new-persistent-volume-on-shutdown.service create a flag file in the persistent volume after completion, so during next boot we can tell whether it did its job.
    • The next scenarios fail in the exact same way, for the exact same reason.
  • run 13
    • “Scenario I am warned I can not use Additional Software when I start Tails from a DVD and install a package” fails to sync the clock, before it even reached steps related to Additional Software. The htpdate log suggests that lizard had networking issues at the time, which happens regularly since a few weeks.
    • The following scenarios fail in the same way as run 16, modulo the error message is “E: Unable to parse package file /var/lib/apt/lists/ftp.us.debian.org_debian_dists_sid_main_binary-amd64_Packages (1)”.
  • run 12
    • “Scenario The Additional Software dpkg hook notices when persistence is locked down while installing a package” fails at step “the Additional Software dpkg hook has been run for package ”makepp" and notices the persistence is locked".
      • The debug log confirms that makepp was successfully installed. It took 35 seconds to install it though, which is very long for a small package that won’t pull in any new dependency.
      • /run/live-additional-software/log is not empty, but 2 minutes later the expected “New packages manually installed” log line has not appeared in that file. In the Journal I see no trace that apt_hook_pre or apt_hook_post have run, so I’m a little confused.
      • ACTION: on test failure, save the content of /run/live-additional-software/log

To sum up:

  • The most common failure mode, that I’ve investigated yesterday, did not happen in 13 runs, and I could find no other failure that looks related to the tentative fix I’ve applied. This suggests I’ve fixed the worse failure mode.
  • The corrupted APT packages list failure mode happened 2 times in 13 runs. That’s not great. It needs further investigation. See “ACTION” above for what I’d like to try next.
  • The dpkg hook failure mode happened 1 time in 13 runs. I’m pretty sure this is an old problem, that happened before we filed this issue. I’m inclined to handle it separately, on a dedicated issue, if the single idea I have so far to investigate it does not hint us in the right direction.

#10 Updated by intrigeri 2020-05-13 09:03:24

> To sum up:
>
> * The most common failure mode, that I’ve investigated yesterday, did not happen in 13 runs, and I could find no other failure that looks related to the tentative fix I’ve applied. This suggests I’ve fixed the worse failure mode.

Before I ask a review’n’merge, I’ll wait for a few more test runs, which I’ll want to do anyway to debug the other 2 failure modes I’ve spotted:

> * The corrupted APT packages list failure mode happened 2 times in 13 runs. That’s not great. It needs further investigation. See “ACTION” above for what I’d like to try next.

I’ve implemented all these action items on my bugfix/17278-fix-additional-software-intrigeri branch.

> * The dpkg hook failure mode happened 1 time in 13 runs. I’m pretty sure this is an old problem, that happened before we filed this issue. I’m inclined to handle it separately, on a dedicated issue, if the single idea I have so far to investigate it does not hint us in the right direction.

My updated branch saves the content of /run/live-additional-software/log on failure, which might help investigating this one.

#11 Updated by intrigeri 2020-05-14 08:21:06

>> * The most common failure mode, that I’ve investigated yesterday, did not happen in 13 runs, and I could find no other failure that looks related to the tentative fix I’ve applied. This suggests I’ve fixed the worse failure mode.
>
> Before I ask a review’n’merge, I’ll wait for a few more test runs, which I’ll want to do anyway to debug the other 2 failure modes I’ve spotted:

This problem (“Download is performed unsandboxed as root”) happened again in test suite run 19.

I still think that commit:1b94f25ce5778d18b2370caa1161a771e44f305b fixes a real bug that can cause these symptoms, but thanks to the extra debugging info saved by the test suite on failure on my branch, I’ve found at least another bug that causes these symptoms. I now have evidence that tails-synchronize-data-to-new-persistent-volume-on-shutdown.service did not complete successfully:

>> * The corrupted APT packages list failure mode happened 2 times in 13 runs. That’s not great. It needs further investigation. See “ACTION” above for what I’d like to try next.

I’m now convinced that these problems (“Unable to parse package file”) happen because indeed, we feed APT incomplete files. Sometimes these incomplete files still have a valid format and we see errors like “Package sslh is not available”, sometimes they’re simply invalid and we instead see errors like “Unable to parse package file”.

⇒ I’ll now focus on making tails-synchronize-data-to-new-persistent-volume-on-shutdown.service run in a robust manner.
I think I already know what’s happening.

>> * The dpkg hook failure mode happened 1 time in 13 runs. I’m pretty sure this is an old problem, that happened before we filed this issue. I’m inclined to handle it separately, on a dedicated issue, if the single idea I have so far to investigate it does not hint us in the right direction.
>
> My updated branch saves the content of /run/live-additional-software/log on failure, which might help investigating this one.

I’ll need more test runs to gather data about this rare problem.

#12 Updated by intrigeri 2020-05-14 09:46:17

> ⇒ I’ll now focus on making tails-synchronize-data-to-new-persistent-volume-on-shutdown.service run in a robust manner.
> I think I already know what’s happening.

I tried something (commit:d17e4c6d2f43e7e0fc50b7e88e4286dab238c048), but I’m wondering if my analysis is correct: even without this commit, I see that shutting down Tails after setting up persistence + Additional Software, i.e. when tails-synchronize-data-to-new-persistent-volume-on-shutdown.service actually does something, takes 30+ seconds, and I did not manage to make it fail on my laptop by rate limiting the virtual USB drive to 5 MB/s with this patch:

--- a/features/domains/disk.xml
+++ b/features/domains/disk.xml
@@ -2,4 +2,8 @@
<driver name='qemu' type=''/>
<source file=''/>
<target dev='' bus=''/>
+  <iotune>
+    <write_bytes_sec>5000000</write_bytes_sec>
+    <write_bytes_sec_max>5000000</write_bytes_sec_max>
+  </iotune>
</disk>

… while I expected this I/O rate limiting to be sufficient to trigger the bug.
So at the face of it, it does not look like DefaultTimeoutStopSec=5s is fully honored for this service, and perhaps something else than this timeout makes its ExecStop= abort early, I don’t know.
I’m a little bit confused. Let’s see how this commit impacts test results.

#13 Updated by intrigeri 2020-05-14 12:44:53

> I tried something (commit:d17e4c6d2f43e7e0fc50b7e88e4286dab238c048), but I’m wondering if my analysis is correct: even without this commit, I see that shutting down Tails after setting up persistence + Additional Software, i.e. when tails-synchronize-data-to-new-persistent-volume-on-shutdown.service actually does something, takes 30+ seconds, and I did not manage to make it fail on my laptop by rate limiting the virtual USB drive to 5 MB/s with this patch:
>
> […]
>
> … while I expected this I/O rate limiting to be sufficient to trigger the bug.
> So at the face of it, it does not look like DefaultTimeoutStopSec=5s is fully honored for this service, and perhaps something else than this timeout makes its ExecStop= abort early, I don’t know.
> I’m a little bit confused.

After running tests locally with and without this commit, passing noplash, with limited I/O write rate, and inspecting what I see on screen during system shutdown, I have a theory that can explain this:

1. Without this patch

  • I think tails-synchronize-data-to-new-persistent-volume-on-shutdown.service starts doing stuff and is quickly killed: it does not even show up on screen when I do virsh send-key TailsToaster KEY_LEFTSHIFT KEY_PAGEUP. Depending on how fast the system is, during these 5s cp can have sent part, or all, of the data to the USB stick (or rather, to the kernel disk cache, really).
  • Then, the mount unit for /media/tails-persistence-setup/TailsData (which has a much larger timeout) takes a while to stop: it effectively sync’s data to the USB stick, but only data that cp could already send to the kernel disk cache before it got killed. I see this stop job run for several minutes.
  • The I/O rate limiting only impacts how much time the mount unit needs to sync’ the data cp had time to send. It does not impact (much?) how much of the data was sent.
  • Success/failure rate is a function of the performance of the CPU. This explains why I never can reproduce this problem locally, while lizard regularly exposes it. To confirm this I would need to ask QEMU to make TailsToaster’s CPU slower, but at this point I’m a bit lazy. I may come back to this if actual test results on lizard still fail even with commit:d17e4c6d2f43e7e0fc50b7e88e4286dab238c048.

2. With this patch, I see A stop job is running for Synchronize data to newly created persistent volume on shutdown (Xs / 5min) run for several minutes — yeah!

> Let’s see how this commit impacts test results.

I’m now quite optimistic, but we’ll see.

#14 Updated by intrigeri 2020-05-14 19:01:52

  • Status changed from In Progress to Needs Validation
  • Target version set to Tails_4.7
  • Feature Branch set to bugfix/17278-fix-additional-software-intrigeri

So the last 10 consecutive runs of features/additional_software_packages.feature on my branch passed (FTR, that’s runs 23-32). I can’t recall features/additional_software_packages.feature passing 10 times in a row on any branch recently, so I think I got it right this time!

This issue is very long so here’s a summary for the reviewer. I’ve documented rather verbosely above the thought process that lead here, so if you want details, you may either read the rest of this issue, or ask me :)

The main part of this branch is 2 bugfixes: commit:d17e4c6d2f43e7e0fc50b7e88e4286dab238c048 and commit:1b94f25ce5778d18b2370caa1161a771e44f305b. They fix 2 different root causes, that triggered similar failure modes, that at the end of the day boil down to “APT can’t use its packages lists and the Additional Software feature does not behave as expected”, but for distinct technical reasons.

One thing that’s missing on my branch is recovering functionality of an existing persistent volume that already went through one of the failing scenarios, and where the persistent /var/lib/apt/lists has incorrect ownership and/or incorrect permissions. I’d like to work on this next. I believe it’s a matter of adding a few simple lines in the live-persist script. But given the problematic impact of the bugs I’ve fixing here on our users’ and developers’ experience, I’d rather not delay this first iteration any longer ⇒ if you agree, please let me know and I’ll follow up on this in a new issue.

There are a few small other changes on the branch, that were not enough to fix the problems this issue is about, but that are good ideas anyway IMO (e.g. commit:111160fdeb3aec8d466d1200c000edcdd3e08b93, commit:575ae9ebe9635fc44007c0c687fec06252d04e4f).

I think we could revert commit:c8429eecf23570274b0bb2134a87ae1fcf72ce07 and commit:7347b73638cfdd61086e3cb129d098efbf66a9b6 at some point, once we’re confident enough that these failure modes are fixed. Still, we have several other open bugs about Additional Software problems, that I don’t think my branch fixes (and that I suspect our test suite does not exercise). They’ve been generally very hard to debug. So for now I’d rather leave the debugging tools in place, so that we can get more useful info from our advanced users when they’re affected (e.g. Cody has been super helpful for debugging Additional Software problems).

Finally, the last 3 commits are totally unrelated cleanups, on a script that somehow my work on this issue lead me to read. I figured including them here would save some paperwork. If they don’t look obviously right to you, please merge without these 3 commits and I’ll either give up or transplant them elsewhere.

#15 Updated by intrigeri 2020-05-14 19:02:31

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_4.7)

#16 Updated by intrigeri 2020-05-14 19:04:37

  • Target version set to Tails_4.7

#17 Updated by intrigeri 2020-05-15 07:35:37

  • Subject changed from Additional Software tests often fail to Additional Software is often broken after initial persistence setup and when recovering old APT lists

(In the spirit of moving towards generating our changelog from PRs instead of from individual commits.)

Dear reviewer, please start by reading Bug #17278#note-14 :)

#18 Updated by intrigeri 2020-05-15 10:02:40

Hi segfault, I’ve renamed your branch to wip/bugfix/17278-fix-additional-software, so it does not eat cycles on Jenkins: I believe we don’t need this workaround now that I’ve identified, and apparently fixed, the root cause of the problem.