Bug #16314: IO errors on buster when restoring snapshot from previous run with --keep-snapshots

Bug #16314

IO errors on buster when restoring snapshot from previous run with --keep-snapshots

Added by CyrilBrulebois 2019-01-07 10:45:23 . Updated 2019-09-01 18:35:52 .

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Test suite

Target version:

Start date:

2019-01-07

Due date:

% Done:

10%

Feature Branch:

Type of work:

Code

Blueprint:

Starter:

Affected tool:

Deliverable for:

Description

persistence.feature doesn’t pass at the moment because we lose the remote shell quite easily.

@product
Feature: Tails persistence
  As a Tails user
  I want to use Tails persistence feature

    Checkpoint: I have started Tails from DVD without network and stopped at Tails Greeter's login screen
      Given the network is unplugged
00:00:22.890929091: [log] CLICK on L(1023,384)@S(0)[0,0 1024x768]
      And I start the computer
00:01:11.097168628: calling as root: echo 'hello?'
00:01:11.097257235: [log]  TYPE " autotest_never_use_this_option blacklist=psmouse #ENTER."
00:01:11.246477905: call returned: [0, "hello?\n", ""]
00:01:11.249658224: calling as root: service tor status
00:01:11.334480810: call returned: [3, "● tor.service - Anonymizing overlay network for TCP (multi-instance-master)\n   Loaded: loaded (/lib/systemd/system/tor.service; disabled; vendor preset: enabled)\n   Active: inactive (dead)\n", ""]
00:01:11.334993101: opening file /etc/tor/torrc in 'append' mode
00:01:11.402447995: append complete
      And the computer boots Tails
00:01:28.964382109: calling as root: echo 'hello?'
00:01:29.024071217: call returned: [0, "hello?\n", ""]
00:01:29.747661330: calling as root: nmcli device show eth0
00:01:29.747755758: [log] CLICK on L(1023,384)@S(0)[0,0 1024x768]
  Scenario: Booting Tails from a USB drive with a disabled persistent partition                                                         # features/persistence.feature:6
    Given I have started Tails without network from a USB drive with a persistent partition and stopped at Tails Greeter's login screen # features/step_definitions/snapshots.rb:186
      execution expired (RemoteShell::Timeout)
      ./features/support/helpers/remote_shell.rb:34:in `readline'
      ./features/support/helpers/remote_shell.rb:34:in `block (2 levels) in communicate'
      ./features/support/helpers/remote_shell.rb:33:in `loop'
      ./features/support/helpers/remote_shell.rb:33:in `block in communicate'
      ./features/support/helpers/remote_shell.rb:30:in `communicate'
      ./features/support/helpers/remote_shell.rb:72:in `execute'
      ./features/support/helpers/remote_shell.rb:81:in `initialize'
      ./features/support/helpers/vm_helper.rb:433:in `new'
      ./features/support/helpers/vm_helper.rb:433:in `execute'
      ./features/support/helpers/vm_helper.rb:472:in `has_network?'
      ./features/step_definitions/common_steps.rb:33:in `post_snapshot_restore_hook'
      ./features/step_definitions/snapshots.rb:155:in `reach_checkpoint'
      ./features/step_definitions/snapshots.rb:153:in `reach_checkpoint'
      ./features/step_definitions/snapshots.rb:153:in `reach_checkpoint'
      ./features/step_definitions/snapshots.rb:153:in `reach_checkpoint'
      ./features/step_definitions/snapshots.rb:187:in `/^I\ have\ started\ Tails\ without\ network\ from\ a\ USB\ drive\ with\ a\ persistent\ partition\ and\ stopped\ at\ Tails\ Greeter's\ login\ screen$/'
      features/persistence.feature:7:in `Given I have started Tails without network from a USB drive with a persistent partition and stopped at Tails Greeter's login screen'

This will be debugged once we have results for other features in the test suite.

Subtasks

History

#1 Updated by intrigeri 2019-01-08 10:29:34

Category set to Test suite
Priority changed from Normal to Elevated
Parent task set to ~~Bug #16281~~

#2 Updated by hefee 2019-04-05 09:35:07

Priority changed from Elevated to Low

We saw this issue sometimes the last days on various tests, but couldn’t spot and fix it right now.

#3 Updated by anonym 2019-04-05 11:36:56

Subject changed from Losing remote shell access with snapshots vs. persistence to IO errors on buster when restoring snapshot from previous run with --keep-snapshots
% Done changed from 0 to 10

In my experience this only happens with --keep-snapshots and when there are snapshots from a previous run to restore. When such a snapshot is restored there’s a “crash” (more about this below) a few seconds in, just like in kibi’s case.

Here is another example where it is not the remote shell that is interrupted, but Sikuli waiting for an image to appear:

Feature: Additional software
  As a Tails user
  I may want to install softwares not shipped in Tails
  And have them installed automatically when I enable persistence in the Greeter

00:00:13.702098075: Remote shell: calling as root: echo 'hello?'
00:00:13.784917022: call returned: [0, "hello?\n", ""]
00:00:13.785650525: Sikuli: calling click(L(1023,384)@S(0)[0,0 1024x768])...
00:00:14.516432523: [log] CLICK on L(1023,384)@S(0)[0,0 1024x768]
00:00:14.521425680: Sikuli: calling click("TailsGreeterAddMoreOptions.png")...
00:00:15.477901594: [log] CLICK on L(244,615)@S(0)[0,0 1024x768]
00:00:15.478457359: Sikuli: calling wait("TailsGreeterAdditionalSettingsDialog.png")...
00:00:15.698135102: Sikuli: calling wait("TailsGreeterAdminPassword.png")...
00:00:51.693826588: Failed to find fuzzy candidate picture for TailsGreeterAdminPassword.png                                                                                    
00:00:51.693987950: FindFailedHookFailure was raised, re-running the failing Sikuli method.                                                                                     
  # Here we install the sslh package to test if debconf does not prevent
  # Additional Software from automatically installing packages.
  # This scenario also sets up the "__internal" drive that the following
  # scenarios will reuse.
  # Note: the "__internal" drive will keep its state across scenarios
  # and features until one of its snapshots is restored.
  Scenario: I set up Additional Software when installing a package without persistent partition and the package is installed next time I start Tails # features/additional_software_packages.feature:25                                                                 
    Given I start Tails from a freshly installed USB drive with an administration password and the network is plugged and I login                    # features/step_definitions/common_steps.rb:168                                                                    
      FindFailed: can not find TailsGreeterAdminPassword.png in S(0)[0,0 1024x768]
        Line 2171, in file Region.java
       (RuntimeError)
      ./features/step_definitions/common_steps.rb:170:in `/^I start Tails from a freshly installed USB drive with an administration password and the network is plugged and I login$/'
      features/additional_software_packages.feature:26:in `Given I start Tails from a freshly installed USB drive with an administration password and the network is plugged and I login'

I.e. in the Greeter, Sikuli presses the button for more options, but then the “crash” happens. The button remains “pressed in”, and the more options dialog doesn’t appear. If I take over the VM, the Greeter GUI is frozen (but the mouse pointer works), and a few minutes later the Greeter window just appear and we’re stuck in GDM. But the remote shell hasn’t crashed, and I managed to recover the journal and saw this (time 136 is when the button was clicked):

[   44.220329] xhci_hcd 0000:02:00.0: WARN Event TRB for slot 1 ep 0 with no TDs queued?
[   44.220359] xhci_hcd 0000:02:00.0: WARN Event TRB for slot 1 ep 2 with no TDs queued?
[   44.220360] xhci_hcd 0000:02:00.0: WARN Event TRB for slot 1 ep 3 with no TDs queued?
[   44.220361] xhci_hcd 0000:02:00.0: WARN Event TRB for slot 2 ep 0 with no TDs queued?
[   75.486727] usb 2-2: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
[   78.133619] serial8250: too much work for irq4
[   78.653618] serial8250: too much work for irq4
[   79.201611] serial8250: too much work for irq4
[   79.757615] serial8250: too much work for irq4
[   80.313609] serial8250: too much work for irq4
[   86.409612] serial8250: too much work for irq4
[   86.969611] serial8250: too much work for irq4
[   87.529616] serial8250: too much work for irq4
[   88.089612] serial8250: too much work for irq4
[   88.653610] serial8250: too much work for irq4
[   99.733609] serial8250: too much work for irq4
[  100.301576] serial8250: too much work for irq4
[  100.873640] serial8250: too much work for irq4
[  101.433612] serial8250: too much work for irq4
[  101.989647] serial8250: too much work for irq4
[  106.203182] usb 2-2: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
[  136.918867] usb 2-2: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
[  136.965892] print_req_error: I/O error, dev sda, sector 2525336
[  136.966119] print_req_error: I/O error, dev sda, sector 2525336
[  136.966261] print_req_error: I/O error, dev loop0, sector 2509030
[  136.966294] SQUASHFS error: squashfs_read_data failed to read block 0x4c91cc46
[  136.966301] SQUASHFS error: Unable to read metadata cache entry [4c91cc46]
[  136.966303] SQUASHFS error: Unable to read inode 0x5280613dc
[  136.968255] print_req_error: I/O error, dev sda, sector 2525464
[  136.968277] print_req_error: I/O error, dev sda, sector 2525464
[  136.968416] print_req_error: I/O error, dev loop0, sector 2509152
[  136.969710] SQUASHFS error: squashfs_read_data failed to read block 0x4c92c062
[  136.969716] SQUASHFS error: Unable to read metadata cache entry [4c92c062]
[  136.969719] SQUASHFS error: Unable to read inode 0x61c220ae4

Yay, IO errors. In this instance the remote shell works but I am sure the above could also result in the remote shell crashing, explaining the times when the remote shell hangs, like kibi’s example, but both me and hefee has seen that variant as well.

#4 Updated by anonym 2019-04-05 13:50:33

Derp, so I had misunderstood the “we need to patch qemu on stretch” as “you need to patch your host’s qemu if your host runs stretch”, and I didn’t bother about it. Now I’ve seen this known issue in the setup instructions and installed it on my tester.

From the relevant Debian bug I see it’s about IO errors and the nec-xhci USB controller as well, so I’m hopeful that with this patch my problems with the test suite will be fixed. My initial tests seem to confirm this.

#5 Updated by CyrilBrulebois 2019-04-05 14:17:29

FWIW: There’s Bug #16355 that hasn’t been transformed into a branch due to ENOTIME with other topics (RMing, TB, and sprints), which I hope documents my findings from a quarter ago.

#6 Updated by intrigeri 2019-08-18 19:34:41

Priority changed from Low to Normal
Target version deleted (~~Tails_4.0~~)

FWIW, I’ve “restor[ed] snapshot from previous run with —keep-snapshots” a lot today while working on ~~Bug #16004~~, and did not notice this problem. But anyway, it seems that @anonym retitled this bug due to a misunderstanding (the error he was seeing is well known and it seems it was fixed once he followed our setup instructions), so perhaps this should be reverted?

I don’t know if the errors @hefee was seeing had the same explanation.

And wrt. the original topic of this ticket (“Losing remote shell access with snapshots vs. persistence”): did anyone notice this with our current code? Did this cause enough trouble recently to be worth being a 4.0 release blocker? (I’m optimistically assuming it’s not.)

#7 Updated by hefee 2019-08-18 22:37:34

intrigeri wrote:
> I don’t know if the errors hefee was seeing had the same explanation.

As @anonym and me were sitting next to each other - We discussed the issue and anonym started to dig into it, when they also spotted the issue. So I can’t give more information here.

#8 Updated by intrigeri 2019-09-01 18:35:52

Status changed from Confirmed to Rejected

Okay, 2 weeks later, I’ve run persistence.feature a few times in a row with --keep-snapshots enabled. I did not see this problem. AFAICT:

Nobody reproduced what this ticket was originally about since 8 months and I’m afraid that it’s not actionable without more info from someone who’ve seen it again.
The problem anonym & hefee experienced had another, well known root cause.

So I’m tentatively going to close this ticket as non actionable: keeping it open as-is does not seem to help us improve things.

But if you see this problem again, please reopen and provide:

the output of virsh snapshot-list TailsToaster
the full debug output
the output of ls -lR /tmp/TailsToaster
the version of QEMU installed on the system where you run ./run_test_suite.

Also, in case that’s relevant: this bug originally happened when restoring the “I have started Tails from DVD without network” checkpoint, which we don’t use anymore in persistence.feature. So if that’s relevant, the problem might happen elsewhere nowadays.