Bug #11588

Sometimes fails to boot from USB on Jenkins with I/O errors

Added by intrigeri 2016-07-22 02:07:16 . Updated 2016-09-20 16:50:08 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
2016-07-22
Due date:
% Done:

100%

Feature Branch:
test/11588-usb-on-jenkins+10733
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
270

Description

While working on Bug #10720 I noticed a few I/O errors that blocked the boot. Let’s start compiling them here and we’ll see what can be done about this. So far I’ve seen such issues only when booting from USB. I’m curious if the same root cause can trigger more subtle issues, i.e. not blocking the boot but causing false positives later on (I’m thinking e.g. of all the scenarios in which the system under test seems frozen in Tails Greeter after clicking “log in”).

  • The second (and last) “I start Tails from USB drive ”isohybrid" with network unplugged and I login" step in “Cat:ing a Tails isohybrid to a USB drive and booting it, then trying to upgrading it but ending up having to do a fresh installation, which boots” fails: the test suite options are added to the kernel command line, and then, while the syslinux menu is still displayed and there’s no trace of Linux booting:
    • 7min30 later: CHS: Error 0c00 reading sector 2247939 (140/14/6) and EDD: Error 0c00 reading sector 2249987
    • another minute later: CHS: Error 0c00 reading sector 2251621 (140/72/34) and EDD: Error 0c00 reading sector 2253669
    • the test suite times out before anything else happens
  • (see 2 times) “I start Tails from USB drive ”old" with network unplugged and I login" fails with very similar CHS/EDD errors as above, but at some point Linux starts spitting output and there’s a kernel panic (“Failed to execute /init”)
  • I’ve seen at least two Tails cat:ed from ISO fail to boot with SquashFS errors.
  • “I start Tails from USB drive ”__internal" with network unplugged and I login with persistence enabled" in “Watching MP4 videos stored on the persistent volume should work as expected given our AppArmor confinement” fails with similar CHS/EDD errors as above; at some point Linux starts spitting output and there’s a kernel panic (“Failed to execute /init”)
  • “I start Tails from USB drive ”_internal" with network unplugged and I login with read-only persistence enabled" in “I start Tails from USB drive ”_internal" with network unplugged and I login with read-only persistence enabled" fails with similar CHS/EDD errors as above
  • “I start Tails from USB drive ”old" with network unplugged and I login" in “Creating a persistent partition with the old Tails USB installation”: kernel panic
  • “I start Tails from USB drive ”old" with network unplugged and I login with persistence enabled" in “Writing files to a read/write-enabled persistent partition with the old Tails USB installation”: CHS/EDD errors
  • “I start Tails from USB drive ”to_upgrade" with network unplugged and I login with persistence enabled" in “Booting a USB drive upgraded from ISO with persistence enabled” is stuck at “syslinux 6.03 EDD” and never displays the bootloader menu (see 02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv attached)

I’ve never seen that outside of Jenkins, so I suspect a problem with the platform.

Random debugging ideas:

  • upgrade isotesters’ kernel to Linux 4.6: done between 2016-07-23 10:31 UTC and 11:00 UTC
  • upgrade isotesters’ QEMU to 2.5 from jessie-backports: done on 2016-07-27 around 08:00 UTC
  • check if the isotesters’ Journal has anything interesting around the time of the failure: nothing special in there
  • check if isotesters I/O load is as we expect it to be while running the test suite (including USB scenarios), i.e. most of our temporary data should stay in memory cache, and should never be flushed out to disk; the most recent work we’ve done in this area can serve as reference: Feature #11175: I/O load is as expected (most action happens on tmpfs so isotesters don’t do much disk I/O)
  • check if there’s anything interesting on Munin around the time of the failures: WIP; nothing I could notice; only a potential correlation with check-mirrors runs might be worth looking closer into
  • give the system under testing a USB3 (nec-xhci) controller: WIP (commit:499c630, https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins/)
  • upgrade the host system’s QEMU to 2.5 from jessie-backports
  • check virtual USB disk settings, e.g. the “cache” attribute
  • check how we’re managing snapshots vs. disks in the scenarios that sometimes fail

Let’s keep in mind that we have other options, such as finally giving up on nested KVM for running our test suite on Jenkins, and instead getting a dedicated machine. Infrastructure-wise, IMO we are now ready to handle more machines (we have the VPN & Puppet setup in place for that). The additional engineering effort (support running multiple instances of our test suite concurrently on the same system) is certainly non-trivial, but it may still be cheaper than fixing this very ticket and all other bugs we only see on Jenkins. So let’s not spend too much time on this here.


Files


Subtasks


Related issues

Related to Tails - Bug #12142: The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Stretch or newer hosts Rejected 2017-01-13
Blocks Tails - Bug #11583: UEFI boot tests fail on Jenkins Resolved 2016-07-21
Blocked by Tails - Bug #11590: Improve Tails Installer robustness for 2.6 Resolved 2016-07-22
Blocked by Tails - Bug #10733: Run our initramfs memory erasure hook earlier Resolved 2015-12-09

History

#1 Updated by intrigeri 2016-07-22 02:07:51

  • Description updated

#2 Updated by intrigeri 2016-07-22 02:17:50

  • Description updated

#3 Updated by intrigeri 2016-07-22 06:00:25

  • Description updated

#4 Updated by intrigeri 2016-07-23 03:04:17

  • Description updated

#5 Updated by intrigeri 2016-07-23 04:05:44

  • Description updated

#6 Updated by intrigeri 2016-07-23 12:13:29

  • Description updated

#7 Updated by intrigeri 2016-07-26 11:48:40

  • Description updated

#8 Updated by intrigeri 2016-07-26 11:51:11

FWIW it seems that failures occur much more frequently when the system is under heavy load (e.g. running multiple instances of the test suite at the same time): since I’ve stopped working on such things and triggering builds+tests on branches that have the USB tests enabled, https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10720-installer-freezes-on-jenkins/ is quite robust again.

#9 Updated by intrigeri 2016-07-27 01:01:07

  • Description updated

#10 Updated by intrigeri 2016-07-27 08:20:28

  • Description updated

#11 Updated by intrigeri 2016-07-27 08:25:52

  • Description updated

#12 Updated by intrigeri 2016-07-27 13:50:15

  • Description updated

#13 Updated by intrigeri 2016-07-27 14:01:36

  • Description updated

#15 Updated by intrigeri 2016-07-28 05:18:31

  • Description updated

#16 Updated by intrigeri 2016-07-28 06:43:06

  • Description updated

#17 Updated by intrigeri 2016-07-28 06:52:30

  • Description updated

#18 Updated by intrigeri 2016-07-28 07:49:20

  • Description updated

#19 Updated by intrigeri 2016-07-28 08:37:34

  • Description updated

#20 Updated by intrigeri 2016-07-28 08:45:22

  • blocks Bug #11583: UEFI boot tests fail on Jenkins added

#21 Updated by intrigeri 2016-07-28 08:45:46

  • blocked by Bug #10720: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device added

#22 Updated by intrigeri 2016-07-29 06:51:07

  • Feature Branch set to test/11588-usb-on-jenkins

#23 Updated by intrigeri 2016-07-29 06:51:45

  • blocks deleted (Bug #10720: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device)

#24 Updated by intrigeri 2016-07-29 06:57:34

  • blocked by Bug #11590: Improve Tails Installer robustness for 2.6 added

#25 Updated by intrigeri 2016-07-29 14:20:36

  • Description updated

#26 Updated by intrigeri 2016-07-30 03:16:43

  • Description updated

#27 Updated by intrigeri 2016-07-30 11:24:44

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri
  • Target version set to Tails_2.6
  • % Done changed from 0 to 10
  • Feature Branch changed from test/11588-usb-on-jenkins to test/11588-usb-on-jenkins+10733

Status update: looks like I’ve got something robust enough, see https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/ and https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-10971-more-cpus-for-tailstoaster-11588-10733/. I’ll let it run a couple more weeks on Jenkins and we’ll see. I hope we’ll be enable to merge this (and re-enable most USB tests) during the 2.6 cycle, fingers crossed.

#28 Updated by intrigeri 2016-07-30 11:25:00

  • blocked by Bug #10733: Run our initramfs memory erasure hook earlier added

#29 Updated by intrigeri 2016-07-30 15:12:30

  • Deliverable for set to 270

#31 Updated by intrigeri 2016-08-01 07:19:15

  • Assignee changed from intrigeri to anonym
  • % Done changed from 10 to 20
  • QA Check set to Ready for QA

This seems to be rock solid on Jenkins.

#32 Updated by intrigeri 2016-08-01 07:37:14

I’d like to ease reviewing for the 2.6 RM, and to get automated tests running about the combination of all these changes ASAP in the 2.6 dev cycle. So, I’ve merged this work, along with the other major branches I’m proposing for 2.6, into the feature/from-intrigeri-for-2.6 integration branch (Jenkins builds and tests.

#33 Updated by anonym 2016-08-23 08:28:01

  • Status changed from In Progress to Fix committed
  • Assignee deleted (anonym)
  • % Done changed from 20 to 100
  • QA Check changed from Ready for QA to Pass

#34 Updated by anonym 2016-09-20 16:50:08

  • Status changed from Fix committed to Resolved

#35 Updated by anonym 2017-01-16 14:47:16

  • related to Bug #12142: The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Stretch or newer hosts added