Concerning amount of test suite runs aborted on Jenkins due to timeout
As reported on
> the concerning amount of jobs that are aborted due to inactivity seems to be the top reason of non-successful test suite runs (out of the last 10 runs of
test_Tails_ISO_devel, 6 aborted that way, 2 were succesful, and 2 failed; on stable it’s “only” 3 aborted jobs out of 10).
It may block
Feature #10382, depending on whether we send email notification on timeouts (which IMO we should do but that’s off-topic here).
#1 Updated by intrigeri 2015-12-06 15:53:21
- Status changed from Confirmed to In Progress
- Assignee set to intrigeri
- % Done changed from 0 to 10
After feeling a bit guilty that I wasn’t being nice on
Feature #10382#note-10, I wanted to help a bit and I’ve tracked https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-5926-freezable-apt-repository/27/console while it was quite clearly going to timeout. I’m telling the entire story of this debugging session, in case it might be useful and inspire others on our team for debugging the next similar problems themselves.
All times below are UTC.
At 22:40, all previous scenarios in the current feature have passed and the last visible lines are:
22:12:48 # Depends on scenario: Writing files to a read/write-enabled persistent partition with the old Tails USB installation 22:12:48 Scenario: Upgrading an old Tails USB installation from an ISO image, running on the old version # features/usb_upgrade.feature:96 22:12:52 Given a computer # features/step_definitions/common_steps.rb:116 22:12:52 And I clone USB drive "old" to a new USB drive "to_upgrade" # features/step_definitions/usb.rb:51 22:12:57 And I setup a filesystem share containing the Tails ISO # features/step_definitions/usb.rb:157
=> either the currently running step is I start Tails from USB drive “old” with network unplugged and I login, or some messages from Cucumber where not received by Jenkins yet. I’m unsure, see below.
On the slave where this test runs (isotester2):
- Cucumber is eating a CPU core
- QEMU is almost idling
- all filesystems have lots of free space
- memory situation sounds OK:
total used free shared buffers cached Mem: 20126 16040 4085 14 86 11326 -/+ buffers/cache: 4627 15498 Swap: 0 0 0
/var/log/libvirt/qemu/TailsToaster.log, last startup is:
2015-12-06 22:13:00.329+0000: starting up, which matches the last line I see in the test suite log
$ /bin/ls -lhtr /tmp/TailsToaster/TailsToasterStorage/ total 6.0G -rwxrwxr-- 1 libvirt-qemu libvirt-qemu 1.1G Dec 6 21:56 old -rwxrwxr-- 1 root root 3.0G Dec 6 22:04 current -rwxrwxr-- 1 libvirt-qemu libvirt-qemu 2.1G Dec 6 22:18 to_upgrade
… which indicates that something has modified the
to_upgrade disk since it was cloned, last modification being 5 minutes after the VM startup. Add to this the fact that on last successful run on devel (https://jenkins.tails.boum.org/view/Raw/job/test_Tails_ISO_devel/60/console) it took about 6.5 minutes between I setup a filesystem share containing the Tails ISO and the end of the upgrade of the
to_upgrade disk. And then I seriously wonder if the currently running step really is I start Tails from USB drive “old” with network unplugged and I login — I bet the last messages from Cucumber simply did not reach Jenkins yet.
I see that video is still being recorded:
/tmp/TailsToaster/run-2015-12-06_16:41:34_+0000-git_2d18e9f-GzFqfL $ ls -thr total 9.7M -rw-r--r-- 1 root root 65K Dec 6 19:26 02:44:44_Clock_is_one_day_in_the_future_in_bridge_mode.png -rw-r--r-- 1 root root 1.5M Dec 6 19:26 02:44:44_Clock_is_one_day_in_the_future_in_bridge_mode.mkv -rw-r--r-- 1 root root 42K Dec 6 20:08 03:26:26_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..png -rw-r--r-- 1 root root 530K Dec 6 20:08 03:26:26_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..mkv -rw-r--r-- 1 root root 7.7M Dec 6 23:01 Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_old_version.mkv
The current video (
Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_old_version.mkv) shows Tails booting, Tails Installer upgrading
/dev/sdb1 and freezing at 99% from 04:07 to 04:30 while “Updating properties of system partition /dev/sdb1”, and finally an error message appears in Tails Installer’s graphical log that says: “Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken” (And then nothing happens for the next dozens of minutes, and I bet this test suite run will then time out.)
I’ve checked a couple more runs that timed out and the root cause seems to be the same:
- https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_devel/64/artifact/build-artifacts/Upgrading_an_old_Tails_USB_installation_from_a_Tails_DVD.mkv → exactly the same symptoms
Regarding the symptom we see in Tails Installer: according to Debian Code Search, this error message comes from https://codesearch.debian.net/show?file=dbus_1.10.6-1%2Fdbus%2Fdbus-pending-call.c&line=362 — but the tails-installer 4.x branch that we use on
feature/jessie uses udisks2 and communicates with it in very different ways so the very next thing we should do on this ticket is see if the bug happens on Jessie ISOs. Sadly, something very similar happens there: https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/77/, https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/76/ and https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/72/ all time out on an installation failure at “Partitioning device /dev/sda” with “GDBus.Error:org.freedesktop.DBus.Error.UnknownMethod: No such interface ‘org.freedesktop.UDisks2.Partition’ on object at path /org/freedesktop/UDisks2/block_devices/sda1”; this seems to happen in
def partition_device, which already has a pile of workaround for UDisks bugs, so I’m not surprised :/ — I’ll file a ticket about this later, so assigning to me.
The root cause might live in UDisks, in QEMU, or in the Linux kernel (on the level-0 host, or on the level-1 aka. isotesterN). All this is painfully hard to debug, so if a workaround can be added to Tails Installer it’ll be good enough IMO. Note that I won’t bother doing that in Tails Installer 3.x.
#3 Updated by bertagaz 2015-12-07 04:24:27
> bertagaz: if email notifications go out for aborted jobs, please mark this very ticket as blocking
For some of the root causes, maybe going forward on the plan you posted on
Bug #9157 may help. At least if the bug is in the QEMU/Kvm.
#8 Updated by intrigeri 2016-01-24 19:26:59
- Status changed from In Progress to Rejected
- Assignee deleted (
> Now that the USB tests have been tagged fragile, I don’t think this blocks anything immediately. I’ll check this again during next cycle and close if no similar problem comes up.
There are similar problems but let’s track them on