Bug #10717
Concerning amount of test suite runs aborted on Jenkins due to timeout
10%
Description
As reported on Feature #10382#note-10:
> the concerning amount of jobs that are aborted due to inactivity seems to be the top reason of non-successful test suite runs (out of the last 10 runs of test_Tails_ISO_devel
, 6 aborted that way, 2 were succesful, and 2 failed; on stable it’s “only” 3 aborted jobs out of 10).
It may block Feature #10382, depending on whether we send email notification on timeouts (which IMO we should do but that’s off-topic here).
Subtasks
Related issues
Related to Tails - |
Resolved | 2015-04-04 | |
Related to Tails - |
Resolved | 2015-12-07 |
History
#1 Updated by intrigeri 2015-12-06 15:53:21
- Status changed from Confirmed to In Progress
- Assignee set to intrigeri
- % Done changed from 0 to 10
After feeling a bit guilty that I wasn’t being nice on Feature #10382#note-10, I wanted to help a bit and I’ve tracked https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-5926-freezable-apt-repository/27/console while it was quite clearly going to timeout. I’m telling the entire story of this debugging session, in case it might be useful and inspire others on our team for debugging the next similar problems themselves.
All times below are UTC.
At 22:40, all previous scenarios in the current feature have passed and the last visible lines are:
22:12:48 # Depends on scenario: Writing files to a read/write-enabled persistent partition with the old Tails USB installation
22:12:48 Scenario: Upgrading an old Tails USB installation from an ISO image, running on the old version # features/usb_upgrade.feature:96
22:12:52 Given a computer # features/step_definitions/common_steps.rb:116
22:12:52 And I clone USB drive "old" to a new USB drive "to_upgrade" # features/step_definitions/usb.rb:51
22:12:57 And I setup a filesystem share containing the Tails ISO # features/step_definitions/usb.rb:157
=> either the currently running step is I start Tails from USB drive “old” with network unplugged and I login, or some messages from Cucumber where not received by Jenkins yet. I’m unsure, see below.
On the slave where this test runs (isotester2):
- Cucumber is eating a CPU core
- QEMU is almost idling
- all filesystems have lots of free space
- memory situation sounds OK:
total used free shared buffers cached
Mem: 20126 16040 4085 14 86 11326
-/+ buffers/cache: 4627 15498
Swap: 0 0 0
In /var/log/libvirt/qemu/TailsToaster.log
, last startup is: 2015-12-06 22:13:00.329+0000: starting up
, which matches the last line I see in the test suite log
Storage-wise:
$ /bin/ls -lhtr /tmp/TailsToaster/TailsToasterStorage/
total 6.0G
-rwxrwxr-- 1 libvirt-qemu libvirt-qemu 1.1G Dec 6 21:56 old
-rwxrwxr-- 1 root root 3.0G Dec 6 22:04 current
-rwxrwxr-- 1 libvirt-qemu libvirt-qemu 2.1G Dec 6 22:18 to_upgrade
… which indicates that something has modified the to_upgrade
disk since it was cloned, last modification being 5 minutes after the VM startup. Add to this the fact that on last successful run on devel (https://jenkins.tails.boum.org/view/Raw/job/test_Tails_ISO_devel/60/console) it took about 6.5 minutes between I setup a filesystem share containing the Tails ISO and the end of the upgrade of the to_upgrade
disk. And then I seriously wonder if the currently running step really is I start Tails from USB drive “old” with network unplugged and I login — I bet the last messages from Cucumber simply did not reach Jenkins yet.
I see that video is still being recorded:
/tmp/TailsToaster/run-2015-12-06_16:41:34_+0000-git_2d18e9f-GzFqfL $ ls -thr
total 9.7M
-rw-r--r-- 1 root root 65K Dec 6 19:26 02:44:44_Clock_is_one_day_in_the_future_in_bridge_mode.png
-rw-r--r-- 1 root root 1.5M Dec 6 19:26 02:44:44_Clock_is_one_day_in_the_future_in_bridge_mode.mkv
-rw-r--r-- 1 root root 42K Dec 6 20:08 03:26:26_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..png
-rw-r--r-- 1 root root 530K Dec 6 20:08 03:26:26_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..mkv
-rw-r--r-- 1 root root 7.7M Dec 6 23:01 Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_old_version.mkv
The current video (Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_old_version.mkv
) shows Tails booting, Tails Installer upgrading /dev/sdb1
and freezing at 99% from 04:07 to 04:30 while “Updating properties of system partition /dev/sdb1”, and finally an error message appears in Tails Installer’s graphical log that says: “Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken” (And then nothing happens for the next dozens of minutes, and I bet this test suite run will then time out.)
I’ve checked a couple more runs that timed out and the root cause seems to be the same:
- https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_devel/64/artifact/build-artifacts/Upgrading_an_old_Tails_USB_installation_from_a_Tails_DVD.mkv → exactly the same symptoms
- https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_devel/63/artifact/build-artifacts/Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_new_version.mkv
Regarding the symptom we see in Tails Installer: according to Debian Code Search, this error message comes from https://codesearch.debian.net/show?file=dbus_1.10.6-1%2Fdbus%2Fdbus-pending-call.c&line=362 — but the tails-installer 4.x branch that we use on feature/jessie
uses udisks2 and communicates with it in very different ways so the very next thing we should do on this ticket is see if the bug happens on Jessie ISOs. Sadly, something very similar happens there: https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/77/, https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/76/ and https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/72/ all time out on an installation failure at “Partitioning device /dev/sda” with “GDBus.Error:org.freedesktop.DBus.Error.UnknownMethod: No such interface ‘org.freedesktop.UDisks2.Partition’ on object at path /org/freedesktop/UDisks2/block_devices/sda1”; this seems to happen in def partition_device
, which already has a pile of workaround for UDisks bugs, so I’m not surprised :/ — I’ll file a ticket about this later, so assigning to me.
The root cause might live in UDisks, in QEMU, or in the Linux kernel (on the level-0 host, or on the level-1 aka. isotesterN). All this is painfully hard to debug, so if a workaround can be added to Tails Installer it’ll be good enough IMO. Note that I won’t bother doing that in Tails Installer 3.x.
#2 Updated by intrigeri 2015-12-06 15:57:16
- Priority changed from Normal to Elevated
bertagaz: if email notifications go out for aborted jobs, please mark this very ticket as blocking Feature #10382.
#3 Updated by bertagaz 2015-12-07 04:24:27
intrigeri wrote:
> bertagaz: if email notifications go out for aborted jobs, please mark this very ticket as blocking Feature #10382.
They’re not.
Nice debugging!
For some of the root causes, maybe going forward on the plan you posted on Bug #9157 may help. At least if the bug is in the QEMU/Kvm.
#4 Updated by bertagaz 2015-12-07 04:25:21
- related to
Bug #9157: ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel added
#5 Updated by intrigeri 2015-12-07 08:32:52
- related to
Bug #10720: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device added
#6 Updated by intrigeri 2015-12-07 08:49:48
- Priority changed from Elevated to Normal
Now that the USB tests have been tagged fragile, I don’t think this blocks anything immediately. I’ll check this again during next cycle and close if no similar problem comes up.
#7 Updated by intrigeri 2015-12-07 17:57:02
- Type of work changed from Research to Wait
#8 Updated by intrigeri 2016-01-24 19:26:59
- Status changed from In Progress to Rejected
- Assignee deleted (
intrigeri)
intrigeri wrote:
> Now that the USB tests have been tagged fragile, I don’t think this blocks anything immediately. I’ll check this again during next cycle and close if no similar problem comes up.
There are similar problems but let’s track them on Bug #10988.