Bug #10717: Concerning amount of test suite runs aborted on Jenkins due to timeout

Bug #10717

Concerning amount of test suite runs aborted on Jenkins due to timeout

Added by intrigeri 2015-12-06 15:15:19 . Updated 2016-01-24 19:26:59 .

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Continuous Integration

Target version:

Tails_2.0

Start date:

2015-12-06

Due date:

% Done:

10%

Feature Branch:

Type of work:

Wait

Blueprint:

Starter:

Affected tool:

Deliverable for:

267

Description

As reported on ~~Feature #10382#note-10~~:

> the concerning amount of jobs that are aborted due to inactivity seems to be the top reason of non-successful test suite runs (out of the last 10 runs of test_Tails_ISO_devel, 6 aborted that way, 2 were succesful, and 2 failed; on stable it’s “only” 3 aborted jobs out of 10).

It may block ~~Feature #10382~~, depending on whether we send email notification on timeouts (which IMO we should do but that’s off-topic here).

Subtasks

Related issues

Related to Tails - ~~Bug #9157~~: ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel	Resolved	2015-04-04
Related to Tails - ~~Bug #10720~~: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device	Resolved	2015-12-07

History

#1 Updated by intrigeri 2015-12-06 15:53:21

Status changed from Confirmed to In Progress
Assignee set to intrigeri
% Done changed from 0 to 10

After feeling a bit guilty that I wasn’t being nice on ~~Feature #10382#note-10~~, I wanted to help a bit and I’ve tracked https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-5926-freezable-apt-repository/27/console while it was quite clearly going to timeout. I’m telling the entire story of this debugging session, in case it might be useful and inspire others on our team for debugging the next similar problems themselves.

All times below are UTC.

At 22:40, all previous scenarios in the current feature have passed and the last visible lines are:

22:12:48   # Depends on scenario: Writing files to a read/write-enabled persistent partition with the old Tails USB installation
22:12:48   Scenario: Upgrading an old Tails USB installation from an ISO image, running on the old version # features/usb_upgrade.feature:96
22:12:52     Given a computer                                                                              # features/step_definitions/common_steps.rb:116
22:12:52     And I clone USB drive "old" to a new USB drive "to_upgrade"                                   # features/step_definitions/usb.rb:51
22:12:57     And I setup a filesystem share containing the Tails ISO                                       # features/step_definitions/usb.rb:157

=> either the currently running step is I start Tails from USB drive “old” with network unplugged and I login, or some messages from Cucumber where not received by Jenkins yet. I’m unsure, see below.

On the slave where this test runs (isotester2):

Cucumber is eating a CPU core
QEMU is almost idling
all filesystems have lots of free space
memory situation sounds OK:

             total       used       free     shared    buffers     cached
Mem:         20126      16040       4085         14         86      11326
-/+ buffers/cache:       4627      15498
Swap:            0          0          0

In /var/log/libvirt/qemu/TailsToaster.log, last startup is: 2015-12-06 22:13:00.329+0000: starting up, which matches the last line I see in the test suite log

Storage-wise:

$ /bin/ls -lhtr /tmp/TailsToaster/TailsToasterStorage/
total 6.0G
-rwxrwxr-- 1 libvirt-qemu libvirt-qemu 1.1G Dec  6 21:56 old
-rwxrwxr-- 1 root         root         3.0G Dec  6 22:04 current
-rwxrwxr-- 1 libvirt-qemu libvirt-qemu 2.1G Dec  6 22:18 to_upgrade

… which indicates that something has modified the to_upgrade disk since it was cloned, last modification being 5 minutes after the VM startup. Add to this the fact that on last successful run on devel (https://jenkins.tails.boum.org/view/Raw/job/test_Tails_ISO_devel/60/console) it took about 6.5 minutes between I setup a filesystem share containing the Tails ISO and the end of the upgrade of the to_upgrade disk. And then I seriously wonder if the currently running step really is I start Tails from USB drive “old” with network unplugged and I login — I bet the last messages from Cucumber simply did not reach Jenkins yet.

I see that video is still being recorded:

/tmp/TailsToaster/run-2015-12-06_16:41:34_+0000-git_2d18e9f-GzFqfL $ ls -thr
total 9.7M
-rw-r--r-- 1 root root  65K Dec  6 19:26 02:44:44_Clock_is_one_day_in_the_future_in_bridge_mode.png
-rw-r--r-- 1 root root 1.5M Dec  6 19:26 02:44:44_Clock_is_one_day_in_the_future_in_bridge_mode.mkv
-rw-r--r-- 1 root root  42K Dec  6 20:08 03:26:26_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..png
-rw-r--r-- 1 root root 530K Dec  6 20:08 03:26:26_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..mkv
-rw-r--r-- 1 root root 7.7M Dec  6 23:01 Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_old_version.mkv

The current video (Upgrading_an_old_Tails_USB_installation_from_an_ISO_image,_running_on_the_old_version.mkv) shows Tails booting, Tails Installer upgrading /dev/sdb1 and freezing at 99% from 04:07 to 04:30 while “Updating properties of system partition /dev/sdb1”, and finally an error message appears in Tails Installer’s graphical log that says: “Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken” (And then nothing happens for the next dozens of minutes, and I bet this test suite run will then time out.)

I’ve checked a couple more runs that timed out and the root cause seems to be the same:

Regarding the symptom we see in Tails Installer: according to Debian Code Search, this error message comes from https://codesearch.debian.net/show?file=dbus_1.10.6-1%2Fdbus%2Fdbus-pending-call.c&line=362 — but the tails-installer 4.x branch that we use on feature/jessie uses udisks2 and communicates with it in very different ways so the very next thing we should do on this ticket is see if the bug happens on Jessie ISOs. Sadly, something very similar happens there: https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/77/, https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/76/ and https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/72/ all time out on an installation failure at “Partitioning device /dev/sda” with “GDBus.Error:org.freedesktop.DBus.Error.UnknownMethod: No such interface ‘org.freedesktop.UDisks2.Partition’ on object at path /org/freedesktop/UDisks2/block_devices/sda1”; this seems to happen in def partition_device, which already has a pile of workaround for UDisks bugs, so I’m not surprised :/ — I’ll file a ticket about this later, so assigning to me.

The root cause might live in UDisks, in QEMU, or in the Linux kernel (on the level-0 host, or on the level-1 aka. isotesterN). All this is painfully hard to debug, so if a workaround can be added to Tails Installer it’ll be good enough IMO. Note that I won’t bother doing that in Tails Installer 3.x.

#2 Updated by intrigeri 2015-12-06 15:57:16

Priority changed from Normal to Elevated

bertagaz: if email notifications go out for aborted jobs, please mark this very ticket as blocking ~~Feature #10382~~.

#3 Updated by bertagaz 2015-12-07 04:24:27

intrigeri wrote:
> bertagaz: if email notifications go out for aborted jobs, please mark this very ticket as blocking ~~Feature #10382~~.

They’re not.

Nice debugging!

For some of the root causes, maybe going forward on the plan you posted on ~~Bug #9157~~ may help. At least if the bug is in the QEMU/Kvm.

#4 Updated by bertagaz 2015-12-07 04:25:21

related to ~~Bug #9157~~: ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel added

#5 Updated by intrigeri 2015-12-07 08:32:52

related to ~~Bug #10720~~: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device added

#6 Updated by intrigeri 2015-12-07 08:49:48

Priority changed from Elevated to Normal

Now that the USB tests have been tagged fragile, I don’t think this blocks anything immediately. I’ll check this again during next cycle and close if no similar problem comes up.

#7 Updated by intrigeri 2015-12-07 17:57:02

Type of work changed from Research to Wait

#8 Updated by intrigeri 2016-01-24 19:26:59

Status changed from In Progress to Rejected
Assignee deleted (~~intrigeri~~)

intrigeri wrote:
> Now that the USB tests have been tagged fragile, I don’t think this blocks anything immediately. I’ll check this again during next cycle and close if no similar problem comes up.

There are similar problems but let’s track them on ~~Bug #10988~~.