Bug #10502
The test suite sometimes cannot connect to the remote shell
100%
Description
We’ve had failures lately in Jenkins. It made some steps fail, and it’s not surprising given the place it has in the test suite. We should watch out the rate of this bug as well it where condition it happens to get a better plan.
Here are some excerpts:
Step: the computer reboots Tails
Defined at: features/step_definitions/common_steps.rb:244
Used in: Scenario: Anti-test: Changes to the hardware clock are kept when rebooting
features/time_syncing.feature:2
Error trace:
Remote shell seems to be down
Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
./features/support/helpers/misc_helpers.rb:33:in `try_for'
./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
features/time_syncing.feature:101:in `And the computer reboots Tails'
Step: I start Tails from DVD with network unplugged and I login
Defined at: features/step_definitions/common_steps.rb:178
Used in: Scenario: Upgrading an old Tails USB installation from an ISO image, running on the new version
features/usb_upgrade.feature:2
Error trace:
Remote shell seems to be down
Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
./features/support/helpers/misc_helpers.rb:33:in `try_for'
./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
./features/step_definitions/common_steps.rb:186:in `/^I start Tails( from DVD)?( with network unplugged)?( and I login)?$/'
features/usb_upgrade.feature:111:in `And I start Tails from DVD with network unplugged and I login'
Step: I start Tails from DVD with network unplugged and I login
Defined at: features/step_definitions/common_steps.rb:178
Used in: Scenario: Upgrading an old Tails USB installation from an ISO image, running on the new version
features/usb_upgrade.feature:2
Error trace:
Remote shell seems to be down
Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
./features/support/helpers/misc_helpers.rb:33:in `try_for'
./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
./features/step_definitions/common_steps.rb:186:in `/^I start Tails( from DVD)?( with network unplugged)?( and I login)?$/'
features/usb_upgrade.feature:111:in `And I start Tails from DVD with network unplugged and I login'
Step: I start Tails from USB drive "old" with network unplugged and I login
Defined at: features/step_definitions/common_steps.rb:200
Used in: Scenario: Upgrading an old Tails USB installation from an ISO image, running on the old version
features/usb_upgrade.feature:2
Error trace:
Remote shell seems to be down
Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
./features/support/helpers/misc_helpers.rb:33:in `try_for'
./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
./features/step_definitions/common_steps.rb:208:in `/^I start Tails from (.+?) drive "(.+?)"(| with network unplugged)( and I login(| with(| read-only) persistence enabled))?$/'
features/usb_upgrade.feature:100:in `When I start Tails from USB drive "old" with network unplugged and I login'
Subtasks
Related issues
Related to Tails - |
Resolved | 2016-09-27 | |
Blocked by Tails - |
Resolved | 2016-10-07 | |
Blocked by Tails - |
Resolved | 2015-12-18 |
History
#1 Updated by bertagaz 2015-11-06 11:40:27
- blocks #8668 added
#2 Updated by bertagaz 2015-11-06 11:40:54
- Description updated
#3 Updated by bertagaz 2015-11-06 11:42:20
- Parent task set to Bug #10288
#4 Updated by anonym 2015-11-06 11:47:47
- Assignee set to anonym
#5 Updated by anonym 2015-11-06 11:54:49
- Target version changed from Tails_1.8 to Tails_2.3
#6 Updated by bertagaz 2015-11-16 04:04:06
- Target version changed from Tails_2.3 to Tails_1.8
Since the last @fragile tagging, this step has become the most important failure in the test suite (20 times since Nov 6, 2015). Some runs would have passed without this bug, I think this should be worked on more quickly, so I’m changing its target version.
#7 Updated by intrigeri 2015-11-16 04:44:37
On feature/jessie, the remote shell should absolutely not be racy at all anymore (Type=notify
). Can you please confirm? If it’s indeed fixed there, perhaps the urgency of fixing it on Wheezy can be rethought a bit.
#8 Updated by bertagaz 2015-11-16 05:52:33
intrigeri wrote:
> On feature/jessie, the remote shell should absolutely not be racy at all anymore (Type=notify
). Can you please confirm? If it’s indeed fixed there, perhaps the urgency of fixing it on Wheezy can be rethought a bit.
According to the test jobs logs in the Jenkins history (48 runs), it happened 16 times in this feature/jessie branch test job, but in 6 different runs only: 10, 19, 21, 22, 26, 36.
So it seems for sure more robust, but I’m not sure I’d say completely fixed. Might need a bit more investigation to get what happened.
Now if we decide we want to postpone the resolution of this ticket because feature/jessie will greatly enhance it, it will still mean we won’t release the email notification meanwhile, given the number of test jobs that fails because of this (20 times out of 52 test jobs).
But I agree this bug is quite difficult, so we may ask ourselves if it’s worth it.
#9 Updated by intrigeri 2015-11-17 03:31:26
> According to the test jobs logs in the Jenkins history (48 runs), it happened 16 times in this feature/jessie branch test job, but in 6 different runs only: 10, 19, 21, 22, 26, 36.
Thanks for investigating! It’s very interesting that we see these failures concentrated in a few runs.
I’ve looked at https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/36/consoleFull and see no obvious indication that the remote shell is at fault (plenty of other reasons can cause what I see there), let alone that it’s racy (racing against what, exactly?). What I can say, if it may help, is that on feature/jessie, when tails-autotest-remote-shell.service
says it has started, it means it has open its communication channels and is waiting for clients. I’ve reviewed the code again, and verified on the output of systemd-analyze plot
that gdm.service
starts after this has occurred.
So I’m very confident that on feature/jessie, the race condition we used to have (the Greeter shows up before the remote shell has started) is gone, and these errors are caused by something else.
I think we’ll need more debugging info to find out what the actual cause is (QEMU failing to set up communication between the host and the guest? the remote shell code not being 100% robust? some temporary performance issue causing the 30 seconds timeout to be too low? the post_snapshot_restore_hook
not doing things in the right order?). E.g. do we often see that when starting from DVD, or mostly when booting from emulated USB? Do we ever see that when creating a snapshot initially (as opposed to restoring one)?
In the meantime, I suggest this ticket should be given a title that reflects more clearly what we know, instead of a guess wrt. what a reason for what we see could be.
#10 Updated by intrigeri 2015-12-05 16:02:09
bertagaz, or anyone else who knows how to extract such info: has this been happening a lot recently when testing ISO images built from feature/jessie? (I’m trying to evaluate how much of a blocker this is for Feature #10382.)
#11 Updated by intrigeri 2015-12-05 16:02:57
- Subject changed from Remote-shell is still racy to The test suite sometimes cannot connect to the remote shell
#12 Updated by intrigeri 2015-12-05 16:03:22
- Priority changed from Normal to Elevated
(It’s apparently blocking Feature #10382.)
#13 Updated by bertagaz 2015-12-07 01:49:46
intrigeri wrote:
> bertagaz, or anyone else who knows how to extract such info: has this been happening a lot recently when testing ISO images built from feature/jessie? (I’m trying to evaluate how much of a blocker this is for Feature #10382.)
Since the last report in Bug #10502#note-8, it happened in two other builds only: 4 times in build #75, and one time in build #76 (which was aborted).
#14 Updated by intrigeri 2015-12-07 05:57:11
- Assignee changed from anonym to bertagaz
> Since the last report in Bug #10502#note-8,
(That is: in 3 weeks.)
> it happened in two other builds only: 4 times in build #75, and one time in build #76 (which was aborted).
OK, so:
- for build 75: I see several bugs in the test suite (that you should file as subtask of Bug #10288 since it’s blocking your work) during that run, that make tons of scenarios fail anyway
- for build 76: with my changes for
Bug #10718in, the developer would have (rightfully) got a notification anyway
So in the worst case, on feature/jessie this problem would have triggered one undeserved notification in 3 weeks. I think it’s OK. bertagaz, can you please come up with a plan that makes us not block on resolving this on Wheezy?
#15 Updated by bertagaz 2015-12-15 03:36:09
- Target version changed from Tails_1.8 to Tails_2.0
Postponing
#16 Updated by intrigeri 2015-12-18 08:38:57
intrigeri wrote:
> So in the worst case, on feature/jessie this problem would have triggered one undeserved notification in 3 weeks. I think it’s OK. bertagaz, can you please come up with a plan that makes us not block on resolving this on Wheezy?
Like: enabling notifications (Feature #10382) for all branches except stable, since devel and experimental are now Jessie-based. I doubt we’ll prepare many branches based on stable (Wheezy) until the release of Tails 2.0 (Jessie), so even if they are affected, the amount of false positive notifications should be acceptable IMO.
Assuming we agree on this, next step on this ticket becomes essentially: keep an eye on this on test suite runs of branches based on stable, quickly react if it is indeed a problem (turn off notifications), and if not reevaluate the whole thing in 6 weeks (sorry for the hardcoded magic number).
Unless someone objects soonish I think we should just go ahead on Tuesday.
Thoughts?
#17 Updated by intrigeri 2015-12-20 02:28:30
It happened today on devel: https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/cucumberTestReport/various-checks/apparmor-is-enabled-and-has-enforced-profiles/ — and last time I’ve see it on Tails/Jessie the same scenario was the one failing. I suspect it’s not pure chance that this happens for the first scenario that actually boots a Tails, while following scenarios are just fine.
On https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/artifact/build-artifacts/00%3A02%3A35_AppArmor_is_enabled_and_has_enforced_profiles.mkv I see that the syslinux cmdline editor is never entered, and autotest_never_use_this_option
is not typed, which explains why the remote shell can’t be reached later. anonym suggested the other day to increase the delay in screen.waitVanish(bootsplash_tab_msg, 1)
IIRC.
anonym, can you please take a look?
#18 Updated by intrigeri 2015-12-20 02:43:12
- Status changed from Confirmed to In Progress
- Assignee changed from bertagaz to anonym
#19 Updated by anonym 2016-01-27 14:36:49
- Target version changed from Tails_2.0 to Tails_2.2
#20 Updated by intrigeri 2016-02-05 14:53:36
- related to
Bug #10777: The test suite machinery sometimes misses the boot splash added
#21 Updated by intrigeri 2016-02-05 14:54:52
intrigeri wrote:
> On https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/artifact/build-artifacts/00%3A02%3A35_AppArmor_is_enabled_and_has_enforced_profiles.mkv I see that the syslinux cmdline editor is never entered, and autotest_never_use_this_option
is not typed, which explains why the remote shell can’t be reached later. anonym suggested the other day to increase the delay in screen.waitVanish(bootsplash_tab_msg, 1)
IIRC.
It might be that in the end, Bug #10777 is the root cause of this problem. I say let’s get Bug #10777 fixed and come back here a month later to see if we still see it happen.
#22 Updated by anonym 2016-02-20 13:23:28
- Target version changed from Tails_2.2 to Tails_2.3
I think we’ll have Bug #10777 fixed for 2.2, so let’s see how it works out.
#23 Updated by anonym 2016-04-20 10:56:33
- Target version changed from Tails_2.3 to Tails_2.4
#24 Updated by anonym 2016-05-10 07:36:20
- Deliverable for changed from 270 to SponsorS_Internal
#25 Updated by anonym 2016-06-08 01:35:00
- Target version changed from Tails_2.4 to Tails_2.5
#26 Updated by anonym 2016-06-15 13:13:44
intrigeri wrote:
> intrigeri wrote:
> > On https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/artifact/build-artifacts/00%3A02%3A35_AppArmor_is_enabled_and_has_enforced_profiles.mkv I see that the syslinux cmdline editor is never entered, and autotest_never_use_this_option
is not typed, which explains why the remote shell can’t be reached later. anonym suggested the other day to increase the delay in screen.waitVanish(bootsplash_tab_msg, 1)
IIRC.
>
> It might be that in the end, Bug #10777 is the root cause of this problem. I say let’s get Bug #10777 fixed and come back here a month later to see if we still see it happen.
Indeed, given your quote, it sounds like a very reasonable explanation (without “autotest_never_use_this_option
” the remote shell won’t be started) (but my suggestion in it sucks! sorry I cannot always be great :)). Given my recent attempt at solving Bug #10777, I am hopeful!
#27 Updated by intrigeri 2016-07-18 06:28:33
- Target version changed from Tails_2.5 to Tails_2.7
#28 Updated by anonym 2016-10-07 12:50:25
- blocked by
Bug #11866: Cannot download some artifacts from Jenkins added
#29 Updated by anonym 2016-10-07 15:00:30
- Assignee deleted (
anonym) - % Done changed from 0 to 50
- Feature Branch set to test/10777-robust-boot-menu
anonym wrote:
> intrigeri wrote:
> > It might be that in the end, Bug #10777 is the root cause of this problem. I say let’s get Bug #10777 fixed and come back here a month later to see if we still see it happen.
>
> Indeed, given your quote, it sounds like a very reasonable explanation (without “autotest_never_use_this_option
” the remote shell won’t be started)
When doing the September @fragile
tagging I noticed four instances of Remote shell seems to be down
, with these videos:
- https://jenkins.tails.boum.org/job/test_Tails_ISO_stable/443/artifact/build-artifacts/02%3A43%3A23_VirtualBox_guest_modules_are_available.mkv
- https://jenkins.tails.boum.org/job/test_Tails_ISO_stable/456/artifact/build-artifacts/00%3A14%3A02_Memory_erasure_on_an_old_computer.mkv
- https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/530/artifact/build-artifacts/02%3A47%3A38_VirtualBox_guest_modules_are_available.mkv
- https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/526/artifact/build-artifacts/00%3A11%3A32_Anti-test%3A_no_memory_erasure_on_an_old_computer.mkv
In all of them, “autotest_never_use_this_option
” is properly added, but then the system reboots suddenly during early init, and since this is in the stable
and devel
branches there’s no fix for Bug #10777, so we won’t try to add it a second time.
First of all: WTF, why the sudden reboot? Second: I don’t care, the fix for Bug #10777 will clearly fix this, and for now I will assume that Bug #10777 indeed is the only cause for the remote shell being down.
#30 Updated by anonym 2016-10-07 15:00:44
- related to deleted (
)Bug #10777: The test suite machinery sometimes misses the boot splash
#31 Updated by anonym 2016-10-07 15:00:53
- blocked by
Bug #10777: The test suite machinery sometimes misses the boot splash added
#32 Updated by intrigeri 2016-10-09 20:00:36
This happened again during almost every test suite run on the devel branch since October 5. The branch for Bug #11846 was merged on October 4 so I guess it’s realted. autotest_never_use_this_option
was properly added, and there’s no “sudden reboot”, so this does not look like Bug #10777 but really a problem with the communication with the remote shell.
By the way: anonym, you’ve removed yourself from the assignee field, but target version is still 2.7, so I wonder what’s the plan.
#33 Updated by intrigeri 2016-10-09 20:03:45
- related to
Bug #11846: The remote shell can mix up responses after an execution has been aborted added
#34 Updated by anonym 2016-10-10 08:37:05
- Assignee set to anonym
intrigeri wrote:
> This happened again during almost every test suite run on the devel branch since October 5. The branch for Bug #11846 was merged on October 4 so I guess it’s realted. autotest_never_use_this_option
was properly added, and there’s no “sudden reboot”, so this does not look like Bug #10777 but really a problem with the communication with the remote shell.
False alarm, most likely. See: Bug #11846#note-12
> By the way: anonym, you’ve removed yourself from the assignee field, but target version is still 2.7, so I wonder what’s the plan.
Thanks! That was a mistake.
#35 Updated by anonym 2016-10-10 09:27:21
- Assignee changed from anonym to bertagaz
- QA Check set to Ready for QA
anonym wrote:
> intrigeri wrote:
> > This happened again during almost every test suite run on the devel branch since October 5. The branch for Bug #11846 was merged on October 4 so I guess it’s realted. autotest_never_use_this_option
was properly added, and there’s no “sudden reboot”, so this does not look like Bug #10777 but really a problem with the communication with the remote shell.
>
> False alarm, most likely. See: Bug #11846#note-12
Do we have an agreement here? If so, I want to merge this branch.
bertagaz, was it you that switched the --old-iso
?
> > By the way: anonym, you’ve removed yourself from the assignee field, but target version is still 2.7, so I wonder what’s the plan.
>
> Thanks! That was a mistake.
#36 Updated by bertagaz 2016-10-10 10:29:24
anonym wrote:
> anonym wrote:
> > False alarm, most likely. See: Bug #11846#note-12
>
> Do we have an agreement here? If so, I want to merge this branch.
>
> bertagaz, was it you that switched the --old-iso
?
Yes, I did. We discussed that on the tails-rm@ list.
#38 Updated by bertagaz 2016-10-30 09:34:55
So I’ll wait a few base branch test runs to see if this really goes away.
#39 Updated by bertagaz 2016-11-06 09:13:25
- Status changed from In Progress to Fix committed
- Assignee deleted (
bertagaz) - % Done changed from 50 to 100
- QA Check changed from Ready for QA to Pass
#40 Updated by bertagaz 2016-11-15 18:38:00
- Status changed from Fix committed to Resolved