Bug #10502

The test suite sometimes cannot connect to the remote shell

Added by bertagaz 2015-11-06 11:39:47 . Updated 2016-11-15 18:38:00 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Test suite
Target version:
Start date:
2015-11-06
Due date:
% Done:

100%

Feature Branch:
test/10777-robust-boot-menu
Type of work:
Research
Blueprint:

Starter:
0
Affected tool:
Deliverable for:
280

Description

We’ve had failures lately in Jenkins. It made some steps fail, and it’s not surprising given the place it has in the test suite. We should watch out the rate of this bug as well it where condition it happens to get a better plan.

Here are some excerpts:

Step: the computer reboots Tails
  Defined at: features/step_definitions/common_steps.rb:244
  Used in:    Scenario: Anti-test: Changes to the hardware clock are kept when rebooting
              features/time_syncing.feature:2
  Error trace:
    Remote shell seems to be down
    Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
    ./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
    ./features/support/helpers/misc_helpers.rb:33:in `try_for'
    ./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
    ./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
    ./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
    features/time_syncing.feature:101:in `And the computer reboots Tails'
Step: I start Tails from DVD with network unplugged and I login
  Defined at: features/step_definitions/common_steps.rb:178
  Used in:    Scenario: Upgrading an old Tails USB installation from an ISO image, running on the new version
              features/usb_upgrade.feature:2
  Error trace:
    Remote shell seems to be down
    Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
    ./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
    ./features/support/helpers/misc_helpers.rb:33:in `try_for'
    ./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
    ./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
    ./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
    ./features/step_definitions/common_steps.rb:186:in `/^I start Tails( from DVD)?( with network unplugged)?( and I login)?$/'
    features/usb_upgrade.feature:111:in `And I start Tails from DVD with network unplugged and I login'
Step: I start Tails from DVD with network unplugged and I login
  Defined at: features/step_definitions/common_steps.rb:178
  Used in:    Scenario: Upgrading an old Tails USB installation from an ISO image, running on the new version
              features/usb_upgrade.feature:2
  Error trace:
    Remote shell seems to be down
    Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
    ./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
    ./features/support/helpers/misc_helpers.rb:33:in `try_for'
    ./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
    ./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
    ./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
    ./features/step_definitions/common_steps.rb:186:in `/^I start Tails( from DVD)?( with network unplugged)?( and I login)?$/'
    features/usb_upgrade.feature:111:in `And I start Tails from DVD with network unplugged and I login'
Step: I start Tails from USB drive "old" with network unplugged and I login
  Defined at: features/step_definitions/common_steps.rb:200
  Used in:    Scenario: Upgrading an old Tails USB installation from an ISO image, running on the old version
              features/usb_upgrade.feature:2
  Error trace:
    Remote shell seems to be down
    Last ignored exception was: Timeout::Error: execution expired (Timeout::Error)
    ./features/support/helpers/misc_helpers.rb:83:in `rescue in try_for'
    ./features/support/helpers/misc_helpers.rb:33:in `try_for'
    ./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
    ./features/support/helpers/vm_helper.rb:452:in `wait_until_remote_shell_is_up'
    ./features/step_definitions/common_steps.rb:267:in `/^the computer (re)?boots Tails$/'
    ./features/step_definitions/common_steps.rb:208:in `/^I start Tails from (.+?) drive "(.+?)"(| with network unplugged)( and I login(| with(| read-only) persistence enabled))?$/'
    features/usb_upgrade.feature:100:in `When I start Tails from USB drive "old" with network unplugged and I login'

Subtasks


Related issues

Related to Tails - Bug #11846: The remote shell can mix up responses after an execution has been aborted Resolved 2016-09-27
Blocked by Tails - Bug #11866: Cannot download some artifacts from Jenkins Resolved 2016-10-07
Blocked by Tails - Bug #10777: The test suite machinery sometimes misses the boot splash Resolved 2015-12-18

History

#1 Updated by bertagaz 2015-11-06 11:40:27

  • blocks #8668 added

#2 Updated by bertagaz 2015-11-06 11:40:54

  • Description updated

#3 Updated by bertagaz 2015-11-06 11:42:20

#4 Updated by anonym 2015-11-06 11:47:47

  • Assignee set to anonym

#5 Updated by anonym 2015-11-06 11:54:49

  • Target version changed from Tails_1.8 to Tails_2.3

#6 Updated by bertagaz 2015-11-16 04:04:06

  • Target version changed from Tails_2.3 to Tails_1.8

Since the last @fragile tagging, this step has become the most important failure in the test suite (20 times since Nov 6, 2015). Some runs would have passed without this bug, I think this should be worked on more quickly, so I’m changing its target version.

#7 Updated by intrigeri 2015-11-16 04:44:37

On feature/jessie, the remote shell should absolutely not be racy at all anymore (Type=notify). Can you please confirm? If it’s indeed fixed there, perhaps the urgency of fixing it on Wheezy can be rethought a bit.

#8 Updated by bertagaz 2015-11-16 05:52:33

intrigeri wrote:
> On feature/jessie, the remote shell should absolutely not be racy at all anymore (Type=notify). Can you please confirm? If it’s indeed fixed there, perhaps the urgency of fixing it on Wheezy can be rethought a bit.

According to the test jobs logs in the Jenkins history (48 runs), it happened 16 times in this feature/jessie branch test job, but in 6 different runs only: 10, 19, 21, 22, 26, 36.

So it seems for sure more robust, but I’m not sure I’d say completely fixed. Might need a bit more investigation to get what happened.

Now if we decide we want to postpone the resolution of this ticket because feature/jessie will greatly enhance it, it will still mean we won’t release the email notification meanwhile, given the number of test jobs that fails because of this (20 times out of 52 test jobs).

But I agree this bug is quite difficult, so we may ask ourselves if it’s worth it.

#9 Updated by intrigeri 2015-11-17 03:31:26

> According to the test jobs logs in the Jenkins history (48 runs), it happened 16 times in this feature/jessie branch test job, but in 6 different runs only: 10, 19, 21, 22, 26, 36.

Thanks for investigating! It’s very interesting that we see these failures concentrated in a few runs.

I’ve looked at https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_feature-jessie/36/consoleFull and see no obvious indication that the remote shell is at fault (plenty of other reasons can cause what I see there), let alone that it’s racy (racing against what, exactly?). What I can say, if it may help, is that on feature/jessie, when tails-autotest-remote-shell.service says it has started, it means it has open its communication channels and is waiting for clients. I’ve reviewed the code again, and verified on the output of systemd-analyze plot that gdm.service starts after this has occurred.

So I’m very confident that on feature/jessie, the race condition we used to have (the Greeter shows up before the remote shell has started) is gone, and these errors are caused by something else.

I think we’ll need more debugging info to find out what the actual cause is (QEMU failing to set up communication between the host and the guest? the remote shell code not being 100% robust? some temporary performance issue causing the 30 seconds timeout to be too low? the post_snapshot_restore_hook not doing things in the right order?). E.g. do we often see that when starting from DVD, or mostly when booting from emulated USB? Do we ever see that when creating a snapshot initially (as opposed to restoring one)?

In the meantime, I suggest this ticket should be given a title that reflects more clearly what we know, instead of a guess wrt. what a reason for what we see could be.

#10 Updated by intrigeri 2015-12-05 16:02:09

bertagaz, or anyone else who knows how to extract such info: has this been happening a lot recently when testing ISO images built from feature/jessie? (I’m trying to evaluate how much of a blocker this is for Feature #10382.)

#11 Updated by intrigeri 2015-12-05 16:02:57

  • Subject changed from Remote-shell is still racy to The test suite sometimes cannot connect to the remote shell

#12 Updated by intrigeri 2015-12-05 16:03:22

  • Priority changed from Normal to Elevated

(It’s apparently blocking Feature #10382.)

#13 Updated by bertagaz 2015-12-07 01:49:46

intrigeri wrote:
> bertagaz, or anyone else who knows how to extract such info: has this been happening a lot recently when testing ISO images built from feature/jessie? (I’m trying to evaluate how much of a blocker this is for Feature #10382.)

Since the last report in Bug #10502#note-8, it happened in two other builds only: 4 times in build #75, and one time in build #76 (which was aborted).

#14 Updated by intrigeri 2015-12-07 05:57:11

  • Assignee changed from anonym to bertagaz

> Since the last report in Bug #10502#note-8,

(That is: in 3 weeks.)

> it happened in two other builds only: 4 times in build #75, and one time in build #76 (which was aborted).

OK, so:

  • for build 75: I see several bugs in the test suite (that you should file as subtask of Bug #10288 since it’s blocking your work) during that run, that make tons of scenarios fail anyway
  • for build 76: with my changes for Bug #10718 in, the developer would have (rightfully) got a notification anyway

So in the worst case, on feature/jessie this problem would have triggered one undeserved notification in 3 weeks. I think it’s OK. bertagaz, can you please come up with a plan that makes us not block on resolving this on Wheezy?

#15 Updated by bertagaz 2015-12-15 03:36:09

  • Target version changed from Tails_1.8 to Tails_2.0

Postponing

#16 Updated by intrigeri 2015-12-18 08:38:57

intrigeri wrote:
> So in the worst case, on feature/jessie this problem would have triggered one undeserved notification in 3 weeks. I think it’s OK. bertagaz, can you please come up with a plan that makes us not block on resolving this on Wheezy?

Like: enabling notifications (Feature #10382) for all branches except stable, since devel and experimental are now Jessie-based. I doubt we’ll prepare many branches based on stable (Wheezy) until the release of Tails 2.0 (Jessie), so even if they are affected, the amount of false positive notifications should be acceptable IMO.

Assuming we agree on this, next step on this ticket becomes essentially: keep an eye on this on test suite runs of branches based on stable, quickly react if it is indeed a problem (turn off notifications), and if not reevaluate the whole thing in 6 weeks (sorry for the hardcoded magic number).

Unless someone objects soonish I think we should just go ahead on Tuesday.

Thoughts?

#17 Updated by intrigeri 2015-12-20 02:28:30

It happened today on devel: https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/cucumberTestReport/various-checks/apparmor-is-enabled-and-has-enforced-profiles/ — and last time I’ve see it on Tails/Jessie the same scenario was the one failing. I suspect it’s not pure chance that this happens for the first scenario that actually boots a Tails, while following scenarios are just fine.

On https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/artifact/build-artifacts/00%3A02%3A35_AppArmor_is_enabled_and_has_enforced_profiles.mkv I see that the syslinux cmdline editor is never entered, and autotest_never_use_this_option is not typed, which explains why the remote shell can’t be reached later. anonym suggested the other day to increase the delay in screen.waitVanish(bootsplash_tab_msg, 1) IIRC.

anonym, can you please take a look?

#18 Updated by intrigeri 2015-12-20 02:43:12

  • Status changed from Confirmed to In Progress
  • Assignee changed from bertagaz to anonym

#19 Updated by anonym 2016-01-27 14:36:49

  • Target version changed from Tails_2.0 to Tails_2.2

#20 Updated by intrigeri 2016-02-05 14:53:36

  • related to Bug #10777: The test suite machinery sometimes misses the boot splash added

#21 Updated by intrigeri 2016-02-05 14:54:52

intrigeri wrote:
> On https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/artifact/build-artifacts/00%3A02%3A35_AppArmor_is_enabled_and_has_enforced_profiles.mkv I see that the syslinux cmdline editor is never entered, and autotest_never_use_this_option is not typed, which explains why the remote shell can’t be reached later. anonym suggested the other day to increase the delay in screen.waitVanish(bootsplash_tab_msg, 1) IIRC.

It might be that in the end, Bug #10777 is the root cause of this problem. I say let’s get Bug #10777 fixed and come back here a month later to see if we still see it happen.

#22 Updated by anonym 2016-02-20 13:23:28

  • Target version changed from Tails_2.2 to Tails_2.3

I think we’ll have Bug #10777 fixed for 2.2, so let’s see how it works out.

#23 Updated by anonym 2016-04-20 10:56:33

  • Target version changed from Tails_2.3 to Tails_2.4

#24 Updated by anonym 2016-05-10 07:36:20

  • Deliverable for changed from 270 to SponsorS_Internal

#25 Updated by anonym 2016-06-08 01:35:00

  • Target version changed from Tails_2.4 to Tails_2.5

#26 Updated by anonym 2016-06-15 13:13:44

intrigeri wrote:
> intrigeri wrote:
> > On https://jenkins.tails.boum.org/job/test_Tails_ISO_devel/89/artifact/build-artifacts/00%3A02%3A35_AppArmor_is_enabled_and_has_enforced_profiles.mkv I see that the syslinux cmdline editor is never entered, and autotest_never_use_this_option is not typed, which explains why the remote shell can’t be reached later. anonym suggested the other day to increase the delay in screen.waitVanish(bootsplash_tab_msg, 1) IIRC.
>
> It might be that in the end, Bug #10777 is the root cause of this problem. I say let’s get Bug #10777 fixed and come back here a month later to see if we still see it happen.

Indeed, given your quote, it sounds like a very reasonable explanation (without “autotest_never_use_this_option” the remote shell won’t be started) (but my suggestion in it sucks! sorry I cannot always be great :)). Given my recent attempt at solving Bug #10777, I am hopeful!

#27 Updated by intrigeri 2016-07-18 06:28:33

  • Target version changed from Tails_2.5 to Tails_2.7

#28 Updated by anonym 2016-10-07 12:50:25

  • blocked by Bug #11866: Cannot download some artifacts from Jenkins added

#29 Updated by anonym 2016-10-07 15:00:30

  • Assignee deleted (anonym)
  • % Done changed from 0 to 50
  • Feature Branch set to test/10777-robust-boot-menu

anonym wrote:
> intrigeri wrote:
> > It might be that in the end, Bug #10777 is the root cause of this problem. I say let’s get Bug #10777 fixed and come back here a month later to see if we still see it happen.
>
> Indeed, given your quote, it sounds like a very reasonable explanation (without “autotest_never_use_this_option” the remote shell won’t be started)

When doing the September @fragile tagging I noticed four instances of Remote shell seems to be down, with these videos:

In all of them, “autotest_never_use_this_option” is properly added, but then the system reboots suddenly during early init, and since this is in the stable and devel branches there’s no fix for Bug #10777, so we won’t try to add it a second time.

First of all: WTF, why the sudden reboot? Second: I don’t care, the fix for Bug #10777 will clearly fix this, and for now I will assume that Bug #10777 indeed is the only cause for the remote shell being down.

#30 Updated by anonym 2016-10-07 15:00:44

  • related to deleted (Bug #10777: The test suite machinery sometimes misses the boot splash)

#31 Updated by anonym 2016-10-07 15:00:53

  • blocked by Bug #10777: The test suite machinery sometimes misses the boot splash added

#32 Updated by intrigeri 2016-10-09 20:00:36

This happened again during almost every test suite run on the devel branch since October 5. The branch for Bug #11846 was merged on October 4 so I guess it’s realted. autotest_never_use_this_option was properly added, and there’s no “sudden reboot”, so this does not look like Bug #10777 but really a problem with the communication with the remote shell.

By the way: anonym, you’ve removed yourself from the assignee field, but target version is still 2.7, so I wonder what’s the plan.

#33 Updated by intrigeri 2016-10-09 20:03:45

  • related to Bug #11846: The remote shell can mix up responses after an execution has been aborted added

#34 Updated by anonym 2016-10-10 08:37:05

  • Assignee set to anonym

intrigeri wrote:
> This happened again during almost every test suite run on the devel branch since October 5. The branch for Bug #11846 was merged on October 4 so I guess it’s realted. autotest_never_use_this_option was properly added, and there’s no “sudden reboot”, so this does not look like Bug #10777 but really a problem with the communication with the remote shell.

False alarm, most likely. See: Bug #11846#note-12

> By the way: anonym, you’ve removed yourself from the assignee field, but target version is still 2.7, so I wonder what’s the plan.

Thanks! That was a mistake.

#35 Updated by anonym 2016-10-10 09:27:21

  • Assignee changed from anonym to bertagaz
  • QA Check set to Ready for QA

anonym wrote:
> intrigeri wrote:
> > This happened again during almost every test suite run on the devel branch since October 5. The branch for Bug #11846 was merged on October 4 so I guess it’s realted. autotest_never_use_this_option was properly added, and there’s no “sudden reboot”, so this does not look like Bug #10777 but really a problem with the communication with the remote shell.
>
> False alarm, most likely. See: Bug #11846#note-12

Do we have an agreement here? If so, I want to merge this branch.

bertagaz, was it you that switched the --old-iso?

> > By the way: anonym, you’ve removed yourself from the assignee field, but target version is still 2.7, so I wonder what’s the plan.
>
> Thanks! That was a mistake.

#36 Updated by bertagaz 2016-10-10 10:29:24

anonym wrote:
> anonym wrote:
> > False alarm, most likely. See: Bug #11846#note-12
>
> Do we have an agreement here? If so, I want to merge this branch.
>
> bertagaz, was it you that switched the --old-iso?

Yes, I did. We discussed that on the tails-rm@ list.

#37 Updated by intrigeri 2016-10-10 18:28:28

> Do we have an agreement here?

Yes!

#38 Updated by bertagaz 2016-10-30 09:34:55

So I’ll wait a few base branch test runs to see if this really goes away.

#39 Updated by bertagaz 2016-11-06 09:13:25

  • Status changed from In Progress to Fix committed
  • Assignee deleted (bertagaz)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

Bug #10777 has been merges, so I guess it means this ticket is fixed too nos, as they were related. There’s been new type of failures with the remote shell (ECONNRESET) but I’ll open a ticket for that.

#40 Updated by bertagaz 2016-11-15 18:38:00

  • Status changed from Fix committed to Resolved