Bug #11846

The remote shell can mix up responses after an execution has been aborted

Added by anonym 2016-09-27 03:47:06 . Updated 2016-11-15 18:23:34 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
2016-09-27
Due date:
% Done:

100%

Feature Branch:
test/11846-remote-shell-request-ids
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

From Bug #11832#note-3:

[...]
calling as amnesia: gpg --batch --list-keys '10CC5BC7'
Tor operation failed (Try 1 of 10) with:
Timeout::Error: try_for() timeout expired
Last ignored exception was: ExecutionFailedInVM: Command failed: gpg --batch --list-keys '10CC5BC7'
error code: 2
stderr: gpg: error reading key: public key not found
.
<false> is not true.
Forcing new Tor circuit...
calling as root: . /usr/local/lib/tails-shell-library/tor.sh && tor_control_send "signal NEWNYM" 
call returned: [2, "", "gpg: error reading key: public key not found\n"]
    When I fetch the "10CC5BC7" OpenPGP key using Seahorse                         # features/step_definitions/torified_gnupg.rb:154
      Command failed: . /usr/local/lib/tails-shell-library/tor.sh && tor_control_send "signal NEWNYM" 
      error code: 2
      stderr: gpg: error reading key: public key not found
      .
      <false> is not true. (ExecutionFailedInVM)
      ./features/support/helpers/vm_helper.rb:442:in `rescue in execute_successfully'
      ./features/support/helpers/vm_helper.rb:439:in `execute_successfully'
      ./features/support/helpers/misc_helpers.rb:94:in `force_new_tor_circuit'
      ./features/support/helpers/misc_helpers.rb:104:in `block in retry_tor'
      ./features/support/helpers/misc_helpers.rb:133:in `call'
      ./features/support/helpers/misc_helpers.rb:133:in `rescue in block in retry_action'
      ./features/support/helpers/misc_helpers.rb:125:in `block in retry_action'
      ./features/support/helpers/misc_helpers.rb:124:in `loop'
      ./features/support/helpers/misc_helpers.rb:124:in `retry_action'
      ./features/support/helpers/misc_helpers.rb:108:in `retry_tor'
      ./features/step_definitions/torified_gnupg.rb:172:in `/^I fetch the "([^"]+)" OpenPGP key using Seahorse( via the OpenPGP Applet)?$/'
      features/torified_gnupg.feature:22:in `When I fetch the "10CC5BC7" OpenPGP key using Seahorse'


and look closer at:

calling as root: . /usr/local/lib/tails-shell-library/tor.sh && tor_control_send "signal NEWNYM" 
call returned: [2, "", "gpg: error reading key: public key not found\n"]


I.e. the remote shell has mixup up the response from the gpg call earlier. That gpg call was aborted (by a try_for timeout) while it was executing on the guest, before writing back the response. So our remote shell is too simplistic here — we need an ID that ties a response to a specific call.

Simple solution: when the client sends a request to the remote shell, except the command line it also sends an ID. The remote shell will put that ID on the response. Whenever the client reads a reponse it will drop those with unmatching ID (since they must have been aborted).


Subtasks


Related issues

Related to Tails - Bug #10502: The test suite sometimes cannot connect to the remote shell Resolved 2015-11-06
Related to Tails - Bug #12040: Test suite cannot sometimes connect to the remote shell: "Dropped out-of-order remote shell response: got id but expected id NNNN" Confirmed 2016-12-19

History

#1 Updated by anonym 2016-09-27 04:16:44

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 30
  • QA Check set to Dev Needed
  • Feature Branch set to test/11846-remote-shell-request-ids

I’ve pushed an untested branch. Let’s hope it doesn’t break anything jenkins — I think that is enough criteria for merging, since this error is very rare.

#2 Updated by anonym 2016-09-28 03:20:32

  • Assignee changed from anonym to bertagaz
  • Target version changed from Tails_2.9.1 to Tails_2.7
  • % Done changed from 30 to 50
  • QA Check changed from Dev Needed to Ready for QA

Note: I force-pushed a fix for a syntax error.

This branch seems to work fine, with two runs on Jenkins with only unrelated problems (e.g. that one of our keys will expire in <3 months). I guess we can even try to get this bugfix into Tails 2.7.

#3 Updated by anonym 2016-09-28 10:16:18

  • Assignee changed from bertagaz to anonym
  • QA Check changed from Ready for QA to Dev Needed

While we’re working on remote shell issues, I think we should consider trying this:

--- a/features/support/helpers/exec_helper.rb
+++ b/features/support/helpers/exec_helper.rb
@@ -37,16 +37,7 @@ class VMCommand
     socket.puts(JSON.dump([id, type, options[:user], cmd]))
     loop do
       s = socket.readline(sep = "\0").chomp("\0")
-      begin
-        response_id, *rest = JSON.load(s)
-      rescue JSON::ParserError
-        # The server often returns something unparsable for the very
-        # first execute() command issued after a VM start/restore
-        # (generally from wait_until_remote_shell_is_up()) presumably
-        # because the TCP -> serial link isn't properly setup yet. All
-        # will be well after that initial hickup, so we just retry.
-        return VMCommand.execute(vm, cmd, options)
-      end
+      response_id, *rest = JSON.load(s)
       if response_id == id
         debug_log("#{type} returned: #{s}") if not(options[:spawn])
         return rest


That part seems like a hack to me now. I’ve never seen (which would manifest it self with two+ debug_log() calls) the retrying part happen for any real use of the remote shell, only for the first one during VMCommand.wait_until_remote_shell_is_up(), and it does its own retrying. Combined with the response ID thing guaranteeing that the client will only get a response for the actual command issued, I think it’s safe to remove this code and any problem it may cause. So, I went ahead and pushed a commit doing that. Let’s see what Jenkins thinks.

#4 Updated by anonym 2016-09-29 00:36:28

  • Assignee changed from anonym to bertagaz
  • QA Check changed from Dev Needed to Ready for QA

Still passes on Jenkins so let’s go for it!

#5 Updated by bertagaz 2016-10-04 13:41:23

  • Status changed from In Progress to Fix committed
  • Assignee deleted (bertagaz)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

Yay, love it. Saw it rescuing precisely what it was meant to, so merged it.

I wonder if we shouldn’t track when this unrelated previous remote shell responses happen, as that’s a sign something may not be that good there.

#6 Updated by anonym 2016-10-05 08:41:26

  • Assignee set to bertagaz
  • QA Check changed from Pass to Info Needed

bertagaz wrote:
> Yay, love it. Saw it rescuing precisely what it was meant to, so merged it.

Wow, if you can link me to the debug.log where this happened, I would be delighted!

> I wonder if we shouldn’t track when this unrelated previous remote shell responses happen, as that’s a sign something may not be that good there.

It would be interesting, indeed. I sometimes wish I had access to the system where the artifacts are stored, and that I could do something like grep $SOMETHING */debug.log over all artifacts…

Otherwise, what did you have in mind with “tracking”, specifically?

#7 Updated by intrigeri 2016-10-05 09:50:09

> I sometimes wish I had access to the system where the artifacts are stored, and that I could do something like grep $SOMETHING */debug.log over all artifacts…

Please file a ticket about it and we’ll discuss the implications.

#8 Updated by bertagaz 2016-10-05 13:44:42

  • Assignee changed from bertagaz to anonym

anonym wrote:
> bertagaz wrote:
> Wow, if you can link me to the debug.log where this happened, I would be delighted!

calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [238, 3, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
calling as root: journalctl SYSLOG_IDENTIFIER=restart-tor
Dropped out-of-order remote shell response: got id 239 but expected id 240
call returned: [240, 0, "-- Logs begin at Mon 2016-10-03 14:08:40 UTC, end at Mon 2016-10-03 14:14:42 UTC. --\n \
Oct 03 14:10:46 amnesia restart-tor[4265]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \ 
Oct 03 14:11:08 amnesia restart-tor[5280]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:11:31 amnesia restart-tor[6276]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:11:54 amnesia restart-tor[7325]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:12:16 amnesia restart-tor[8351]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:12:39 amnesia restart-tor[9354]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:13:02 amnesia restart-tor[10384]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:13:24 amnesia restart-tor[11416]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:13:47 amnesia restart-tor[12443]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:14:10 amnesia restart-tor[13470]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:14:32 amnesia restart-tor[14502]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n", ""]
From the journal:
  -- Logs begin at Mon 2016-10-03 14:08:40 UTC, end at Mon 2016-10-03 14:14:42 UTC. --
Oct 03 14:10:46 amnesia restart-tor[4265]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:11:08 amnesia restart-tor[5280]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:11:31 amnesia restart-tor[6276]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:11:54 amnesia restart-tor[7325]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:12:16 amnesia restart-tor[8351]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:12:39 amnesia restart-tor[9354]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:13:02 amnesia restart-tor[10384]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:13:24 amnesia restart-tor[11416]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:13:47 amnesia restart-tor[12443]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:14:10 amnesia restart-tor[13470]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:14:32 amnesia restart-tor[14502]: Tor seems to have stalled while bootstrapping. Restarting Tor.
    Step failed while creating checkpoint: Tor is ready

> Otherwise, what did you have in mind with “tracking”, specifically?

I don’t know exactly, in any cases a way to collect which steps are used to trigger this (maybe one of the mission of the “Deal with XX 2016 false positive” task?), and probably a ticket to follow on on it?

#9 Updated by anonym 2016-10-07 13:24:30

  • Assignee deleted (anonym)
  • QA Check changed from Info Needed to Pass

bertagaz wrote:
> anonym wrote:
> > bertagaz wrote:
> > Wow, if you can link me to the debug.log where this happened, I would be delighted!
>
> […]

Cool! So this happened in a similar instance (try_for timing out in the middle of a call to the remote shell).

> > Otherwise, what did you have in mind with “tracking”, specifically?
>
> I don’t know exactly, in any cases a way to collect which steps are used to trigger this (maybe one of the mission of the “Deal with XX 2016 false positive” task?), and probably a ticket to follow on on it?

On second thought, I don’t think tracking this will be very interesting. I’m very sure that all instances will be the same, i.e. cause by try_for timing out in the middle of a call to the remote shell. So I say, let’s not bother. Please let me know if you think I am wrong! :)

#10 Updated by intrigeri 2016-10-09 20:03:45

  • related to Bug #10502: The test suite sometimes cannot connect to the remote shell added

#11 Updated by intrigeri 2016-10-09 20:05:18

As reported on Bug #10502#note-32, this seems to break some tests on the devel branch quite often. Is this tracked elsewhere, or shall we reopen this ticket?

#12 Updated by anonym 2016-10-10 08:36:06

intrigeri wrote:
> As reported on Bug #10502#note-32, this seems to break some tests on the devel branch quite often. Is this tracked elsewhere, or shall we reopen this ticket?

The reason is because those tests were run with an --old-iso that doesn’t have the necessary changes to the remote shell — if you look at the actual errors, you can see that all of them are results of this. I think someone has realized this, because devel’s current test jobs has changed from using the Tails 2.6 ISO to a recent devel build with the required changes. Any way, I think all is good.

#13 Updated by intrigeri 2016-10-10 18:27:12

> Any way, I think all is good.

Cool!

#14 Updated by bertagaz 2016-11-15 18:23:34

  • Status changed from Fix committed to Resolved

#15 Updated by intrigeri 2016-12-20 10:58:27

  • related to Bug #12040: Test suite cannot sometimes connect to the remote shell: "Dropped out-of-order remote shell response: got id but expected id NNNN" added