Feature #9516

Restart Tor if bootstrapping stalls for too long

Added by anonym 2015-06-02 14:02:59 . Updated 2015-11-03 11:31:04 .

Status:
Resolved
Priority:
High
Assignee:
Category:
Test suite
Target version:
Start date:
2015-06-02
Due date:
% Done:

100%

Feature Branch:
kytv/feature/9516-restart-tor-on-bootstrap-failure
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

This is both w.r.t. the initial Tor bootstrap, and after that, when we reconnect to the Tor network again after restoring a snapshot. We may have to retry multiple times in a row.


Files


Subtasks


Related issues

Related to Tails - Bug #9478: How to deal with transient network errors in the test suite? Resolved 2015-05-27
Related to Tails - Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping Rejected 2015-09-23
Related to Tails - Bug #10537: Fix restart-tor and tor-has-bootstrapped semantics vs. test suite on Jessie Resolved 2015-11-12

History

#1 Updated by anonym 2015-06-02 14:07:52

Since we always have to run wait_until_tor_is_working() when Tor is connecting to the network, it may be a good place to fix this. The following might be a good starting point (tl;dr: restart Tor when the bootstrap progress has stalled for more than 15 seconds):

--- a/features/support/helpers/misc_helpers.rb
+++ b/features/support/helpers/misc_helpers.rb
@@ -75,8 +75,23 @@ rescue unique_timeout_exception => e
 end

 def wait_until_tor_is_working
-  try_for(270) { @vm.execute(
-    '. /usr/local/lib/tails-shell-library/tor.sh; tor_is_working').success? }
+  bootstrap_progress = 0
+  last_bootstrap_change = Time.now
+  try_for(270) do
+    new_bootstrap_progress = @vm.execute_successfully('. /usr/local/lib/tails-shell-library/tor.sh; tor_bootstrap_progress').stdout.to_i
+    if new_bootstrap_progress == 100
+      return
+    elsif new_bootstrap_progress > bootstrap_progress
+      bootstrap_progress = new_bootstrap_progress
+      last_bootstrap_change = Time.now
+    elsif Time.now - last_bootstrap_change > 15
+      STDERR.puts "Tor bootstrap stalled at #{bootstrap_progress}% for 15 seconds, restarting Tor..." if $config["DEBUG"]
+      @vm.execute_successfully('service tor restart')
+      bootstrap_progress = 0
+      last_bootstrap_change = Time.now
+    end
+    false
+  end
 end

 def convert_bytes_mod(unit)

I’ve had to use this myself when running the test suite in some places.

#2 Updated by anonym 2015-06-02 14:19:25

  • blocks #8538 added

#3 Updated by anonym 2015-06-02 14:36:28

  • related to Bug #9478: How to deal with transient network errors in the test suite? added

#4 Updated by intrigeri 2015-06-04 21:25:28

  • related to deleted (Bug #9478: How to deal with transient network errors in the test suite?)

#5 Updated by intrigeri 2015-06-04 21:25:32

  • blocked by deleted (#8538)

#6 Updated by intrigeri 2015-06-04 21:29:59

  • related to Bug #9478: How to deal with transient network errors in the test suite? added

#7 Updated by intrigeri 2015-06-04 21:30:05

  • blocks #8538 added

#8 Updated by kytv 2015-06-13 14:46:31

  • Status changed from Confirmed to In Progress
  • Assignee changed from kytv to anonym
  • QA Check set to Info Needed

anonym wrote:
> Since we always have to run wait_until_tor_is_working() when Tor is connecting to the network, it may be a good place to fix this. The following might be a good starting point (tl;dr: restart Tor when the bootstrap progress has stalled for more than 15 seconds):
> […]

What if we do this outside of the test suite, e.g. for all users in /usr/local/sbin/restart-tor or the nm hooks?

I converted/translated your ruby code into shell and am building an ISO for testing.

#9 Updated by anonym 2015-06-16 05:16:16

  • Assignee changed from anonym to kytv
  • QA Check changed from Info Needed to Dev Needed

This may be a good idea since it will help users that experience some types of bootstrap issues.

Also, I noted that my hack in wait_until_tor_is_working() can cause a race condition; if it just decided to restart Tor, but Tor bootstraps right after and our NetworkManager hook noticed that Tor is working and proceeds, then htpdate will fail since the SOCKSPort isn’t there. Quite unlikely, but yeah.

If we do this from restart-tor (another option would be the Tor init script) we need to always use it when (re)starting Tor, so we have some places where it needs to be introduced: git grep "service tor \(re\)\?start" -- config/ features/

#10 Updated by kytv 2015-06-19 13:00:00

All right. If this ends up being deemed good, excellent. :) I’m quite pleased with how it’s working thus far, but I’m still testing and not quite ready to submit it for review.

#11 Updated by kytv 2015-06-22 13:09:15

  • Assignee changed from kytv to intrigeri
  • % Done changed from 0 to 30
  • QA Check changed from Dev Needed to Ready for QA
  • Feature Branch set to kytv:test/9516-tor-bootstrap

I’ve been using an ISO built from this branch for more than a week while working on the other tickets in this group (Feature #9517, Feature #9518). I’ve also run the entire test suite a couple of times with an ISO built from this branch as well. Tor has not failed to bootstrap and while running with --debug it’s been obvious that the changes in this branch have helped to ensure that Tor is ready more quickly than without it.

I’m confident that it will improve things for both testers and users.

#12 Updated by kytv 2015-06-25 18:07:04

  • Assignee changed from intrigeri to kytv
  • Target version changed from Tails_1.4.1 to Tails_1.5
  • QA Check changed from Ready for QA to Dev Needed

I’ll take this one back to do a bit more testing.

#13 Updated by kytv 2015-06-26 09:10:19

  • % Done changed from 30 to 0
  • Feature Branch deleted (kytv:test/9516-tor-bootstrap)

Upon further testing I see that there are problems which are more frequently evident in the time_syncing.feature when using this branch.

#14 Updated by anonym 2015-06-27 06:15:34

kytv wrote:
> Upon further testing I see that there are problems which are more frequently evident in the time_syncing.feature when using this branch.

Can you re-push kytv:test/9516-tor-bootstrap? I can probably provide some help with any tordate-related issues.

#15 Updated by intrigeri 2015-07-01 10:25:44

So, is the current theory to do that in the test suite only, on directly in Tails itself? If the latter, please update the ticket’s description and category.

#16 Updated by kytv 2015-08-04 04:01:41

  • Target version changed from Tails_1.5 to Tails_1.6

#17 Updated by anonym 2015-08-13 12:17:21

anonym wrote:
> kytv wrote:
> > Upon further testing I see that there are problems which are more frequently evident in the time_syncing.feature when using this branch.
>
> Can you re-push kytv:test/9516-tor-bootstrap? I can probably provide some help with any tordate-related issues.

Ping?

#18 Updated by anonym 2015-08-13 12:30:15

anonym wrote:
> Also, I noted that my hack in wait_until_tor_is_working() can cause a race condition; if it just decided to restart Tor, but Tor bootstraps right after and our NetworkManager hook noticed that Tor is working and proceeds, then htpdate will fail since the SOCKSPort isn’t there. Quite unlikely, but yeah.

I believe we’d have similar race if implemented in Tails proper. We’d need to ensure that we block other things from starting to use Tor until the new bootstrap check has succeeded and we won’t restart Tor. For the htpdate case above it could be implemented by making wait_for_working_tor (which is used to “queue” its start in 20-time.sh) also wait until this new bootstrap checker is done. However, there may be other (potentially future) places where this will cause issues. Not sure if there’s any good general solution. Starting Tor without a SocksPort and only adding it once the bootstrap check is done? Sounds crazy.

#19 Updated by kytv 2015-08-31 09:58:10

  • Feature Branch set to kytv:test/9516-tor-bootstrap

Here’s my WIP branch.

#20 Updated by kytv 2015-08-31 10:01:31

  • Assignee changed from kytv to anonym
  • QA Check changed from Dev Needed to Ready for QA

…and assigning to anonym for his perusal / comments / suggestions.

#21 Updated by kytv 2015-08-31 12:54:54

Applied in changeset commit:7f798ea57ad586e4b4d554d9d5c7ff403b434033.

#22 Updated by kytv 2015-08-31 12:54:55

Applied in changeset commit:400a4323f53489d2537977db8644f8483aad091b.

#23 Updated by anonym 2015-08-31 13:09:04

  • Assignee changed from anonym to kytv
  • QA Check changed from Ready for QA to Dev Needed

kytv wrote:
> …and assigning to anonym for his perusal / comments / suggestions.

From commit 7f798ea:
> +try_for 270 maybe_restart_tor

Maybe we should check this return status and show a helpful notification on failure? Previously we haven’t done anything, we’ve just continue and “optimistically” hoped stuff would work (logging stuff only to syslog, which isn’t very user friendly), but while we’re thinking about these things, perhaps we can improve that? We could say, please try reconnecting to the network (á Window-style). :)

From commit 0101de3;
> don’t use restart-tor when using a bridge

We discussed this obscure commit over IRC. I pushed something that I think explains things better, and improves a few other things, in commit: e0c7c1fc in the branch named the same as yours (test/9516-tor-bootstrap) in Tails Git repo.

Does it make sense?

#24 Updated by anonym 2015-08-31 13:09:45

  • % Done changed from 0 to 60

#25 Updated by kytv 2015-08-31 15:43:56

  • % Done changed from 60 to 0

anonym wrote:

> Does it make sense?

Absolutely.

#26 Updated by kytv 2015-08-31 15:44:12

  • % Done changed from 0 to 60

#27 Updated by anonym 2015-09-01 06:34:10

I still think there is a minimal risk for some crazy races with this
vs tails-upgrade-checker-wrapper and tails-security-check-wrapper
imagine this crazy scenario:
restart-tor just times out so it wants to restart Tor
but right after that Tor bootstraps, but before restart-tor restarts tor
one of these wrappers notices this, via /usr/local/sbin/tor-has-bootstrapped
and they start fetching something using Tor's SocksPort
and then restart-tor restarts Tor
=> the wrapper fails
all that must happen in a split second, so the "race window" is very tiny and unlikely
those wrappers are started via xdg
it would be better if they were started via NM hooks too
so blocking is ensured until Tor actually is working
I think we can bring much more sense to this using systemd
once we're based on Jessie

#28 Updated by intrigeri 2015-09-01 08:48:20

> those wrappers are started via xdg
> it would be better if they were started via NM hooks too
> so blocking is ensured until Tor actually is working
> I think we can bring much more sense to this using systemd
> once we’re based on Jessie

IIRC I’ve already ported those services to systemd on feature/jessie.

#29 Updated by kytv 2015-09-22 03:49:32

  • Target version changed from Tails_1.6 to Tails_1.7

#30 Updated by intrigeri 2015-09-22 06:07:27

  • Priority changed from Normal to High

The deployment this is blocking is happening.

#31 Updated by anonym 2015-09-23 09:35:08

  • related to Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping added

#32 Updated by anonym 2015-09-27 05:13:24

I suggest adding the following debug info to help us debug bootstrapping issues if they continue happening with this branch:

--- a/features/support/helpers/misc_helpers.rb
+++ b/features/support/helpers/misc_helpers.rb
@@ -120,6 +120,14 @@ end
 def wait_until_tor_is_working
   try_for(270) { $vm.execute(
     '. /usr/local/lib/tails-shell-library/tor.sh; tor_is_working').success? }
+rescue Timeout::Error => e
+  c = $vm.execute("grep restart-tor /var/log/syslog")
+  if c.success?
+    debug_log("From syslog:\n" + c.stdout.sub(/^/, "  ")
+  else
+    debug_log("Nothing was syslog:ed about 'restart-tor'")
+  end
+  raise e
 end

 def convert_bytes_mod(unit)

WARNING! The above code has not been tested, but I think the intention is pretty clear.

#33 Updated by kytv 2015-09-28 02:51:29

  • Assignee changed from kytv to anonym
  • % Done changed from 60 to 70
  • QA Check changed from Dev Needed to Ready for QA
  • Feature Branch changed from kytv:test/9516-tor-bootstrap to kytv:feature/9516-restart-tor-on-bootstrap-failure

#34 Updated by kytv 2015-09-28 03:23:49

anonym wrote:
> I suggest adding the following debug info

Added, cheers. :)

#35 Updated by anonym 2015-09-30 16:12:37

  • Assignee changed from anonym to kytv
  • QA Check changed from Ready for QA to Dev Needed

I haven’t reviewed your latest branch state, but I just realized the following while doing something unrelated: in post_snapshot_restore_hook we do not have to do $vm.spawn("restart-vidalia") any more; now when we run restart-tor, it will start vidalia for us. In fact, us running it as well probably introduces some funny races. Could you have a look at this until I return to this branch in a few days?

#36 Updated by kytv 2015-10-01 09:20:21

  • Assignee changed from kytv to anonym
  • QA Check changed from Dev Needed to Ready for QA

restart-tor will call restart-vidalia if vidalia is already running. That’s actually been the case for a while now.

a3b52392 (Tails developers 2012-11-15 16:56:26 +0100 57) if pgrep "\<vidalia\>" >/dev/null; then
05676de6 (Tails developers 2014-03-03 12:48:29 +0100 58)     killall -SIGKILL vidalia
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 59)     # Since Tor just restarted we wait for a while until the
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 60)     # ControlPort hopefully is up.
11faf45f (anonym           2015-05-03 16:29:07 +0200 61)     counter=0
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 62)     until [ "${counter}" -ge 10 ] || nc -z localhost 9051 2>/dev/null; do
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 63)         sleep 1
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 64)         counter=$((${counter}+1))
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 65)     done
a3b52392 (Tails developers 2012-11-15 16:56:26 +0100 66)     restart-vidalia
a3b52392 (Tails developers 2012-11-15 16:56:26 +0100 67) fi

But indeed it seems to me that we don’t need to do it ourselves in the test suite. Changes pushed.

#37 Updated by kytv 2015-10-06 01:38:40

Applied in changeset commit:1ed0a6a29e410aba7c1f5e5859e83a0ae761ded8.

#38 Updated by kytv 2015-10-06 01:38:40

Applied in changeset commit:14ae9bda69f46fc02522ef81d44b9f5fd216c144.

#39 Updated by anonym 2015-10-06 01:38:40

Applied in changeset commit:d30225ff7f51eca91610c08485161609b871a821.

#40 Updated by anonym 2015-10-06 01:38:41

  • Status changed from In Progress to Fix committed
  • % Done changed from 70 to 100

Applied in changeset commit:430651c8f37d1dd63bf4db51f3d3acd1afe02473.

#41 Updated by anonym 2015-10-06 01:42:25

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

This branch definitely improves the situation for me.

It doesn’t help in bridge mode, but in my experience there’s less bootstrapping issues with bridges. Any way, I opened Bug #10338 for that.

#42 Updated by intrigeri 2015-10-12 14:05:15

(Moved to Feature #6094 since I don’t see this bug with current devel. Sorry for the noise.)

#43 Updated by anonym 2015-10-13 08:27:02

  • File clock_gettime_monotonic.patch added
  • Status changed from Fix committed to In Progress
  • Assignee set to kytv
  • % Done changed from 100 to 70
  • QA Check changed from Pass to Dev Needed

intrigeri wrote:
> (Moved to Feature #6094 since I don’t see this bug with current devel. Sorry for the noise.)

I actually think you were on to something. References: Feature #6094#note-60 and Feature #6094#note-62

I think restart-tor is buggy when the clock is incorrect. Specifically, it uses the system clock (via date) both explicitly in maybe_restart_tor(), and implicitly in try_for(). Hence, if “tordate” (i.e. config/chroot_local-includes/etc/NetworkManager/dispatcher.d/20-time.sh) changes the time while we still run try_for 270 maybe_restart_tor in restart-tor, all sorts of craziness can happen.

We need a monotonic clock to implement these things in a robust manner vs time changes. A drop-in monotonic replacement for date +%s would be perl -w -MTime::HiRes=clock_gettime,CLOCK_MONOTONIC -E 'say clock_gettime(CLOCK_MONOTONIC)' | sed 's/\..*$//'. See the attached patch for a complete (but untested) fix! Thoughts?

(That said, try_for is still ugly since it will not timeout if something inside never returns, and it has the issue with “overtime”. Perhaps we can implement it in perl too, using some timer/timeout facility + shell evaluation? But that would just be a bonus.)

#44 Updated by kytv 2015-10-21 10:29:45

anonym wrote:

> We need a monotonic clock to implement these things in a robust manner vs time changes. A drop-in monotonic replacement for date +%s would be perl -w -MTime::HiRes=clock_gettime,CLOCK_MONOTONIC -E 'say clock_gettime(CLOCK_MONOTONIC)' | sed 's/\..*$//'. See the attached patch for a complete (but untested) fix! Thoughts?

Testing

#45 Updated by kytv 2015-10-26 03:12:07

  • Assignee changed from kytv to anonym
  • QA Check changed from Dev Needed to Ready for QA
  • Feature Branch changed from kytv:feature/9516-restart-tor-on-bootstrap-failure to kytv/feature/9516-restart-tor-on-bootstrap-failure

After a fair amount of testing I think the supplied patch is OK.

#46 Updated by anonym 2015-10-26 13:01:12

  • Status changed from In Progress to Fix committed
  • % Done changed from 70 to 100

Applied in changeset commit:7a19e96e282204659fa576a8ed0fb45916d90293.

#47 Updated by anonym 2015-10-26 13:03:30

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

#48 Updated by anonym 2015-11-03 11:31:04

  • Status changed from Fix committed to Resolved

#49 Updated by anonym 2015-11-06 10:23:51

  • related to Bug #10494: Retry htpdate when it fails added

#50 Updated by intrigeri 2015-11-17 09:59:38

  • related to Bug #10537: Fix restart-tor and tor-has-bootstrapped semantics vs. test suite on Jessie added

#51 Updated by intrigeri 2016-07-13 14:00:57

  • related to deleted (Bug #10494: Retry htpdate when it fails)