Bug #13541

Tor still sometimes fails to bootstrap in the test suite

Added by bertagaz 2017-07-31 10:39:43 . Updated 2018-08-18 11:07:38 .

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
2017-07-31
Due date:
% Done:

20%

Feature Branch:
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

It’s been noticed while reviewing Jenkins test suite runs failure in June (Feature #12289) that the TorBootstrapFailure exception still happens sometimes.

There are 3 possible reasons to it, I think: a) chutney/tor sometimes feels stupid; or b) our test suite sometimes fails to manage chutney properly; or c) some unrelated event on the system prevents them from doing their job.

Next steps to investigate this would be to:

1. look at the test suite debug log and
* double-check there’s no explanation for the tor bootstrap failure (that’s about (a) and (b))
* record the exact times when things went wrong
2. ensure we save the tor log from the Tails system when this happens, so we can see if that tor is stupid (a)
3. make the Journal persistent on isotesters so we can try to correlate such failures with system events ©.


Files


Subtasks


Related issues

Related to Tails - Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping Rejected 2015-09-23
Related to Tails - Feature #12411: Stop restarting tor if bootstrapping stalls Resolved 2017-03-29
Related to Tails - Feature #12289: Deal with June 2017 false positive scenarios Resolved 2017-06-05 2017-07-05
Related to Tails - Bug #15799: "Additional software packages are installed even without network" test always fail in my environment Resolved 2018-08-17
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

History

#1 Updated by bertagaz 2017-07-31 10:40:30

#2 Updated by bertagaz 2017-07-31 10:40:42

  • related to Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping added

#3 Updated by bertagaz 2017-07-31 10:41:05

  • related to Feature #12411: Stop restarting tor if bootstrapping stalls added

#4 Updated by bertagaz 2017-07-31 10:41:15

  • related to Feature #12289: Deal with June 2017 false positive scenarios added

#5 Updated by bertagaz 2017-07-31 16:04:09

I’ll paste debug logs and all later when I’ll have finished Feature #12290.

#6 Updated by bertagaz 2017-08-03 15:36:12

intrigeri wrote:
>
> 2. ensure we save the tor log from the Tails system when this happens, so we can see if that tor is stupid (a)

While doing Bug #13472, I’ve pushed the feature/13541-save-more-data-on-htpdate-or-tor-failures branch which contains a rough implementation of that. It also saves htpdate.log on ‘time has synced failures’, so that’s why I wanted it pushed and live in Jenkins.

#7 Updated by bertagaz 2017-08-18 12:58:55

  • Status changed from Confirmed to In Progress
  • Assignee set to anonym
  • Target version set to Tails_3.2
  • % Done changed from 0 to 10
  • QA Check set to Ready for QA

intrigeri wrote:
> 3. make the Journal persistent on isotesters so we can try to correlate such failures with system events ©.

I’ve pushed another commit in the branch mentioned in my previous note that does that. I’ve made it so that it does save the journal no matter what the failure. I think that’s an interesting information we may want for a lot of reasons/cases. For example, I’ve done it because I wanted to have the systemd journal for Bug #13461.

I’ve also pushed another commit in this branch that clean up the Tor and Htpdate logs retrieval.

I think that’s something that’d be useful to get into stable and devel so that we get more useful informations from the Jenkins runs, so I’m puthing this branch RfQA. Merging it does not mean this ticket is over though so please set it back to Dev needed and no assignee if/when it’s merged.

#8 Updated by bertagaz 2017-08-22 09:36:35

I’m attaching the systemd journal from https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-13541-save-more-data-on-htpdate-or-tor-failures/39/ which exposed this failure. The related saved Tor log is empty, it seems it didn’t even started. That’s the only builds from this branch that had this problem, so it’s not sure it really shows the exact bug we’re talking about.

#9 Updated by anonym 2017-09-25 16:46:00

  • Assignee changed from anonym to bertagaz
  • Target version changed from Tails_3.2 to Tails_3.3
  • % Done changed from 10 to 20

I merged the feature/13541-save-more-data-on-htpdate-or-tor-failures branch, but had to push some fixups straight into testing:

892150ba5e Be more careful with remote shell usage when a scenario has failed.
0f5b0ea305 Prefer do..end over {} for multi-line blocks.
a44a711aa7 Fix indentation.
dcb30cc995 Simplify.
9f74565b80 Fix typos.
b85dbcb642 Simplify.

Please review (post-merge)!

#10 Updated by bertagaz 2017-10-03 10:53:10

  • Target version changed from Tails_3.3 to Tails_3.5

#11 Updated by intrigeri 2017-10-04 08:03:04

#12 Updated by bertagaz 2017-10-23 10:48:04

  • Assignee deleted (bertagaz)
  • QA Check changed from Ready for QA to Dev Needed

anonym wrote:
> I merged the feature/13541-save-more-data-on-htpdate-or-tor-failures branch, but had to push some fixups straight into testing:
>
> Please review (post-merge)!

yay, thanks for the improvements / fixes. It’s live, works and changes make sense.

#13 Updated by intrigeri 2017-12-04 18:18:17

  • Target version deleted (Tails_3.5)

#14 Updated by intrigeri 2017-12-09 07:31:24

While analyzing test failures on Bug #15019 I’ve just seen 2 problems that I was going to naively classify as “Tor bootstrap failure”, thinking “oh crap, the new Tor fails to bootstrap more often!”. But then I looked at the Journal and noticed that nm-dispatcher did not react to NM connectivity change, did not start its scripts, and therefore the Tor service was not even started => no chance it ever bootstraps. I’m attaching two example Journals that display this bug.

I don’t know if this is caused by the hacks we do around time sync’ing etc. in our test suite (e.g. I see a 2 hours time bump after starting NM + its dispatcher, but before it got a DHCP lease), or if it’s a bug that can happen during Tails regular usage.

I suggest we check that the Tor service is actually started next time we’re tempted to classify issues as “Tor fails to bootstrap”: see if there are entries in the *.tor artifact and possibly look for “Starting Anonymizing overlay network for TCP” in the Journal.

#15 Updated by intrigeri 2018-01-06 21:53:07

bertagaz wrote:
> anonym wrote:
> > I merged the feature/13541-save-more-data-on-htpdate-or-tor-failures branch, but had to push some fixups straight into testing:
> >
> > Please review (post-merge)!
>
> yay, thanks for the improvements / fixes. It’s live, works and changes make sense.

For some reason the Journal is not saved consistently, e.g. it’s not saved when the “Browsing the web using the Tor Browser ǂ The persistent Tor Browser directory is usable” scenario fails this way:

try_for() timeout expired
Last ignored exception was: RuntimeError: The Dogtail script raised: SearchError: descendent of [application | Firefox]: child with name="Tails - News - Tor Browser" roleName='frame' (Timeout::Error)
./features/support/helpers/misc_helpers.rb:90:in `rescue in try_for'
./features/support/helpers/misc_helpers.rb:36:in `/^"([^"]+)" has loaded in the Tor Browser$/'
features/torified_browsing.feature:139:in `And the Tor Browser loads the startup page'

#16 Updated by intrigeri 2018-03-29 16:17:06

On Bug #15456#note-11 bertagaz wrote:

> Hint: I see more of this kind of failure with the usual Chutney network used in the test suite. Lately, when setting V3AuthVotingInterval to something like 14400 in submodules/chutney/torrc_templates/authority.i I get close to none of this bootstrap errors. That makes sense: the consensus is not renewed very 10 seconds anymore which gives a much more stable tor network. The Tails VM Tor process gets a valid consensus at the first download attempt and bootstrap very much faster. I wanted to add a note on that on Bug #13541 and/or Bug #15160. It could even help for Feature #15211 a bit (reduce crypto operations).

#17 Updated by intrigeri 2018-08-17 07:12:38

intrigeri wrote:
> While analyzing test failures on Bug #15019 I’ve just seen 2 problems that I was going to naively classify as “Tor bootstrap failure”, thinking “oh crap, the new Tor fails to bootstrap more often!”. But then I looked at the Journal and noticed that nm-dispatcher did not react to NM connectivity change, did not start its scripts, and therefore the Tor service was not even started => no chance it ever bootstraps.

Similarly, I’ve just seen one case where NM will get a DHCP lease but did not start its dispatcher scripts, so tor is not started.

#18 Updated by intrigeri 2018-08-17 14:05:42

  • related to Bug #15799: "Additional software packages are installed even without network" test always fail in my environment added

#19 Updated by intrigeri 2018-08-17 17:39:54

  • Assignee set to intrigeri
  • Target version set to Tails_3.9
  • Feature Branch set to test/13541-more-robust-tor-bootstrap+force-all-tests

intrigeri wrote:
> On Bug #15456#note-11 bertagaz wrote:
>
> > Hint: I see more of this kind of failure with the usual Chutney network used in the test suite. Lately, when setting V3AuthVotingInterval to something like 14400 in submodules/chutney/torrc_templates/authority.i I get close to none of this bootstrap errors. That makes sense: the consensus is not renewed very 10 seconds anymore which gives a much more stable tor network. The Tails VM Tor process gets a valid consensus at the first download attempt and bootstrap very much faster.

It does seem to help here as well so I’ve prepared a branch.

#20 Updated by intrigeri 2018-08-18 11:07:38

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_3.9)
  • Feature Branch deleted (test/13541-more-robust-tor-bootstrap+force-all-tests)

Actually my branch only solves part of the problem, and not the cases when Tor is not even started, so I’ll submit it on Bug #15799 instead because that’s the immediate problem it fixes.

#21 Updated by intrigeri 2019-03-08 15:24:59

#22 Updated by intrigeri 2019-03-20 14:49:26

#23 Updated by intrigeri 2019-03-20 14:49:34

  • blocked by deleted (Feature #13241: Core work: Test suite maintenance)