Bug #16792: Upgrade our Chutney fork

Bug #16792

Upgrade our Chutney fork

Added by intrigeri 2019-06-09 15:25:24 . Updated 2019-12-18 14:20:16 .

Status:

Resolved

Priority:

Normal

Assignee:

hefee

Category:

Test suite

Target version:

Tails_4.2

Start date:

Due date:

% Done:

100%

Feature Branch:

feature/16792-only-update-chutney+force-all-tests

Type of work:

Code

Blueprint:

Starter:

Affected tool:

Deliverable for:

Description

We haven’t updated our Chutney fork since 2016. Current diffstat is: 139 files changed, 3401 insertions(+), 712 deletions(-).
Using configuration and code that was developed for a somewhat antique version of tor will likely cause trouble at some point.

I’ve noticed this because our suite sets deprecated options in torrc for Chutney usage:

Jun 10 19:38:59 amnesia tor[8575]: Jun 10 19:38:59.126 [warn] The configuration option 'TestingBridgeDownloadSchedule' is deprecated; use 'TestingBridgeDownloadInitialDelay' instead.
Jun 10 19:38:59 amnesia tor[8575]: Jun 10 19:38:59.126 [warn] The configuration option 'TestingClientConsensusDownloadSchedule' is deprecated; use 'TestingClientConsensusDownloadInitialDelay' instead.
Jun 10 19:38:59 amnesia tor[8575]: Jun 10 19:38:59.126 [warn] The configuration option 'TestingClientDownloadSchedule' is deprecated; use 'TestingClientDownloadInitialDelay' instead.

Upstream Chutney has removed all DownloadSchedule torrc options from these templates so in this case, it’s no big deal.

Subtasks

Related issues

Related to Tails - ~~Bug #17163~~: "SSH is using the default SocksPort" test suite scenario is fragile	Resolved
Blocks Tails - Feature #16209: Core work: Foundations Team	Confirmed
Blocks Tails - Feature #15211: Reduce our Chutney network	In Progress	2018-01-22
Blocks Tails - Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete	In Progress	2019-02-17
Blocks Tails - Bug #11589: Time syncing over bridge is fragile	Confirmed	2016-07-22

History

#1 Updated by intrigeri 2019-06-09 15:26:30

blocks Feature #16209: Core work: Foundations Team added

#2 Updated by intrigeri 2019-06-14 12:31:30

blocks Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete added

#3 Updated by anonym 2019-06-14 12:32:04

Assignee set to anonym

#4 Updated by anonym 2019-06-14 12:35:08

Subject changed from Upgrade our Chutney fork to Upgrade our Chutney fork and make configuratoin more similar to the real Tor network

#5 Updated by hefee 2019-06-14 12:37:27

Here is another patch for chutney to be more like the current Tor network, to let tests passes that do time screws.
https://salsa.debian.org/hefee/chutney/merge_requests/1

#6 Updated by anonym 2019-06-14 12:41:22

Subject changed from Upgrade our Chutney fork and make configuratoin more similar to the real Tor network to Upgrade our Chutney fork and make configuration more similar to the real Tor network

I’ll use hefee’s work on https://salsa.debian.org/hefee/chutney/merge_requests/1

#7 Updated by anonym 2019-06-14 14:32:21

Status changed from Confirmed to In Progress

Applied in changeset commit:tails|3ef189655a20a1b7673858fe54546027942ed190.

#8 Updated by anonym 2019-06-14 15:00:01

% Done changed from 0 to 10
Feature Branch set to feature/16792-update-chutney+force-all-tests

The current branch updates chutney, and I’ve at least verified that tor bootstraps. I saw some initial failures with htpdate (restarting the services fixed it), not sure why is that or if it was a transient thing.

Any way, let’s see what a full run looks like.

#9 Updated by anonym 2019-06-26 09:56:20

% Done changed from 10 to 20

The full test runs look like any other +force-all-tets run, so I see no apparent regressions, yay!

Next step, make the configuration more similar to the real Tor network.

#10 Updated by anonym 2019-06-26 11:52:10

We used to get consensuses looking like this:

valid-after 2019-06-26 10:12:00
fresh-until 2019-06-26 10:15:00
valid-until 2019-06-26 10:18:00
voting-delay 4 4

i.e. really short intervals/delays. In the real Tor network it looks like this:

valid-after 2019-06-26 09:00:00
fresh-until 2019-06-26 10:00:00
valid-until 2019-06-26 12:00:00
voting-delay 300 300

I’ve adjusted the *V3Auth* settings so they now have the same length intervals etc as the real Tor network:

 
valid-after 2019-06-26 11:31:00
fresh-until 2019-06-26 12:31:00
valid-until 2019-06-26 14:31:00
voting-delay 300 300

That should be all we need for Bug #16471, but there are other discrepancies too (e.g. the real Tor network use consensus-method 28 but we get consensus-method 25 for some reason, but this is just an example — I don’t think we care about this one).

#11 Updated by intrigeri 2019-08-10 16:56:01

Target version set to Tails_3.16

anonym, could finishing this be your next focus point once you’re done with 4.0~beta1? Or maybe you prefer someone else to finish this work?

In any case, I’ve refreshed the branch so that it builds on Jenkins again.

Getting this done would allow us to make progress on Bug #16471 :)

Setting target version to the same as Bug #16471, for consistency. But postponing both would be fine.

#12 Updated by intrigeri 2019-08-10 17:29:22

blocks Feature #15211: Reduce our Chutney network added

#13 Updated by anonym 2019-08-16 12:40:57

intrigeri wrote:
> anonym, could finishing this be your next focus point once you’re done with 4.0~beta1?

Yes!

> In any case, I’ve refreshed the branch so that it builds on Jenkins again.

Thanks! Sadly we lost all results except last week’s, so now we have have much less data to rely on. I guess there is nothing we can do about this? Or could they still exist on some backup that isn’t too awkward (say at most 15 minutes of work) to restore from?

After looking briefly at Jenkins runs 1-5 and comparing them devel’s results from the same time frame, I see mostly the same network using branches failing, but not similar enough that it’s a no-brainer to tell if things are worse/same/better. So I feel tempted to spam test this branch a bit on Jenkins. I started 3 runs now, and hope to start a few more during the weekend (feel free to help me if you happen to see some Jenkins slaves idling!), and hopefully have more convincing data on Monday. Then there’s two weeks to get this and Bug #16471 ready for Tails 3.16, which seems doable.

#14 Updated by intrigeri 2019-08-16 13:23:13

Yo,

anonym wrote:
> intrigeri wrote:
>> anonym, could finishing this be your next focus point once you’re done with 4.0~beta1?

> Yes!

Excellent news :)

> Thanks! Sadly we lost all results except last week’s, so now we have have much less data to rely on. I guess there is nothing we can do about this? Or could they still exist on some backup that isn’t too awkward (say at most 15 minutes of work) to restore from?

Nope: we don’t backup the output of Jenkins jobs (that would take huge amounts of disk, bandwidth, CPU, and degrade performance of all our services).

> After looking briefly at Jenkins runs 1-5 and comparing them devel’s results from the same time frame, I see mostly the same network using branches failing, but not similar enough that it’s a no-brainer to tell if things are worse/same/better. So I feel tempted to spam test this branch a bit on Jenkins. I started 3 runs now, and hope to start a few more during the weekend (feel free to help me if you happen to see some Jenkins slaves idling!), and hopefully have more convincing data on Monday. Then there’s two weeks to get this and Bug #16471 ready for Tails 3.16, which seems doable.

Sounds like a great plan! Let’s take it easy wrt. spamming Jenkins: these days it happens regularly that it has a large backlog (I’ve seen the feedback loop take 20+ hours a few times recently).

I’ll run a few jobs on my local Jenkins too, if its Jenkins workers and myself both have some spare time for it.

#15 Updated by intrigeri 2019-08-16 13:35:23

> comparing them devel’s results from the same time frame

Note that comparing with the base branch of this one (stable) instead would probably be a tiny bit more accurate :)

#16 Updated by intrigeri 2019-08-18 16:35:38

> I’ll run a few jobs on my local Jenkins too, if its Jenkins workers and myself both have some spare time for it.

We both had! So I should be able to report about 12+ full local runs tomorrow :)

#17 Updated by intrigeri 2019-08-19 07:33:37

> We both had! So I should be able to report about 12+ full local runs tomorrow :)

Here we go. I got 12 full runs on my local Jenkins.

Apart of the usual suspects (mostly ~~Bug #11592~~ these days, see my comment there), I see:

lots of TimeSyncingError, in the time sync and bridges scenarios, including in “Clock with host’s time”
some TorBootstrapFailure, in the time sync scenarios, including in “Clock with host’s time”
lots of “available upgrades have been checked” failures after “Tor is ready” succeeded, in the bridges scenarios
one “Tails clock is less than 5 minutes incorrect” failure in “Clock with host’s time”: clock was off by 435 seconds; interestingly, the clock was correct before htpdate “fixed” it, so I dunno what the heck is going on here
lots of GnuPG key fetching failures; that might be caused by instability of my Internet connection but I have a doubt since I’ve not seen this fail as much while running tests for other branches recently; or it might be that pool.sks-keyservers.net was having trouble at that point, which is entirely possible
some weird failures for SSH and Git tests

All in all, that’s many more failures than I see usually in the same environment (that tends to expose less robustness issues than our shared Jenkins).

Once you’ve analysed the runs on our shared Jenkins, if there’s important problems I’m seeing that you did not see, and if you think it would help, I could share with you a tarball of the debug logs, tor logs, and Journal dumps.

#18 Updated by anonym 2019-08-20 12:22:17

intrigeri wrote:
> * lots of TimeSyncingError, in the time sync and bridges scenarios, including in “Clock with host’s time”

Same.

> * some TorBootstrapFailure, in the time sync scenarios, including in “Clock with host’s time”

I saw this once, for “Clock with host’s time”.

> * lots of “available upgrades have been checked” failures after “Tor is ready” succeeded, in the bridges scenarios

None of these. I find it odd that you see “lots” of these while I see none, but I suppose it could be your internet connection? IIRC these tests are especially sensitive to that?

> * one “Tails clock is less than 5 minutes incorrect” failure in “Clock with host’s time”: clock was off by 435 seconds; interestingly, the clock was correct before htpdate “fixed” it, so I dunno what the heck is going on here

Huh. Will keep an eye open for htpdate strangeness.

> * lots of GnuPG key fetching failures; that might be caused by instability of my Internet connection but I have a doubt since I’ve not seen this fail as much while running tests for other branches recently; or it might be that pool.sks-keyservers.net was having trouble at that point, which is entirely possible
> * some weird failures for SSH and Git tests

Same for these.

> All in all, that’s many more failures than I see usually in the same environment (that tends to expose less robustness issues than our shared Jenkins).

On jenkins there’s also noticeably more failures on the feature branch compared to stable. However, it seems that most failures are introduced in a few runs only, with the other runs looking pretty much like stable. Hmm. One could almost suspect that Chutney generated a “buggy” Tor network for those bad runs.

Let’s note that this branch added two things:

Chutney: use normal consensus cycle interval
Chutney: update to current upstream master

We should test these separately and see which one causes all the trouble, so I created these poorly named branches:

feature/16792-only-update-chutney+force-all-tests
feature/16792-normal-consensus+force-all-tests

It looks increasingly unlikely we’ll have this done for Tails 3.16. Or what do you think, @intrigeri?

#19 Updated by intrigeri 2019-08-21 08:49:17

Hi,

>> * lots of “available upgrades have been checked” failures after “Tor is ready” succeeded, in the bridges scenarios

> None of these. I find it odd that you see “lots” of these while I see none, but I suppose it could be your internet connection? IIRC these tests are especially sensitive to that?

I’m not used to see lots of those here but it could indeed be my Internet connection. If you feel it would be useful, I could trigger a few more runs to see if this happens often again.

> Let’s note that this branch added two things: […]
> We should test these separately and see which one causes all the trouble

Excellent idea :)

> It looks increasingly unlikely we’ll have this done for Tails 3.16.

I agree it looks less easy than we could have been hoping previously, but there’s still ~10 days left, which is plenty. We’ll see! If it makes it into 3.16, great, and it gets merged a bit later, fine.

#20 Updated by anonym 2019-08-21 09:07:01

intrigeri wrote:
> >> * lots of “available upgrades have been checked” failures after “Tor is ready” succeeded, in the bridges scenarios
>
> > None of these. I find it odd that you see “lots” of these while I see none, but I suppose it could be your internet connection? IIRC these tests are especially sensitive to that?
>
> I’m not used to see lots of those here but it could indeed be my Internet connection. If you feel it would be useful, I could trigger a few more runs to see if this happens often again.

Yes, it would be very useful …

> > Let’s note that this branch added two things: […]
> > We should test these separately and see which one causes all the trouble
>
> Excellent idea :)

… if you could do a few runs of the two new branches!

#21 Updated by intrigeri 2019-08-21 09:31:23

> Yes, it would be very useful …
> […]
> … if you could do a few runs of the two new branches!

OK, I shall do this once I’m done with ~~Bug #12092~~, which currently keeps my local Jenkins busy.

#22 Updated by intrigeri 2019-08-30 16:10:18

Target version changed from Tails_3.16 to Tails_3.17

intrigeri wrote:
> > Yes, it would be very useful …
> > […]
> > … if you could do a few runs of the two new branches!
>
> OK, I shall do this once I’m done with ~~Bug #12092~~, which currently keeps my local Jenkins busy.

Unfortunately, this has been the case almost continuously since I wrote that, so I could not run anything on the new branches yet.

#23 Updated by intrigeri 2019-09-01 16:38:41

I’ve started 3 runs of each of these 2 new branches locally. And then I noticed that the 2 branches are based on stable. This of course made sense back when they were prepared: we were targeting 3.16. But since then, lots of test suite robustness improvements went to devel and not to stable, so it’ll be more painful than it could be to analyze the results. So if you don’t mind, I’ll merge devel into these branches.

#24 Updated by intrigeri 2019-09-05 10:53:13

Assignee deleted (~~anonym~~)

#25 Updated by intrigeri 2019-09-05 14:39:22

Target version changed from Tails_3.17 to Tails_4.0

intrigeri wrote:
> So if you don’t mind, I’ll merge devel into these branches.

I’ve rebased the 3 branches on top of current devel.

#26 Updated by intrigeri 2019-09-07 07:02:24

Assignee set to intrigeri

I’ll take care of analyzing the test results, in the hope that we can at least merge the “upgrade Chutney” part soonish :)

#27 Updated by intrigeri 2019-09-07 19:08:46

only update Chutney
- runs 18-21 of https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-16792-only-update-chutney-force-all-tests/:
  - “Cloning a Git repository ǂ Cloning git repository over SSH” failed every time, while I’ve not seen it fail on devel recently (I’m analyzing every failure there since Aug 31)
  - Same for “Tor stream isolation is effective ǂ SSH is using the default SocksPort”. This might be Bug #17013 but I find surprising to see this fail 4 times in 4 runs while it only failed once in the last 18 runs on devel.
  - Apart of these 2 concerning issues, it’s business as usual, no regression spotted there.
- runs 4-5 on my local Jenkins: passed!
normal consensus
- runs 13-16 of https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-16792-normal-consensus-force-all-tests/:
  - Business as usual, no regression spotted.
- runs 4-5 on my local Jenkins:
  - one of these just passed
  - the other one had 1 “Tor failed to bootstrap (TorBootstrapFailure)” and a few known fragile failures

I feel I need much more data to draw any conclusion. I’ll update the info in this comment in a week or so.

#28 Updated by intrigeri 2019-09-15 07:24:49

One week later, I’ve updated my analysis of test results (see below) and I now feel I have enough data to draw preliminary conclusions:

In the current state of things, none of the 2 branches is ready for merging. Both have issues that need further investigation.
Regarding the “only update Chutney” branch: I’ll come back to it in a week or so to see if the tweaks that I’ve just pushed help.
I’m starting to suspect that we’re starting to use the simulated tor network before it’s fully ready: in Chutney nodes’ logs I see quite often bootstrap status that’s below 100%. It seems we should use chutney wait_for_bootstrap in our ensure_chutney_is_running function: see the “Waiting for the network” section in Chutney’s README for details.

Test results:

only update Chutney
- runs 18-30 of https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-16792-only-update-chutney-force-all-tests/:
  - “Cloning a Git repository ǂ Cloning git repository over SSH” failed 8/13 times, compared to 0/21 times on devel. git clone immediately fails with “nc: connection failed, SOCKSv5 error: General SOCKS server failure”.
  - “Tor stream isolation is effective ǂ SSH is using the default SocksPort” failed 10/13 times, compared to 3/21 times on devel. This looks like Bug #17013 but the increased failure rate is concerning. Just like in the Git over SSH scenario, SSH immediately fails with “nc: connection failed, SOCKSv5 error: General SOCKS server failure”.
  - Apart of these 2 concerning issues, it’s business as usual, no regression spotted there.
- runs 4-7 on my local Jenkins:
  - 3/4 fully passed
  - “Cloning a Git repository ǂ Cloning git repository over SSH” failed 1/4 times, in exactly the same way as on lizard
  - “Tor stream isolation is effective ǂ SSH is using the default SocksPort” failed 1/4 times, in exactly the same way as on lizard
- Wrt. “nc: connection failed, SOCKSv5 error: General SOCKS server failure”, since then I’ve applied a few tweaks that might help (unlikely) and added commit:79f86a310a3568a5ef2f4646b216b1be1fab7144 to get more debugging info.
normal consensus
- runs 13-23 of https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-16792-normal-consensus-force-all-tests/:
  - 1 “Tor failed to bootstrap (TorBootstrapFailure)” in “Clock with host’s time”
  - 1 “Time syncing failed (TimeSyncingError)” in “Using bridges”: only 2/10 HTTPS connections initiated by htpdate succeeded.
  - other than that, business as usual
- runs 4-7 on my local Jenkins:
  - 1/7 fully passed
  - 1 “Tor failed to bootstrap (TorBootstrapFailure)” in “Clock with host’s time” (see logs below)
  - other than that, business as usual

Wrt. the aforementioned Tor bootstrap failures, tor got a valid consensus:

Sep 07 10:25:51 amnesia time[8106]: Waiting for a Tor consensus file to contain a valid time interval
Sep 07 10:25:51 amnesia time[8109]: A Tor consensus file now contains a valid time interval.
Sep 07 10:25:51 amnesia time[8110]: Waiting for the chosen Tor consensus file to contain a valid time interval...
Sep 07 10:25:51 amnesia time[8112]: The chosen Tor consensus now contains a valid time interval, let's use it.
Sep 07 10:25:51 amnesia time[8116]: Tor: valid-after=2019-09-07 10:13:00 | valid-until=2019-09-07 13:13:00
Sep 07 10:25:51 amnesia time[8119]: Current time is 2019-09-07 10:25:51
Sep 07 10:25:51 amnesia time[8124]: Current time is in valid Tor range
Sep 07 10:25:51 amnesia time[8125]: Waiting for Tor to be working...

… it can’t build a circuit:

Sep 07 10:25:50.000 [notice] Bootstrapped 0% (starting): Starting
Sep 07 10:25:50.000 [notice] Starting with guard context "default"
Sep 07 10:25:50.000 [notice] Signaled readiness to systemd
Sep 07 10:25:50.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:51.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:51.000 [notice] Opening Control listener on /run/tor/control
Sep 07 10:25:51.000 [notice] Opened Control listener on /run/tor/control
Sep 07 10:25:51.000 [notice] Bootstrapped 5% (conn): Connecting to a relay
Sep 07 10:25:51.000 [notice] Bootstrapped 10% (conn_done): Connected to a relay
Sep 07 10:25:51.000 [notice] Bootstrapped 14% (handshake): Handshaking with a relay
Sep 07 10:25:51.000 [notice] Bootstrapped 15% (handshake_done): Handshake with a relay done
Sep 07 10:25:51.000 [notice] Bootstrapped 20% (onehop_create): Establishing an encrypted directory connection
Sep 07 10:25:51.000 [notice] Bootstrapped 25% (requesting_status): Asking for networkstatus consensus
Sep 07 10:25:51.000 [notice] New control connection opened.
Sep 07 10:25:51.000 [notice] Bootstrapped 30% (loading_status): Loading networkstatus consensus
Sep 07 10:25:51.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Sep 07 10:25:51.000 [notice] Bootstrapped 40% (loading_keys): Loading authority key certs
Sep 07 10:25:51.000 [notice] The current consensus has no exit nodes. Tor can only build internal paths, such as paths to onion services.
Sep 07 10:25:51.000 [notice] Bootstrapped 45% (requesting_descriptors): Asking for relay descriptors
Sep 07 10:25:51.000 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/2, and can only build 0% of likely paths. (We have 0% of guards bw, 0% of midpoint bw, and 0% of end bw (no exits in consensus, using mid) = 0% of path bw.)
Sep 07 10:25:51.000 [notice] Bootstrapped 50% (loading_descriptors): Loading relay descriptors
Sep 07 10:25:51.000 [notice] The current consensus contains exit nodes. Tor can build exit and internal paths.
Sep 07 10:25:51.000 [notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits
Sep 07 10:25:51.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:51.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:52.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:52.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:52.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:52.000 [warn] Failed to find node for hop #2 of our path. Discarding this circuit.
Sep 07 10:25:52.000 [notice] Our circuit 0 (id: 3) died due to an invalid selected path, purpose General-purpose client. This may be a torrc configuration issue, or a bug.
Sep 07 10:25:52.000 [notice] New control connection opened.
Sep 07 10:25:53.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:53.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:53.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:53.000 [warn] Failed to find node for hop #2 of our path. Discarding this circuit.
Sep 07 10:25:53.000 [notice] New control connection opened.
Sep 07 10:25:53.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:54.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:54.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:54.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:54.000 [warn] Failed to find node for hop #2 of our path. Discarding this circuit.

… and then the last 3 lines repeat endlessly.

Note that we currently use 20 non-exit relays.

#29 Updated by intrigeri 2019-09-16 15:47:54

blocked by deleted (~~Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete~~)

#30 Updated by intrigeri 2019-09-18 17:35:14

Subject changed from Upgrade our Chutney fork and make configuration more similar to the real Tor network to Upgrade our Chutney fork
Feature Branch changed from feature/16792-update-chutney+force-all-tests to feature/16792-only-update-chutney+force-all-tests

Narrowing scope to something I feel can be completed soon.

I’ve renamed to wip/ the branches that “make configuration more similar to the real Tor network”: wip/feature/16792-update-chutney+force-all-tests and wip/feature/16792-normal-consensus+force-all-tests. That’ll be for another ticket whenever we need that.

#31 Updated by intrigeri 2019-09-19 11:20:21

> * I’m starting to suspect that we’re starting to use the simulated tor network before it’s fully ready: in Chutney nodes’ logs I see quite often bootstrap status that’s below 100%. It seems we should use chutney wait_for_bootstrap in our ensure_chutney_is_running function: see the “Waiting for the network” section in Chutney’s README for details.

I’ve implemented this on feature/16792-only-update-chutney+force-all-tests. I have good hopes it will eliminate some kinds of test suite robustness issues but unfortunately, it did not magically solve the “nc: connection failed, SOCKSv5 error: General SOCKS server failure” problem, which is thus left to be investigated.

#32 Updated by intrigeri 2019-09-30 17:55:09

intrigeri wrote:
> it did not magically solve the “nc: connection failed, SOCKSv5 error: General SOCKS server failure” problem, which is thus left to be investigated.

I’ve seen this problem happen a lot on devel-based branches in the last few days. I suspect this is caused by memory starvation on isotesters (~~Bug #17088~~). So it might be that this problem is actually caused by limitations in our CI environment, and has nothing to do with Chutney itself. I’ve just given isotesters more RAM (for ~~Bug #17088~~), let’s see if it helps here too!

#33 Updated by intrigeri 2019-10-09 16:43:01

Target version changed from Tails_4.0 to Tails_4.1

intrigeri:
>> * I’m starting to suspect that we’re starting to use the simulated tor network before it’s fully ready: in Chutney nodes’ logs I see quite often bootstrap status that’s below 100%. It seems we should use chutney wait_for_bootstrap in our ensure_chutney_is_running function: see the “Waiting for the network” section in Chutney’s README for details.

> I’ve implemented this on feature/16792-only-update-chutney+force-all-tests. I have good hopes it will eliminate some kinds of test suite robustness issues but unfortunately, it did not magically solve the “nc: connection failed, SOCKSv5 error: General SOCKS server failure” problem, which is thus left to be investigated.

Since I gave isotesters more RAM for unrelated reasons, apart of the scenarios that are super fragile on devel too these days, runs 54-60 on lizard yielded this:

chutney wait_for_bootstrap failed 2/7 times. Note that this aborts the test suite run before running anything, which is why the following stats are about the 5 other test suite runs. FWIW, the nodes that did not bootstrap yet are:
- test035bridge: (80, None, 'Connecting to the Tor network internally')
- test039obfs4: (85, None, 'Finishing handshake with first hop of internal circuit')
- test004ba: (85, None, 'Finishing handshake with first hop of internal circuit')
- test035bridge: (80, None, 'Connecting to the Tor network internally')
- test036bridge: (80, None, 'Connecting to the Tor network internally')
“SSH is using the default SocksPort” failed with the now common “nc: connection failed […]” error 5/5 times (vs. 3 times in the last 10 test suite runs on the devel branch)
“Chatting with some friend over XMPP” failed 1/5 times vs. “Chatting with some friend over XMPP and with OTR” failing 1/10 times on devel with similar symptoms ⇒ I won’t lose any sleep over this one.

Apart of this, everything looks just as good/bad as on the devel branch.

So IMO, the only blockers here are now:

Increase the chutney wait_for_bootstrap timeout until it’s robust enough. If this requires an unreasonable large timeout, or if we notice that Chutney simply will sometimes never bootstrap at all, come back to the drawing board: it could be a bug in Chutney. I’ve just bumped the timeout from 180s to 240s.
Find out what’s going on with “SSH is using the default SocksPort”. Idea: try with a different host, that’s outside of the *.tails.boum.org network (which our isotesters are part of).

I’ve also merged upstream Chutney changes again on this branch, because 4 months passed since anonym did it, and I’d hate to be blocked here by issues that upstream has fixed already.

In passing, we’re still running Chutney under Python 2, while it now supports Python 3, and it seems that Tor folks now run it that way. We should change this at some point.

I probably won’t have time to come back to this ticket in time for 4.0 and anyway, I’ll need to let a week or two pass until there’s enough new data to analyze ⇒ postponing.

#34 Updated by intrigeri 2019-10-18 16:49:11

related to ~~Bug #17163~~: "SSH is using the default SocksPort" test suite scenario is fragile added

#35 Updated by intrigeri 2019-10-27 08:22:30

I’ve just looked at runs 61-76.

> So IMO, the only blockers here are now:

> * Increase the chutney wait_for_bootstrap timeout until it’s robust enough. If this requires an unreasonable large timeout, or if we notice that Chutney simply will sometimes never bootstrap at all, come back to the drawing board: it could be a bug in Chutney. I’ve just bumped the timeout from 180s to 240s.

This problem happened in 3 out of the 16 last runs. I’ve bumped the timeout again, this time to 600 seconds.

If that’s still not enough, I see two ways to approach this, whenever I or someone else has time for this:

This might be related to our V3AuthVotingInterval 180 custom setting. Next steps:
- Research why we did that
- Try reverting to Chutney’s default (V3AuthVotingInterval 20)
Worst case, retry (i.e. restart Chutney) when chutney wait_for_bootstrap fails.

> * Find out what’s going on with “SSH is using the default SocksPort”. Idea: try with a different host, that’s outside of the *.tails.boum.org network (which our isotesters are part of).

This was solved via ~~Bug #17163~~ :)

Apart of that, the test suite on this branch is as robust as on devel.

#36 Updated by intrigeri 2019-11-08 18:29:02

>> * Increase the chutney wait_for_bootstrap timeout until it’s robust enough. If this requires an unreasonable large timeout, or if we notice that Chutney simply will sometimes never bootstrap at all, come back to the drawing board: it could be a bug in Chutney. I’ve just bumped the timeout from 180s to 240s.

> This problem happened in 3 out of the 16 last runs. I’ve bumped the timeout again, this time to 600 seconds.

This problem did not happen any single time during the last 15 runs (77-91), so I’m confident this did the trick.

I’ve looked closer at the last 5 runs and the failures are all well-known robustness problems.

#37 Updated by intrigeri 2019-11-08 18:30:44

Status changed from In Progress to Needs Validation
Assignee deleted (~~intrigeri~~)

#38 Updated by intrigeri 2019-11-08 18:34:07

blocked by Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete added

#39 Updated by intrigeri 2019-11-08 18:34:12

blocks deleted (~~Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete~~)

#40 Updated by intrigeri 2019-11-08 18:34:19

blocks Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete added

#41 Updated by intrigeri 2019-11-08 19:11:48

blocks Bug #11589: Time syncing over bridge is fragile added

#42 Updated by CyrilBrulebois 2019-12-04 11:31:20

Target version changed from Tails_4.1 to Tails_4.2

#43 Updated by hefee 2019-12-13 15:12:49

Assignee set to hefee

#44 Updated by Anonymous 2019-12-18 14:20:16

Status changed from Needs Validation to Resolved
% Done changed from 20 to 100

Applied in changeset commit:tails|e15781db1fba00d61df2957ba466a703b430dfa8.