Bug #10495

The 'the time has synced' step is fragile

Added by anonym 2015-11-06 10:22:03 . Updated 2017-08-14 18:15:07 .

Status:
In Progress
Priority:
Elevated
Assignee:
Category:
Test suite
Target version:
Start date:
2015-11-06
Due date:
% Done:

0%

Feature Branch:
Type of work:
Research
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

See Bug #10494 which will fix this in Tails, not the test suite. This ticket is really just to acknowledge this robustness issue, even though nothing will be done in the test suite.


Subtasks


Related issues

Related to Tails - Bug #13472: Replace www.centos.org in htpdate pools Resolved 2017-07-15
Blocked by Tails - Bug #10494: Retry htpdate when it fails Rejected 2016-07-17
Blocked by Tails - Feature #9521: Use the chutney Tor network simulator in our test suite Resolved 2016-04-15
Blocked by Tails - Bug #11562: Monitor servers from the htpdate pools Confirmed 2016-07-14
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

History

#1 Updated by anonym 2015-11-06 10:22:25

  • blocked by Bug #10494: Retry htpdate when it fails added

#2 Updated by anonym 2015-11-06 10:22:41

  • blocks #8668 added

#3 Updated by anonym 2015-11-06 10:23:35

  • Assignee set to kytv
  • Target version set to Tails_1.8
  • Parent task set to Bug #10288
  • Deliverable for set to 270

#4 Updated by bertagaz 2015-11-16 04:01:23

Since Nov 6, 2015 (test_Tails_ISO_experimental #42), this step has been one of the most common failures as it broke a test job 19 times, and without it, some would probably have passed as it was the only trouble.

This should probably be worked on ASAP to complete the test suite robustness.

#5 Updated by intrigeri 2015-12-05 08:17:15

This might be a duplicate of Bug #10440.

#6 Updated by intrigeri 2015-12-05 13:24:32

  • Target version changed from Tails_1.8 to Tails_2.0

(We’re going to mark as fragile all tests that depend on Tor to have bootstrapped for the moment => not so urgent.)

#7 Updated by intrigeri 2015-12-07 07:38:29

bertagaz wrote:
> Since Nov 6, 2015 (test_Tails_ISO_experimental #42), this step has been one of the most common failures as it broke a test job 19 times, and without it, some would probably have passed as it was the only trouble.

I don’t see it at all in the latest summary I generated and posted on Bug #10288. bertagaz, can you please check?

#8 Updated by intrigeri 2015-12-19 10:29:18

It would be fine to postpone this. In any case, please prioritize your SponsorsM4 stuff (Icedove) higher.

#9 Updated by intrigeri 2016-01-06 13:58:15

  • Target version changed from Tails_2.0 to Tails_2.2

#10 Updated by intrigeri 2016-02-05 13:52:50

  • Target version changed from Tails_2.2 to Tails_2.3

intrigeri wrote:
> It would be fine to postpone this. In any case, please prioritize your SponsorsM4 stuff (Icedove) higher.

Still the case.

#11 Updated by anonym 2016-03-03 16:08:49

  • Assignee changed from kytv to anonym
  • Target version changed from Tails_2.3 to Tails_2.4
  • Type of work changed from Wait to Research

Hopefully Chutney (Feature #9521) will fix the tordate parts.

#12 Updated by anonym 2016-03-03 16:18:16

Bug #10238 is probably related, but Redmine forbids adding a relationship due to circularity.

#13 Updated by anonym 2016-03-03 16:18:41

  • blocked by Feature #9521: Use the chutney Tor network simulator in our test suite added

#14 Updated by intrigeri 2016-05-14 13:44:13

  • blocked by deleted (#8668)

#15 Updated by intrigeri 2016-05-18 15:17:21

Again: is it a duplicate of Bug #10440?

#16 Updated by intrigeri 2016-05-26 09:10:13

intrigeri wrote:
> Again: is it a duplicate of Bug #10440?

Actually, let’s say no: Bug #10440 is about the scenarios that are specifically testing time sync, while this one is about the ‘the time has synced’ step, that all online scenarios rely on (via “Tor is ready”). So even though Bug #10440 is “fixed” (by disabling some tests) in test/10497-tor-bootstrap-is-fragile, we still have a problem here, and unsurprisingly I’ve seen it break tests again.

#17 Updated by anonym 2016-06-08 01:34:59

  • Target version changed from Tails_2.4 to Tails_2.5

#18 Updated by bertagaz 2016-07-15 04:16:14

  • QA Check set to Ready for QA

So, as noted on Bug #10494, I’ll report here my test suite run results.

I’ve done a bit of report of my first runs in Bug #10494#note-25. As promised, I’ve tried with the --connect-timeout options, but it didn’t bring much amelioration: I had 3 failures of this step on 120 runs (so a bit more than without). Still I think this option makes sense. Waiting 2 minutes for a single request sounds too much for me, even over Tor.

After that I’ve found that 3 urls in the pools were faulty, and fixed them (as stated in Bug #10494#note-30). Since then, I’ve run something like 150 times the scenario mentioned in Bug #10494#note-25, and seen no failure!

So to me it seems that the little errors that appeared in the previous runs were probably due to this faulty urls. 2 of them were in the HTP_POOL_PAL pool, which may explain things, if htpdate tries 5 times for a pool before erroring out. I still see it is restarted some times though it seems to appear a bit less than before.

So in the end, I think the enhancement brought by Bug #10494 fixes this step. Actually, it may very well have been a bug in Tails. I believe this ticket can also be considered RfQA now, so setting it accordingly.

#19 Updated by bertagaz 2016-07-15 08:58:30

  • Assignee changed from anonym to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

bertagaz wrote:
> Since then, I’ve run something like 150 times the scenario mentioned in Bug #10494#note-25, and seen no failure!

And it seems I needed to post this note to see one. :/

So this step is not entirely fixed, and I was too late at noticing the failure to get the reason why. When I inspected the htpdate logs, it claimed to have succeeded. So this could be due to Tor bootstrapping problems maybe.

I’ll do more test, I’ll raise the try_for time to see if I still have failures.

#20 Updated by intrigeri 2016-07-15 09:35:26

Hold on, see my latest comments on Bug #10494. IMO we should do something simpler and less risky first before we invest even more time here.

#21 Updated by intrigeri 2016-07-18 07:22:26

  • Target version changed from Tails_2.5 to Tails_2.6
  • Deliverable for changed from 270 to SponsorS_Internal

#23 Updated by intrigeri 2016-08-18 07:30:58

  • Target version deleted (Tails_2.6)

#24 Updated by intrigeri 2016-08-18 07:35:53

  • Assignee deleted (bertagaz)
  • Deliverable for deleted (SponsorS_Internal)

#25 Updated by bertagaz 2017-07-12 11:19:41

  • Priority changed from Normal to Elevated

Happened 60 times on all currently known branches in June, 104 times in total for what 2017 logs we have.

That’s a lot, so raising priority. First step would probably be to check if HTTP servers used by htpdate are OK, then tackle Bug #10494.

#26 Updated by intrigeri 2017-07-13 18:36:35

  • blocked by Bug #11562: Monitor servers from the htpdate pools added

#27 Updated by intrigeri 2017-07-13 18:38:17

> Happened 60 times on all currently known branches in June, 104 times in total for what 2017 logs we have.

Ouch.

> That’s a lot, so raising priority. First step would probably be to check if HTTP servers used by htpdate are OK, then tackle Bug #10494.

IMO next step is Bug #10495 (on your plate): there might be issues in our current HTP pool, and there’s some hope that fixing them will avoid having to do Bug #10494 at all.

#28 Updated by bertagaz 2017-07-15 14:41:16

intrigeri wrote:
> > Happened 60 times on all currently known branches in June, 104 times in total for what 2017 logs we have.
>
> Ouch.
>
> > That’s a lot, so raising priority. First step would probably be to check if HTTP servers used by htpdate are OK, then tackle Bug #10494.
>
> IMO next step is Bug #10495 (on your plate): there might be issues in our current HTP pool, and there’s some hope that fixing them will avoid having to do Bug #10494 at all.

To fix this faster than before we have monitoring in place, I’ve quickly checked the urls of the different HTP pools, and found out that www.centos.org is always failing to reply to curl (with the same cmdline than the one used by htpdate).

So I’ll open a ticket and prepare a branch replacing www.centos.org by something like https://getfedora.org/ that seems to be reliable.

#29 Updated by bertagaz 2017-07-15 14:44:07

  • related to Bug #13472: Replace www.centos.org in htpdate pools added

#30 Updated by intrigeri 2017-07-16 11:58:32

> To fix this faster than before we have monitoring in place, I’ve quickly checked the urls of the different HTP pools, and found out that www.centos.org is always failing to reply to curl (with the same cmdline than the one used by htpdate).

Amazing!

#31 Updated by bertagaz 2017-08-14 18:15:07

  • Status changed from Confirmed to In Progress

Applied in changeset commit:aeb903a6fe8422c3beb677110c56f04b28c6108b.

#32 Updated by intrigeri 2019-03-08 15:36:50

#33 Updated by intrigeri 2019-03-20 14:48:22

#34 Updated by intrigeri 2019-03-20 14:48:37

  • blocked by deleted (Feature #13241: Core work: Test suite maintenance)