Bug #11562

Monitor servers from the htpdate pools

Added by bertagaz 2016-07-14 03:41:44 . Updated 2019-09-18 12:15:35 .

Status:
Confirmed
Priority:
Normal
Assignee:
zen
Category:
Time synchronization
Target version:
Start date:
2016-07-14
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

While tackling Bug #10494, it came up that some of the HTTP servers of the htpdate pools were buggy. This has some incidence for Tails to boot correctly, and our test suite to run nicely. We should monitor if this servers are up and answering correctly to the CURL requests made by htpdate to ensure this service is reliable.


Subtasks


Related issues

Related to Tails - Bug #13472: Replace www.centos.org in htpdate pools Resolved 2017-07-15
Related to Tails - Bug #10494: Retry htpdate when it fails Rejected 2016-07-17
Related to Tails - Bug #17233: Ensure only one of our HTP pool has hosts handled by Cloudflare Confirmed
Related to Tails - Bug #12023: htpdate: stop sending User-Agent that fakes Tor Browser Resolved 2016-12-08
Blocks Tails - Bug #10495: The 'the time has synced' step is fragile In Progress 2015-11-06
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29

History

#1 Updated by intrigeri 2016-07-16 05:19:42

Excellent idea!

The consequences of a failing check will likely need to be different from what we do for our own services: we can’t fix the web servers that are in the HTP pools, all we can do is to drop them from the pool in next Tails release. So, what matters here is aggregated availability stats, rather than real-time up/down status info.

Email notifications would be useless noise, and as a sysadmin I’d rather not see info about such failures on our dashboard’s “Current Incidents” page, if possible: sysadmins’ duty does not include maintaining the HTP pools we use, and I don’t want to train myself to ignore incidents.

But the RM (or the Foundations team?) needs to regularly check, e.g. at the beginning of each release cycle, if some servers in the pool are too unreliable, so that they can be replaced. How can they be given access to the aggregated availability stats they need to do this job? The easiest their task, the greatest the chances that it’ll actually be done regularly.

#2 Updated by anonym 2016-09-20 16:54:12

  • Target version changed from Tails_2.6 to Tails_2.7

#3 Updated by bertagaz 2016-09-22 05:48:45

  • Target version changed from Tails_2.7 to Tails_2.9.1

#4 Updated by anonym 2016-12-14 20:11:25

  • Target version changed from Tails_2.9.1 to Tails 2.10

#5 Updated by intrigeri 2016-12-18 09:57:40

  • Target version changed from Tails 2.10 to Tails_2.11

#6 Updated by bertagaz 2017-03-08 10:38:05

  • Target version changed from Tails_2.11 to Tails_2.12

#7 Updated by bertagaz 2017-03-08 11:09:21

  • Target version changed from Tails_2.12 to Tails_3.0

#8 Updated by intrigeri 2017-03-17 08:59:05

  • Type of work changed from Code to Sysadmin

#9 Updated by bertagaz 2017-05-21 16:03:32

  • Target version changed from Tails_3.0 to Tails_3.1

#10 Updated by bertagaz 2017-05-27 10:17:33

  • Target version changed from Tails_3.1 to Tails_3.2

#11 Updated by intrigeri 2017-06-29 10:17:08

  • blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#12 Updated by intrigeri 2017-07-13 18:36:35

  • blocks Bug #10495: The 'the time has synced' step is fragile added

#13 Updated by bertagaz 2017-07-15 14:44:23

  • related to Bug #13472: Replace www.centos.org in htpdate pools added

#14 Updated by bertagaz 2017-09-07 13:02:35

  • Target version changed from Tails_3.2 to Tails_3.3

#15 Updated by bertagaz 2017-09-07 13:34:16

  • blocked by deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#16 Updated by bertagaz 2017-09-07 13:34:25

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#17 Updated by bertagaz 2017-10-03 10:53:32

  • Target version changed from Tails_3.3 to Tails_3.5

#18 Updated by bertagaz 2017-10-23 10:35:23

One idea about this: with Bug #13541 and the feature/13541-save-more-data-on-htpdate-or-tor-failures branch merge, we’re now collecting htpdate logs each time there’s sudch a failure of that kind in our isotesters. We could gather this files and use them as a source to output statistics about servers failures. That’d give an overview closer to server failure in almost real Tails context, rather than using basic URL fetching or coding some htpdate behavior simulation (depending how we want to test this servers).

intrigeri wrote:
> But the RM (or the Foundations team?) needs to regularly check, e.g. at the beginning of each release cycle, if some servers in the pool are too unreliable, so that they can be replaced. How can they be given access to the aggregated availability stats they need to do this job? The easiest their task, the greatest the chances that it’ll actually be done regularly.

Then maybe there are different options:

  • It could be accessible through a web page. Could be hosted on www.lizard. That could even be the starter of some status.t.b.o page, where to output such informations + where to also publicly output Jenkins builds statuses. Or maybe joined with other type of stats on a metrics.t.b.o page?
  • Given the people we’re talking about, and the impact it has on our test suite in Jenkins, maybe the tails-ci list is a good recipient. We could send email notifications there.

#19 Updated by bertagaz 2017-11-15 14:39:52

  • Target version changed from Tails_3.5 to Tails_3.6

#20 Updated by Anonymous 2018-01-19 16:27:48

  • related to Bug #10494: Retry htpdate when it fails added

#21 Updated by bertagaz 2018-03-14 11:32:11

  • Target version changed from Tails_3.6 to Tails_3.7

#22 Updated by intrigeri 2018-03-19 12:42:21

FWIW I was told that some servers in our pool don’t send a Date header anymore, which could explain issues we’ve seen. I’ve not verified it myself but to identify such issues, here also: “what matters here is aggregated availability stats, rather than real-time up/down status info”.

#23 Updated by bertagaz 2018-05-10 11:09:16

  • Target version changed from Tails_3.7 to Tails_3.8

#24 Updated by intrigeri 2018-06-26 16:27:54

  • Target version changed from Tails_3.8 to Tails_3.9

#25 Updated by intrigeri 2018-09-05 16:26:53

  • Target version changed from Tails_3.9 to Tails_3.10.1

#26 Updated by intrigeri 2018-10-24 17:03:38

  • Target version changed from Tails_3.10.1 to Tails_3.11

#27 Updated by CyrilBrulebois 2018-12-16 14:11:13

  • Target version changed from Tails_3.11 to Tails_3.12

#28 Updated by anonym 2019-01-30 11:59:15

  • Target version changed from Tails_3.12 to Tails_3.13

#29 Updated by CyrilBrulebois 2019-03-20 14:34:05

  • Target version changed from Tails_3.13 to Tails_3.14

#30 Updated by CyrilBrulebois 2019-05-23 21:23:21

  • Target version changed from Tails_3.14 to Tails_3.15

#31 Updated by CyrilBrulebois 2019-07-10 10:33:58

  • Target version changed from Tails_3.15 to Tails_3.16

#32 Updated by intrigeri 2019-08-09 14:55:59

  • Assignee changed from bertagaz to Sysadmins
  • Target version deleted (Tails_3.16)

(As per today’s sysadmin team meeting.)

#33 Updated by zen 2019-09-18 12:15:35

  • Assignee changed from Sysadmins to zen

#34 Updated by intrigeri 2019-11-14 16:16:55

  • related to Bug #17233: Ensure only one of our HTP pool has hosts handled by Cloudflare added

#35 Updated by intrigeri 2019-11-15 09:34:10

  • related to Bug #12023: htpdate: stop sending User-Agent that fakes Tor Browser added