Feature #8650

Configure monitoring for the most critical services

Added by intrigeri 2015-01-09 17:15:52 . Updated 2016-04-20 06:57:07 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2015-01-09
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
268

Description

That is, those with “CRITICAL” priority on the blueprint.


Subtasks


Related issues

Blocked by Tails - Feature #8648: Initial set up of the monitoring software Resolved 2016-03-07
Blocks Tails - Feature #8652: Evaluate how the initial monitoring setup behaves and adjust things accordingly Resolved 2015-01-09

History

#1 Updated by intrigeri 2015-01-09 17:16:07

  • blocked by Feature #8649: Specify our monitoring needs and build an inventory of the services that need monitoring added

#2 Updated by intrigeri 2015-01-09 17:16:18

  • blocked by Feature #8648: Initial set up of the monitoring software added

#3 Updated by intrigeri 2015-05-28 14:39:58

  • blocks deleted (Feature #8649: Specify our monitoring needs and build an inventory of the services that need monitoring)

#4 Updated by intrigeri 2015-05-28 14:40:46

  • blocks deleted (Feature #8648: Initial set up of the monitoring software)

#5 Updated by intrigeri 2015-05-28 14:41:10

#6 Updated by intrigeri 2015-05-28 14:41:32

  • blocked by Feature #8648: Initial set up of the monitoring software added

#7 Updated by intrigeri 2015-05-28 14:41:55

  • blocks #8668 added

#8 Updated by intrigeri 2015-05-28 14:44:55

  • blocks Feature #8652: Evaluate how the initial monitoring setup behaves and adjust things accordingly added

#9 Updated by intrigeri 2015-05-28 14:50:21

  • Target version changed from Tails_1.5 to Tails_1.6

#10 Updated by Dr_Whax 2015-07-07 12:28:38

  • Target version changed from Tails_1.6 to Tails_1.5

#11 Updated by intrigeri 2015-08-19 11:42:11

  • Target version changed from Tails_1.5 to Tails_1.6

#12 Updated by bertagaz 2015-09-23 01:26:03

  • Target version changed from Tails_1.6 to Tails_1.7

#13 Updated by intrigeri 2015-09-26 07:10:27

  • Description updated

#14 Updated by intrigeri 2015-09-26 07:14:36

  • Due date set to 2015-10-26

#15 Updated by intrigeri 2015-12-05 16:08:30

  • Due date deleted (2015-10-26)
  • Assignee changed from Dr_Whax to bertagaz
  • Target version changed from Tails_1.7 to Tails_2.0

#16 Updated by bertagaz 2016-01-27 10:49:45

  • Target version changed from Tails_2.0 to Tails_2.2

#17 Updated by bertagaz 2016-01-31 15:06:26

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

I’ve done configuration for all this checks, and they can all be done with the plugins shipped in the monitoring-plugins-* Debian packages, apart from the rsync one, which requires the use of one that can be found on the nagios exchange website with little adaptations.

#18 Updated by bertagaz 2016-03-10 18:51:03

  • Target version changed from Tails_2.2 to Tails_2.3

#19 Updated by bertagaz 2016-03-18 16:13:28

  • % Done changed from 20 to 30

I’ve deployed what is the skeleton for service checks. That’s tails::monitoring::service, and a first APT check has been deployed.

Now as a reminder, the services marked as CRITICAL in the blueprint are:

#20 Updated by intrigeri 2016-03-22 11:01:40

Hi! Commit 695ac501582cc771440368d837574bf76d950942 in puppet-tails introduces a buggy regexp that clearly doesn’t do what you believe it does (in practice it makes the validation much more lax than intended). I think you rather mean something like '^\w+(?:\w|\.)+$’@ (untested). Please do test such regexps when unsure, both with strings that are supposed to match, and with strings that are not supposed to match, especially when it’s about validating stuff :)

#21 Updated by bertagaz 2016-03-23 14:13:58

intrigeri wrote:
> Hi! Commit 695ac501582cc771440368d837574bf76d950942 in puppet-tails introduces a buggy regexp that clearly doesn’t do what you believe it does (in practice it makes the validation much more lax than intended). I think you rather mean something like '^\w+(?:\w|\.)+$’@ (untested). Please do test such regexps when unsure, both with strings that are supposed to match, and with strings that are not supposed to match, especially when it’s about validating stuff :)

I did test it, and it did seem to work, even with invalid input. Your proposal doesn’t seem to work to me otoh: it seems you misplaced/forgot the ‘@’. I’ve pushed another version, which is maybe a bit more strict.

#22 Updated by intrigeri 2016-03-25 18:09:02

> intrigeri wrote:
>> Hi! Commit 695ac501582cc771440368d837574bf76d950942 in puppet-tails introduces a buggy regexp that clearly doesn’t do what you believe it does (in practice it makes the validation much more lax than intended). I think you rather mean something like '^\w+(?:\w|\.)+$’@ (untested). Please do test such regexps when unsure, both with strings that are supposed to match, and with strings that are not supposed to match, especially when it’s about validating stuff :)

> I did test it, and it did seem to work, even with invalid input.

OK, so you were unlucky and did not hit any of the false negatives (cases when it would validate buggy input). FYI it would have (erroneously) validated for example that string:

abc@a%_»$PATH

… which, I believe, was not intended :)

> Your proposal doesn’t seem to work to me otoh: it seems you misplaced/forgot the ‘@’.

Redmine mangled my comment (or rather, I got it wrong how to include the “" char in inline code that's formatted with "”). Sorry about the confusion! In such cases, when what you see is obviously wrong, you may want to pretend you’re going to edit my comment (the pen icon next to it), so you can see what exactly I have typed :) And then of course you can cancel the edit.

> I’ve pushed another version, which is maybe a bit more strict.

I think you misunderstand a little bit how a character set (in square brackets) works. E.g. those strings are valid according to the current validation regexp:

a|@b
a@b|

… which feels quite wrong.

The regexp I’ve proposed has none of those problems. But apparently you now want (or need) to allow dashes in both the left-hand and right-hand side of the “@”, so my proposal is outdated. Here’s an updated (and simpler) one:

^[\w-]+@[\w.-]+$

I suggest you either take that one as-is, or learn some basics about regexps, before submitting another proposal you don’t fully understand the meaning of. Fair enough?

#23 Updated by bertagaz 2016-03-31 13:42:13

intrigeri wrote:
>
> I suggest you either take that one as-is, or learn some basics about regexps, before submitting another proposal you don’t fully understand the meaning of. Fair enough?

Yep, thx for the lengthy explanation. I’ve fixed that with your own regexp in commit puppet-tails:9007871

#24 Updated by bertagaz 2016-03-31 13:49:20

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 30 to 70
  • QA Check set to Ready for QA

bertagaz wrote:
>
> * our APT repo serves a `stable` suite.
> * https://jenkins.t.b.o asks for auth.
> * there are ISO images for `devel` and `stable` on http://nightly.t.b.o.
> * rsync is up and serving the right directories.
> * https://tails.boum.org/ is up.

ok, I’ve deployed all this checks. Most of them are HTTP checks and use the same tails::monitoring::service::http manifest.

#25 Updated by intrigeri 2016-04-15 05:29:44

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

> ok, I’ve deployed all this checks. Most of them are HTTP checks and use the same tails::monitoring::service::http manifest.

Thanks. I’ve had a look from the web interface PoV (I didn’t look at the code this time, let’s move on and I’ll review it all at the same time when you’re done with the other checks).

https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=nightly_stable and https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=nightly_devel result in 404 errors and socket timeouts too often; I realize that this very ticket is not about fixing problems identified by monitoring, and the majority of these problems are probably on the monitored systems’ side (and not problems with the monitoring system); but let’s not close another QA work chapter by leaving it in a shape in which we ignore false positives that are too numerous ⇒ please file a subtask of Feature #9484 to investigate and fix these flaky checks (and probably their root cause, most of the time; I can help with some of those, we can share them as part of sysadmin shifts).

Same for:

Once these robustness issues are well tracked in a way that explicitly blocks Feature #9484, please close this ticket as resolved. Congrats!

#26 Updated by bertagaz 2016-04-20 06:57:08

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 70 to 100

intrigeri wrote:
> https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=nightly_stable and https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=nightly_devel result in 404 errors and socket timeouts too often; I realize that this very ticket is not about fixing problems identified by monitoring, and the majority of these problems are probably on the monitored systems’ side (and not problems with the monitoring system); but let’s not close another QA work chapter by leaving it in a shape in which we ignore false positives that are too numerous ⇒ please file a subtask of Feature #9484 to investigate and fix these flaky checks (and probably their root cause, most of the time; I can help with some of those, we can share them as part of sysadmin shifts).
>
> Same for:
>
> * https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=jenkins.tails.boum.org
> * https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=deb.tails.boum.org
> * https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=tails.boum.org
> * https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=rsync.tails.boum.org
>
> Once these robustness issues are well tracked in a way that explicitly blocks Feature #9484, please close this ticket as resolved. Congrats!

I’ve made Feature #11358 blocking Feature #9484 as a way to track this. Might be that our different HTTP checks running concurrently and often (every minute at least for each, 30s if the check is flappy) wasn’t also a bit too more intense, and it might be that being less aggressive on that will help. If not, then it will deserve another ticket.