Feature #8653: Configure monitoring for other high-priority services

Feature #8653

Configure monitoring for other high-priority services

Added by intrigeri 2015-01-09 17:18:32 . Updated 2016-04-30 06:51:48 .

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Infrastructure

Target version:

Tails_2.4

Start date:

2015-01-09

Due date:

2015-11-09

% Done:

100%

Feature Branch:

Type of work:

Sysadmin

Blueprint:

Starter:

Affected tool:

Deliverable for:

268

Description

That is, those with “HIGH” priority on the blueprint.

Subtasks

History

#1 Updated by intrigeri 2015-01-09 17:18:43

blocked by ~~Feature #8652~~: Evaluate how the initial monitoring setup behaves and adjust things accordingly added

#2 Updated by intrigeri 2015-05-28 14:39:35

blocks deleted (~~~~Feature #8652~~: Evaluate how the initial monitoring setup behaves and adjust things accordingly~~)

#3 Updated by intrigeri 2015-05-28 14:43:14

Target version changed from Tails_1.8 to Tails_1.5
Parent task changed from ~~Feature #5734~~ to ~~Feature #9482~~

#4 Updated by intrigeri 2015-05-28 14:45:11

blocks #8668 added

#5 Updated by intrigeri 2015-05-28 14:45:22

blocked by ~~Feature #8652~~: Evaluate how the initial monitoring setup behaves and adjust things accordingly added

#6 Updated by intrigeri 2015-05-28 14:50:07

Target version changed from Tails_1.5 to Tails_1.6

#7 Updated by bertagaz 2015-09-23 01:25:45

Target version changed from Tails_1.6 to Tails_1.7

#8 Updated by intrigeri 2015-09-26 07:10:48

Description updated

#9 Updated by intrigeri 2015-09-26 07:19:23

Due date set to 2015-11-09

#10 Updated by Dr_Whax 2015-09-26 07:24:34

Target version changed from Tails_1.7 to Tails_1.8

#11 Updated by intrigeri 2015-12-05 16:14:20

Assignee changed from Dr_Whax to bertagaz
Target version changed from Tails_1.8 to Tails_2.0

#12 Updated by bertagaz 2016-01-27 10:49:45

Target version changed from Tails_2.0 to Tails_2.2

#13 Updated by bertagaz 2016-02-05 21:04:33

Target version changed from Tails_2.2 to Tails_2.3

#14 Updated by bertagaz 2016-03-18 16:18:07

Status changed from Confirmed to In Progress
% Done changed from 0 to 10

Same as ~~Feature #8650~~, the playground for the service checks to be deployed is in place.

The services that are ranked as HIGH in the blueprint are:

~~all systems are up and running.~~
~~all systems disk usage is fine (bytes and inodes).~~
~~all systems memory usage is fine.~~
~~http://torbrowser-archive.t.b.o must serve TorBrowser tarballs.~~
~~SSH ans SFTP test accounts must be up~~
~~Whisperback smtp relay is up~~

#15 Updated by bertagaz 2016-03-18 18:19:18

Assignee changed from bertagaz to intrigeri
QA Check set to Info Needed

bertagaz wrote:
> * all systems disk usage is fine (bytes and inodes).

I’ve added the disk check for all nodes. It display bytes and inodes. Is that enough, or does it requires to add a specific check for inodes?

#16 Updated by bertagaz 2016-03-21 09:59:39

bertagaz wrote:
> * all systems memory usage is fine.

There’s no memory plugin shipped with Icinga2 or the Nagios plugins. There’s a check_load one though. I found one on the web that can check for system memory: https://github.com/justintime/nagios-plugins. Do you think we should use this one? Or shall we stick to the check_load one, even if that’s not really what we drafted?

#17 Updated by intrigeri 2016-03-22 11:55:58

Assignee changed from intrigeri to bertagaz
QA Check changed from Info Needed to Dev Needed

>> * all systems disk usage is fine (bytes and inodes).

> I’ve added the disk check for all nodes. It display bytes and inodes. Is that enough, or does it requires to add a specific check for inodes?

I’m not sure I understand the question; all this feels pretty blury to me, e.g. I’ve no idea what practical difference “add a specific check” would make, compared to what we have (“display”). I suggest you communicate with me as if I knew nothing about Icinga2 (which is mostly true), and instead refer to the kinds of outcome I can understand well, that is what we wrote together on the design doc :)

We need to monitor disk usage (bytes) and disk usage (inodes). We need to be notified if one of those goes wrong. If what you implemented already satisfies these requirements already, then fine. Or do we need to clarify these requirements?

> There’s no memory plugin shipped with Icinga2 or the Nagios plugins. There’s a check_load one though. I found one on the web that can check for system memory: https://github.com/justintime/nagios-plugins. Do you think we should use this one?

I see a crapload of memory plugins so I suggest you ask our many friends who use Nagios/Icinga which one they prefer and use.

> Or shall we stick to the check_load one, even if that’s not really what we drafted?

If you think that we actually don’t need to monitor memory usage, please argue about it (monitoring system load has very little to do with that AFAICT; sorry if I didn’t guess right what a plugin called check_load does).

#18 Updated by bertagaz 2016-03-23 14:10:02

intrigeri wrote:
> We need to monitor disk usage (bytes) and disk usage (inodes). We need to be notified if one of those goes wrong. If what you implemented already satisfies these requirements already, then fine. Or do we need to clarify these requirements?

Nop, forget about it, it was blurry to me how the check_disk plugin works, hence my not-so-clear question. But in the end, it seems the plugins does what we want: check disks bytes and inodes and report if one of this value is out of bound (that last point was the one blurry to me).

> I see a crapload of memory plugins so I suggest you ask our many friends who use Nagios/Icinga which one they prefer and use.

TBH, only 3 or 4 of them are interesting in our case. I’ve send an email to get input from our friends, we’ll see. I suspect from their puppet manifest they don’t check the memory, or use a check_mk feature to do that.

> > Or shall we stick to the check_load one, even if that’s not really what we drafted?
>
> If you think that we actually don’t need to monitor memory usage, please argue about it (monitoring system load has very little to do with that AFAICT; sorry if I didn’t guess right what a plugin called check_load does).

That was just to be sure how flexible the blueprint was on this kind of items.

#19 Updated by bertagaz 2016-03-23 14:11:31

bertagaz wrote:
> * all systems disk usage is fine (bytes and inodes).

Ok, added a check for the basic partitions of every systems (/boot and /), as well as checks for custom partitions dedicated to services.

#20 Updated by bertagaz 2016-03-23 14:17:51

Also, in commit puppet-tails:2b0e864 I’ve added a bunch of regexp which I hope do not stink too much.

#21 Updated by intrigeri 2016-03-25 18:33:03

> Also, in commit puppet-tails:2b0e864 I’ve added a bunch of regexp which I hope do not stink too much.

OK, so I had a quick look:

$partitions: it seems overly complicated, using capturing parenthesis in places where I doubt we need even grouping parenthesis at all, escaping chars that I doubt need to be escaped, and generally it’s not clear to me what you are trying to do here, so I can’t review this; I could decide that you’re probably right and look elsewhere, but given what I know about your regexp skills, I’m not sure that the benefit potentially brought by this regexp is worth its current complexity; if we’re going to keep a regexp here, it might be better if I look closer. So I see two options:
- document what the regexp should, and should not, match, so I can review it (and I’ll be happy to improve it if you want)
- drop the regexp and just validate_string
$wfree and friends, i.e. ^(\d{1,3}%|\d+)$: I think you can drop the parenthesis, which would be less confusing (parenthesis actually do something, so a reader like me wastes time trying to understand why you added them, because if they were useless you would not add them, presumably; see? :)
$name: this seems to be duplicated from a similar regexp found in tails::monitoring::service, and at first glance it shares the same bugs (that we’ve been discussing on another ticket)

#22 Updated by intrigeri 2016-03-25 18:35:37

> That was just to be sure how flexible the blueprint was on this kind of items.

IMO the blueprint is flexible in the sense that if it has useless or incorrect stuff, we can drop or correct it. Replacing checks we initially thought were needed (and still seem pretty important to me), with mostly unrelated ones, needs to be justified. I hope we’re now on the same page :)

#23 Updated by bertagaz 2016-03-31 14:20:54

intrigeri wrote:
>
> * $partitions: it seems overly complicated, using capturing parenthesis in places where I doubt we need even grouping parenthesis at all, escaping chars that I doubt need to be escaped, and generally it’s not clear to me what you are trying to do here, so I can’t review this; I could decide that you’re probably right and look elsewhere, but given what I know about your regexp skills, I’m not sure that the benefit potentially brought by this regexp is worth its current complexity; if we’re going to keep a regexp here, it might be better if I look closer. So I see two options:
> document what the regexp should, and should not, match, so I can review it (and I’ll be happy to improve it if you want)
> drop the regexp and just validate_string

In the end I dropped this regexp, as this parameter is already documented. You could have guessed what I’ve tried to capture by reading it btw.

> * $wfree and friends, i.e. ^(\d{1,3}%|\d+)$: I think you can drop the parenthesis, which would be less confusing (parenthesis actually do something, so a reader like me wastes time trying to understand why you added them, because if they were useless you would not add them, presumably; see? :)

Done, commit puppet-tails:9007871

> * $name: this seems to be duplicated from a similar regexp found in tails::monitoring::service, and at first glance it shares the same bugs (that we’ve been discussing on another ticket)

Yep, fixed too.

#24 Updated by bertagaz 2016-03-31 14:23:01

intrigeri wrote:
> IMO the blueprint is flexible in the sense that if it has useless or incorrect stuff, we can drop or correct it. Replacing checks we initially thought were needed (and still seem pretty important to me), with mostly unrelated ones, needs to be justified. I hope we’re now on the same page :)

We are! :)

#25 Updated by bertagaz 2016-03-31 14:25:47

% Done changed from 10 to 40

bertagaz wrote:
> The services that are ranked as HIGH in the blueprint are:
>
> * ~~all systems are up and running.~~
> * ~~all systems disk usage is fine (bytes and inodes).~~
> * ~~Whisperback smtp relay is up~~

This one have been deployed already.

#26 Updated by bertagaz 2016-04-01 08:47:41

% Done changed from 40 to 50

bertagaz wrote:
> * ~~all systems memory usage is fine.~~

I’ve deployed that too, and set lower limits for lizard, given that with reserved hugepages it was raising alarms.

#27 Updated by bertagaz 2016-04-05 09:15:23

% Done changed from 50 to 60

bertagaz wrote:
>
> * ~~http://torbrowser-archive.t.b.o must serve TorBrowser tarballs.~~

This one is deployed. It will probably require more review, as the plugin had to be written from scratch given the particular needs. Only one left!

#28 Updated by intrigeri 2016-04-13 03:48:45

https://icingaweb2.tails.boum.org/monitoring/service/show?host=apt.lizard&service=apt-snapshots-disk should allow for a larger disk usage (90% maybe?) before warning
https://icingaweb2.tails.boum.org/monitoring/service/show?host=apt-proxy.lizard&service=apt-cacher-ng-disk ← same

#29 Updated by intrigeri 2016-04-13 04:21:41

FWIW https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=4mvq3pnvid3awjln.onion seems to reliably fail once out of three attempts. I don’t know how this will translate into feedback to sysadmins (dashboard/notifications), in a way that allows us to distinguish “service is unstable” from “oh it’s normal that the check fails once out of three times”; time will tell (I don’t think the feedback part was submitted for review yet so I’ll wait). In passing, it seems a bit intense to query that server every 30 seconds, no?

#30 Updated by bertagaz 2016-04-20 03:20:34

Assignee changed from bertagaz to intrigeri
% Done changed from 60 to 80
QA Check changed from Dev Needed to Ready for QA

bertagaz wrote:
> * ~~SSH ans SFTP test accounts must be up~~

This last one has been deployed too!

intrigeri wrote:
> FWIW https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=4mvq3pnvid3awjln.onion seems to reliably fail once out of three attempts. I don’t know how this will translate into feedback to sysadmins (dashboard/notifications), in a way that allows us to distinguish “service is unstable” from “oh it’s normal that the check fails once out of three times”; time will tell (I don’t think the feedback part was submitted for review yet so I’ll wait).

No, it wasn’t. I’ve seen this service check was regulary failing. I’ve enabled flapping detection for this one, which should help regarding notifications. This feature has been introduced in Icinga2 precisely for that kind of not so reliable service check.

> In passing, it seems a bit intense to query that server every 30 seconds, no?

Yes, and I think it’s part of the problem. I haven’t taken care yet of setting up the check intervals, the implementation of it is tracked by ~~Feature #11358~~.

#31 Updated by intrigeri 2016-04-25 04:13:51

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Dev Needed

> bertagaz wrote:
>> * ~~SSH ans SFTP test accounts must be up~~

> This last one has been deployed too!

So I guess it’s time I do a code review of all these checks, right?

> intrigeri wrote:
>> FWIW https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=4mvq3pnvid3awjln.onion seems to reliably fail once out of three attempts.

> No, it wasn’t. I’ve seen this service check was regulary failing. I’ve enabled flapping detection for this one, which should help regarding notifications. This feature has been introduced in Icinga2 precisely for that kind of not so reliable service check.

Thanks! However, https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=4mvq3pnvid3awjln.onion seems to tell us that 10+ notifications about it were sent over the last 24 hours. Did I misunderstand? Or what’s the deal?

Other than that, I’m reassigning to you as “dev needed” since ~~Feature #8653#note-28~~ was not addressed as far as I can tell, and thus the dashboard still has a warning that I already taught myself to ignore, which feels wrong.

#32 Updated by bertagaz 2016-04-26 04:59:32

Target version changed from Tails_2.3 to Tails_2.4

#33 Updated by bertagaz 2016-04-26 05:22:17

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
> So I guess it’s time I do a code review of all these checks, right?

Yep, it is!

> Thanks! However, https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=4mvq3pnvid3awjln.onion seems to tell us that 10+ notifications about it were sent over the last 24 hours. Did I misunderstand? Or what’s the deal?

Nop you got it right. It fails frequently due to bad cricuits unwilling to reach the hidden service.

Since then I’ve changed a bit this check configuration and added the --isolate option to torsocks as you advised (commits puppet-tails:25364de and puppet-tails:2860537). I tried the flapping threshold setting too, but it didn’t seem to help much, given the high level of flapping it had. The isolate option seems to help a bit though, we’ll see in some days.

> Other than that, I’m reassigning to you as “dev needed” since ~~Feature #8653#note-28~~ was not addressed as far as I can tell, and thus the dashboard still has a warning that I already taught myself to ignore, which feels wrong.

Regarding the apt-cacher-ng-disk check, it seems to to be in good shape right now, so I’m not sure we need to change something here.

For the apt-snaphots-disk one, I lowered the warnings to < 50G free, and the critical one to < 25G. I’m not sure if this levels sounds correct, I may need your input here, given you are more knowledgeable on this feature.

#34 Updated by intrigeri 2016-04-26 05:53:37

Assignee changed from intrigeri to bertagaz
QA Check changed from Info Needed to Dev Needed

https://icingaweb2.tails.boum.org/monitoring/list/hosts doesn’t list all our VMs (at least puppet-git.lizard is missing), so it looks like the checks that are about “all systems” are not set up completely yet. Please double-check that everything required has been implemented and then I’ll happpily review it :)

Regarding the WhisperBack check (and any other fragile one, really): either consider it’s part of this ticket, and then make sure it satisfies our needs before your next request for review; or, file a dedicated subtask of ~~Feature #8652~~ about it. Your call.

Regarding the apt-cacher-ng-disk check: the fact that we’re back to a usage level that doesn’t show up on icinga2 right now doesn’t resolve the problem I’ve raised. I doubt we can afford ensuring that a fifth of each of our large data filesystems is constantly free, so I still am in favour of monitoring settings that don’t force us to waste this much disk space.

Regarding the apt-cacher-ng-disk check: sounds good, let’s try it out this way!

#35 Updated by bertagaz 2016-04-26 09:28:43

intrigeri wrote:
> https://icingaweb2.tails.boum.org/monitoring/list/hosts doesn’t list all our VMs (at least puppet-git.lizard is missing), so it looks like the checks that are about “all systems” are not set up completely yet. Please double-check that everything required has been implemented and then I’ll happpily review it :)

Right, I didn’t activate the agents on the other VMs yet. Will do soon.

> Regarding the WhisperBack check (and any other fragile one, really): either consider it’s part of this ticket, and then make sure it satisfies our needs before your next request for review; or, file a dedicated subtask of ~~Feature #8652~~ about it. Your call.

Ack, will report about it on this ticket then. Note that I previously reported some bits on ~~Feature #8652~~ already.

> Regarding the apt-cacher-ng-disk check: the fact that we’re back to a usage level that doesn’t show up on icinga2 right now doesn’t resolve the problem I’ve raised. I doubt we can afford ensuring that a fifth of each of our large data filesystems is constantly free, so I still am in favour of monitoring settings that don’t force us to waste this much disk space.

Ok, updated it in commit b8c273f in our main repo.

> Regarding the apt-cacher-ng-disk check: sounds good, let’s try it out this way!

That’s about the apt-snapshots-disk check here I guessed. If so, fine with me.

#36 Updated by bertagaz 2016-04-26 13:26:01

bertagaz wrote:
> Right, I didn’t activate the agents on the other VMs yet. Will do soon.

Added the agent on every remaining VMs, their checks are slowly getting reported. Only the isobuilders one are broken, because their Puppet manifest doesn’t apply cleanly for other reasons (ikiwiki apt_preferences related problems). I’ll investigate tomorrow.

#37 Updated by bertagaz 2016-04-27 02:54:03

bertagaz wrote:
> Only the isobuilders one are broken, because their Puppet manifest doesn’t apply cleanly for other reasons (ikiwiki apt_preferences related problems). I’ll investigate tomorrow.

Fixed. The isobuilders checks are added slowly.

#38 Updated by bertagaz 2016-04-27 03:17:33

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
> Regarding the WhisperBack check (and any other fragile one, really): either consider it’s part of this ticket, and then make sure it satisfies our needs before your next request for review; or, file a dedicated subtask of ~~Feature #8652~~ about it. Your call.

The --isolate option to torsocks seems to help. I’ve had much less false positives emails. But there still are some on this check. I’ve raised a bit the max_check_attempts to 8 (5 before) to see if it helps getting rid of them. Do you think it makes sense?

In the end I think that’s probably not so much a monitoring problem rather than a service one. We’re experiencing here what our users sometimes do (not possible to send reports). I wonder if another way to fix that would be to set up other whisperback nodes to have redundancy. But I’m not sure it would fix this issue for real.

On another topic, the system_disks check on puppet-git raised the problem that the root partition is only < 500M free. It seems that the etckeeper Git repo in /etc/ is taking almost 400M by itself, as it’s also tracking all the files under /etc/puppet/environments/production. Shall I grow this partition by adding something like 1G?

#39 Updated by intrigeri 2016-04-28 12:51:01

>> Regarding the apt-cacher-ng-disk check: […]

> Ok, updated it in commit b8c273f in our main repo.

Looks good, thank you.

>> Regarding the apt-cacher-ng-disk check: sounds good, let’s try it out this way!

> That’s about the apt-snapshots-disk check here I guessed. If so, fine with me.

Yes.

#40 Updated by intrigeri 2016-04-28 12:57:45

> Fixed. The isobuilders checks are added slowly.

Good. Only problems I see:

https://icingaweb2.tails.boum.org/monitoring/service/show?host=puppet-git.lizard&service=iso-history-gitolite-disk seems to still be broken
same for https://icingaweb2.tails.boum.org/monitoring/service/show?host=puppet-git.lizard&service=torbrowser-gitolite-disk — I think it’s fine to set more relaxed permissions on the root directory of this FS, as long as the sensitive subdirs (e.g. the .git one, if any) are protected
system_disks check on puppet-git that I’ll reply to separately

#41 Updated by intrigeri 2016-04-28 13:05:48

Assignee changed from intrigeri to bertagaz

> The --isolate option to torsocks seems to help. I’ve had much less false positives emails. But there still are some on this check. I’ve raised a bit the max_check_attempts to 8 (5 before) to see if it helps getting rid of them. Do you think it makes sense?

Yes, assuming this impacts .

> In the end I think that’s probably not so much a monitoring problem rather than a service one. We’re experiencing here what our users sometimes do (not possible to send reports).

I must say I’m a little bit surprised, as I’ve never seen any hidden service I’m regularly using perform this poorly. So I’m curious: where does “what our users sometimes do” come from, exactly?

> I wonder if another way to fix that would be to set up other whisperback nodes to have redundancy. But I’m not sure it would fix this issue for real.

I don’t see why this would change anything, given how HS load-balancing worked last time I checked. Did I miss anything?

> On another topic, the system_disks check on puppet-git raised the problem that the root partition is only < 500M free. It seems that the etckeeper Git repo in /etc/ is taking almost 400M by itself, as it’s also tracking all the files under /etc/puppet/environments/production.

According to /etc/.gitignore, these files are not tracked by Git, so the reason for disk usage must lie elsewhere. Anyway, no big deal.

> Shall I grow this partition by adding something like 1G?

Yes, please.

I’ll now do the code review for these checks, yeah :)

#42 Updated by intrigeri 2016-04-28 14:07:56

>> So I guess it’s time I do a code review of all these checks, right?

> Yep, it is!

Here we go! I have to first say this is high quality work, on which I don’t have much to say except some (mostly style) nitpicking. Congrats :)

globals.return_servicegroup_host is buggy if service.vars.real_host is true but service.host_name != service.vars.real_host is false: it won’t return anything. The if (service.host_name != service.vars.real_host) condition seems to be useless, and introduces this bug.
files/monitoring/icinga2/plugins/check_ssh_or_sftp_account: I’m surprised you had to write a custom plugin for this. But the code is good, and simple enough, so well :)
Please read the “Autorequires” section on https://docs.puppet.com/puppet/latest/reference/type.html#file, it’ll help you improve the signal/noise ratio of your Puppet code (you’re sometimes explicitly adding requires on the owner user, or on the parent directory, which as you’ll see is not needed; and you’re not doing it consistently, which is a bit troubling for crazy folks like me).
“$real_host must resolve to the host where the service is running” ← do you really mean “resolve to”, as in DNS-speak, or just “be”?
files/monitoring/icinga2/plugins/check_rsync doesn’t exactly respect current Perl best practices, but well, let’s say it’s good enough.
files/monitoring/icinga2/plugins/check_mem.pl could use libraries that are kept up-to-date instead of reinventing wheels, but well, let’s say it’s good enough.
in files/monitoring/icinga2/plugins/check_torbrowser_archive, in the get_url function: I’m aware that generally bare exceptions are Bad™, so on the one hand I’m glad you handle only the exceptions you expect, but in this specific case I wonder what happens if we ever hit a type of exception that’s not handled. Perhaps we should catch all exceptions here? Otherwise, good job, you’re making progress at programming these days, it’s a pleasure for me :)
Nitpicking: “Url” (typo for “URL”) is now called “URI” these days, I think. Ignore that if you want for already written code, but please take it into account in the future ;)
s/unix/Unix/
s/Host name/Hostname/
s/IP Address/IP address/
Grammar mistake in “that will be checks” (checked)
“Take care to use a uniq $name” ← please clarify within what set it must be unique (and s/uniq/unique/).

#43 Updated by bertagaz 2016-04-29 02:34:48

intrigeri wrote:
> Here we go! I have to first say this is high quality work, on which I don’t have much to say except some (mostly style) nitpicking. Congrats :)

Glad you like it.

> * globals.return_servicegroup_host is buggy if service.vars.real_host is true but service.host_name != service.vars.real_host is false: it won’t return anything. The if (service.host_name != service.vars.real_host) condition seems to be useless, and introduces this bug.

Hmmm, in the case you describe, in fact I believe it will return service.hostname. But you’re right, this condition is useless, it’s there only because I thought to add vars.real_host in every service check configuration, but chose a different path in the end. Fixed in puppet-tails:35664d4

> * files/monitoring/icinga2/plugins/check_ssh_or_sftp_account: I’m surprised you had to write a custom plugin for this. But the code is good, and simple enough, so well :)

There’s plugin to check if a SSH server is running, but it’s not able to do more than that and try to connect to a specific user account.

> * Please read the “Autorequires” section on https://docs.puppet.com/puppet/latest/reference/type.html#file, it’ll help you improve the signal/noise ratio of your Puppet code (you’re sometimes explicitly adding requires on the owner user, or on the parent directory, which as you’ll see is not needed; and you’re not doing it consistently, which is a bit troubling for crazy folks like me).

Right, this is still a bit blurry to me what this autorequire feature does actually require. :)

Commit puppet-tails:e2f57ee

> * “$real_host must resolve to the host where the service is running” ← do you really mean “resolve to”, as in DNS-speak, or just “be”?

Be, commit puppet-tails:b34e35e

> * files/monitoring/icinga2/plugins/check_rsync doesn’t exactly respect current Perl best practices, but well, let’s say it’s good enough.

That’s the most relevant plugin I found, but it’s a bit outdated.

> * files/monitoring/icinga2/plugins/check_mem.pl could use libraries that are kept up-to-date instead of reinventing wheels, but well, let’s say it’s good enough.

Same, often plugins are of quite bad quality in term of coding practices.

> * in files/monitoring/icinga2/plugins/check_torbrowser_archive, in the get_url function: I’m aware that generally bare exceptions are Bad™, so on the one hand I’m glad you handle only the exceptions you expect, but in this specific case I wonder what happens if we ever hit a type of exception that’s not handled. Perhaps we should catch all exceptions here? Otherwise, good job, you’re making progress at programming these days, it’s a pleasure for me :)

True, I did it in the other custom plugin, I should have done it same in this one. Fixed in commit puppet-tails:1e656d9

> * Nitpicking: “Url” (typo for “URL”) is now called “URI” these days, I think. Ignore that if you want for already written code, but please take it into account in the future ;)

Right. Old school here. :)

Note that the urllib python module still talk about URLs.

> * s/unix/Unix/
> * s/Host name/Hostname/
> * s/IP Address/IP address/

Thanks for this good catch. commit puppet-tails:327a0e0

> * Grammar mistake in “that will be checks” (checked)

Oops, commit puppet-tails:0da5d7c

> * “Take care to use a uniq $name” ← please clarify within what set it must be unique (and s/uniq/unique/).

commit puppet-tails:3741247

#44 Updated by bertagaz 2016-04-29 03:05:02

intrigeri wrote:
> * https://icingaweb2.tails.boum.org/monitoring/service/show?host=puppet-git.lizard&service=iso-history-gitolite-disk seems to still be broken
> * same for https://icingaweb2.tails.boum.org/monitoring/service/show?host=puppet-git.lizard&service=torbrowser-gitolite-disk — I think it’s fine to set more relaxed permissions on the root directory of this FS, as long as the sensitive subdirs (e.g. the .git one, if any) are protected

Yeah, I think my workaround didn’t change anything, the nagios user probably don’t have enough access to the block device too. Will try the other permission change workaround.

#45 Updated by bertagaz 2016-04-29 08:22:59

bertagaz wrote:
> intrigeri wrote:
> > * https://icingaweb2.tails.boum.org/monitoring/service/show?host=puppet-git.lizard&service=iso-history-gitolite-disk seems to still be broken
> > * same for https://icingaweb2.tails.boum.org/monitoring/service/show?host=puppet-git.lizard&service=torbrowser-gitolite-disk — I think it’s fine to set more relaxed permissions on the root directory of this FS, as long as the sensitive subdirs (e.g. the .git one, if any) are protected
>
> Yeah, I think my workaround didn’t change anything, the nagios user probably don’t have enough access to the block device too. Will try the other permission change workaround.

I did commit puppet-tails:35667c9 and puppet-tails:b579632, and it seems to solve the problem as the torbrowser-gitolite-disk check state change shows.

#46 Updated by bertagaz 2016-04-29 08:27:37

intrigeri wrote:
> I must say I’m a little bit surprised, as I’ve never seen any hidden service I’m regularly using perform this poorly. So I’m curious: where does “what our users sometimes do” come from, exactly?

From my own experience, when I tried to send a bug report, and as ~~Bug #11136~~ seems to indicate it sometimes happen.

But you were right, as discussed on IRC it seems the Tor running on whisperback.li was the Wheezy version, while our system is Jessie, so it was crashing every few minutes because it was compiled with older openssl and libevent versions than the one installed on the system. We’ll see how it behaves now that I’ve upgraded it to the Jessie version, and that it doesn’t restart so often.

#47 Updated by bertagaz 2016-04-29 09:37:41

Assignee changed from bertagaz to intrigeri
QA Check changed from Info Needed to Ready for QA

intrigeri wrote:
> > Shall I grow this partition by adding something like 1G?
>
> Yes, please.

Done. The check should come back to normal soon.

Putting this ticket back to RfQA. I think I’ve done everything needed, and we just have to wait to see if the checks are going better now. I’ll report here if I find they do (or not). Meanwhile, there’s some code change after your review that need validating. :)

#48 Updated by intrigeri 2016-04-29 12:39:56

> Note that the urllib python module still talk about URLs.

Yes, and I do to in many other places :)

Wrt. all other changes you mention in the comment I’m replying to: great!

#49 Updated by intrigeri 2016-04-29 12:44:28

> I did commit puppet-tails:35667c9 and puppet-tails:b579632, and it seems to solve the problem as the torbrowser-gitolite-disk check state change shows.

Assuming you’re talking of these commits in our manifests repo (and not in the puppet-tails one): that’s a pretty big hammer you used there, and I’m worried this will grant the nagios more permissions than what we would like to, now or later. I would really prefer making the root directory of each of those filesystems o+r (and the parent directories o+x). What do you think?

#50 Updated by intrigeri 2016-04-29 12:48:25

> But you were right, as discussed on IRC it seems the Tor running on whisperback.li was the Wheezy version, while our system is Jessie, so it was crashing every few minutes because it was compiled with older openssl and libevent versions than the one installed on the system. We’ll see how it behaves now that I’ve upgraded it to the Jessie version, and that it doesn’t restart so often.

Indeed. Please reassign to me once you’ve confirmed it improves things to a point when it seems production-ready (or delegate to a subtask of ~~Feature #8652~~).

#51 Updated by intrigeri 2016-04-29 12:51:03

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Dev Needed

#52 Updated by bertagaz 2016-04-30 04:59:42

intrigeri wrote:
> > But you were right, as discussed on IRC it seems the Tor running on whisperback.li was the Wheezy version, while our system is Jessie, so it was crashing every few minutes because it was compiled with older openssl and libevent versions than the one installed on the system. We’ll see how it behaves now that I’ve upgraded it to the Jessie version, and that it doesn’t restart so often.
>
> Indeed. Please reassign to me once you’ve confirmed it improves things to a point when it seems production-ready (or delegate to a subtask of ~~Feature #8652~~).

This check didn’t fail since the Tor upgrade on whisperback.li. Given its failure rate before, it should already have failed. So it really was a bug in the service, not in the check, and it seems to be fixed now. \o/

#53 Updated by bertagaz 2016-04-30 06:03:12

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
> > I did commit puppet-tails:35667c9 and puppet-tails:b579632, and it seems to solve the problem as the torbrowser-gitolite-disk check state change shows.
>
> Assuming you’re talking of these commits in our manifests repo (and not in the puppet-tails one): that’s a pretty big hammer you used there, and I’m worried this will grant the nagios more permissions than what we would like to, now or later. I would really prefer making the root directory of each of those filesystems o+r (and the parent directories o+x). What do you think?

I was wondering too. So before implementing that change, I’ve had a look at the permissions under /var/lib/gitolite. They seems to be quite strict: e.g every repository is 0700 (and we don’t need to change that in my first implementation), and most of the files directly under /var/lib/gitolite are 0600 (except .gitolite.rc). So in fact adding the nagios user to the gitolite group doesn’t add that much access to the nagios user.

I’ve given a try to just change the permissions as you described (commit puppet-tails:91c354c), but in Puppet it didn’t work, and to do so we’d have to do serious changes in the manifests. So I’m unsure whether it’s worth so much work. What do you think?

#54 Updated by intrigeri 2016-04-30 06:43:49

% Done changed from 80 to 100
QA Check changed from Info Needed to Pass

> I was wondering too. So before implementing that change, I’ve had a look at the permissions under /var/lib/gitolite. They seems to be quite strict: e.g every repository is 0700 (and we don’t need to change that in my first implementation), and most of the files directly under /var/lib/gitolite are 0600 (except .gitolite.rc). So in fact adding the nagios user to the gitolite group doesn’t add that much access to the nagios user.

I still find it a bit scary because we have no guarantee that these things don’t change some day, but well, given:

> I’ve given a try to just change the permissions as you described (commit puppet-tails:91c354c), but in Puppet it didn’t work, and to do so we’d have to do serious changes in the manifests. So I’m unsure whether it’s worth so much work. What do you think?

OK, fair enough. Thanks for looking into it!

#55 Updated by intrigeri 2016-04-30 06:49:33

blocked by deleted (~~#8668~~)

#56 Updated by intrigeri 2016-04-30 06:49:39

blocks deleted (~~~~Feature #8652~~: Evaluate how the initial monitoring setup behaves and adjust things accordingly~~)

#57 Updated by intrigeri 2016-04-30 06:51:48

Status changed from In Progress to Resolved
Assignee deleted (~~intrigeri~~)