Feature #8651
Configure the receiving side of the monitoring notifications
100%
Description
This implies discussing with the rest of the sysadmin team how/when we want to get notifications.
Subtasks
Related issues
Blocks Tails - |
Resolved | 2015-01-09 |
History
#1 Updated by intrigeri 2015-05-28 14:46:45
- blocks #8668 added
#2 Updated by intrigeri 2015-05-28 14:47:07
- blocks
Feature #9484: Deploy the monitoring setup to production added
#3 Updated by bertagaz 2015-12-15 03:31:09
- Target version changed from Tails_1.8 to Tails_2.0
Postponing
#4 Updated by bertagaz 2016-01-06 15:07:18
- Target version changed from Tails_2.0 to Tails_2.2
- Deliverable for changed from 268 to 269
Postponing this part of the monitoring setup, as it will be unlikely done for the previously planed deadline.
#5 Updated by bertagaz 2016-01-06 15:20:50
- Deliverable for changed from 269 to 268
#6 Updated by bertagaz 2016-02-05 20:47:41
- Target version changed from Tails_2.2 to Tails_2.3
#7 Updated by bertagaz 2016-04-20 07:36:49
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 10
I’ve enabled notifications for me with commits referencing this ticket on puppet-tails
.
They’re sent to me for the moment, enabled for all hosts, the HTTP checks, the whisperback one as well as the disks checks.
They are sent when the type of event is Problem
or Recovery
.
#8 Updated by bertagaz 2016-04-21 06:10:26
- Assignee changed from bertagaz to intrigeri
- QA Check set to Info Needed
I’ve made some tests to get how Icinga2 is notifying.
At the moment, the notifications are configured to be sent when a service is “OK”, “WARNING” or “CRITICAL”, and when the event type is “Problem”, “Recovery”, “Acknowlegement”, “DowntimeStart” and “DowntimeEnd”.
It means:
We’ll get notified if a Downtime starts or ends, or if someone acknowledges a problem.
When a service check starts to fail, Icinga2 will retry x times (configured with max_attempts
, which is 5 at the moment), waiting retry_interval
between each time, and if it is still failing, then send a “Problem” notification.
It will then retry to check the service every check_interval
, and send a notification every interval
time (30 minutes at the moment). If a Acknowledgment has been made, it will stop sending “Problem” notifications, but will send the “Recovery” one once the check succeed again.
That sounds quite fair to me. Only problem I see is that we’ll get spam bombing every 30 minutes if a service fails continuously and no one acknowledged the problem. I think we should set interval
to 1 day, so that we get notified less often in this case.
What’s your opinion on this?
#9 Updated by bertagaz 2016-04-21 06:10:39
- % Done changed from 10 to 40
#10 Updated by intrigeri 2016-04-25 04:34:21
- Assignee changed from intrigeri to bertagaz
- QA Check changed from Info Needed to Dev Needed
> That sounds quite fair to me. Only problem I see is that we’ll get spam bombing every 30 minutes if a service fails continuously and no one acknowledged the problem.
That sounds too much, considering the kind of availability we can reasonably expect from the on-duty sysadmin.
> I think we should set interval
to 1 day, so that we get notified less often in this case.
Yes.
I’m curious to know how many email you’ve received over N days, but perhaps fix some actual problems and check robustness issues before computing these stats.
#11 Updated by bertagaz 2016-04-26 04:55:17
- Assignee changed from bertagaz to intrigeri
- % Done changed from 40 to 60
- QA Check changed from Dev Needed to Info Needed
intrigeri wrote:
> > That sounds quite fair to me. Only problem I see is that we’ll get spam bombing every 30 minutes if a service fails continuously and no one acknowledged the problem.
>
> That sounds too much, considering the kind of availability we can reasonably expect from the on-duty sysadmin.
>
> > I think we should set interval
to 1 day, so that we get notified less often in this case.
>
> Yes.
Done in commits puppet-tails:b7c4915
and puppet-tails:cd501eb
> I’m curious to know how many email you’ve received over N days, but perhaps fix some actual problems and check robustness issues before computing these stats.
Since April 20, I had 153 emails, from which approx. 50 were due to the apt-snapshots-disk
check spamming me every 30 minutes. I had it resolved, but it came back again. All the rest is due to the whisperback
check, that is flapping very much.
We’ll see now that I’ve set the notification interval to 1 day, and worked a bit on the checks. Shall we use this ticket to track this evaluation, or use Feature #8652 for that?
#12 Updated by bertagaz 2016-04-26 04:55:27
- Target version changed from Tails_2.3 to Tails_2.4
#13 Updated by intrigeri 2016-04-26 06:13:26
- blocked by deleted (
#8668)
#14 Updated by intrigeri 2016-04-26 06:15:54
- Assignee changed from intrigeri to bertagaz
- QA Check deleted (
Info Needed)
bertagaz wrote:
> Since April 20, I had 153 emails, from which approx. 50 were due to the apt-snapshots-disk
check spamming me every 30 minutes. I had it resolved, but it came back again. All the rest is due to the whisperback
check, that is flapping very much.
>
> We’ll see now that I’ve set the notification interval to 1 day, and worked a bit on the checks. Shall we use this ticket to track this evaluation, or use Feature #8652 for that?
IMO making sure that these notifications are useful is part of this ticket (it’s actually the hardest part of it I bet, since just enabling email notifications was not too hard I guess).
So, please reassign to me for QA once you’re happy with the current notifications (as in: you actually manage to stay on top of them in practice, plus they give you information you were not aware of, and that’s actionable), and you deem it ready to be directed to all sysadmins (instead of you only). I guess we’re not far from it :)
#15 Updated by bertagaz 2016-04-30 06:57:13
- Assignee changed from bertagaz to intrigeri
- QA Check set to Info Needed
Ok, so here are some new stats since the 1d remailing interval change on April 26:
I received 14 emails in total (apart from the whisperback one, but this one were less spammy too) from which:
* 7 are acknowledgments of a problem
* 2 are downtimes start and end notifications.
No new emails since April 29 in the evening, as there was no real change in the situation.
All this emails were legit, there effectively were (and is) some problems, an they were worked on.
I’m wondering if it wouldn’t be the good time to point the notifications to our sysadmins list, so that you can have a look at what it looks like. I don’t think that with the current setting it would be much risky, and I think it’s working quite well at the moment. We could still roll back to emailing just me if that doesn’t work well for you.
Some URI that could help you to have a look maybe:
https://icingaweb2.tails.boum.org/monitoring/alertsummary/index?interval=1w
https://icingaweb2.tails.boum.org/monitoring/list/notifications?limit=100
What do you think?
#16 Updated by intrigeri 2016-05-03 07:46:18
- Assignee changed from intrigeri to bertagaz
- QA Check changed from Info Needed to Dev Needed
> Ok, so here are some new stats since the 1d remailing interval change on April 26:
Thanks, good to hear!
> I’m wondering if it wouldn’t be the good time to point the notifications to our sysadmins list, so that you can have a look at what it looks like.
OK, let’s try this!
#17 Updated by bertagaz 2016-05-03 08:48:59
- Assignee changed from bertagaz to intrigeri
- QA Check changed from Dev Needed to Ready for QA
intrigeri wrote:
> OK, let’s try this!
Done. I’m assigning this ticket to you and setting it to RfQA, as a reminder to check if the notification rate meets our design.
#18 Updated by intrigeri 2016-05-03 12:51:13
- Assignee changed from intrigeri to bertagaz
> I’m assigning this ticket to you and setting it to RfQA, as a reminder to check if the notification rate meets our design.
I think we have a ticket to deal with the consequences of the initial deployment, so feel free to close this one.
#19 Updated by bertagaz 2016-05-04 01:47:00
- Status changed from In Progress to Resolved
- Assignee deleted (
bertagaz) - % Done changed from 60 to 100
- QA Check changed from Ready for QA to Pass
intrigeri wrote:
> I think we have a ticket to deal with the consequences of the initial deployment, so feel free to close this one.
\o/