Feature #8652

Evaluate how the initial monitoring setup behaves and adjust things accordingly

Added by intrigeri 2015-01-09 17:18:01 . Updated 2016-07-14 02:54:48 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2015-01-09
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
268

Description

This includes evaluating how much disk space it costs us to save downtime information.


Subtasks


Related issues

Blocked by Tails - Feature #8650: Configure monitoring for the most critical services Resolved 2015-01-09

History

#1 Updated by intrigeri 2015-01-09 17:18:43

  • blocks Feature #8653: Configure monitoring for other high-priority services added

#2 Updated by intrigeri 2015-05-28 14:39:34

  • blocked by deleted (Feature #8653: Configure monitoring for other high-priority services)

#3 Updated by intrigeri 2015-05-28 14:43:54

  • Assignee changed from bertagaz to Dr_Whax
  • Target version changed from Tails_1.8 to Tails_1.5
  • Parent task changed from Feature #5734 to Feature #9482

#4 Updated by intrigeri 2015-05-28 14:44:31

  • blocks #8668 added

#5 Updated by intrigeri 2015-05-28 14:44:55

  • blocked by Feature #8650: Configure monitoring for the most critical services added

#6 Updated by intrigeri 2015-05-28 14:45:21

  • blocks Feature #8653: Configure monitoring for other high-priority services added

#7 Updated by intrigeri 2015-05-28 14:50:14

  • Target version changed from Tails_1.5 to Tails_1.6

#8 Updated by bertagaz 2015-09-23 01:25:47

  • Target version changed from Tails_1.6 to Tails_1.7

#9 Updated by Dr_Whax 2015-09-26 04:56:18

  • Target version changed from Tails_1.7 to Tails_1.8

Since we should test it out and i’m traveling soon, maybe we should postpone this a bit.

#10 Updated by intrigeri 2015-09-26 07:32:21

  • Description updated

#11 Updated by intrigeri 2015-12-05 16:14:38

  • Assignee changed from Dr_Whax to bertagaz
  • Target version changed from Tails_1.8 to Tails_2.0

#12 Updated by bertagaz 2016-01-06 15:09:02

  • Target version changed from Tails_2.0 to Tails_2.2
  • Deliverable for changed from 268 to 269

Postponing this part of the monitoring setup, as it will be unlikely done for the previously planed deadline.

#13 Updated by bertagaz 2016-01-06 15:20:39

  • Deliverable for changed from 269 to 268

#14 Updated by bertagaz 2016-02-05 21:04:43

  • Target version changed from Tails_2.2 to Tails_2.3

#15 Updated by bertagaz 2016-04-21 08:24:32

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

I guess this starts with the release of notifications, and thus has started with Feature #8651.

#16 Updated by bertagaz 2016-04-22 10:14:42

To me this is a subtask of Feature #9484

#17 Updated by bertagaz 2016-04-22 10:28:43

I wanted to roll-over this last parenting change. I initially wanted to reorganize a bunch of ticket parents to clarify the difference between the prototype and the production state, which is in Redmine parenting hierarchy IMO hardly representing what’s going on. But I get ‘Parent task is invalid’ errors while reparenting other tickets, and now I can’t reparent back this ticket…

#18 Updated by bertagaz 2016-04-22 11:12:56

Apart from the notifications, on the check side it seems the whisperback one is not reliable, probably because of Tor network troubles to reach the hidden service.

This is the main source of probems we can seein the history: apt-snapshot-disk has been resolved, and autotest-s* were “artificial” tests of the notifications.

This is actually the main cause of notification spams because of false positives caused by network problems. Even HTTP checks at the moment are quiet (no notifications), and have very few false positives from time to time (mostly timeouts of 10 seconds reached).

I’m unsure how to solve the whisperback one though. The hidden service network overhead seems to ensure we’ll have false positives. I tried to modify the {check,retry}_interval to higher levels (c=20m and r=10m), it didn’t lead to less notifications. I’ll investigate a bit more on this side.

We could also maybe rather check locally everything is running fine with (yet to be defined) commands run from the agent on whisperback.li. We’d lost the network view, which OTOH is the main source of problems probably. But this may require a far more elaborated homebrew plugin, depending on what we would define as a correct local check…

#19 Updated by bertagaz 2016-04-26 04:59:33

  • Target version changed from Tails_2.3 to Tails_2.4

#20 Updated by intrigeri 2016-04-30 06:49:37

  • blocked by deleted (Feature #8653: Configure monitoring for other high-priority services)

#21 Updated by bertagaz 2016-05-18 13:24:54

  • Target version changed from Tails_2.4 to Tails_2.5

#22 Updated by bertagaz 2016-06-08 06:03:55

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 20 to 50
  • QA Check set to Ready for QA

After one month and a half or so of production, my conclusion is that it’s working quite well. It already has pointed us to fix issues, is useful while doing sysadmin shifts to get things that need care, and doesn’t mailbomb us. So I’d be to close this ticket.

#23 Updated by intrigeri 2016-06-08 06:17:29

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

> After one month and a half or so of production, my conclusion is that it’s working quite well. It already has pointed us to fix issues, is useful while doing sysadmin shifts to get things that need care, and doesn’t mailbomb us.

Yes!

> So I’d be to close this ticket.

Yes… once what the task that the description of this ticket mentions has been done.

#24 Updated by bertagaz 2016-06-17 08:10:44

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:
> > So I’d be to close this ticket.
>
> Yes… once what the task that the description of this ticket mentions has been done.

We don’t keep that much data, so I don’t think disk space is an issue:

~# du -sh /var/lib/mysql/ /var/lib/icinga2
100M    /var/lib/mysql/
1.1M    /var/lib/icinga2

#25 Updated by intrigeri 2016-07-13 09:34:09

  • Status changed from In Progress to Resolved
  • QA Check changed from Ready for QA to Pass

> We don’t keep that much data, so I don’t think disk space is an issue:

I doubt that this really includes “downtime information”, since IIRC we had actually disabled storing such data. To clarify, what I meant when this was added to this ticket’s description is: data that would allow to compute per-service availability stats some day, i.e. when we want to identify unreliable services. But whatever, at this point I don’t think it’s a must, so let’s forget about it.

#26 Updated by intrigeri 2016-07-13 09:35:58

  • Assignee deleted (intrigeri)
  • % Done changed from 50 to 100

#27 Updated by intrigeri 2016-07-13 09:36:07

  • blocked by deleted (#8668)

#28 Updated by bertagaz 2016-07-14 02:54:48

intrigeri wrote:
> I doubt that this really includes “downtime information”, since IIRC we had actually disabled storing such data. To clarify, what I meant when this was added to this ticket’s description is: data that would allow to compute per-service availability stats some day, i.e. when we want to identify unreliable services.

It’s stored in the MySQL database, so the data is not so huge as I reported, and we’re good on this.