Feature #11358

Set relevant check_interval and retry_interval for hosts and services

Added by bertagaz 2016-04-20 03:02:04 . Updated 2016-04-27 03:22:26 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2016-04-20
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
268

Description

So far we’ve used import generic-service and import generic-host, which set the same check_interval and retry_interval for every services and hosts. These are a bit too short and not always relevant. We should set them to something that makes more sense.


Subtasks


History

#1 Updated by bertagaz 2016-04-20 03:35:31

  • % Done changed from 0 to 20

Set the check_interval to 5 minutes, and the retry_interval to 2 minutes for every hosts in commit puppet-tails:44a5a30. As the max_check_attempts is set to 3 by default, we should be notified after 6 minutes if a host is down (max_check_attempts * retry_interval). Previously the settings were 1 minutes and 30 seconds respectively, which sounded a bit low and intense to me.

#2 Updated by bertagaz 2016-04-20 05:00:21

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 20 to 70
  • QA Check set to Ready for QA

I’ve set up the check_interval (c below) and retry_interval (r below) for the various services:

  • disks: c=12h and r=5m
  • apt: c=6h and r=5m
  • http: c=15m and r=5m
  • memory: c=10m and r=2m
  • torbrowser_archive: c=10m and r=2m
  • rsync: c=10m and r=2m
  • ssh and sftp accounts: c=10m and r=2m
  • whisperback: c=10m and r=2m

What do you think about that?

Also I think it will probably helps the HTTP checks to be a bit more stable, given they will be check less often.

#3 Updated by bertagaz 2016-04-20 06:44:50

  • blocks Feature #9484: Deploy the monitoring setup to production added

#4 Updated by bertagaz 2016-04-20 06:55:55

bertagaz wrote:
> Also I think it will probably helps the HTTP checks to be a bit more stable, given they will be check less often.

This means we’ll have to check how the HTTP checks are behaving with this changes (see Feature #8650#note-25)

#5 Updated by intrigeri 2016-04-25 04:25:26

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

> I’ve set up the check_interval (c below) and retry_interval (r below) for the various services:

> * disks: c=12h and r=5m
> * apt: c=6h and r=5m

Sounds good.

> * http: c=15m and r=5m

So, we’ll learn after 15 (best case) to 30 (worst case) minutes if one of our HTTP services is down. A lot of our stuff (e.g. CI, image building, additional software packages feature) depends on our various HTTP services to be up, so this feels too relaxed to me. I would say c=5m and r=100s so that the notification is triggered between 5 and 10 minutes after the outage starts.

> * memory: c=10m and r=2m

I see value in checking this more often, as problematic memory usage peaks can be very short lived. Just set it back to the (somewhat crazy) defaults?

> * torbrowser_archive: c=10m and r=2m
> * rsync: c=10m and r=2m
> * ssh and sftp accounts: c=10m and r=2m
> * whisperback: c=10m and r=2m

OK.

#6 Updated by bertagaz 2016-04-26 04:59:34

  • Target version changed from Tails_2.3 to Tails_2.4

#7 Updated by bertagaz 2016-04-26 05:10:45

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:
>
> > * http: c=15m and r=5m
>
> So, we’ll learn after 15 (best case) to 30 (worst case) minutes if one of our HTTP services is down. A lot of our stuff (e.g. CI, image building, additional software packages feature) depends on our various HTTP services to be up, so this feels too relaxed to me. I would say c=5m and r=100s so that the notification is triggered between 5 and 10 minutes after the outage starts.

Ok, I’ve implemented that in commit puppet-tails:625fd30. Let see how it behaves.

> > * memory: c=10m and r=2m
>
> I see value in checking this more often, as problematic memory usage peaks can be very short lived. Just set it back to the (somewhat crazy) defaults?

True, let’s try that. commit puppet-tails:6a66282

#8 Updated by intrigeri 2016-04-26 06:04:43

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • Target version changed from Tails_2.4 to Tails_2.3
  • % Done changed from 70 to 0
  • QA Check changed from Ready for QA to Pass

OK, great. Let’s handle as subtasks of Feature #8652 any issue we identify once we start really using the thing.

#9 Updated by intrigeri 2016-04-26 06:05:00

  • blocked by deleted (Feature #9484: Deploy the monitoring setup to production)

#10 Updated by intrigeri 2016-04-26 06:05:45

#11 Updated by bertagaz 2016-04-27 03:22:26

  • % Done changed from 0 to 100