Bug #12574

isobuilders system_disks check keeps switching between OK and WARNING since the switch to Vagrant

Added by intrigeri 2017-05-22 06:43:55 . Updated 2017-07-28 14:08:00 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2017-05-16
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
289

Description

This makes icinga2’s email and dashboard too noisy. Please check if these are false positives, and if they are then adjust checks; if they are not please fix the problem. Thanks!


Subtasks


History

#1 Updated by intrigeri 2017-05-22 06:44:14

  • Status changed from In Progress to Confirmed

#2 Updated by bertagaz 2017-05-22 12:43:34

I noticed that too lately, and have workaround it by removing old kernels, but it indeed does not seem to be enough. It looks like switching to vagrant means more disk space used by the required packages,so I guess I’ll have to grow the root partitions on isobuilders, yay!

#3 Updated by intrigeri 2017-05-24 07:00:45

  • blocks Feature #12002: Estimate hardware cost of reproducible builds in Jenkins added

#4 Updated by intrigeri 2017-05-25 06:59:33

  • related to Bug #12595: Not enough space in /var/lib/jenkins on isobuilders added

#5 Updated by intrigeri 2017-06-01 06:40:41

  • related to deleted (Bug #12595: Not enough space in /var/lib/jenkins on isobuilders)

#6 Updated by intrigeri 2017-06-04 11:06:00

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

Looking at it again, I don’t think it’s worth growing these volumes: likely we’ll stick to Stretch for 2-5 years on these systems, and I don’t see why we would need substantially more space there until we upgrade them to Buster. So here’s my proposal:

  1. set APT::Periodic::AutocleanInterval and APT::Periodic::CleanInterval to 1 to save some space and tone down the noise this ticket is about (BTW, I wouldn’t mind if we did that globally on all our systems)
  2. postpone this ticket to 3.1
  3. wait a few weeks
  4. look at the monitoring checks history
  5. if the maximum space used on these rootfs seems reasonable, then adjust the monitoring checks just enough so that they would not have raised warnings during these few weeks, and close this ticket; otherwise, come back to the drawing board with some new data in hand

#7 Updated by bertagaz 2017-06-04 13:57:10

intrigeri wrote:
> Looking at it again, I don’t think it’s worth growing these volumes: likely we’ll stick to Stretch for 2-5 years on these systems, and I don’t see why we would need substantially more space there until we upgrade them to Buster. So here’s my proposal:
>
> # set APT::Periodic::AutocleanInterval and APT::Periodic::CleanInterval to 1 to save some space and tone down the noise this ticket is about (BTW, I wouldn’t mind if we did that globally on all our systems)
> # postpone this ticket to 3.1
> # wait a few weeks
> # look at the monitoring checks history
> # if the maximum space used on these rootfs seems reasonable, then adjust the monitoring checks just enough so that they would not have raised warnings during these few weeks, and close this ticket; otherwise, come back to the drawing board with some new data in hand

Ack. I was going to assume we already want to reserve space for the Buster upgrade, but I’m fine with you’re proposal. I’ll implement that next week.

#8 Updated by bertagaz 2017-06-06 11:09:51

  • Target version changed from Tails_3.0 to Tails_3.1

intrigeri wrote:
> # set APT::Periodic::AutocleanInterval and APT::Periodic::CleanInterval to 1 to save some space and tone down the noise this ticket is about (BTW, I wouldn’t mind if we did that globally on all our systems)

Done in puppet-tails:f4513e5

> # postpone this ticket to 3.1

Done now too.

> # wait a few weeks
> # look at the monitoring checks history
> # if the maximum space used on these rootfs seems reasonable, then adjust the monitoring checks just enough so that they would not have raised warnings during these few weeks, and close this ticket; otherwise, come back to the drawing board with some new data in hand

Regarding this: at the moment, after an APT clean, the rootfs is still 81% used, so it won’t fix the notification issue. OTOH, that’s almost 1G. We could reduce the disk check right now, which would still be big enough for the time being and faster the resolution of this ticket.

#9 Updated by intrigeri 2017-06-07 17:18:40

>> # set APT::Periodic::AutocleanInterval and APT::Periodic::CleanInterval to 1 to save some space and tone down the noise this ticket is about (BTW, I wouldn’t mind> if we did that globally on all our systems)

> Done in puppet-tails:f4513e5

Please lint it (double quotes without any variable in it ⇒ useless escaping).

> Regarding this: at the moment, after an APT clean, the rootfs is still 81% used, so it won’t fix the notification issue. OTOH, that’s almost 1G. We could reduce the disk check right now, which would still be bug enough for the time being and faster the resolution of this ticket.

Yes, clearly, please go ahead: it’s useless to suffer any longer if we already know we’ll have to change these checks later anyway.

#10 Updated by bertagaz 2017-06-08 09:37:16

intrigeri wrote:
> Please lint it (double quotes without any variable in it ⇒ useless escaping).

Nop, ‘\n’ in this string needs to be interpreted.

> Yes, clearly, please go ahead: it’s useless to suffer any longer if we already know we’ll have to change these checks later anyway.

Done. Lowered system_disk_warning to 17%.

#11 Updated by intrigeri 2017-06-08 10:46:54

>> Please lint it (double quotes without any variable in it ⇒ useless escaping).

> Nop, ‘\n’ in this string needs to be interpreted.

Indeed. BTW, using 09_autoclean as the filename for one thing that’s about clean and the other that’s about autoclean, which are two different operations, seems confusing. Please split it (and then you won’t have \n anymore ;)

#12 Updated by bertagaz 2017-06-08 11:23:02

intrigeri wrote!
> Indeed. BTW, using 09_autoclean as the filename for one thing that’s about clean and the other that’s about autoclean, which are two different operations, seems confusing. Please split it (and then you won’t have \n anymore ;)

Done.

#13 Updated by intrigeri 2017-06-28 09:11:09

  • Assignee changed from bertagaz to intrigeri

intrigeri wrote:
> # set APT::Periodic::AutocleanInterval and APT::Periodic::CleanInterval to 1 to save some space and tone down the noise this ticket is about (BTW, I wouldn’t mind if we did that globally on all our systems)
> # postpone this ticket to 3.1
> # wait a few weeks
> # look at the monitoring checks history
> # if the maximum space used on these rootfs seems reasonable, then adjust the monitoring checks just enough so that they would not have raised warnings during these few weeks, and close this ticket; otherwise, come back to the drawing board with some new data in hand

Weeks have passed and I’m now on sysadmin duty. Handling the last 2 steps myself won’t cost me more than handling the lack of a fix.

#14 Updated by intrigeri 2017-06-28 09:23:49

  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 10 to 50
  • QA Check set to Ready for QA

Done in b324697 in our manifests repo. AFAICT the new warning limit (15%) would not have made icinga raise any eyebrow in the last few weeks. Let’s see how it goes in the next few weeks before closing.

#15 Updated by intrigeri 2017-07-04 09:58:21

  • blocked by deleted (Feature #12002: Estimate hardware cost of reproducible builds in Jenkins)

#16 Updated by intrigeri 2017-07-06 15:08:47

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

No issue since June 28, closing.

#17 Updated by intrigeri 2017-07-26 08:00:35

  • Status changed from Resolved to In Progress
  • Assignee set to intrigeri
  • % Done changed from 100 to 80
  • QA Check changed from Pass to Dev Needed

FTR this happened again: during the last APT upgrade of QEMU all isobuilders’ system partition was in WARNING state. It took 10 hours to recover (presumably until the APT cleanup cronjobs were run). I think we should set APT::Keep-Downloaded-Packages to “false” on all our systems: this would decrease the time needed to recover from a temporary disk space warning situation, and we use acng anyway so it makes little sense to store yet another copy of downloaded .deb’s after they’ve been installed. I’ll take care of that.

#18 Updated by intrigeri 2017-07-26 08:06:45

  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 80 to 90
  • QA Check changed from Dev Needed to Ready for QA

Done: puppet-tails commit 5cd30ebf7f1b19992a5e29e362e5d3b104fa83f3. Please review :)

#19 Updated by bertagaz 2017-07-28 12:07:36

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Ready for QA to Info Needed

intrigeri wrote:
> Done: puppet-tails commit 5cd30ebf7f1b19992a5e29e362e5d3b104fa83f3. Please review :)

Nice catch! Looks good. I’m wondering though: the notifications we received said the space free was 14% when it happened, when our check limit is at 15. Maybe we should lower it to 10 (and critical limit to something like 5%), as that’s still 500M free? I can take care of that and close this ticket if you agree.

#20 Updated by intrigeri 2017-07-28 13:08:17

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Info Needed to Dev Needed

> Nice catch! Looks good. I’m wondering though: the notifications we received said the space free was 14% when it happened, when our check limit is at 15. Maybe we should lower it to 10 (and critical limit to something like 5%), as that’s still 500M free? I can take care of that and close this ticket if you agree.

Why not, assuming “that’s still 500M free” refers to critical (we need a warning earlier IMO).

#21 Updated by bertagaz 2017-07-28 14:08:00

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 90 to 100
  • QA Check changed from Dev Needed to Pass

intrigeri wrote:
> > Nice catch! Looks good. I’m wondering though: the notifications we received said the space free was 14% when it happened, when our check limit is at 15. Maybe we should lower it to 10 (and critical limit to something like 5%), as that’s still 500M free? I can take care of that and close this ticket if you agree.
>
> Why not, assuming “that’s still 500M free” refers to critical (we need a warning earlier IMO).

I was talking about the “warning” limit. The current critical limit is already 10%, so nothing more to do here I guess.