Bug #12595

Not enough space in /var/lib/jenkins on isobuilders

Added by intrigeri 2017-05-25 06:59:11 . Updated 2017-10-07 12:06:04 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2017-05-25
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
289

Description

This ticket is about two issues:

  • Short term: noise on our monitoring dashboard panel/notifications. Under normal conditions we use up to 20GB out of 23GB there, and our monitoring doesn’t allow that.
  • Medium term: we lack disk space in /var/lib/jenkins on isobuilders. Once Feature #12576 is done we’ll be hosting multiple baseboxes in /var/lib/jenkins/.vagrant.d/ so we will need even more space there, so this disk space issues is currently blocking such performance optimizations. See Bug #12531#note-23 and follow-ups where the research about how many baseboxes we need to store there is being worked on.

Subtasks


Related issues

Related to Tails - Feature #12002: Estimate hardware cost of reproducible builds in Jenkins Resolved 2016-11-28
Blocks Tails - Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all Resolved 2017-05-22
Blocked by Tails - Bug #13425: Upgrade lizard's storage (2017 edition) Resolved 2017-07-05

History

#1 Updated by intrigeri 2017-05-25 06:59:22

  • blocks Feature #12002: Estimate hardware cost of reproducible builds in Jenkins added

#2 Updated by intrigeri 2017-05-25 06:59:32

  • related to Bug #12574: isobuilders system_disks check keeps switching between OK and WARNING since the switch to Vagrant added

#3 Updated by bertagaz 2017-05-26 13:14:14

  • blocks Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

#4 Updated by intrigeri 2017-05-26 13:23:17

> Blocks Feature Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

I don’t understand why, but perhaps it’s not important.

Let’s just fix it and I’ll shut up :)

#5 Updated by bertagaz 2017-05-26 13:48:51

intrigeri wrote:
> > Blocks Feature Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added
>
> I don’t understand why, but perhaps it’s not important.

Because at the moment we don’t have enough space to build a new basebox AND host several baseboxes in /var/lib/jenkins/.vagrant.d/, which will probably happen if we switch to basebox:clean_old.

#6 Updated by intrigeri 2017-05-26 14:11:46

> Because at the moment we don’t have enough space to build a new basebox AND host several baseboxes in /var/lib/jenkins/.vagrant.d/, which will probably happen if we switch to basebox:clean_old.

Wow, interesting! If that means we’re going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it’s a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).

Meta: I have almost no clue how the whole thing works, so perhaps there’s a very good reason to do this, and then forget it, sorry.

#7 Updated by intrigeri 2017-05-27 09:00:08

  • blocked by deleted (Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all)

#8 Updated by intrigeri 2017-05-27 09:00:53

  • blocks Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

#9 Updated by bertagaz 2017-05-27 14:08:48

intrigeri wrote:
> Wow, interesting! If that means we’re going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it’s a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).

As I get it, we delete the baseboxes in /var/lib/libvirt/images through the use of the forcecleanup option. I think if we end up having several ones in this directory, it’s because of a build failure, that prevent rake from executing the related clean_up_builder_vms function in this case. I’ll track the builds to see if/why this situation happens in the next days.

#10 Updated by bertagaz 2017-05-27 14:44:26

  • Assignee changed from bertagaz to anonym
  • QA Check set to Info Needed

bertagaz wrote:
> intrigeri wrote:
> > Wow, interesting! If that means we’re going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it’s a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).
>
> As I get it, we delete the baseboxes in /var/lib/libvirt/images through the use of the forcecleanup option. I think if we end up having several ones in this directory, it’s because of a build failure, that prevent rake from executing the related clean_up_builder_vms function in this case. I’ll track the builds to see if/why this situation happens in the next days.

Hmmm, assigning to anonym, because I just checked and there’s still at least one volume in the libvirt default storage after a build is finished (e.g. tails-builder-amd64-jessie-20170524-8cc1ccbade_vagrant_box_image_0.img). And indeed I don’t see code in the Rakefile that takes care of removing such volumes, only the ones that have the same name than the vagrant VM. Is that expected? I guess we could simply remove it too, if we store it in ~/.vagrant.d/?

#11 Updated by intrigeri 2017-05-28 09:01:09

  • Subject changed from isobuilders jenkins-data-disk check keeps switching between OK and WARNING since the switch to Vagrant to isobuilders jenkins-data-disk check keeps switching between OK, WARNING and CRITICAL since the switch to Vagrant

#12 Updated by intrigeri 2017-05-28 10:34:00

I’m concerned we’ll become quickly confused (and it’ll be hard to maintain Redmine ticket semantics) if we discuss the clean up process of /var/lib/libvirt/images/ on a ticket that’s about something else entirely, i.e. /var/lib/jenkins, so please, anonym:

  • address the “are we really going to store essentially the same data twice on each isobuilder?” question here;
  • answer questions that are specific to the libvirt storage pool GC process on Bug #12599, rather than here.

Thanks :)

#13 Updated by anonym 2017-05-28 11:17:56

bertagaz wrote:
> bertagaz wrote:
> > intrigeri wrote:
> > > Wow, interesting! If that means we’re going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it’s a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).
> >
> > As I get it, we delete the baseboxes in /var/lib/libvirt/images through the use of the forcecleanup option. I think if we end up having several ones in this directory, it’s because of a build failure, that prevent rake from executing the related clean_up_builder_vms function in this case. I’ll track the builds to see if/why this situation happens in the next days.
>
> Hmmm, assigning to anonym, because I just checked and there’s still at least one volume in the libvirt default storage after a build is finished (e.g. tails-builder-amd64-jessie-20170524-8cc1ccbade_vagrant_box_image_0.img).

Ah, that is right; the file

~/.vagrant.d/boxes/${BOX_NAME}/0/libvirt/box.img


is copied to

/var/lib/libvirt/images/"${BOX_NAME}_vagrant_box_image_0.img


by Vagrant whenever it sets up a domain using that base box (unless it already exists). This is redundant.

> And indeed I don’t see code in the Rakefile that takes care of removing such volumes, only the ones that have the same name than the vagrant VM.

There is code for removing such volume in clean_up_basebox(), so we only do it when removing base boxes.

> Is that expected?

Yes. I had underestimated how much of a problem using more disk space was, so I didn’t think this would matter.

> I guess we could simply remove it too, if we store it in ~/.vagrant.d/?

Yup, Vagrant will make the copy if needed. So I guess we can do this cleanup by default, since a ~800 MiB disk copy shouldn’t increase the build time too much for non-Jenkins users. Just to be sure we’re in sync, I implemented the change I believe to be safe (but I haven’t tested it!) on the feature/12599 branch (commit:69faba0c1d5e7b517616103f8e1c14528bdb55e8). What do you think?

#14 Updated by intrigeri 2017-05-28 14:44:31

> Yes. I had underestimated how much of a problem using more disk space was, so I didn’t think this would matter.

Meta: I’m constantly advocating against spending substantial engineering time on issues that can trivially be solved with a bit more hardware. This nagging from my part probably participated in creating the situation we’re trying to fix here. Now, there are caveats I want to clarify to try and fix the confusion I have perhaps created:

  • Alas, our hardware doesn’t grow magically on-demand. So ideally, new requirements need to be roughly evaluated & communicated so whatever hardware purchase & installation is needed by a change can happen before we deploy stuff… and break bit of our infra (Feature #12002). But granted, often we go through several iterations, design changes, and it’s simply impossible to accurately evaluate the final requirements in advance. The only realistic solution I can think of, to avoid this chicken’n’egg problem, as long as we’re managing the bare metal our stuff runs on, is to do infra development in a setup that looks very much like the production one, but is not the production one.
  • Sometimes only a little bit of software engineering effort is enough to avoid raising hardware requirements, and it’s worth spending this time instead of upgrading hardware (which has a cost in terms of sysadmin work). Seeing your fix for Bug #12599, I guess that ticket falls into this category :)

#15 Updated by anonym 2017-05-29 15:04:13

  • Assignee changed from anonym to bertagaz
  • QA Check deleted (Info Needed)

[Reassigning back to bert now that the question he had for me is answered.]

#16 Updated by intrigeri 2017-06-01 06:32:54

  • Subject changed from isobuilders jenkins-data-disk check keeps switching between OK, WARNING and CRITICAL since the switch to Vagrant to Not enough space in /var/lib/jenkins on isobuilders
  • Description updated

Clarified scope of this ticket so we don’t track the mere short-term monitoring issue.

#17 Updated by intrigeri 2017-06-01 06:39:56

  • blocked by deleted (Feature #12002: Estimate hardware cost of reproducible builds in Jenkins)

#18 Updated by intrigeri 2017-06-01 06:40:41

  • related to deleted (Bug #12574: isobuilders system_disks check keeps switching between OK and WARNING since the switch to Vagrant)

#19 Updated by intrigeri 2017-06-01 06:41:58

  • related to Feature #12002: Estimate hardware cost of reproducible builds in Jenkins added

#20 Updated by intrigeri 2017-06-08 17:47:43

  • Target version changed from Tails_3.0 to Tails_3.1

I’ll build the 3.0 ISO in two days, so let’s not do potentially disruptive changes on our infra at this point.

#21 Updated by bertagaz 2017-07-05 17:52:21

  • Status changed from Confirmed to In Progress
  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Info Needed

Now that 3.0 is out and Feature #12002 is over, I propose we add 7G to each isobuilders’ /var/lib/jenkins. That way they would go up to 30G, which should handle a bunch of baseboxes, probably enough for the time being. We’ll still have 100G left to allocate wherever needed while we wait for Feature #11806. If we do that we should be able to tackle Feature #12576.

#22 Updated by intrigeri 2017-07-05 19:56:13

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Info Needed to Dev Needed

> Now that 3.0 is out and Feature #12002 is over, I propose we add 7G to each isobuilders’ /var/lib/jenkins. That way they would go up to 30G, which should handle a bunch of baseboxes, probably enough for the time being. We’ll still have 100G left to allocate wherever needed while we wait for Feature #11806.

We need these 100G for other matters (ever-growing data for services that existed before vagrant-libvirt), we’ve been struggling with disk space in a painful way since a couple months “thanks” to the timing of the vagrant-libvirt deployment vs. hardware planning, and I’ve grown tired of this situation already, so: no, sorry; I really don’t want to make things even worse. Let’s deal with the storage upgrade that the vagrant-libvirt stuff requires before allocating even more space to it.

#23 Updated by bertagaz 2017-07-06 11:23:20

  • blocked by Bug #13425: Upgrade lizard's storage (2017 edition) added

#24 Updated by bertagaz 2017-07-06 11:34:25

intrigeri wrote:
> We need these 100G for other matters (ever-growing data for services that existed before vagrant-libvirt), we’ve been struggling with disk space in a painful way since a couple months “thanks” to the timing of the vagrant-libvirt deployment vs. hardware planning, and I’ve grown tired of this situation already, so: no, sorry; I really don’t want to make things even worse. Let’s deal with the storage upgrade that the vagrant-libvirt stuff requires before allocating even more space to it.

Having a look at the spreadsheet I’m not sure to see which services will require that much data. I thought taking 30G and leaving 100G free was affordable while we purchase more HDDs. But I get you’re upset. Let’s wait then.

#25 Updated by intrigeri 2017-07-06 14:52:19

  • Target version changed from Tails_3.1 to Tails_3.2

(Please focus on making builds robust again first, and postpone the performance improvements. Sorry we could not discuss this at the CI team meeting today.)

#26 Updated by intrigeri 2017-09-07 12:46:05

  • Target version changed from Tails_3.2 to Tails_3.3

#27 Updated by bertagaz 2017-10-07 12:06:04

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 0 to 100
  • QA Check deleted (Dev Needed)

I’ve grown the jenkins-data partitions to the extent we defined in the Feature #12002 blueprint, so we should be ok on this front for a while now.