Bug #12531: ISO builds on Jenkins are fragile since the migration to vagrant-libvirt

Bug #12531

ISO builds on Jenkins are fragile since the migration to vagrant-libvirt

Added by intrigeri 2017-05-10 15:36:00 . Updated 2017-08-03 17:59:27 .

Status:

Resolved

Priority:

Elevated

Assignee:

Category:

Continuous Integration

Target version:

Tails_3.1

Start date:

2017-05-22

Due date:

% Done:

100%

Feature Branch:

Type of work:

Sysadmin

Blueprint:

Starter:

Affected tool:

Deliverable for:

289

Description

E.g. https://jenkins.tails.boum.org/view/RM/job/build_Tails_ISO_devel/lastFailedBuild/console and https://jenkins.tails.boum.org/view/RM/job/build_Tails_ISO_feature-stretch/lastFailedBuild/console.

It would be really nice if this was fixed by the end of the week, so we can rely on Jenkins during the Stretch sprint that starts on Monday (May 15th), hence High priority. In other words: please fix such issues that break our current development workflow before working on bonus improvements :)

Subtasks

Bug #12577: Aborted Jenkins ISO builds can cause next ones on the same isobuilder to fail

Resolved

100

Bug #12578: Abort if Vagrant create_box fails

Resolved

100

Bug #12599: /var/lib/libvirt/images gets filled on isobuilders

Resolved

100

Bug #12618: Retrieving ISO build artifacts sometimes fails on Jenkins

Resolved

100

Bug #13302: /var/lib/libvirt/images sometimes gets filled on isobuilders, take 2

Resolved

100

History

#1 Updated by anonym 2017-05-10 17:43:37

tl;dr: we run out of disk space, seemingly at the very late stage where we are generating the final .box file (which is < 300 MiB), so only a little space is missing. But some margin would be appreciated!

Also, note that this error is ignored, and the build process proceeds with trying to import this corrupt .box, since create_box.sh doesn’t check error statuses or set -e. While we’re at it we could force set -e ourselves with:

--- a/vagrant/definitions/tails-builder/generate-tails-builder-box.sh
+++ b/vagrant/definitions/tails-builder/generate-tails-builder-box.sh
@@ -72,7 +72,7 @@ sudo ${http_proxy:+http_proxy="$http_proxy"} \
      --customize "$(pwd)/customize.sh" \
      --verbose

-/usr/share/doc/vagrant-libvirt/examples/create_box.sh \
+bash -e /usr/share/doc/vagrant-libvirt/examples/create_box.sh \
     "${TARGET_IMG}" "${TARGET_BOX}"

 rm -rf "${TARGET_IMG}"* "${DEBOOTSTRAP_GNUPG_HOMEDIR}"

(If you are wondering: yes, it is a bash script, using at least one bashism.)
Makes sense? I admit that I didn’t test it, and perhaps that script has error statuses returned elsewhere, so set -e won’t work.

#2 Updated by bertagaz 2017-05-11 09:21:25

Status changed from Confirmed to In Progress
% Done changed from 0 to 10
Feature Branch set to bugfix/12531-make-create-box-more-robust

anonym wrote:
> tl;dr: we run out of disk space, seemingly at the very late stage where we are generating the final .box file (which is < 300 MiB), so only a little space is missing. But some margin would be appreciated!

Ok, I found out why: in the jenkins user home directory, the .vagrant/ directory contains the box file. Not sure if it’s intended. I didn’t take this into account when I grow the partition to 20G to hold the basebox creation, hence the disk space filling. I guess we should remove this directory at each build (and maybe grow a bit the partition).

> Also, note that this error is ignored, and the build process proceeds with trying to import this corrupt .box, since create_box.sh doesn’t check error statuses or set -e. While we’re at it we could force set -e ourselves with:
> […]
> (If you are wondering: yes, it is a bash script, using at least one bashism.)
> Makes sense? I admit that I didn’t test it, and perhaps that script has error statuses returned elsewhere, so set -e won’t work.

I pushed a branch with this change, to see if it breaks something.

#3 Updated by bertagaz 2017-05-12 10:59:51

% Done changed from 10 to 20

bertagaz wrote:
> Ok, I found out why: in the jenkins user home directory, the .vagrant/ directory contains the box file. Not sure if it’s intended. I didn’t take this into account when I grow the partition to 20G to hold the basebox creation, hence the disk space filling. I guess we should remove this directory at each build (and maybe grow a bit the partition).

I’ve pushed a commit that removes this .vagrant.d/ directory before each build, after the call to rake basebox:clean_all. We’ll see if it workaround this issue, hopefully it will.

> I pushed a branch with this change, to see if it breaks something.

Doesn’t seems so yet, let just run it a few more times.

#4 Updated by anonym 2017-05-14 10:51:52

bertagaz wrote:
> bertagaz wrote:
> > Ok, I found out why: in the jenkins user home directory, the .vagrant/ directory contains the box file. Not sure if it’s intended.

It is intended — this is how the base box is kept between builds, so it doesn’t have to be rebuilt each time.

> > I didn’t take this into account when I grow the partition to 20G to hold the basebox creation, hence the disk space filling. I guess we should remove this directory at each build (and maybe grow a bit the partition).
>
> I’ve pushed a commit that removes this .vagrant.d/ directory before each build, after the call to rake basebox:clean_all. We’ll see if it workaround this issue, hopefully it will.

Ok, if you do both of these things (which both seem wrong to me) I guess we are not on the same page on how this was gonna work on Jenkins’ side. If you search for “vagrant.d” on ~~Feature #12409~~ you’ll find this exchange between you and me:

bertagaz wrote (~~Feature #12409#note-6~~):
> anonym wrote (~~Feature #12409#note-5~~):
> > Some ideas to move in this direction:
> >
> > # It would be very easy to implement a Rake task for cleaning up old baseboxes, say those older than six months. isobuilders could then run this before/after each build, or with daily cron job, or similar. Of course, ~/.vagrant.d would have to persist between builds with this approach.
> > […]
> >
> > What do you think?
>
> I like your first idea, which solves my concerns in an easy way (and could avoid implementing this proposal in Jenkins with several iterations).

So, it seems you agreed to that Jenkins should:

keep ~/.vagrant.d between builds
run rake basebox:clean_old (not rake basebox:clean_all)

Have I missed something?

> > I pushed a branch with this change, to see if it breaks something.
>
> Doesn’t seems so yet, let just run it a few more times.

Cool!

#5 Updated by bertagaz 2017-05-15 11:11:20

Assignee changed from bertagaz to anonym
QA Check set to Info Needed

anonym wrote:
> * keep ~/.vagrant.d between builds
> * run rake basebox:clean_old (not rake basebox:clean_all)
>
> Have I missed something?

Doesn’t seem so. I just missed that we actually wanted to keep this ~/.vagrant.d/ directory.

OTOH it just makes me wonder: in commit:9b43319eb0f718dc0bbd85e069703ce5e82c8e4c we implemented that a basebox used to build one branch can not be reused to build another. This seems conflicting to me, or at least may be problematic. Maybe we should rethink this a bit?

#6 Updated by anonym 2017-05-15 11:33:46

Assignee changed from anonym to bertagaz

bertagaz wrote:
> anonym wrote:
> > * keep ~/.vagrant.d between builds
> > * run rake basebox:clean_old (not rake basebox:clean_all)
> >
> > Have I missed something?
>
> Doesn’t seem so. I just missed that we actually wanted to keep this ~/.vagrant.d/ directory.

Got it! I was fearing there had been a deeper misunderstanding here and I’m glad I was wrong. Jenkins seems to be back in business, yay!

> OTOH it just makes me wonder: in commit:9b43319eb0f718dc0bbd85e069703ce5e82c8e4c we implemented that a basebox used to build one branch can not be reused to build another.

Note that this is not about which branch a base box was built from, but which branch was checked out when the build was started.

> This seems conflicting to me, or at least may be problematic. Maybe we should rethink this a bit?

With the above clarification, is the conflict/problematic aspect still there?

#7 Updated by bertagaz 2017-05-15 12:18:51

Assignee changed from bertagaz to anonym
QA Check changed from Info Needed to Dev Needed

anonym wrote:
> Jenkins seems to be back in business, yay!

That’s because we’re still removing ~/.vagrant.d/ at the moment. I’ll grow the partition so that we can remove this wrong trick.

> Note that this is not about which branch a base box was built from, but which branch was checked out when the build was started.

Right! I’m getting more and more confused with this long lasting changes. We need to end this! :)

> With the above clarification, is the conflict/problematic aspect still there?

I don’t think so.

#8 Updated by bertagaz 2017-05-15 12:19:17

Assignee changed from anonym to bertagaz

Opps. :)

#9 Updated by intrigeri 2017-05-16 07:05:11

>> Note that this is not about which branch a base box was built from, but which branch was checked out when the build was started.

> Right! I’m getting more and more confused with this long lasting changes. We need to end this! :)

Time to update the design doc for the whole thing so it reflects our reasoning and decisions?

#10 Updated by intrigeri 2017-05-16 08:59:56

Type of work changed from Research to Sysadmin

#11 Updated by bertagaz 2017-05-16 14:04:35

bertagaz wrote:
> I’ll grow the partition so that we can remove this wrong trick.

Done growing the partitions. I’ve removed the deletion of ~/.vagrant.d/ and switched the jobs to use basebox:clean_old. Let see how it goes.

#12 Updated by intrigeri 2017-05-16 15:51:49

> switched the jobs to use basebox:clean_old.

Reverted as it breaks the ISO build. Note that we’re in a Stretch sprint and now is not a good time to deploy untested stuff.

#13 Updated by intrigeri 2017-05-16 15:53:41

intrigeri wrote:
> > switched the jobs to use basebox:clean_old.
>
> Reverted as it breaks the ISO build. Note that we’re in a Stretch sprint and now is not a good time to deploy untested stuff.

To be clear, I was referring to https://jenkins.tails.boum.org/job/build_Tails_ISO_bugfix-12364-greeter-races/1/console. No idea if it breaks all ISO builds or only in some cases, but it broke precisely the build I cared about, hence the quick revert.

#14 Updated by bertagaz 2017-05-16 16:17:25

Assignee changed from bertagaz to anonym

intrigeri wrote:
> > Reverted as it breaks the ISO build. Note that we’re in a Stretch sprint and now is not a good time to deploy untested stuff.

Ah, sorry, didn’t know/remind that.

> To be clear, I was referring to https://jenkins.tails.boum.org/job/build_Tails_ISO_bugfix-12364-greeter-races/1/console. No idea if it breaks all ISO builds or only in some cases, but it broke precisely the build I cared about, hence the quick revert.

I think it does break all of them, I’ve seen this failure on other builds.

So it seems basebox:clean_old task is not working. Assigning to anonym for him to fix his code. Also we probably need to adjust it to remove baseboxes that do not contain the right commit in their name, rather than the ones that are 6 months old. What do you think?

#15 Updated by intrigeri 2017-05-22 06:46:53

Target version changed from Tails_3.0~rc1 to Tails_3.0

#16 Updated by intrigeri 2017-05-22 07:43:36

bertagaz wrote:
> So it seems basebox:clean_old task is not working. Assigning to anonym for him to fix his code.

Sounds right.

> Also we probably need to adjust it to remove baseboxes that do not contain the right commit in their name, rather than the ones that are 6 months old.

What’s “the right commit” you’re talking about?

#17 Updated by intrigeri 2017-05-22 07:46:59

I’ve noticed some other causes of fragility so I’m gonna move the basebox:clean_old one to a subtask and create other subtasks.

#18 Updated by intrigeri 2017-05-22 07:57:07

Assignee changed from anonym to bertagaz
Feature Branch deleted (~~bugfix/12531-make-create-box-more-robust~~)

(Most subtasks are on bertagaz’ plate.)

#19 Updated by bertagaz 2017-05-24 10:08:49

intrigeri wrote:
> bertagaz wrote:
> > Also we probably need to adjust it to remove baseboxes that do not contain the right commit in their name, rather than the ones that are 6 months old.
>
> What’s “the right commit” you’re talking about?

I’m talking about the one contained in the basebox name, which is the last one made in the vagrant/ directory. With the new design, if we clean only the baseboxes that are 6 month old, but there are newer commits in this directory meanwhile, we’ll host useless baseboxes that were build with an older commit made in the vagrant directory.

#20 Updated by intrigeri 2017-05-24 11:31:02

>> > Also we probably need to adjust it to remove baseboxes that do not contain the right commit in their name, rather than the ones that are 6 months old.
>>
>> What’s “the right commit” you’re talking about?

> I’m talking about the one contained in the basebox name, which is the last one made in the vagrant/ directory.

OK, thanks for clarifying.

> With the new design, if we clean only the baseboxes that are 6 month old, but there are newer commits in this directory meanwhile, we’ll host useless baseboxes that were build with an older commit made in the vagrant directory.

Some might be useless, but some might be still be used by other branches than the one we’re currently building. If I got your proposal right, any build (on a given isobuilder) from a branch that updates some part of the build system will delete the basebox used by all other branches, which will kinda make the clean_old + caching idea and code useless in many cases, and we won’t get the performance “improvement” (rather: regression fix) that we’re seeking. So I’d rather stick to the clean_old design and tweake the N months number to make it work well enough without storing too much old stuff (your unvoiced concern seems to be about resources needs i.e. ~~Feature #12002~~, right?). E.g. 4 months should be enough to ensure we keep the last stable release’s basebox around, but delete older stuff. What do you think?

If I got your idea wrong, please state what problem it’s meant to fix, so it’s easier for me to reason about it :)

#21 Updated by bertagaz 2017-05-24 12:52:14

intrigeri wrote:
> Some might be useless, but some might be still be used by other branches than the one we’re currently building. If I got your proposal right, any build (on a given isobuilder) from a branch that updates some part of the build system will delete the basebox used by all other branches, which will kinda make the clean_old + caching idea and code useless in many cases, and we won’t get the performance “improvement” (rather: regression fix) that we’re seeking. So I’d rather stick to the clean_old design and tweake the N months number to make it work well enough without storing too much old stuff (your unvoiced concern seems to be about resources needs i.e. ~~Feature #12002~~, right?). E.g. 4 months should be enough to ensure we keep the last stable release’s basebox around, but delete older stuff. What do you think?

Yes, my concern was about disk space. But I get your reasoning, so let say 4 month is enough. Hopefully once we’ll have stabilized the vagrant build implementation, there won’t be so many changes and different baseboxes. Maybe considering having enough disk space to host 3 different baseboxes can be considered enough as a starter.

#22 Updated by intrigeri 2017-05-24 13:34:01

> Maybe considering having enough disk space to host 3 different baseboxes can be considered enough as a starter.

Please take into account that our storage upgrade plans span over 2 years, so I’d rather give ourselves some more margin.

#23 Updated by bertagaz 2017-05-26 13:39:04

intrigeri wrote:
> > Maybe considering having enough disk space to host 3 different baseboxes can be considered enough as a starter.
>
> Please take into account that our storage upgrade plans span over 2 years, so I’d rather give ourselves some more margin.

So would 5 baseboxes be more relevant? We’ll rotate baseboxes at every Tails releases and will clean the ones that are older than 4 months. I’m not sure we’ll have to host a lot of baseboxes, even in 2 years.

#24 Updated by intrigeri 2017-05-26 14:57:07

> So would 5 baseboxes be more relevant?

I have no idea where this number comes from. I find it very hard, if not impossible, to review/discuss such magic numbers. So feel the need to bring some data and reasoning on the table.

(Meta: I don’t think it’s the reviewer’s duty to do this research work. But it’s easier for me to do it myself, than to ask you to do it and then feel like I’m PITA forcing you once more to (temporarily) switch to an engineering communication mindset that I’m aware you’re not extremely comfortable with. I feel that I’m stuck between a rock and a hard place, because as I told you a few days ago I’ve already eaten my time budget on this project, and now I’m in a situation where it’s easier for me to do part of your work, than to clarify expectations and ask you to do it. I know you didn’t ask me to do this work, but you know me well enough to guess what will happen if you drop “would 5 baseboxes be more relevant?” like this ;) Whatever, no drama, I’ll survive, TBH I don’t dislike doing this kind of work per-se, and I’m aware that this situation is partly my own responsibility.)

So, I think it greatly depends on how intensively one is working on the build system:

At times when nobody at all is working on the vagrant/ directory, we will generally host 1 basebox for stable, and another one for testing/devel. And to be realistic, add another one for feature/$next_debian. So our lower bound for the ideal situation is 3.
When the build system is being worked on intensively, then the number of baseboxes can quickly explode. E.g. right now I have 5 of them on my own system, and on top of that I see two active branches on Jenkins that modify stuff in vagrant/, and that I’ve never built locally, so I don’t have the corresponding basebox. I didn’t look very closely so there might be more. None of these baseboxes are more than 4 months old so they won’t be garbage collected. When I prepare 3.0 there will be one more. Therefore, in a couple weeks, every isobuilder will need to host up to 8+ baseboxes (once it has built at least one of the branches relying on each basebox).

So, 5 won’t work, and we need to be able to host at the very least 8 baseboxes. I’m too lazy to look at past Git history and find out what’s the maximum number we should support in practice, sorry. So I have to make some guesswork and assume a few things:

we will never be working much more intensively on the build system than what’s happening these days;
in the next 3 months, we will need to fix/polish some more things in vagrant/; how many branches? 1 feels too optimistic, 3-5 more realistic.

So with the clean_old system parameterized to 4 months, we’re looking at hosting 11-13 baseboxes in August. If that’s a problem, then it can be solved either via ~~Feature #12002~~ + ~~Bug #11680~~, or with a more clever garbage collection system (e.g. similar to how our custom APT repo removes dists corresponding to branches merged into a base branch, and as you know Jenkins does the same for its jobs). I’d rather see you evaluate the hardware cost first (~~Feature #12002~~) as I’m pretty sure our engineering time would be better used elsewhere than on saving a few dozens GB in a datacenter.

Please check my reasoning, it’s probably flawed here and there :) Thanks in advance!

#25 Updated by intrigeri 2017-05-27 08:54:04

intrigeri wrote:
> Therefore, in a couple weeks, every isobuilder will need to host up to 8+ baseboxes (once it has built at least one of the branches relying on each basebox).

I was wrong! Technically, given we haven’t started to use clean_old yet, only the baseboxes corresponding to build system branches that haven’t been merged yet when we switch to clean_old will be stored. But this only impacts the short term scenario: we should support the case when the same development story happens in the future and clean_old is used. So the rest of the reasoning wrt. long-term requirements still holds (until we find other flaws in it ;)

Now, I’m going to move the bits about clean_old out of this ticket: the current workaround (clean_all) does address the problem at hand, and switching to clean_old is “merely” a build time optimization. So let’s continue this discussion on ~~Feature #12576~~, and re-focus this ticket on ISO builds fragility.

#26 Updated by anonym 2017-05-30 13:25:03

Status changed from In Progress to Fix committed
% Done changed from 47 to 100

Applied in changeset commit:60c74058de430e126c6250ca5a2ec5223cc5adc5.

#27 Updated by intrigeri 2017-05-30 15:27:07

Status changed from Fix committed to In Progress

#28 Updated by intrigeri 2017-06-08 17:50:14

Target version changed from Tails_3.0 to Tails_3.1

(Like the remaining subtasks.)

#29 Updated by bertagaz 2017-08-03 15:31:26

Status changed from In Progress to Resolved
Assignee deleted (~~bertagaz~~)
QA Check deleted (~~Dev Needed~~)

All subtasks are over, s let’s close this ticket.

#30 Updated by intrigeri 2017-08-03 17:59:27

Woohoo!