Bug #14944

jenkins-data-disk is running out of diskspace

Added by groente 2017-11-09 17:06:54 . Updated 2018-01-24 12:38:21 .

Status:
Resolved
Priority:
Urgent
Assignee:
groente
Category:
Continuous Integration
Target version:
Start date:
2017-11-09
Due date:
% Done:

100%

Feature Branch:
puppet-tails:bugfix/14944-deduplicate-reproducible-jobs-ISOs
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
301

Description

The subject says it all really, I can’t see anything that looks dispensible (though I also find it hard to judge), so I’m tempted to simply extend the LV with 100GB.


Subtasks


Related issues

Related to Tails - Feature #12633: Lower the workload caused by reproducible builds Jenkins jobs Resolved 2017-10-22
Related to Tails - Bug #15107: Some topic branches are never built reproducibly during Jenkins daily runs => add option to specify which APT snapshot serials to use during build Resolved 2017-12-26
Related to Tails - Bug #16025: jenkins-data-disk is running out of diskspace again Resolved 2018-10-03
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29

History

#1 Updated by intrigeri 2017-11-10 10:15:50

  • Category changed from Infrastructure to Continuous Integration

Our storage estimates don’t include room for these extra 100GB so I’d like to look into this a bit more before we allocate to this LV space we have already planned to use elsewhere.

It first glance I see a few potential root causes:

  • It looks like we may be keeping artifacts for reproducibly_build_Tails_ISO_* for too long: when a branch fails to build reproducibly, generally it will do so until we fix it, and a few sets of failure artifacts are enough to debug the problem. I’ll check if/how we’re garbage collecting them and will try to adjust.
  • When we did the storage estimates for the reprobuilds jobs, we didn’t take into account the fact UX would be poor if we didn’t keep the 2 ISOs on failure, which we now do, so we’re simply storing twice more data there than planned (Feature #12633#note-35).
  • All branches fail to build reproducibly at the moment. I do hope this is an exception and won’t happen often in the future, so I’m fine with short-term workarounds (without problematic long-term consequences) for now.
  • I see artifacts for a job that was deleted a month ago (test-sikuli-similar-0.60). Not sure what’s going on here. I’ve deleted this directory.

#2 Updated by intrigeri 2017-11-10 10:19:35

intrigeri wrote:
> * It looks like we may be keeping artifacts for reproducibly_build_Tails_ISO_* for too long: when a branch fails to build reproducibly, generally it will do so until we fix it, and a few sets of failure artifacts are enough to debug the problem. I’ll check if/how we’re garbage collecting them and will try to adjust.

Actually, as far as I can tell we simply have no GC mechanism in place for these artifacts. In jenkins-jobs.git:defaults.yaml I see:

      artifactDaysToKeep: -1
      artifactNumToKeep: -1

(overriden nowhere).

And we call clean_old_jenkins_artifacts only on build_Tails_ISO_* jobs. I’ll try to fix this right away.

#3 Updated by intrigeri 2017-11-10 10:23:34

  • related to Feature #12633: Lower the workload caused by reproducible builds Jenkins jobs added

#4 Updated by intrigeri 2017-11-10 10:26:59

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20
  • QA Check changed from Ready for QA to Dev Needed

intrigeri wrote:
> Actually, as far as I can tell we simply have no GC mechanism in place for these artifacts.

Hopefully fixed with
https://git-tails.immerda.ch/jenkins-jobs/commit/?id=5ee375f7290be2e4b6406fcb39a66a99365ec119, let’s see how it goes (I don’t know when Jenkins schedules cleanup of old artifacts).

I’ve deleted a few useless/obsolete branches which saved some space so this disk should now be back to warning state. If within 24h the aforementioned fix didn’t clean up enough to escape warning state, I’ll look closer at what’s left.

#5 Updated by intrigeri 2017-11-10 10:27:17

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#6 Updated by intrigeri 2017-11-10 15:46:01

#7 Updated by intrigeri 2017-11-11 11:33:54

Also, note that we’ve been in a rather unusual (at least in theory) situation in the last few weeks, here’s some kind of post-mortem analysis trying to understand what happened:

  • Lots of branches waiting to be reviewed: a few ones about updating process/internal doc post-summit, mostly blocking on sajolida & u being busy, and on some affected workers being unresponsive; a few bugfix ones blocking on the RM to look at them; these are known issues, and “xyz runs out of disk space” is one of the most minor out of their consequences, so let’s not bother looking into this here.
  • No branch builds reproducibly due to various problems (at least Bug #14924, Bug #14933, Bug #14946) that nobody spotted / acted on in a timely manner. I’m not sure why, but I bet it’s because it was unclear who was responsible for keeping an eye on this and then either fixing issues, or if it’s Someone Else’s Problem™, then letting them know. For example, it wasn’t clear to me to what extent bertagaz felt responsible for this. I think that this can easily be solved by improving the interface between CI developers (mostly bertagaz) & CI users (mostly Foundations Team): e.g. next time, the person who deploys a big change to our CI could say “hey, I’ve deployed $this, I’m confident it should work and I will monitor how it works in practice, but if you notice any issues I missed, please let me know” or “hey, I’ve deployed $this, I won’t evaluate how it works for you in practice apart of big breakage that’s obviously on my plate, so I expect you to evaluate the output of this system and report any problem”. And then we’ll know who should keep an eye on this :)
  • Based on my UX feedback (“how do I find the 1st ISO?”) we implemented “keep the 2 ISOs” before we had a plan to deal with the increased storage space, and actually before we had any kind of GC in place for these artifacts. This feels risky, to say the least. I have my share of responsibility for creating an atmosphere in which bertagaz might have felt pressured to fix my UX feedback quickly, resulting in even worse UX in the last 24h with Jenkins being broken due to the problem this ticket is about. I’ll try to improve how I convey feedback, in particular how urgent or not a given problem is.

#8 Updated by intrigeri 2017-11-14 08:09:08

  • Target version changed from Tails_3.3 to Tails_3.5

#9 Updated by intrigeri 2017-11-18 11:31:53

  • Status changed from In Progress to Resolved
  • % Done changed from 20 to 100
  • QA Check deleted (Dev Needed)

#10 Updated by groente 2017-12-25 20:36:08

  • Status changed from Resolved to Confirmed
  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Info Needed

Jenkins-data is hitting warning-levels again, do you see any way to clean things up?
If not, and if I read the spreadsheet correctly, the used data grows roughly 10GB per year, so how about adding 20GB to the partition to keep it happy ’till 2019?

#11 Updated by bertagaz 2017-12-26 11:11:01

  • related to Bug #15107: Some topic branches are never built reproducibly during Jenkins daily runs => add option to specify which APT snapshot serials to use during build added

#12 Updated by bertagaz 2018-01-01 14:28:03

  • Status changed from Confirmed to In Progress

Applied in changeset commit:d0230678e5e6098adc3a39e594ebc2efdc40757c.

#13 Updated by intrigeri 2018-01-04 18:56:21

/var/lib/jenkins/jobs/reproducibly_build_Tails_ISO_* currently uses 23G; FTR that’s just as much as what we would save if the branches waiting to be reviewed by the 3.6 RM merged them. And for curiosity’s sake, out of these 23G, 16G is reproducibly_build_Tails_ISO_web-14997-explain-better-verification-and-failure (14 ISO images i.e. one per day since a week, since we keep these artifacts for a week, and likely it’s scheduled at a bad time that causes Bug #15107 but I did not verify). On Bug #15107 bertagaz seems to be convinced that reproducible build jobs are the reason for this disk space issue; I’m not convinced given the above data, and FTR build_Tails_ISO_* takes 217GB. We’re discussing what the best approach on that other ticket. In the meantime, I’ve tweaked the reproducibly_build_Tails_ISO_* Jenkins jobs config to keep 2 less days of artifacts, i.e. 5 days. This should bring the impact of these reproducible build jobs down to something that start to look like it’s negligible, until we make up our mind wrt. how to handle this.

#14 Updated by groente 2018-01-06 17:49:05

  • Priority changed from Elevated to Urgent

disk use is critical again!

#15 Updated by bertagaz 2018-01-09 14:20:53

  • Assignee changed from bertagaz to groente
  • QA Check changed from Info Needed to Ready for QA
  • Feature Branch set to bugfix/14944-deduplicate-reproducible-jobs-ISOs

I’ve fixed temporally the situation by removing ISOs from the build_Tails_ISO_bugfix-10494-retry-curl-in-htpdate job which are useless as it’s here for me to test changes in the build system in Jenkins without sending false positive notifications to developers.

We also have the problem that some ISOs are duplicated: the reproducibly_build_Tails_ISO_* jobs downloads the one build from their respective build_Tails_ISO_ job so that it can compare for reproducibility with the one it builds itself.

So far I’ve dedpulicated this ISOs by replacing them by a symlink to the original ones archived in the build_Tails_ISO_ jobs artifacts. I’ve pushed a branch which adds a cronjob running a script that does that automatically.

It’s not clear to me who should review that. Groente, do you want to have a look? I used it last times I wanted to remove duplicated ISOs and it worked fine.

A way to test it, if the code looks good to you and you want to:

  • find in jenkins.lizard:/var/lib/jenkins/jobs which reproducibly_build_Tails_ISO_ which job contains one of this downloaded ISO in its archives (its path should be something like /var/lib/jenkins/jobs/reproducibly_build_Tails_ISO_*/builds/2018-*/archive/build-artifacts/1/tails-amd64-*.iso, pay attention the ISO is in the /1/ subdirectory)
  • note the job name and go to its corresponding URL on our nyghtly website (e.g https://nightly.tails.boum.org/reproducibly_build_Tails_ISO_web-14997-explain-better-verification-and-failure/builds/2018-01-07_23-26-05/archive/build-artifacts/1/)
  • run the script with /var/lib/jenkins/jobs as arguments on jenkins.lizard as the jenkins user
  • have a look at the nightly webpage you opened. The ISO should still be downloadable. Check on jenkins.lizard the ISO has been replaced by a valid relative symlink to the original ISO in the related /var/lib/jenkins/jobs/build_Tails_ISO_ job archive.

If it seems too complicated, I guess you can pass that review to intrigeri.

#16 Updated by bertagaz 2018-01-09 18:15:03

  • Feature Branch changed from bugfix/14944-deduplicate-reproducible-jobs-ISOs to puppet-tails:bugfix/14944-deduplicate-reproducible-jobs-ISOs

#17 Updated by anonym 2018-01-23 19:52:29

  • Target version changed from Tails_3.5 to Tails_3.6

#18 Updated by groente 2018-01-24 12:38:22

  • Status changed from In Progress to Resolved
  • QA Check changed from Ready for QA to Pass

seems to work nicely, merged & deployed. thanks!

#19 Updated by bertagaz 2018-10-03 09:36:56

  • related to Bug #16025: jenkins-data-disk is running out of diskspace again added