Bug #10772

Next ISO builds fail on Jenkins when a previous job was aborted uncleanly

Added by bertagaz 2015-12-17 04:35:56 . Updated 2016-03-08 19:03:09 .

Status:
Resolved
Priority:
High
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2015-12-17
Due date:
% Done:

100%

Feature Branch:
bugfix/10772-umount-old-builds-tmp-directories
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:
280

Description

We’ve seen Tails builds failing in Jenkins on isobuilder2 with huge stacktraces with the same kind of output than the one we had with Bug #10334:

17:58:29 FATAL: hudson.remoting.RequestAbortedException: java.io.EOFException
17:58:29 hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.EOFException

and same symptoms: the /tmp/tmp.xxx and /tmp/tails-build.xxx where not unmounted at some point, and then the other following builds on this isobuilder failed, because /tmp/ was full.

One thing to try if this happen again is to umount this /tmp/ directories and see if we get the same NFS filehandle error.

First build to fail seems to have been https://jenkins.tails.boum.org/view/Raw/job/build_Tails_ISO_feature-jessie/877/consoleFull (which has been deleted since feature/jessie has been merged into devel).

Then the subsequent build_Tails_ISO_devel jobs failed since
https://jenkins.tails.boum.org/view/RM/job/build_Tails_ISO_devel/1400/console

https://jenkins.tails.boum.org/computer/isobuilder2/builds can give an overview of what happened, minus the feature/jessie branch that has been removed.


Subtasks


Related issues

Related to Tails - Bug #10334: Fragile automated builds on lizard Resolved 2015-10-03

History

#1 Updated by bertagaz 2015-12-17 04:36:15

  • related to Bug #10334: Fragile automated builds on lizard added

#2 Updated by bertagaz 2015-12-17 04:40:12

#3 Updated by intrigeri 2015-12-17 05:40:19

  • Assignee set to bertagaz
  • Target version deleted (Tails_2.0)
  • Deliverable for set to SponsorS_Internal

#4 Updated by intrigeri 2015-12-18 14:11:50

https://jenkins.tails.boum.org/view/Raw/job/build_Tails_ISO_bugfix-10720-installer-freezes-on-jenkins/lastFailedBuild/console just failed and I see:

tmpfs                      tmpfs     8.8G   21M  8.8G   1% /tmp/tmpfs.iNpPvX7P
aufs                       aufs      8.8G   21M  8.8G   1% /tmp/tails-build.KM13A83U
tmpfs                      tmpfs     8.8G  1.6G  7.3G  18% /tmp/tmpfs.huwt0WIv
aufs                       aufs      8.8G  1.6G  7.3G  18% /tmp/tails-build.5AbM57Hp
tmpfs                      tmpfs     8.8G  811M  8.0G  10% /tmp/tmpfs.azY7GNkn
aufs                       aufs      8.8G  811M  8.0G  10% /tmp/tails-build.WQr9Sobp

I could unmount the 1st 4 ones, no problem. Another build is running on this Jenkins worker so I’ll let it go on and see.

#5 Updated by intrigeri 2016-01-07 00:57:00

Happened again, and quite a few false positive notifications were sent already.

What I see is:

tmpfs                       8.4G  4.4G  4.0G  53% /tmp/tmpfs.4XcPWADD
aufs                        8.4G  4.4G  4.0G  53% /tmp/tails-build.1bBq132u
# ls -l /tmp/tails-build.1bBq132u/
total 656
drwxr-xr-x  3 jenkins jenkins     80 Jan  6 15:41 auto
drwxr-xr-x  4 root    root        80 Jan  6 15:48 cache
drwxr-xr-x 22 root    root       520 Jan  6 15:58 chroot
drwxr-xr-x 17 jenkins jenkins    440 Jan  6 15:41 config
drwxr-xr-x  2 jenkins jenkins     60 Jan  6 15:41 debian
drwxr-xr-x  4 jenkins jenkins    620 Jan  6 15:41 features
drwxr-xr-x  2 jenkins jenkins    900 Jan  6 15:47 po
-rw-r--r--  1 root    root    664330 Jan  6 16:01 tails-i386-web_10614-temporary-terminology-1.7-20160106T1541Z-c02147e+stable@2d053cd.iso.buildlog
-rw-r--r--  1 root    root        11 Jan  6 15:47 tails-i386-web_10614-temporary-terminology-1.7-20160106T1541Z-c02147e+stable@2d053cd.iso.start.timestamp
drwxr-xr-x  3 root    root        60 Jan  6 15:47 tmp
drwxr-xr-x  3 jenkins jenkins     60 Jan  6 15:41 wiki

One thing that’s interesting is that the job for Feature #10614 is not configured anymore. I suspect that it was deleted yesterday, when I fixed the “Git branches to Jenkins job update process has been broken since Dec 19” issue I told you about. So I wonder if this bug might be happening when we’re deleting a job that is currently running (didn’t check). If it’s the case, then it’s not a problem, in itself, that this job fails… as long as this doesn’t break all next job runs as a result. I’m thinking more and more that the “reboot before running the actual job” process should be applied to ISO build jobs as well, as it would kill an entire class of robustness issues (namely: sub-optimal clean up post-build failure).

In case it’s useful:

# tail -n 20 /tmp/tails-build.1bBq132u/tails-i386-web_10614-temporary-terminology-1.7-20160106T1541Z-c02147e+stable@2d053cd.iso.buildlog 
After this operation, 58.4 kB disk space will be freed.
(Reading database ... 133952 files and directories currently installed.)
Removing hardlink ...
Processing triggers for man-db ...
Creating /etc/default/locale
Setting correct file permissions
Removing *.pyc
cp: cannot stat `config/chroot_local-hooks/99-set_mtimes': No such file or directory
P: Begin unmounting filesystems...

real    14m39.781s
user    10m59.993s
sys     2m2.888s
+ RET=1
+ '[' -e binary.iso ']'
+ fatal 'lb build failed (1).'
+ echo 'lb build failed (1).'
lb build failed (1).
+ exit 1
+ kill -9 28674

Now I’m going to reboot isobuilder2 so it stops spamming contributors. If you would prefer me to put it offline next time, so you can debug what happened more easily, let me know.

In any case, this ticket should be retitled, since at least we know quite clearly why all further builds fail: there’s not enough free memory to complete a build, since some build failed and wasn’t cleaned up properly.

#6 Updated by intrigeri 2016-01-07 00:58:28

  • Target version set to Tails_2.2

(Adding to some milestone so that this bug is on your radar.)

#7 Updated by intrigeri 2016-01-07 01:27:25

  • Status changed from Confirmed to In Progress
  • Priority changed from Normal to Elevated
  • % Done changed from 0 to 10

I have a theory, that comes from a) the fact that the output of ls above lacks some files we would normally have in a Git checkout; b) the build failed with cannot stat `config/chroot_local-hooks/99-set_mtimes': No such file or directory; and c) all this seems correlated to the time when the job that left stuff in a dirty state is likely to have been deleted.

My theory is that jjb deletes the job => Jenkins kills the current run of this job => it deletes the workspace => some of this deletion works, and breaks the build due to missing files; and anyway, the aufs mount is in a weird state since files were just removed directly from its read-only underlying branch. This would also explain the “Stale NFS file handle” we could not make any sense of on Bug #10334.

So, I would call this problem “After a ISO build job is deleted while running in Jenkins, all subsequent ISO builds fail and issue false positives”. If this theory is correct, then it’s essentially a bug in the “generate ISO build jobs from Git branches” implementation.

I see three possible broad categories of solutions:

  1. we avoid deleting jobs while they are running: seems tricky and will be brittle (prone to TOCTU races);
  2. we ensure we don’t leave an isobuilder in a dirty state after a job is killed: if we can do the umount’ing as an additional step, that’s run regardless of whether the previous step has been killed, we’re good (not sure this is even possible with Jenkins); but I would not count on the fact that this clean up step is run if the job is being deleted… also seems brittle, and relying a bit too much on undocumented behaviour that may change some day;
  3. we ensure we don’t start a build job on an isobuilder that’s in a dirty state. This could be a) we reboot isobuilders before running build jobs, as we do on isotesters; I think this is the way to go long-term, because it’s simply the only way to be sure we run in a pristine, reproducible environment. But I can see why we won’t implement this soon, and we need a short-term solution to stop spamming contributors needlessly this much. So, we could do b) forcefully unmount all such temporary mounts created by the build process as the first step of building an ISO (incidentally, that’s exactly what my personal, custom build scripts have been doing at home for years, and I’ve never looked back).

So I think we should:

  • do 3.b soon (ideally this month, worst case next month, ask for help if you don’t think you can do it in this time span); I’m raising priority because this has been going on for 3 months now, and because hand-holding Jenkins is not what we should be spending time on;
  • and then, we can keep 3.a in mind (not on a todo list!), and if we see other robustness issues it would fix some day, you can count on me to put it back on the table.

Thoughts?

#8 Updated by intrigeri 2016-01-07 01:45:21

(Some thoughts in passing, for the sake of curiosity, but probably not worth wasting time on instead of actually fixing the bug: I still wonder why we see it more often on isobuilder2. The only obvious differences I see are that it has ~500MiB less RAM configured, for some reason I can’t explain; and it has ~200MiB less used disk space on its rootfs, probably due to long-term install/remove/upgrade isobuilder1 went through. This leads me to think we should reinstall isobuilder1 at some point to reduce the delta: the fact these two VMs are already exposing slightly different behaviour is not very confidence inspiring. I say we should reinstall both from scratch when we move to Jessie for isobuilders.)

#9 Updated by intrigeri 2016-01-11 01:48:10

Happened again on isobuilder2. The build that left the slave in dirty slave was:

-rw-r--r--  1 root    root    686935 Jan 11 00:21 tails-i386-feature_10841-xserver-xorg-video-intel-backport-2.0-20160110T2353Z-3b8ccbc+devel@dbaea26.iso.buildlog

… that was removed from jenkins-jobs with commit c0b4018dcfec6ebc2c5988c0cd99cee447240b43 (Mon Jan 11 00:06:04 2016 +0000). I don’t think it’s coincidental.

#10 Updated by intrigeri 2016-01-23 12:11:10

Happened again, all ISO builds on isobuilder2 have been failing since ~40 hours. The build that was left dirty was tails-i386-test_9087-lan-ssh-tests-2.0-20160121T2034Z-61f2928+testing@8a01831.iso.buildlog, that was disabled on Thu Jan 21 20:46:42 2016 +0000.

Ping wrt. the timing resolution I proposed?

#11 Updated by bertagaz 2016-01-27 13:02:33

  • related to Bug #11006: Reinstall Lizard's isobuilders from scratch added

#12 Updated by bertagaz 2016-01-27 13:08:44

  • % Done changed from 10 to 20
  • QA Check set to Dev Needed

intrigeri wrote:
> Ping wrt. the timing resolution I proposed?

Sorry for the lag. Your proposal seems relevant to me. I just pushed a branch that implement 3b, and added Bug #11006 that takes note to reinstall the isobuilders once we’ve ported our build system to Jessie. A build will soon be started on Jenkins.

#13 Updated by bertagaz 2016-01-27 13:46:52

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

It seems that the branch at least doesn’t break the build in Jenkins: https://jenkins.tails.boum.org/job/build_Tails_ISO_bugfix-10772-umount-old-builds-tmp-directories/1/consoleFull

#14 Updated by intrigeri 2016-01-31 10:36:42

  • Feature Branch set to bugfix/10772-umount-old-builds-tmp-directories

#15 Updated by intrigeri 2016-01-31 10:52:45

  • Subject changed from Builds in Jenkins sometimes fail with huge stacktrace on isobuilder2 to Next ISO builds fail on Jenkins when a previous job was aborted uncleanly
  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 20 to 30
  • QA Check changed from Ready for QA to Dev Needed

Hi! Thanks for the branch. I think it has two major problems:

  1. it introduces a 2nd, and very different, implementation of code that we already have in the cleanup function; I know that we can’t reuse cleanup as-is (as it is tight to information about the current build) but let’s please try to factorize what can be;
  2. it only umounts stuff under /tmp/tails-build.*, but leaves the actual tmpfs (TMPFS_DIR) alone, so frankly I doubt that it will fix the actual bug this ticket is about.

I would happily merge a branch that fixes the 2nd problem only (so that we finally get rid of this way too old bug) if I’m promised that the 1st problem will be addressed during the 2.2 cycle.

#16 Updated by bertagaz 2016-01-31 13:19:39

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 30 to 40
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:
> # it introduces a 2nd, and very different, implementation of code that we already have in the cleanup function; I know that we can’t reuse cleanup as-is (as it is tight to information about the current build) but let’s please try to factorize what can be;
> # it only umounts stuff under /tmp/tails-build.*, but leaves the actual tmpfs (TMPFS_DIR) alone, so frankly I doubt that it will fix the actual bug this ticket is about.

Thanks for the review and the biggest picture it brought. I’ve just pushed two commits in the branch that hopefully should fix both of your concerns.

#17 Updated by intrigeri 2016-02-05 16:11:39

  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 40 to 50
  • QA Check changed from Ready for QA to Dev Needed

> Thanks for the review and the biggest picture it brought. I’ve just pushed two commits in the branch that hopefully should fix both of your concerns.

Thanks! I’ve pushed another commit on top that improves the program’s design IMO.

These 2 commits indeed are headed in the good direction. They bring a couple regressions, though, that I would like to see fixed before we merge this.

First of all, it looks like commit:31a1ac642a7045100455bff440b6a2c7c5129d65 introduces a regression when not building in RAM: previously we were removing $BUILD_DIR, and now we’re not, since the new remove_build_dirs function does not support this usecase. Did I miss something?

About commit:a7dee66a3ee83217d4b7614e914cde209c810361, that does:

old_build_dirs() {
-       mount | perl -ni -E 'say $mountpoint if (($mountpoint) = ($_ =~ m{^(?:tmpfs|devpts-live|proc-live|sysfs-live) on (/tmp/tails-build\.[/[:alnum:]]+)}))'
+       mount | egrep '(/tmp/tmpfs.|/tmp/tails-build)' | awk '{ print $3 }' | tac
}

… and also replaces a quite good mount|grep|awk pipeline with the one quoted above.

The previous version of the mount filtering was maybe not nice and not using tools you feel at ease with, but it tried to carefully not select anything unrelated. The new version matches lines very broadly, which feels ugly and fragile to me, for two reasons at least:

  • it matches substrings anywhere on the line, without any anchoring nor accurate pattern matching; I would like to keep the “is mounted on” criterion, somehow; to do that, we can either use a good enough regexp, as we did previously (updating it, of course, to match the new needs you discovered along the way); or, some library that parses it for us if you feel like making it fancy;
  • /tmp/tmpfs. is pretty generic, and there’s no guarantee that other programs than our build system won’t use names that match it; I suggest switching to the whole thing to /tmp/tails-build-tmpfs.XXXXXXXX, or something.

Also, you want to escape ‘.’ in the regexp if it stays in there, since it’s a RE special char. And you’ll want to add it to the one about /tmp/tails-build component as well, for consistency’s sake. Factorizing the common prefix would be a nice bonus.

Meta: if you prefer that I fix all this stuff myself instead of writing at length (in the hope that it’s better on first try next time) here, let me know.

#18 Updated by intrigeri 2016-02-12 12:04:19

  • Priority changed from Elevated to High

Seen it again today which wasted time during an emergency release process, this has been going on since 4+ months => bumping priority.

#19 Updated by bertagaz 2016-02-14 12:25:46

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 50 to 60
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:
> Thanks! I’ve pushed another commit on top that improves the program’s design IMO.

Good catch.

> These 2 commits indeed are headed in the good direction. They bring a couple regressions, though, that I would like to see fixed before we merge this.
>
> First of all, it looks like commit:31a1ac642a7045100455bff440b6a2c7c5129d65 introduces a regression when not building in RAM: previously we were removing $BUILD_DIR, and now we’re not, since the new remove_build_dirs function does not support this usecase. Did I miss something?

Nop you’re right, I missed this subtility. It’s back in commit:054c266c14873ace6cddc4535ac5ee97a107ab67

> About commit:a7dee66a3ee83217d4b7614e914cde209c810361, that does:
>
> […]
>
> … and also replaces a quite good mount|grep|awk pipeline with the one quoted above.
>
> The previous version of the mount filtering was maybe not nice and not using tools you feel at ease with, but it tried to carefully not select anything unrelated. The new version matches lines very broadly, which feels ugly and fragile to me, for two reasons at least:
>
> * it matches substrings anywhere on the line, without any anchoring nor accurate pattern matching; I would like to keep the “is mounted on” criterion, somehow; to do that, we can either use a good enough regexp, as we did previously (updating it, of course, to match the new needs you discovered along the way); or, some library that parses it for us if you feel like making it fancy;

Ok, so I’ve tried to get back your perl version, and complete it to match the new logic (catch all of the build directories, including the tmpfs and aufs one, and not only the one mounted on top of the aufs).

commit:98b071669c8252f85d3399b990de116ef01b2589

> * /tmp/tmpfs. is pretty generic, and there’s no guarantee that other programs than our build system won’t use names that match it; I suggest switching to the whole thing to /tmp/tails-build-tmpfs.XXXXXXXX, or something.

Done in commit:81c043c8d58e5efa6591d590195760139a2ee807

> Also, you want to escape ‘.’ in the regexp if it stays in there, since it’s a RE special char.
>
> And you’ll want to add it to the one about /tmp/tails-build component as well, for consistency’s sake. Factorizing the common prefix would be a nice bonus.

So tried to get a better regexp, hope it’ll fit your taste. If not, please modify it accordingly.

#20 Updated by intrigeri 2016-02-27 10:29:58

  • related to deleted (Bug #11006: Reinstall Lizard's isobuilders from scratch)

#21 Updated by intrigeri 2016-02-28 12:07:59

  • Assignee changed from intrigeri to anonym
  • % Done changed from 60 to 70
  • Type of work changed from Research to Code

OK, the last version looks OK to me (I’m not fond of the way responsibilities are split between cleanup and remove_build_dirs, but it’s good enough). Pushed a few nitpicking commits on top.

anonym, can you please:

  1. check that this doesn’t break things on Vagrant
  2. check that it still builds ISOs successfully on Jenkins, with the last commits I’ve pushed there: https://jenkins.tails.boum.org/job/build_Tails_ISO_bugfix-10772-umount-old-builds-tmp-directories/
  3. merge into master -> stable -> devel -> experimental if happy

?

Now, as discussed elsewhere, this code should probably live outside of tails.git, like our test suite wrapper scripts, but let’s fix the problem at hand and we can come back to other matters later.

#22 Updated by anonym 2016-03-06 19:12:05

  • Status changed from In Progress to Fix committed
  • % Done changed from 70 to 100

Applied in changeset commit:72569de65e65a0a03ce758067678b1ad4df89a14.

#23 Updated by anonym 2016-03-06 19:47:25

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

Works fine in Vagrant (Virtualbox).

#24 Updated by anonym 2016-03-08 19:03:09

  • Status changed from Fix committed to Resolved