Bug #12009

Jenkins ISO builders are highly unreliable

Added by intrigeri 2016-12-01 12:37:18 . Updated 2017-01-10 14:23:23 .

Status:
Resolved
Priority:
High
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2016-12-01
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

Since a couple weeks, lots of ISO build jobs fail due to Jenkins crashing (OOM killer?).

The first affected builds are:

The only relevant changes I see immediately before this date are allocating a bit more huge pages on lizard, and giving each ISO builder VM a bit more RAM (8912896 KiB → 9437184 KiB). Then, on November 25 bertagaz gave them 11461632 KiB.

This should be fixed / workarounded by Feature #11972, but depending on the timing of that work, we may need to fix the problem earlier, and in a different way.


Subtasks


Related issues

Related to Tails - Bug #11858: Monitor if isobuilders systems are running fine Resolved 2016-10-03
Related to Tails - Feature #11972: Switch our Jenkins ISO build system to vagrant-libvirt Resolved 2016-12-04

History

#1 Updated by intrigeri 2016-12-01 12:37:30

  • related to Bug #11858: Monitor if isobuilders systems are running fine added

#2 Updated by intrigeri 2016-12-01 12:37:38

  • related to Feature #11972: Switch our Jenkins ISO build system to vagrant-libvirt added

#3 Updated by intrigeri 2016-12-01 19:27:25

The process killed by the OOM killer is:

[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  713]   109   713  1281356    27548     160        0             0 java

#4 Updated by intrigeri 2016-12-06 09:57:24

  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Info Needed

bertagaz: I don’t understand anymore (sorry if I forgot!) why Feature #11972 would fix this breakage. Granted, the memory used by the build system itself will be restricted to whatever we allocate for the Vagrant build VM, but jenkins-slave will still be running on the isobuilderN level-1 guest, and plausibly it would behave just the same as it currently does, and would end up being killed by the OOM killer anyway, no?

#5 Updated by intrigeri 2016-12-06 09:57:34

  • Target version set to Tails_2.9.1

#6 Updated by intrigeri 2016-12-06 10:00:33

(“Whatever the reason, this bug simply doesn’t happen on isobuilder1 with vagrant-libvirt” might be a valid answer :)

#7 Updated by anonym 2016-12-14 20:11:29

  • Target version changed from Tails_2.9.1 to Tails 2.10

#8 Updated by intrigeri 2016-12-18 09:50:21

  • Assignee changed from bertagaz to intrigeri

#9 Updated by intrigeri 2016-12-18 10:19:20

intrigeri wrote:
> The only relevant changes I see immediately before this date are allocating a bit more huge pages on lizard, and giving each ISO builder VM a bit more RAM. I don’t see how given ISO builders more memory can trigger this problem.

vagrant/provision/assets/build-tails mounts a tmpfs with size=100%, so giving ISO builders more memory can actually affect this.

#10 Updated by intrigeri 2016-12-18 10:22:32

  • Description updated
  • QA Check deleted (Info Needed)

#11 Updated by intrigeri 2016-12-19 13:14:15

I’ve noticed that the “cached” value in the output of free sometimes stays very high, after a build has succeeded. During a build, this value grows with the amount of used space on the tmpfs mounted on /tmp/tails-build.*. So it looks like the way we umount this tmpfs (as part of trap cleanup EXIT, with the -l == lazy option) might sometimes not work, which leaves the memory used by that build occupied, which breaks the next builds.

Detailed analysis, on isobuilder2:

  • Dec 18, https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-stretch/101/
    • 15:10:37: build has started, tmpfs not mounted yet, 379 MB cached memory
    • 16:08:45: sudo umount -f -l /tmp/tails-build.EvA6qUle
    • 16:08:47: the tmpfs is not listed in /proc/mounts anymore, but then there’s still 7590 MB cached memory
    • 16:10:23: [WS-CLEANUP] Deleting project workspace…[WS-CLEANUP] done
    • 16:10:25: Finished: SUCCESS
    • 16:10:28: cached memory drops to 6386 MB
    • tl;dr: the tmpfs was not successfully unmounted, and the corresponding memory is never freed
  • Dec 18, https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-stretch-unfrozen/50/
    • 21:19:00: build starts, still more than 6 GB of cached memory
    • 21:46:06: 7179 MB cached memory, of which /tmp/tails-build.vhvkbSG0 uses 385 MB
    • 21:46:10: dpkg fails which fails the build
    • 21:46:10: perl -ni -E say $mountpoint if (($mountpoint) = ($_ =~ m{^(?:aufs|tmpfs|devpts-live|proc-live|sysfs-live) on (/tmp/tails-build(?:-tmpfs)?\.[/[:alnum:]]+)}))
    • 21:46:10: sudo umount -f -l /tmp/tails-build.vhvkbSG0
    • 21:46:11: the tmpfs is not listed in /proc/mounts anymore, there’s 6661 MB cached memory
    • 21:46:13: Finished: FAILURE
    • tl;dr: the tmpfs was successfully unmounted, and the corresponding memory freed
  • then a couple builds fail before having mounted their tmpfs, and cached memory doesn’t change in the significative way
  • Dec 19, https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-5684-screen-locker/19/console
    • 04:00:05: 10147 MB cached memory, of which /tmp/tails-build.pCpXxptd uses 3.8 GB
    • 04:00:09: oom killer kills the (Jenkins) java process, FATAL: hudson.remoting.RequestAbortedException: java.io.EOFException
    • 04:00:15: 10302 MB cached memory, of which /tmp/tails-build.pCpXxptd uses 4.1 GB
    • 04:00:20: the tmpfs is not listed in /proc/mounts anymore, cached memory dropped to 6168 MB
    • tl;dr: the tmpfs was successfully unmounted, and the corresponding memory freed

I see the same pattern on:

#12 Updated by intrigeri 2016-12-19 13:18:29

intrigeri wrote:
> bertagaz: I don’t understand anymore (sorry if I forgot!) why Feature #11972 would fix this breakage.

If the analysis I’ve posted in my previous comment is correct, then Feature #11972 will indeed fix this: the memory used by the tmpfs will be freed by shutting down the Vagrant level-2 VM. Still, we’ve been suffering from this breakage since more than a month, and there’s no ETA that I’m aware of on Feature #11972, so I’ll try to find a quick fix without waiting Feature #11972. I’ll try not spending too much time on it though.

#13 Updated by intrigeri 2016-12-19 14:20:22

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

#14 Updated by intrigeri 2016-12-20 11:22:06

  • % Done changed from 10 to 20

On Stretch builds, the umount operation is blocked by a couple gpg-agent processes: https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-stretch/111/console and https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-stretch/110/console.

#15 Updated by intrigeri 2016-12-20 15:40:18

  • % Done changed from 20 to 50
  • QA Check set to Ready for QA

My workaround seems to work. Let’s wait a bit more before calling it a success, and then I’ll ask anonym to review the set of commits I did to make it work.

#16 Updated by intrigeri 2016-12-21 09:29:41

  • Assignee changed from intrigeri to anonym

Seems to be fixed! Now, some branches based on Stretch might still not have the relevant changes merged in (I did most of them though). The solution if they break isobuilders again is to merge feature/stretch in them.

Please review: git diff c2f09ce2a87d8962964ce7c820be32088c814027..d70a227c2c87d6af95f770189b0eb14054dd8d2e -- vagrant

#17 Updated by anonym 2017-01-10 14:23:23

  • Status changed from In Progress to Resolved
  • Assignee deleted (anonym)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

Pass!