Bug #16177

Building ISO in RAM often fails: mksquashfs reaped by OOM killer

Added by intrigeri 2018-12-02 10:03:19 . Updated 2019-03-20 14:27:50 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Build system
Target version:
Start date:
2018-12-02
Due date:
% Done:

100%

Feature Branch:
bugfix/16177-limit-mksquashfs-memory-usage
Type of work:
Code
Blueprint:

Starter:
0
Affected tool:
Deliverable for:

Description

My local builds with TAILS_BUILD_OPTIONS=ram, that use gzip SquashFS compression by default, often fail like this:

kernel: Out of memory: Kill process 12575 (mksquashfs) score 145 or sacrifice child
kernel: Killed process 12575 (mksquashfs) total-vm:3580684kB, anon-rss:1992044kB, file-rss:4kB, shmem-rss:2288kB
kernel: oom_reaper: reaped process 12575 (mksquashfs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It seems that xz SquashFS compression is not affected because this does not happen on Jenkins.

When building in RAM we’re mounting a tmpfs with size=100%, which can potentially trigger this kind of issues if that filesystem is almost filled. Bumping the amount of RAM given to the VM (via VM_MEMORY_BASE) should fix it; this has been done on feature/buster already and our CI was adjusted for it already. Now, what’s strange is the fact this problem only happens ~20-40% of the time. So perhaps a system event eats precious RAM at the exact wrong time and we could fix the problem by disabling it or ensuring it’s done before the build starts.


Subtasks


Related issues

Related to Tails - Feature #9788: Take benefit from squashfs-tools 4.3 for faster test builds Resolved 2015-07-20
Related to Tails - Bug #16457: Building feature/buster in RAM sometimes fails: mksquashfs reaped by OOM killer Resolved 2019-02-11
Blocks Tails - Feature #15507: Core work 2019Q1: Foundations Team Resolved 2018-04-08

History

#1 Updated by intrigeri 2018-12-02 10:03:40

#2 Updated by intrigeri 2018-12-02 10:06:44

segfault, kibi: do you build in RAM (check if the Vagrant VM has ~1GB of RAM or 13+GB)? If yes, do you see this problem too?

#3 Updated by intrigeri 2018-12-02 11:11:05

Peak memory usage inside the build VM (at 0.5s granularity) during two successful RAM+gzip builds. First line is free -m, 2nd line is the “Avail” column of df -BM /tmp/tails*:

Mem:          13026        2076         262       10192       10687         435
2834M
Mem:          13026        1450         725       10355       10851         898
2680M
Mem:          13026        1631         475       10437       10920         635
2598M
Mem:          13026        1637         434       10472       10954         594
2564M

#4 Updated by segfault 2018-12-02 11:33:27

intrigeri wrote:
> segfault, kibi: do you build in RAM (check if the Vagrant VM has ~1GB of RAM or 13+GB)? If yes, do you see this problem too?

My Vagrant VM has ~1GB of RAM and I have never seen this problem.

#5 Updated by intrigeri 2018-12-02 11:49:40

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • Feature Branch set to bugfix/16177-ftbfs-mksquashfs-oom-killed
  • Starter set to No

intrigeri wrote:
> So perhaps a system event eats precious RAM at the exact wrong time and we could fix the problem by disabling it or ensuring it’s done before the build starts.

Before bumping the VM memory, I’d like to give this a try → done on the topic branch. Let’s merge this and if I still see the problem, let’s cherry-pick commit:09233368a326ea739c3a1b6927c6cb59c4b10c5e.

#6 Updated by intrigeri 2018-12-02 11:56:35

> My Vagrant VM has ~1GB of RAM and I have never seen this problem.

Thanks. I’ve never seen it either before I switched back to building in RAM a couple weeks ago, so it seems that my “Building ISO in RAM” classification is correct.

#7 Updated by intrigeri 2018-12-02 16:57:34

  • Assignee changed from intrigeri to segfault
  • QA Check set to Ready for QA

intrigeri wrote:
> intrigeri wrote:
> > So perhaps a system event eats precious RAM at the exact wrong time and we could fix the problem by disabling it or ensuring it’s done before the build starts.
>
> Before bumping the VM memory, I’d like to give this a try → done on the topic branch. Let’s merge this and if I still see the problem, let’s cherry-pick commit:09233368a326ea739c3a1b6927c6cb59c4b10c5e.

Please review and if happy, merge into master → stable → devel and then reassign to me: I’m not convinced that will suffice to fix the problem I’m seeing and I’ll keep an eye on it.

#8 Updated by intrigeri 2018-12-10 09:03:03

  • Target version changed from Tails_3.11 to Tails_3.12

Too late for 3.11.

#9 Updated by intrigeri 2019-01-04 15:03:47

#10 Updated by intrigeri 2019-01-04 15:03:55

  • blocked by deleted (Feature #15506: Core work 2018Q4: Foundations Team)

#11 Updated by intrigeri 2019-01-04 15:20:56

  • Assignee deleted (segfault)

#12 Updated by intrigeri 2019-01-05 10:37:04

  • Assignee set to hefee

#13 Updated by hefee 2019-01-05 12:05:39

Run into this issue too with 14336 MB of RAM attached to the VM and TAILS_BUILD_OPTIONS=“ram gzipcomp”:

path relative to the source directories.  IgnoringKilled
P: Begin unmounting filesystems...

real    19m23.041s
user    17m37.388s
sys     3m13.152s
+ kill -9 14954
+ cleanup
+ [ -n /tmp/tails-build.nAPsvka5 ]
+ cd /
+ remove_build_dirs
+ old_build_dirs
+ + mount
tac
+ perl -ni -E say $mountpoint if (($mountpoint) = ($_ =~ m{^(?:aufs|tmpfs|devpts-live|proc-live|sysfs-live) on (/tmp/tails-build(?:-tmpfs)?\.[/[:alnum:]]+)}))
-i used with no filenames on the command line, reading from STDIN.
+ tries=0
+ sudo lsof
+ grep --fixed-strings /tmp/tails-build.nAPsvka5
+ true
+ sudo umount -f --verbose /tmp/tails-build.nAPsvka5
umount: /tmp/tails-build.nAPsvka5 (tmpfs) unmounted
+ sudo rm -rf /tmp/tails-build.nAPsvka5
+ sudo rm -rf /tmp/tails-build.nAPsvka5
rake aborted!
VagrantCommandError: 'vagrant ["ssh", "-c", "MKSQUASHFS_OPTIONS='-comp gzip -Xcompression-level 1' ... command failed with exit status 137
/home/hefee/tails/foundation/tails/Rakefile:112:in `rescue in run_vagrant'
/home/hefee/tails/foundation/tails/Rakefile:109:in `run_vagrant'
/home/hefee/tails/foundation/tails/Rakefile:127:in `call'
/home/hefee/tails/foundation/tails/Rakefile:127:in `block (2 levels) in <top (required)>'
/home/hefee/tails/foundation/tails/Rakefile:413:in `block in <top (required)>'

Caused by:
CommandError: command ["vagrant", "ssh", "-c", "MKSQUASHFS_OPTIONS='-comp gzip -Xcompression-level 1' ... failed with exit status 137
/home/hefee/tails/foundation/tails/Rakefile:72:in `run_command'
/home/hefee/tails/foundation/tails/Rakefile:110:in `run_vagrant'
/home/hefee/tails/foundation/tails/Rakefile:127:in `call'
/home/hefee/tails/foundation/tails/Rakefile:127:in `block (2 levels) in <top (required)>'
/home/hefee/tails/foundation/tails/Rakefile:413:in `block in <top (required)>'
Tasks: TOP => build
(See full trace by running task with --trace)
rake build  7,60s user 2,14s system 0% cpu 29:59,58 total

Merging bugfix/16177-ftbfs-mksquashfs-oom-killed does not help to get rid of my issue.

#14 Updated by hefee 2019-01-05 12:21:30

  • Assignee changed from hefee to intrigeri
  • QA Check changed from Ready for QA to Pass

branch looks fine to me. Review passed.

#15 Updated by intrigeri 2019-01-05 15:52:54

  • Assignee deleted (intrigeri)
  • Target version changed from Tails_3.12 to Tails_3.13
  • QA Check deleted (Pass)
  • Feature Branch deleted (bugfix/16177-ftbfs-mksquashfs-oom-killed)

Merged bugfix/16177-ftbfs-mksquashfs-oom-killed, which is an improvement, but that does not fix the problem in all cases.

#16 Updated by intrigeri 2019-01-19 08:25:41

See commit:50fbec1225bfb4439bd1d0f2895bcfe43da90fbb, which I assume is about this problem.

#17 Updated by intrigeri 2019-01-28 23:16:14

I wonder if this is a function of the number of cores or hyperthreads of the host system = number of cores of the build VM: mksquashfs parallelizes work over all available CPUs, so it might use more memory when more cores are present. E.g. my system has 8 hyperthreads => 8 vcpus in the build VM and when building feature/buster, I see mksquashfs use more than 3.1g (top’s RES column) of memory.

#18 Updated by intrigeri 2019-01-29 09:52:09

I think it’s even worse: when building feature/buster mith VM_MEMORY_BASE = 3.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024, and the ram defaultcomp build options, I’ve seen mksquashfs eat basically as much as it can (up to 4.2g i.e. everything but what’s used by the tmpfs) until it gets killed. So it looks like mksquashfs might adapt its memory usage to the memory it sees as available, while counting space used by tmpfs as available.

I think debian/patches/0002-fix_phys_mem_calculation.patch points to the relevant code: indeed, the memory mksquashfs will use is a factor of the physical memory.

Note that mksquashfs supports a -mem option. According to the README:

-mem <size>             Use <size> physical memory.  Currently set to 1922M
                        Optionally a suffix of K, M or G can be given to specify
                        Kbytes, Mbytes or Gbytes respectively

… but on my sid system with 4.3-8.0tails1, mksquashfs --help instead says “Currently set to 3968M” (i.e. ~25% of my total RAM).

So we should probably pass a fixed value to the -mem option, that works in all configurations we care about. Otherwise, we can bump the build VM memory as much as we can, this will only make mksquashfs use more and more of it.

#19 Updated by intrigeri 2019-01-29 10:43:27

  • Assignee set to intrigeri

#20 Updated by intrigeri 2019-01-29 11:30:25

  • Feature Branch set to bugfix/16177-limit-mksquashfs-memory-usage, bugfix/16177-limit-mksquashfs-memory-usage-buster

#21 Updated by intrigeri 2019-01-29 11:59:59

  • Subject changed from Building ISO in RAM with gzip compression often fails: mksquashfs reaped by OOM killer to Building ISO in RAM often fails: mksquashfs reaped by OOM killer

#22 Updated by intrigeri 2019-01-29 14:39:40

  • related to Feature #9788: Take benefit from squashfs-tools 4.3 for faster test builds added

#23 Updated by intrigeri 2019-02-06 14:33:47

  • Priority changed from Normal to Elevated

#24 Updated by intrigeri 2019-02-11 09:55:46

  • Priority changed from Elevated to Normal

Test @anonym, are you notified?

#25 Updated by intrigeri 2019-02-11 15:47:31

Test results on Tails 3.x

Test results on bugfix/16177-limit-mksquashfs-memory-usage (based on devel, VM_MEMORY_BASE = 1024, BUILD_SPACE_REQUIREMENT = 12*1024, -mem 512M):

  • noram gzipcomp: works
  • noram defaultcomp: works
  • ram gzipcomp: works
  • ram defaultcomp: works

⇒ For Stretch we’re good.

Next steps for Tails 3.x

I’ll propose this branch for 3.13 and I’m confident it will improve developer experience :)

Test results on feature/buster

On feature/buster it’s a different story. Test results on bugfix/16177-limit-mksquashfs-memory-usage-buster (based on feature/buster) with smaller -mem values:

  • noram gzipcomp (still using gzip):
    • VM_MEMORY_BASE = 1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M (used at least 1.3g)
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 256M
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 64M (used at least 1.2g)
  • ram defaultcomp:
    • VM_MEMORY_BASE = 1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
    • VM_MEMORY_BASE = 2*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
  • noram defaultcomp: not tested

In all cases, very little memory is used by the rest of the system. mksquashfs gets reaped by the OOM killer because it attempts to use way more memory than what we ask it to.

Conclusion: squashfs-tools 1:4.3-11, used to build feature/buster, does not always honor the value passed to -mem (at least the gzip and lzo ones don’t; I’ve seen the xz compressor honor it when using its default settings, but I need to do some more tests with our custom settings). While OTOH squashfs-tools 1:4.3-3.0tails4, used to build Tails 3.x, consistently honors that argument; note that this package has a pile of non-upstream patches (collected by lynxis) applied, that 4.3-11 does not include.

So here are results on an updated bugfix/16177-limit-mksquashfs-memory-usage-buster, based on feature/buster, that sets -mem 512M, VM_MEMORY_BASE = 2*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256, and whose “fastcomp” build option now uses xz with non-crazy compression settings:

  • noram fastcomp: works
  • ram fastcomp: works
  • noram defaultcomp: fails (OOM-killed)
  • ram defaultcomp: works, but we’re very lucky as the mksquashfs bug is still there:
    • mksquashfs uses up to 2.8G despite -mem 512M; very little free RAM was left near the end of the compression process; had mksquashfs needed the 25% it will use if it ignores our custom -mem option, it would have been reaped by the OOM killer
    • FTR 10G were used on the tmpfs so our BUILD_SPACE_REQUIREMENT value may be over-estimated

I’ve merged that branch into feature/buster anyway, so that builds on Jenkins (ram defaultcomp) again, while previously mksquashfs would be reaped by the OOM killer. We’ll see how it goes.

Next steps for Buster

  1. Move all the Buster-specific bits to a dedicated ticket.
  2. Verify that squashfs-tools 4.3-3.0tails4 does honor -mem consistently (I think I did already but let’s be sure).
  3. Find out why squashfs-tools 4.3-11 does not honor -mem consistently, e.g.:
    1. Try to reproduce with pristine 4.3-3 changes
    2. Try reverting the 4.3-3..4.3-11 changes
    3. Try applying the patchset collected by lynxis on top of 4.3-11
  4. Report the bug, ideally with a patch attached.

How to debug this sort of things

FTR, to debug this kind of things, build with the rescue build option, rake vm:ssh after failure, chroot into /tmp/tails-build*/chroot and then run something like this:

mksquashfs chroot filesystem.squashfs \
  -comp xz -Xbcj x86 -b 1024K -Xdict-size 1024K \
  -no-exports -mem 512M -wildcards \
  -ef chroot/usr/share/tails/build/mksquashfs-excludes \
  -no-progress -info -sort squashfs.sort

#26 Updated by intrigeri 2019-02-11 15:56:13

  • related to Bug #16457: Building feature/buster in RAM sometimes fails: mksquashfs reaped by OOM killer added

#27 Updated by intrigeri 2019-02-11 15:57:06

> Move all the Buster-specific bits to a dedicated ticket.

Done: Bug #16457

#28 Updated by intrigeri 2019-02-11 15:57:21

  • Feature Branch changed from bugfix/16177-limit-mksquashfs-memory-usage, bugfix/16177-limit-mksquashfs-memory-usage-buster to bugfix/16177-limit-mksquashfs-memory-usage

#29 Updated by intrigeri 2019-02-12 08:38:34

  • Assignee changed from intrigeri to hefee
  • % Done changed from 10 to 50
  • QA Check set to Ready for QA

@hefee, do you want to review this? Given the trouble you’re having with our build system at the moment, it seems you’ll have to dive into related parts of our code base anyway, so this could be a good opportunity to learn and kill 2 drones with one stone :)

#30 Updated by anonym 2019-02-12 09:43:45

In order to not bloat Feature #16337 I’ll respond to your question in Feature #16337#note-6 here:

intrigeri wrote:
> @anonym, I’m working on Bug #16177 and I’m curious about what exact problem commit:aca28cec16287befd7dbadbfee7834c153936b04 solved for you. If that was about mksquashfs getting killed:

Yes, mksquashfs was OOM killed

> * gzip compression or xz?

gzip

> * Building in RAM or not?

RAM

> * Please try to revert this commit and retry with bugfix/16177-limit-mksquashfs-memory-usage merged into your branch.

Sure, I’ll report back here!

#31 Updated by anonym 2019-02-12 12:36:02

anonym wrote:
> > * Please try to revert this commit and retry with bugfix/16177-limit-mksquashfs-memory-usage merged into your branch.
>
> Sure, I’ll report back here!

The build was successful! Looks promising! :)

#32 Updated by intrigeri 2019-02-12 13:05:43

> The build was successful! Looks promising! :)

Great!

#33 Updated by hefee 2019-02-17 14:22:34

  • QA Check changed from Ready for QA to Pass

Works for me. I merged this into hefee/bugfix/16349-tor-0.3.5+force-all-tests and had no issues building that branch.

#34 Updated by hefee 2019-02-17 14:46:30

  • Assignee changed from hefee to intrigeri

#35 Updated by intrigeri 2019-02-18 15:18:11

  • Status changed from In Progress to Fix committed
  • % Done changed from 50 to 100

Applied in changeset commit:tails|58395bbb13ac71cf23b4a6ba66865e30e5f890fc.

#36 Updated by intrigeri 2019-02-18 15:18:48

  • Assignee deleted (intrigeri)

Thanks hefee for the review and anonym for testing!

#37 Updated by CyrilBrulebois 2019-03-20 14:27:50

  • Status changed from Fix committed to Resolved