Bug #17088

Test suite became unreliable on Jenkins: OOM kills QEMU, OpenJDK memory allocation failure aborts the test suite run

Added by intrigeri 2019-09-23 18:47:10 . Updated 2019-10-04 06:53:51 .

Status:
Resolved
Priority:
High
Assignee:
intrigeri
Category:
Continuous Integration
Target version:
Start date:
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000073d700000, 16777216, 0) failed; error='Cannot allocate memory' (errno=12)

For example:

Initial investigation:

  • No trace of this on the Jenkins master (jenkins.log, Journal) so the problem must be on the isotesters.
  • It’s hard to investigate this on isotesters as we reboot them next time a test job is scheduled on them. Will need to make some logs persist reboots.
  • Every time it happens, it seems that the last action of the test suite used Sikuli (Java).
  • The Java VM has its own internal memory management, that can be tweaked in various ways. But the error message looks like the java process asked the OS to allocate some memory and its request was denied.

Ideas:

  • Check if I see this on my local Jenkins.
  • Compare RAM allocated to isotesters on my local Jenkins vs. lizard.
  • It could simply be that Cucumber 2.4.0 or Jenkins slave 2.x use more RAM → bump isotesters memory a bit and see?

Subtasks


Related issues

Related to Tails - Bug #10068: Upgrade to Jenkins 2.x, using upstream packages Resolved 2018-01-08
Related to Tails - Feature #11175: Decrease I/O load created by isotesters on lizard Resolved 2016-02-27
Related to Tails - Feature #8681: Make it easy to run the test suite with a ramdisk mounted on its temporary directory Rejected 2015-01-12
Blocks Tails - Feature #13284: Core work: Sysadmin (Adapt our infrastructure) Confirmed 2017-06-30
Blocked by Tails - Bug #17103: stable branch does not boot: aufs not available Resolved

History

#1 Updated by intrigeri 2019-09-23 18:47:24

  • related to Bug #10068: Upgrade to Jenkins 2.x, using upstream packages added

#2 Updated by intrigeri 2019-09-23 18:47:34

  • blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added

#3 Updated by intrigeri 2019-09-24 09:59:03

  • related to Feature #11175: Decrease I/O load created by isotesters on lizard added

#4 Updated by intrigeri 2019-09-24 09:59:09

  • related to Feature #8681: Make it easy to run the test suite with a ramdisk mounted on its temporary directory added

#5 Updated by intrigeri 2019-09-24 10:00:00

  • Status changed from Confirmed to In Progress

> * Check if I see this on my local Jenkins.

It did not so far.

> * Compare RAM allocated to isotesters on my local Jenkins vs. lizard.

  • my local Jenkins, where /tmp/TailsToaster is not a tmpfs:
    • isotester1.sib: 11458M RAM
    • worker1.ant01: 15301M RAM
    • worker2.ant01: 15015M RAM,
  • isotesterN.lizard: 26377M RAM but 22850M is allocated to /tmp/TailsToaster tmpfs

So yeah, the testers on my local Jenkins have plenty of RAM, most of which is meant to be used as disk cache, but that can be evicted by the kernel whenever OpenJDK requires more memory. So it seems this easily explains why I’ve not seen this problem on my local Jenkins.

> * It could simply be that Cucumber 2.4.0 or Jenkins slave 2.x use more RAM → bump isotesters memory a bit and see?

I’ve bumped the memory allocation of isotesterN.lizard from the previous 27518592 KiB to 26.5×1024×1024 = 27787264 KiB (diff = 27787264-27518592 = 268672 KiB =~ 262 MiB).

I’ll keep an eye on new occurrences of this problem like this:

$ ssh jenkins.lizard
$ grep --color=never -l OpenJDK /var/lib/jenkins/jobs/test_Tails_ISO_*/builds/[1-9]*/log \
| xargs ls -ltr

Right now, the newest entry in there is from Sep 23 20:56.

If that’s still not enough RAM, we can probably give isotesterN.lizard a bit more again. But at some point we might want to consider other options:

  • Give isotesters a few GiB of swap space. This was my initial plan 4.5 years ago (Feature #8681), before I decided that ext4 + RAM for disk cache was better, and later changed my mind again and switched to tmpfs after doing some benchmarks (Feature #11175). I forgot the swap idea along the way and since then, we bumped memory + the size of the tmpfs a few times, without looking back at the big picture.
  • Reconsider dropping “/tmp/TailsToaster in tmpfs” in favour of “/tmp/TailsToaster on ext4 + the bonus RAM is used as disk cache, which minimizes actual writes to disk”. Last time we tried was 3 years ago (Feature #11175, https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/) and tmpfs was a clear winner, but since then our storage stack and software have evolved, so we might see different results nowadays.

#6 Updated by intrigeri 2019-09-25 06:23:15

> I’ll keep an eye on new occurrences of this problem like this:
> […]
> Right now, the newest entry in there is from Sep 23 20:56.

Unfortunately, the problem happened 3 more times since. Each time again, the last action recorded by the test suite in debug.log is Sikuli: calling wait("$picture").

Thankfully, even if we don’t persist the Journal to disk, we do persist kern.log on isotesters, so I could investigate a bit one of those failures:

  • In the 1st case, the oom-killer already had come into action ~20 minutes before the OpenJDK failure (it killed QEMU back then). Rate limiting prevented the kernel log about the final failure to be saved to disk.
  • In the 2nd case, the oom-killer was triggered 2s after the OpenJDK error was logged.

Also, the oom-killer was triggered numerous more times on every isotester I inspected. In most cases it killed the QEMU process, so:

  • The problem this bug is about causes many more false positive test suite failures than just the obvious ones that interrupt the test suite when OpenJDK can’t allocate memory.
  • I believe the OpenJDK ones interrupt the test suite as OpenJDK (Sikuli) is run by Cucumber via rjb, and cucumber is the main test suite process, so when it aborts, the full test suite run aborts. While when a QEMU process is killed, a test suite scenario fails and the test suite moves on to the next one.
  • This confirms that memory is occasionally starved on isotesters. The process table dumped by the oom-killer does not show anything particularly more memory-consuming than expected.

In passing, I’m starting to wonder if we should add an Icinga2 check for oom-killer activity: a few times already, OOM killer caused trouble elsewhere, but our monitoring only alerted us about the consequences, not about the root cause; IMO having a clear alert for the root cause would occasionally save us some debugging time. That’s a common thing to do and I see a few icinga/nagios plugins that can do that. @zen, any thought on this?

So:

> If that’s still not enough RAM, we can probably give isotesterN.lizard a bit more again. But at some point we might want to consider other options:

> * Give isotesters a few GiB of swap space. This was my initial plan 4.5 years ago (Feature #8681), before I decided that ext4 + RAM for disk cache was better, and later changed my mind again and switched to tmpfs after doing some benchmarks (Feature #11175). I forgot the swap idea along the way and since then, we bumped memory + the size of the tmpfs a few times, without looking back at the big picture.

Done: given 4GB of swap space to every isotesterN.lizard, as a mitigation. To be clear, this is not about relying on swap to back all the data in the /tmp/TailsToaster tmpfs. It’s about giving the system more memory to cope with the occasional memory starvation.

I’ll come back to it later today or tomorrow and will check:

  • whether the oom-killer got triggered again on isotesters
  • whether the swap space is used

> * Reconsider dropping “/tmp/TailsToaster in tmpfs” in favour of “/tmp/TailsToaster on ext4 + the bonus RAM is used as disk cache, which minimizes actual writes to disk”. Last time we tried was 3 years ago (Feature #11175, https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/) and tmpfs was a clear winner, but since then our storage stack and software have evolved, so we might see different results nowadays.

That’s not feasible right now, because:

  • Evaluating this option requires much more time than I have this week.
  • To evaluate regressions brought by this option, one needs baseline data (i.e. “what’s the previous normal state of things”). Since we just upgraded Jenkins and are currently dealing with other regressions, we have no such reliable baseline data yet.
  • It risks making our CI less reliable for a few days; now is not a good time to do this.

#7 Updated by intrigeri 2019-09-25 06:26:34

  • Subject changed from Test suite sometimes aborts on Jenkins: Cannot allocate memory to Test suite became unreliable on Jenkins: OOM kills QEMU, OpenJDK memory allocation failure aborts the test suite run

#8 Updated by intrigeri 2019-09-25 09:37:38

> I’ll come back to it later today or tomorrow and will check:

> * whether the oom-killer got triggered again on isotesters

for i in $(seq 1 5) ; do ssh isotester$i grep 'Killed' /var/log/kern.log | tail -n 1 ; done | sort --numeric

> * whether the swap space is used

for i in $(seq 1 5) ; do ssh isotester$i free -m | grep '^Swap' ; done 

#9 Updated by intrigeri 2019-09-27 05:35:24

> I’ll come back to it later today or tomorrow and will check:

> * whether the oom-killer got triggered again on isotesters

No OOM killer between Sep 25 06:11 UTC and Sep 27 05:30 UTC (while previously it happened dozens of time a day) ⇒ I think we’re good here.

> * whether the swap space is used

Here are some data points:

  • Sep 25, 12:42 UTC: isotester1 and isotester5 use a bit more than 2000 MiB of swap.
  • Sep 26, 05:43 UTC: isotester3 (1769 MiB), isotester4 (2321 MiB), and isotester5 (877 MiB)
  • Sep 26, 05:59 UTC: isotester2 (766 MiB), isotester3 (2251 MiB), isotester4 (2341 MiB), and isotester5 (2351 MiB)
  • Sep 27, 05:28 UTC: isotester1 (410 MiB), isotester2 (2546 MiB), isotester3 (2712 MiB), isotester4 (2382 MiB), and isotester5 (1383 MiB)

⇒ yes, swap is used.

Next steps:

  • Check if the test suite is more fragile than it used to be.
  • Check how much the I/O bandwidth usage is impacted on lizard as a whole and whether it impacts performance of other services we host there (one of the reasons why we switched to tmpfs in the first place was to avoid this).

#10 Updated by intrigeri 2019-09-29 12:53:37

  • blocked by Bug #17103: stable branch does not boot: aufs not available added

#11 Updated by intrigeri 2019-09-29 13:04:05

> * Check if the test suite is more fragile than it used to be.

Not obviously so on the devel branch. I’d like to check the stable branch too once Bug #17103 is resolved.

> * Check how much the I/O bandwidth usage is impacted on lizard as a whole and whether it impacts performance of other services we host there (one of the reasons why we switched to tmpfs in the first place was to avoid this).

I could not notice significant differences on Munin.

#12 Updated by intrigeri 2019-09-30 17:52:20

intrigeri wrote:
> > * Check if the test suite is more fragile than it used to be.
>
> Not obviously so on the devel branch. I’d like to check the stable branch too once Bug #17103 is resolved.

I’ve looked at a bunch of test suite runs on devel-based branches, and unfortunately, yes, it seems to be more fragile than it used to be a couple weeks ago. So I just gave each isotester 1 GiB more RAM. Let’s see how it goes. Note that this ate almost all the hugepages we had left.

#13 Updated by intrigeri 2019-10-04 06:53:51

  • Status changed from In Progress to Resolved

Looks like we’re back to the usual level of fragility on lizard.
FWIW, our test suite is still much more robust on my local Jenkins, so I’m betting on Bug #11680 to improve things.