Feature #11175

Decrease I/O load created by isotesters on lizard

Added by intrigeri 2016-02-27 14:45:58 . Updated 2016-03-25 21:59:10 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2016-02-27
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Starter:
Affected tool:
Deliverable for:
270

Description

Even with 23G of RAM per isotester, contrary to what I believed earlier (Feature #8681), our isotesters are the biggest consumers of I/O write, and their /tmp/TailsToaster ext4 volume seems to be at fault. I’m trying to mount a tmpfs on there, we’ll see how it goes.


Subtasks


Related issues

Related to Tails - Feature #8681: Make it easy to run the test suite with a ramdisk mounted on its temporary directory Rejected 2015-01-12
Related to Tails - Bug #17088: Test suite became unreliable on Jenkins: OOM kills QEMU, OpenJDK memory allocation failure aborts the test suite run Resolved

History

#1 Updated by intrigeri 2016-02-27 14:46:20

  • related to Feature #8681: Make it easy to run the test suite with a ramdisk mounted on its temporary directory added

#2 Updated by intrigeri 2016-02-27 14:59:01

Note to myself: get rid of the LVs if we stick to tmpfs in the end.

#3 Updated by intrigeri 2016-02-27 16:04:46

Once the /tmp/TailsToaster case is closed, it will be interesting to look at I/O on:

the other isotesterN-* LVs, in particular the one that hosts the 2 ISOs used for testing

Both ISOs are copied to /var/lib/jenkins, that lives on the isotesterN-data LV. If we can afford giving 2.5GB of RAM to each isotester, then we can write these ISOs to /tmp/TailsToaster and avoid the corresponding I/O load. I’m not convinced if it’s worth 8*2.5GB = 20GB of RAM, even though these volumes are on top of our I/O write load. Spreading these volumes across over all our (SSD-backed) RAID arrays would be nice, though: the 1TB one is kinda overloaded compared to the 500GB one, and half of our isotesterN-data are on rotating drives.

the system that serves the ISOs used for testing (it might be that giving it enough RAM to keep the last released ISO in memory could help)

For all branches, the ISO image being tested is copied from jenkins.lizard (using the copy artifacts Jenkins plugin). This takes about 1.6 minutes.

For non-release-branches, an additional ISO image is retrieved over HTTP by isotesters from www.lizard, that itself gets it via NFS from jenkins.lizard. This takes 45-60 seconds. Go figure why it’s faster, but anyway, this is not the point here.

Eventually the actual data comes from /dev/lizard/jenkins-data in both cases. Indeed that’s our biggest consumer of I/O read (https://munin.riseup.net/tails.boum.org/lizard.tails.boum.org/diskstats_throughput/index.html) once we exclude isotesterN-tmp (that are being replaced by tmpfs) and bitcoin (that should be dropped IMO, but this is OT here).

So in most cases, the data that needs to be read from disk (jenkins-data) is either the latest released ISO (that we should probably always keep in disk cache on jenkins.lizard, by giving the VM a bit more RAM), or an ISO that we just copied to jenkins.lizard and retrieve it soon after (that could also be in memory if we gave jenkins.lizard more RAM) => bumped jenkins.lizard from ~1.5G to ~2.7G of RAM, let’s see how it goes.

#4 Updated by intrigeri 2016-02-28 11:05:52

  • % Done changed from 0 to 50
  • Blueprint set to https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/

intrigeri wrote:
> Note to myself: get rid of the LVs if we stick to tmpfs in the end.

Done. I’ll post my benchmarking results to the blueprint soonish. These results + Munin data convinces me that it’s a good thing to back /tmp/TailsToaster with a tmpfs on isotesters.

#5 Updated by intrigeri 2016-02-29 01:27:22

  • Deliverable for set to 270

#7 Updated by intrigeri 2016-02-29 13:07:15

  • Assignee deleted (intrigeri)
  • QA Check set to Ready for QA

Moved isotester[1-4]-data to from rotating drives to a SSD-backed PV, and left isotester[5-8]-data on the other SSD-backed PV. This should make isotester[1-4]-data faster, and will lower the load on the rotating drives, that now are basically dedicated to jenkins-data, which is good since that was the other thing we wanted to optimize here.

I’ll check munin data in a month or so, to confirm that I’m done here.

#8 Updated by intrigeri 2016-02-29 13:07:49

  • % Done changed from 50 to 80

#9 Updated by intrigeri 2016-03-03 20:25:09

  • Assignee set to bertagaz

#10 Updated by intrigeri 2016-03-07 11:54:32

  • Assignee changed from bertagaz to intrigeri

Actually I’ll check Munin data myself first.

#11 Updated by intrigeri 2016-03-25 21:59:10

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 80 to 100
  • QA Check changed from Ready for QA to Pass

Since I’ve done these changes ~1 months ago:

  • disk throughput is substantially smaller on md1, md2 and md3
  • isotesterN-* LVs are not among the top consumers of iops anymore
  • isotesterN VMs are not on top of the top consumers of disk read/write throughput anymore (libvirt-blkstats Munin plugin)

So I call this a success.

#12 Updated by intrigeri 2019-09-24 09:59:03

  • related to Bug #17088: Test suite became unreliable on Jenkins: OOM kills QEMU, OpenJDK memory allocation failure aborts the test suite run added