Feature #11175
Decrease I/O load created by isotesters on lizard
100%
Description
Even with 23G of RAM per isotester, contrary to what I believed earlier (Feature #8681), our isotesters are the biggest consumers of I/O write, and their /tmp/TailsToaster
ext4 volume seems to be at fault. I’m trying to mount a tmpfs on there, we’ll see how it goes.
Subtasks
Related issues
Related to Tails - |
Rejected | 2015-01-12 | |
Related to Tails - |
Resolved |
History
#1 Updated by intrigeri 2016-02-27 14:46:20
- related to
Feature #8681: Make it easy to run the test suite with a ramdisk mounted on its temporary directory added
#2 Updated by intrigeri 2016-02-27 14:59:01
Note to myself: get rid of the LVs if we stick to tmpfs in the end.
#3 Updated by intrigeri 2016-02-27 16:04:46
Once the /tmp/TailsToaster
case is closed, it will be interesting to look at I/O on:
the other isotesterN-* LVs, in particular the one that hosts the 2 ISOs used for testing
Both ISOs are copied to /var/lib/jenkins
, that lives on the isotesterN-data LV. If we can afford giving 2.5GB of RAM to each isotester, then we can write these ISOs to /tmp/TailsToaster
and avoid the corresponding I/O load. I’m not convinced if it’s worth 8*2.5GB = 20GB of RAM, even though these volumes are on top of our I/O write load. Spreading these volumes across over all our (SSD-backed) RAID arrays would be nice, though: the 1TB one is kinda overloaded compared to the 500GB one, and half of our isotesterN-data are on rotating drives.
the system that serves the ISOs used for testing (it might be that giving it enough RAM to keep the last released ISO in memory could help)
For all branches, the ISO image being tested is copied from jenkins.lizard (using the copy artifacts Jenkins plugin). This takes about 1.6 minutes.
For non-release-branches, an additional ISO image is retrieved over HTTP by isotesters from www.lizard, that itself gets it via NFS from jenkins.lizard. This takes 45-60 seconds. Go figure why it’s faster, but anyway, this is not the point here.
Eventually the actual data comes from /dev/lizard/jenkins-data
in both cases. Indeed that’s our biggest consumer of I/O read (https://munin.riseup.net/tails.boum.org/lizard.tails.boum.org/diskstats_throughput/index.html) once we exclude isotesterN-tmp (that are being replaced by tmpfs) and bitcoin (that should be dropped IMO, but this is OT here).
So in most cases, the data that needs to be read from disk (jenkins-data) is either the latest released ISO (that we should probably always keep in disk cache on jenkins.lizard, by giving the VM a bit more RAM), or an ISO that we just copied to jenkins.lizard and retrieve it soon after (that could also be in memory if we gave jenkins.lizard more RAM) => bumped jenkins.lizard from ~1.5G to ~2.7G of RAM, let’s see how it goes.
#4 Updated by intrigeri 2016-02-28 11:05:52
- % Done changed from 0 to 50
- Blueprint set to https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/
intrigeri wrote:
> Note to myself: get rid of the LVs if we stick to tmpfs in the end.
Done. I’ll post my benchmarking results to the blueprint soonish. These results + Munin data convinces me that it’s a good thing to back /tmp/TailsToaster
with a tmpfs on isotesters.
#5 Updated by intrigeri 2016-02-29 01:27:22
- Deliverable for set to 270
#7 Updated by intrigeri 2016-02-29 13:07:15
- Assignee deleted (
intrigeri) - QA Check set to Ready for QA
Moved isotester[1-4]-data to from rotating drives to a SSD-backed PV, and left isotester[5-8]-data on the other SSD-backed PV. This should make isotester[1-4]-data faster, and will lower the load on the rotating drives, that now are basically dedicated to jenkins-data, which is good since that was the other thing we wanted to optimize here.
I’ll check munin data in a month or so, to confirm that I’m done here.
#8 Updated by intrigeri 2016-02-29 13:07:49
- % Done changed from 50 to 80
#9 Updated by intrigeri 2016-03-03 20:25:09
- Assignee set to bertagaz
#10 Updated by intrigeri 2016-03-07 11:54:32
- Assignee changed from bertagaz to intrigeri
Actually I’ll check Munin data myself first.
#11 Updated by intrigeri 2016-03-25 21:59:10
- Status changed from In Progress to Resolved
- Assignee deleted (
intrigeri) - % Done changed from 80 to 100
- QA Check changed from Ready for QA to Pass
Since I’ve done these changes ~1 months ago:
- disk throughput is substantially smaller on md1, md2 and md3
- isotesterN-* LVs are not among the top consumers of iops anymore
- isotesterN VMs are not on top of the top consumers of disk read/write throughput anymore (libvirt-blkstats Munin plugin)
So I call this a success.
#12 Updated by intrigeri 2019-09-24 09:59:03
- related to
Bug #17088: Test suite became unreliable on Jenkins: OOM kills QEMU, OpenJDK memory allocation failure aborts the test suite run added