Bug #10999
Parallelize our ISO building workload on more builders
100%
Description
The problem described in Feature #8072 is back: quite often, ISO builds triggered by Jenkins are queuing up, and the latency between when a developer pushes to a branch, and when the resulting ISO is ready to be downloaded and automatically tested, is increasing. This situation can be explained by changes that make the build substantially slower: the move to Jessie, an added language to the website, and the Installation Assistant. We need to cope with it, somehow.
First of all, let’s note that we initially had planned to give 4 vcpus to each isobuilder, while we currently give them 8 vcpus each. IIRC we did that because we had no better use of our vcpus back then. We currently have other good use of these vcpus.
In my book, these bonus 4 vcpus only should improve the part of the build that parallelizes well, i.e. the SquashFS compression, which takes around 11 minutes these days. So:
- on our current hardware, it would be wasteful to try to improve our ISO building latency by making each individual isobuilder faster; parallelizing this workload over more VMs should work much better;
- in theory, if we give only 4 vpcus to each isobuilder, and as a result mksquashfs is twice as slow, it would only make the build last about 12.5% longer, which feels acceptable if it allows us to double the number of ISO builders we run, and in turn to solve the congestion problem we have.
At first glance, I think we should run 3 or 4 ISO builders, with 4 vpcus each. Let’s see how doable it is:
- vcpus: as explained above, we can simply reclaim some of the bonus vcpus allocated a year ago to our current isobuilders; that is, if we’re not ready to try overallocating vcpus (most of the time, all isobuilders are not used at the same time, so overallocation would make sense);
- RAM:
Feature #11010gave them enough RAM for 1 or 2 more builders - disk space: for 2 additional ISO builders, we need 2*10 GiB; we have some slack in our storage plan for this year, and we can still reclaim some space here and there, so we should be good on this side.
Subtasks
Related issues
Related to Tails - |
Resolved | 2014-10-11 | |
Related to Tails - |
Resolved | 2015-12-15 | |
Related to Tails - |
Resolved | 2016-01-25 | |
Blocked by Tails - |
Resolved | 2016-01-27 |
History
#1 Updated by intrigeri 2016-01-26 16:26:23
- related to
Feature #8072: Set up a second Jenkins slave to build ISO images added
#2 Updated by intrigeri 2016-01-26 16:26:33
- related to
Feature #9264: Consider buying more server hardware to run our automated test suite added
#3 Updated by intrigeri 2016-01-26 17:13:01
- related to
Feature #10996: Try running more isotester:s on lizard added
#4 Updated by intrigeri 2016-01-27 17:20:03
- Blueprint set to https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/
The blueprint takes this problem into account.
#5 Updated by intrigeri 2016-01-27 17:28:56
- Parent task set to
Feature #11009
#6 Updated by intrigeri 2016-01-27 17:37:18
Meta: I’m setting 2.2 as the target version as it’ll be very easy once we have more RAM (Feature #11010), but if this takes 1-2 more months to set up the additional ISO builders, no big deal.
#7 Updated by intrigeri 2016-02-05 11:57:27
- blocked by
Feature #11010: Give lizard v2 more RAM added
#8 Updated by intrigeri 2016-02-29 11:54:15
- Description updated
#9 Updated by intrigeri 2016-02-29 18:09:30
intrigeri wrote:
> vcpus: as explained above, we can simply reclaim some of the bonus vcpus allocated a year ago to our current isobuilders;
Done, isobuilder{1,2} are now down to 4 vcpus.
#10 Updated by intrigeri 2016-02-29 19:39:33
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 30
… and set up isobuilder{3,4}.
#11 Updated by intrigeri 2016-02-29 23:30:14
- Target version changed from Tails_2.2 to Tails_2.3
https://jenkins.tails.boum.org/plugin/cluster-stats/ says the average wait time in queue is:
- isobuilder2 2 hr 3 min
- isobuilder1 1 hr 9 min
It’s unclear if this is for the last 7 days only, or since we have been gathering stats with this plugin (1 month 5 days). I could get that more precisely from the CSV provided by this plugin, but whatever: I think I’ll just come back to it in a month and see if adding isobuilders changes something measurable in terms of wait time.
Regarding build duration: that web page gives me very low average values for our isobuilders (24 and 28 minutes), because it takes into account the builds that fail very early, and also some other, non ISO build, jobs. So I don’t think I can draw very useful conclusions out of these values. The raw CSV data the same plugin gives me doesn’t tell me if the job run succeeded, so I can’t use it to filter out failed and aborted builds. And it’s not doable to find a minimal duration under which I assume a build failed, because apparently builds fail at any point in practice. So, for build duration stats, I’ll instead use the data I can get in XML from the Jenkins Global Build Stats plugin => adapted https://git-tails.immerda.ch/puppet-tails/tree/files/jenkins/master/successful-ISO-builds to output average duration of successful ISO build runs, and then:
- 2015-11: 38.3 minutes
- 2015-12: 39.1 minutes
- 2016-01: 44.8 minutes
- 2016-02: 46.8 minutes
… and so I’ll have something to compare with in a month or so. Of course, running more ISO builds & tests in parallel is likely to raise the average duration, the question is how much, and where is the good latency/throughput sweet spot for our workload. Note that we’ve taken some action to reduce a bit the website build time (that grew a lot recently), which will influence a bit our numbers, but rest assured that in the meantime we’ll find other ways to increase build time.
Too bad we don’t have a single source of raw data that gives us both the info we need for analyzing queue congestion, and the info we need to evaluate per-build performance, but whatever.
#12 Updated by intrigeri 2016-03-25 22:08:23
- Target version changed from Tails_2.3 to Tails_2.4
- % Done changed from 30 to 70
intrigeri wrote:
> https://jenkins.tails.boum.org/plugin/cluster-stats/ says the average wait time in queue is:
>
> * isobuilder1 1 hr 9 min
> * isobuilder2 2 hr 3 min
>
> It’s unclear if this is for the last 7 days only, or since we have been gathering stats with this plugin (1 month 5 days). I could get that more precisely from the CSV provided by this plugin, but whatever: I think I’ll just come back to it in a month and see if adding isobuilders changes something measurable in terms of wait time.
Success! I now see:
- isobuilder1 51 min
- isobuilder2 1 hr 33 min
- isobuilder3 5 min 50 sec
- isobuilder4 1 hr 30 min
> Regarding build duration: that web page gives me very low average values for our isobuilders (24 and 28 minutes), because it takes into account the builds that fail very early, and also some other, non ISO build, jobs. So I don’t think I can draw very useful conclusions out of these values. The raw CSV data the same plugin gives me doesn’t tell me if the job run succeeded, so I can’t use it to filter out failed and aborted builds. And it’s not doable to find a minimal duration under which I assume a build failed, because apparently builds fail at any point in practice. So, for build duration stats, I’ll instead use the data I can get in XML from the Jenkins Global Build Stats plugin => adapted https://git-tails.immerda.ch/puppet-tails/tree/files/jenkins/master/successful-ISO-builds to output average duration of successful ISO build runs, and then:
>
> * 2015-11: 38.3 minutes
> * 2015-12: 39.1 minutes
> * 2016-01: 44.8 minutes
> * 2016-02: 46.8 minutes
>
> … and so I’ll have something to compare with in a month or so. Of course, running more ISO builds & tests in parallel is likely to raise the average duration, the question is how much, and where is the good latency/throughput sweet spot for our workload.
Average build duration grew to 56 minutes in March (expected: caused by lowering the number of vpcus/builder, plus allowing Jenkins to run twice as many builds in parallel which loads the system more).
Note that we already built 10% more ISOs than in February, so we probably had a bit more room for congestion here.
I think that a 10 minutes hit on the build time is acceptable, given the improvements we got on the waiting time in queue. So I’m calling this tentatively done, but will come back to it in a month again to double-check.
#13 Updated by intrigeri 2016-04-29 12:59:59
- Status changed from In Progress to Resolved
- Assignee deleted (
intrigeri) - % Done changed from 70 to 100
We’re down to 53.5 minutes in April, which is probably explained by the fact we built a bit less ISO images. Calling it done.