Feature #17387: Consider disabling CPU vulnerabilities mitigation features in our CI builder/tester VMs

Feature #17387

Consider disabling CPU vulnerabilities mitigation features in our CI builder/tester VMs

Added by intrigeri 2019-12-30 08:18:32 . Updated 2020-04-07 08:34:51 .

Status:

Confirmed

Priority:

Normal

Assignee:

Category:

Continuous Integration

Target version:

Start date:

Due date:

% Done:

Feature Branch:

Type of work:

Sysadmin

Blueprint:

Starter:

Affected tool:

Deliverable for:

Description

Recent kernels (including 4.9 LTS that we run on our CI builders/testers) offer a mitigations=off command line option, that disables all mitigation features for known CPU vulnerabilities. On my local Jenkins instance, I’ve measured that setting this option both on the l0 virtualization host and on the l1 builder/tester VMs gives me a 15% test suite performance boost; this cuts down the feedback loop by about ½ hour, which makes a significant difference to my developer experience. On lizard, I don’t think we can reasonably disable these mitigations on the l0 virtualization host, so this proposal is only about the l1 builder/tester VMs. We won’t know what kind of performance improvement we would get in this context without trying.

These l1 VMs are mostly used to run l2 VMs:

isobuilder:
- The l2 VM is a Vagrant box, that’s been built by the isobuilder itself; that l2 VM is trusted (otherwise we can’t trust the Tails images we publish)
- Given the kind of input data the l2 VM handles (mostly Debian packages) and how it handles it (mostly APT), it seems very unlikely that vulns such as Spectre and Meltdown can be exploited in there, towards escaping to l1 or gaining information about other processes running in the l1 VM.
- The l1 VM runs essentially nothing else than this l2 VM, so a cross-process info leak would have no harmful consequences.
isotester:
- Tails is running in the l2 VM.
- The l2 VM mostly handles trusted input data (e.g. it loads our website in a web browser). Here as well, the way it handles untrusted input data (e.g. gpg --recv-key) does not leave much room for exploiting vulns such as Spectre or Meltdown.
- The l1 VM runs essentially nothing else than this l2 VM, so a cross-process info leak would have no harmful consequences.

If we’re concerned that mitigations=off increases the risk of l2 escape to l1, and thus increases the risk of lateral progression to more sensitive VMs hosted on the same l0 virt host, then we could sandbox the QEMU process that runs l2 in l1 more strictly, by enabling AppArmor for libvirtd in l1.
Given the above, I think this risk increase is too small to bother, but I could change my mind :)

Subtasks

Related issues

Related to Tails - ~~Bug #17386~~: Consider disabling CPU vulnerabilities mitigation features in our Vagrant build box	Resolved
Blocks Tails - Feature #13284: Core work: Sysadmin (Adapt our infrastructure)	Confirmed	2017-06-30
Blocked by Tails - Bug #16960: Make our CI feedback loop shorter	In Progress

History

#1 Updated by intrigeri 2019-12-30 08:19:15

Parent task set to Bug #16960

#2 Updated by intrigeri 2019-12-30 08:19:34

blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added

#3 Updated by intrigeri 2019-12-30 08:25:16

Added to the agenda for upcoming sysadmin team meeting(s).

#4 Updated by intrigeri 2020-02-04 12:15:12

Target version changed from Tails_4.3 to Tails_4.4

We had no time to discuss this during our last meeting, and the next one won’t happen before 4.3.

#5 Updated by intrigeri 2020-02-20 17:37:04

Status changed from Confirmed to In Progress
Assignee changed from Sysadmins to intrigeri
Type of work changed from Discuss to Sysadmin

groente says benchmarking makes sense ⇒ if no significant perf benefits, we can skip the complex & painstaking security analysis.

I’ve set mitigations=off on all lizard isobuilders & isotesters. I’ll come back to it in a few weeks to check if this improves performance measurably.

#6 Updated by intrigeri 2020-02-20 18:01:44

related to ~~Bug #17386~~: Consider disabling CPU vulnerabilities mitigation features in our Vagrant build box added

#7 Updated by intrigeri 2020-03-01 07:15:34

There’s been a bunch of other recent changes that taint the first batch of build results. Here’s the timeline:

Feb 20: set mitigations=off on all lizard isobuilders & isotesters
Feb 21: ~~Bug #17439~~ is merged
Feb 29: ~~Bug #17386~~ is merged

So I think I should now:

analyze impact of mitigations=off on test suite performance (that data is not tainted by the above)
wait another few weeks, to gather enough build perf data with mitigations=off
remove mitigations=off
wait a few weeks, to gather baseline build perf data without mitigations=off
compare build perf results from steps 2 and 4

Hopefully no other change that significantly affects build performance will be merged in the meantime.

#8 Updated by intrigeri 2020-03-01 07:35:17

intrigeri wrote:
> So I think I should now:
>
> # analyze impact of mitigations=off on test suite performance (that data is not tainted by the above)

I’ll use https://jenkins.tails.boum.org/job/test_Tails_ISO_stable/buildTimeTrend as my data source.
I’m only taking successful test suite runs into account, because failures skew the total run time.
Average test suite run time:

Feb 5-19 without mitigations=off, i.e. baseline: 307 minutes
Feb 21-29, with mitigations=off: 292 minutes

⇒ At the “one single test suite job run” zoom level, setting mitigations=off on our isotesters yields a 15 minutes (4.9%) test suite run time decrease, which is significant in itself.

And at the “our CI system” zoom level, when the system is heavily loaded and the queue constantly has job runs waiting, such improvements quickly accumulate with a snowball effect. For example:

The next job, that was waiting in the queue, will start 15 min earlier and last 15 min less, resulting in a 30 min CI feedback loop total decrease.
The 3rd job will start 30 min earlier, last 15 min less, resulting in a 45 min CI feedback loop total decrease.
If extra performance improvements are observed for image builds, these improvements will also add up to the test suite perf improvements, wrt. the total CI feedback loop duration, which is what matters.

I’ll now wait for the build benchmark results.

#9 Updated by intrigeri 2020-03-01 07:35:39

Target version changed from Tails_4.4 to Tails_4.5

#10 Updated by intrigeri 2020-03-15 15:38:07

intrigeri wrote:
> # ~~analyze impact of mitigations=off on test suite performance (that data is not tainted by the above)~~
> # wait another few weeks, to gather enough build perf data with mitigations=off
> # remove mitigations=off

Since the beginning of the month, on the stable branch, ignoring the minority of builds that had to rebuild the website instead of using the cached version: mean build time is 42.875 minutes.

I’ve now removed mitigations=off on isobuilder*.lizard and will come back to it in a few weeks to compare results.

#11 Updated by intrigeri 2020-04-04 08:14:13

intrigeri wrote:
> Since the beginning of the month, on the stable branch, ignoring the minority of builds that had to rebuild the website instead of using the cached version: mean build time is 42.875 minutes.
>
> I’ve now removed mitigations=off on isobuilder*.lizard and will come back to it in a few weeks to compare results.

On the stable branch, since I removed mitigations=off, ignoring the minority of builds that had to rebuild the website instead of using the cached version: mean build time is 45.667 minutes.

So, to sum up, mitigations=off lowers the runtime of our jobs like this:

image builds: 2.8 minutes (6.1%)
image tests: 15 minutes (4.9%)
total for one build+test feedback loop: 17.8 minutes.

I’ve described on ~~Feature #17387#note-8~~ how these savings are cumulative in the situations that matter most, due to queue management and bottlenecks.

> groente says benchmarking makes sense ⇒ if no significant perf benefits, we can skip the complex & painstaking security analysis.

So, it turns out that mitigations=off on {isobuilder,isotester}*.lizard does yield significant performance improvements.

I think we should now discuss the next steps, and how they fit in our various priorities.

#12 Updated by intrigeri 2020-04-07 08:33:38

Status changed from In Progress to Rejected

As per yesterday’s team meeting: we’ll focus on getting new bare metal hardware (Bug #16960) and won’t bother with optimizing CI on lizard.

#13 Updated by intrigeri 2020-04-07 08:34:52

Status changed from Rejected to Confirmed
Assignee deleted (~~intrigeri~~)
Target version deleted (~~Tails_4.5~~)
Parent task deleted (~~Bug #16960~~)

Actually, the optimization this issue is about could make sense on that new hardware: since it’ll be dedicated to CI, perhaps the security analysis would be simpler.

#14 Updated by intrigeri 2020-04-07 08:35:11

blocked by Bug #16960: Make our CI feedback loop shorter added