Feature #12110: Evaluate grsec's performance hit

Feature #12110

Evaluate grsec's performance hit

Added by intrigeri 2017-01-03 10:40:33 . Updated 2017-06-23 11:17:59 .

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

2017-01-03

Due date:

% Done:

Feature Branch:

Type of work:

Test

Blueprint:

Starter:

Affected tool:

Deliverable for:

Description

Something we’ll need to evaluate is the performance hit (the test suite runs I see take 35% longer than those for feature/8415-overlayfs-stretch, which is quite bigger than I would have expected; but I’ve not done any measurement on bare metal yet). If it’s too big then we may need to look closer and perhaps disable some grsec features whose cost/benefit ratio is not interesting enough.

Subtasks

History

#1 Updated by cypherpunks 2017-01-06 02:20:36

The performance hit varies a lot if you’re running in some virtual machines, especially with UDEREF active, which can have a 10-20% perf hit. UDEREF is also one of the most important features though.

#2 Updated by cypherpunks 2017-01-17 10:38:17

According to grsecurity’s documentation, the main performance-killing (20% impact) features are:

#3 Updated by cypherpunks 2017-01-20 07:46:19

cypherpunks wrote:
> According to grsecurity’s documentation, the main performance-killing (20% impact) features are:
>
> * UDEREF on x64 (–10%)
> * PAX_MEMORY_STACKLEAK
> * PAX_MEMORY_SANITIZE
> * GRKERNSEC_RANDSTRUCT

I don’t think UDEREF should have too much of a performance impact on modern hardware. If the processor is modern enough to support PCID (Westmere, Sandybridge, etc), grsecurity can use it to both increase the security and performance of this feature on x86_64 platforms. If the perf hit is still unacceptable, then there is a boot parameter, pax_weakuderef which uses PCID to weaken the security provided by UDEREF, but greatly increases the performance of the feature. I imagine the ~10% hit is for hardware which lacks PCID all togther. It is likely lower on hardware which supports it, and can be made lower still with pax_weakuderef if necessary.

STACKLEAK and SANITIZE are too important to consider disabling, although the latter has a few settings that can reduce the performance impact at the expence of less sanitization coverage (the boot parameter pax_sanitize_slab=off can be used to disable SLAB sanitizing entirely, leaving only free page sanitizing remaining). The former impacts kernel compilation by 1% on a single-core system, which is negligible.

One we can do without is RANDSTRUCT, since it requires the kernel image be unavailable to the attacker. As Tails is obviously not making its users compile their own kernels, all an attacker would have to do is check the version of Tails and download the kernel image to know what the struct layouts are. Not that it’s impossible to still benefit from RANDSTRUCT. Tails could keep, say, 256 kernels and use a random one on each boot (perhaps with one base kernel and the 256 different kernels created via compressed bindiffs of just the structs, applied in memory by the bootloader before boot). Copperhead does something vaguely similar, by having 256 precompiled kernels, each with a different RANDSTRUCT seed, and sending a random one to each phone during a kernel upgrade. But since that’s not strictly necessary, and would certainly be considered only in the long-run, that feature is one which is completely uneccessary for precompiled kernels.

#4 Updated by cypherpunks 2017-03-04 03:11:21

Also remember that these are worst-case scenarios, assuming the system is in CPL0 the majority of the time. I can’t imagine any use cases where Tails will heavily tax the kernel.

Does Tails have benchmarking in its test suite yet?

#5 Updated by intrigeri 2017-03-04 08:29:58

> Does Tails have benchmarking in its test suite yet?

Jenkins records how much time each test scenario takes to run, and the total test suite run time, which gives a good idea of the real-world performance impact.

#6 Updated by cypherpunks 2017-03-06 23:19:54

intrigeri wrote:
> > Does Tails have benchmarking in its test suite yet?
>
> Jenkins records how much time each test scenario takes to run, and the total test suite run time, which gives a good idea of the real-world performance impact.

Then I imagine that would be a better way to benchmark grsecurity’s performance than something like unixbench?

#7 Updated by intrigeri 2017-03-07 21:07:41

> Then I imagine that would be a better way to benchmark grsecurity’s performance than something like unixbench?

Absolutely. These results are actually what prompted me to create this ticket in the first place.

#8 Updated by cypherpunks 2017-03-08 03:39:08

intrigeri wrote:
> > Then I imagine that would be a better way to benchmark grsecurity’s performance than something like unixbench?
>
> Absolutely. These results are actually what prompted me to create this ticket in the first place.

What is the output of this on the test suite VM?

grep -m2 -e "model name" -e flags /proc/cpuinfo

Additionally, if it says pcid in there, try adding pax_weakuderef as a kernel parameter and run the test suite again.

#9 Updated by intrigeri 2017-03-08 09:55:47

> What is the output of this on the test suite VM?

grep -m2 -e "model name" -e flags /proc/cpuinfo

It depends on where it runs: we have <cpu mode='host-model'/> in the
libvirt domain config.

On our Jenkins CI infra I see:

model name      : Intel Xeon E312xx (Sandy Bridge)
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt arat

> Additionally, if it says pcid in there,

It does.

> try adding pax_weakuderef as a kernel parameter and run the test suite again.

I’ve pushed a feature/7649-grsec-weakuderef branch that does exactly this, so I’ll be able to report back comparisons.

#10 Updated by intrigeri 2017-03-08 14:45:19

pax_weakuderef doesn’t change our test suite run time significantly (I see a 2.5% slow down which is well within the variance we usually observe, so not statistically significant).

Anything else you would like me to benchmark?

#11 Updated by cypherpunks 2017-03-09 07:44:49

intrigeri wrote:
> pax_weakuderef doesn’t change our test suite run time significantly (I see a 2.5% slow down which is well within the variance we usually observe, so not statistically significant).
>
> Anything else you would like me to benchmark?
That’s quite a significant speedup compared to the claimed 35% in feature/8415-overlayfs-stretch. I’m wondering if perhaps that test was run on a system that lacked PCID support. Can you try the grsec/pax test again, but without pax_weakuderef, making sure that the VM it’s running on has the same physical host as the previous test? My guess is that this time, it won’t have that insane 35% slowdown.

If I’m correct, UDEREF performance looks a bit like this:

x86_32 without PCID: very high performance, very high security
x86_64 without PCID: low performance, low security
x86_64 with PCID: high performance, high security
x86_64 with PCID and pax_weakuderef: very high performance, medium security

#12 Updated by intrigeri 2017-03-09 08:03:25

> That’s quite a significant speedup compared to the claimed 35% in feature/8415-overlayfs-stretch.

Oops, sorry I was unclear: this “2.5% slowdown” is feature/7649-grsec-weakuderef compared to feature/7649-grsec.
I did not measure any speedup.

The 35% slowdown was feature/7649-grsec compared to feature/8415-overlayfs-stretch.

So with or without pax_weakuderef, grsec slows things down a lot.

> I’m wondering if perhaps that test was run on a system that lacked PCID support.

I’ve already answered this question yesterday. What we do is:

level 0 bare metal host: has pcid
level 1 ISO tester VM: has pcid
level 2 Tails VM (system under test): has pcid

Now, it might be that the benefits of pcid are somewhat lost somewhere along the nested virtualization stack.

> Can you try the grsec/pax test again, but without pax_weakuderef, making sure that the VM it’s running on has the same physical host as the previous test?

This is being run daily, on the same physical host, and it gives me the results I’m reporting above.

Now, these results are a bit skewed, as more tests fail with the grsec kernel, so:

some of these failures prevent following steps to be run, which artificially decreases the test suite run time
some of these failures are only identified after a timeout is reached, which artificially increases the test suite run time

#13 Updated by cypherpunks 2017-03-09 08:49:16

intrigeri wrote:
> > That’s quite a significant speedup compared to the claimed 35% in feature/8415-overlayfs-stretch.
>
> Oops, sorry I was unclear: this “2.5% slowdown” is feature/7649-grsec-weakuderef compared to feature/7649-grsec.
> I did not measure any speedup.
>
> The 35% slowdown was feature/7649-grsec compared to feature/8415-overlayfs-stretch.
>
> So with or without pax_weakuderef, grsec slows things down a lot.

OK, I totally misunderstood. Do you know what specifically slows down the most? Do you have any profiling information to tell where the system is spending most of its cycles, context switches, blocking, etc?

So instead, add pax_nouderef to the command line, and see if that takes out all or most of the performance issues.

> > I’m wondering if perhaps that test was run on a system that lacked PCID support.
>
> I’ve already answered this question yesterday. What we do is:
>
> * level 0 bare metal host: has pcid
> * level 1 ISO tester VM: has pcid
> * level 2 Tails VM (system under test): has pcid
>
> Now, it might be that the benefits of pcid are somewhat lost somewhere along the nested virtualization stack.

I don’t see vmx in the the CPU flags you posted, which means that you aren’t passing on VT-x with nested virtualization, so it indeed is a possibility that the nested virtualization stack is the culprit. Do you see vmx in the flags for level 1? If any binary translation is involved (QEMU emulation, VMWare or VirtualBox without full nested VT-x support), UDEREF in particular will trigger a huge slowdown.

#14 Updated by intrigeri 2017-03-09 08:55:46

> Do you know what specifically slows down the most? Do you have any profiling information to tell where the system is spending most of its cycles, context switches, blocking, etc?

No.

> So instead, add pax_nouderef to the command line, and see if that takes out all or most of the performance issues.

Sure. Done on feature/7649-grsec-nouderef, will have results in a few hours.

> I don’t see vmx in the the CPU flags you posted, which means that you aren’t passing on VT-x with nested virtualization,

No: the flags you saw are from the level 2 guest (which is what matters wrt. pcid being usable or not in the system that runs a grsec kernel). That one doesn’t need VT-x. But the level 1 guest does have vmx, which is what matters. Right?

#15 Updated by cypherpunks 2017-03-09 09:17:35

intrigeri wrote:
> > So instead, add pax_nouderef to the command line, and see if that takes out all or most of the performance issues.
>
> Sure. Done on feature/7649-grsec-nouderef, will have results in a few hours.

I look forward to seeing the results.

> No: the flags you saw are from the level 2 guest (which is what matters wrt. pcid being usable or not in the system that runs a grsec kernel). That one doesn’t need VT-x. But the level 1 guest does have vmx, which is what matters. Right?

And you’re able to confirm that the level 1 hypervisor is making use of VT-x, rather than using binary translation?

What hypervisor is being used, anyway?

#16 Updated by intrigeri 2017-03-09 10:10:53

> And you’re able to confirm that the level 1 hypervisor is making use of VT-x, rather than using binary translation?

Technically, I haven’t verified this (I could if you tell me how). But I expect our test suite would run waaaaaaay slower if the level 1 host wasn’t using VT-x to run the level 2 VM. We didn’t notice any huge difference between this setup and dropping one level of virtualization.

> What hypervisor is being used, anyway?

Linux KVM.

#17 Updated by cypherpunks 2017-03-09 10:21:43

intrigeri wrote:
> > And you’re able to confirm that the level 1 hypervisor is making use of VT-x, rather than using binary translation?
>
> Technically, I haven’t verified this (I could if you tell me how). But I expect our test suite would run waaaaaaay slower if the level 1 host wasn’t using VT-x to run the level 2 VM. We didn’t notice any huge difference between this setup and dropping one level of virtualization.
>
> > What hypervisor is being used, anyway?
>
> Linux KVM.

Well if it’s KVM, then it’s using VT-x. KVM is how things like QEMU or kvmtool are able to make use of hardware accelerated virtualization. I guess we ruled out lack of VT-x as a culprit. I’ll have to talk to spender or pipacs to see if the hypervisor could still be causing problems. In the meantime, I’ll just wait for the test results to come back.

By the way, the 35% perf hit is just the total time it takes for all the tests to complete, right? You said the results would be a bit skewed, so how do you know that the amount of tests that fail, and time out (adding to the results) aren’t responsible for more than half of that? For example, if UDEREF is, in reality, only contributing to a 10% perf hit, and the failed tests hanging and timing out are adding another 25%, then without actual profiling data or more detailed test results, the benchmarks are useless for this purpose.

#18 Updated by intrigeri 2017-03-09 10:26:56

> By the way, the 35% perf hit is just the total time it takes for all the tests to complete, right?

Yes.

> You said the results would be a bit skewed, so how do you know that the amount of tests that fail, and time out (adding to the results) aren’t responsible for more than half of that?

I don’t. This would require investigating further, which I want to do but it’s pretty low priority for me right now.

#19 Updated by cypherpunks 2017-03-09 10:37:32

intrigeri wrote:
> > By the way, the 35% perf hit is just the total time it takes for all the tests to complete, right?
>
> Yes.
>
> > You said the results would be a bit skewed, so how do you know that the amount of tests that fail, and time out (adding to the results) aren’t responsible for more than half of that?
>
> I don’t. This would require investigating further, which I want to do but it’s pretty low priority for me right now.

So just check the total CPU time used by the test VM processes. Or even just prepend time to the command. If the user/system, and real time are both 35% higher (or close to it) than a normal test, then it really is using up that extra time to waste cycles. If on the other hand the user/system time is not all that much higher, and only the real time is 35% higher, then you know that it’s just spending that 35% sleeping or otherwise blocking.

#20 Updated by intrigeri 2017-03-10 13:06:54

> So just check the total CPU time used by the test VM processes.

Thanks for the suggestion. I don’t know how to wrap the QEMU processes started by libvirt this way, so I would instead use other ways of analyzing the results (or fixing the tests!) than this, e.g. comparing runtime for successful tests only. Anyway, we’re back to “I want to do but it’s pretty low priority for me right now” (⇒ I’ll probably do it soon while procrastinating). Note that wrapping the full test suite run with time would not work: in many cases, the way our test suite waits (until a timeout happens) eats CPU cycles on its own.

#21 Updated by cypherpunks 2017-03-10 23:40:07

intrigeri wrote:
> > So just check the total CPU time used by the test VM processes.
>
> Thanks for the suggestion. I don’t know how to wrap the QEMU processes started by libvirt this way, so I would instead use other ways of analyzing the results (or fixing the tests!) than this, e.g. comparing runtime for successful tests only. Anyway, we’re back to “I want to do but it’s pretty low priority for me right now” (⇒ I’ll probably do it soon while procrastinating). Note that wrapping the full test suite run with time would not work: in many cases, the way our test suite waits (until a timeout happens) eats CPU cycles on its own.

Then honestly I think using the test suite as a benchmark is not going to work and we probably shouldn’t keep fighting it. We should instead just use unixbench and the industry standard (kernel compilation) and see what that gives us. If that gives us anything near 35%, then we’ll know something is up, but if the perf hit is negligible, then chances are, all the issues with the test suite are false positives. Remember that unixbench does many types of benchmarks, including graphical ones, so it ensures that the relevant processes are constantly maxed out.

Do you have the ability to schedule a test with unixbench or kernel compilation on grsec Tails, with UDEREF active?

#22 Updated by intrigeri 2017-03-11 07:22:40

> Do you have the ability to schedule a test with unixbench or kernel compilation on grsec Tails, with UDEREF active?

This would take me much more time than my currently preferred option to evaluate this, and would yield less value than testing more real-looking use cases, so I won’t do it myself. But anyone else is welcome to download a nightly built ISO from the various branches I’ve mentioned, and do it themselves :)

#23 Updated by cypherpunks 2017-03-11 08:52:13

intrigeri wrote:
> > Do you have the ability to schedule a test with unixbench or kernel compilation on grsec Tails, with UDEREF active?
>
> This would take me much more time than my currently preferred option to evaluate this, and would yield less value than testing more real-looking use cases, so I won’t do it myself. But anyone else is welcome to download a nightly built ISO from the various branches I’ve mentioned, and do it themselves :)

What are the results of feature/7649-grsec-nouderef? If it’s anywhere near +35%, then the idea of using the test suite for a benchmark is fatally flawed in the first place, since UDEREF will always have the biggest impact on performance. If it’s still quite high, then the vast majority is going to be from tests hanging and timing out, which invalidates the tests. The best thing to do in that case would likely be either to design a test suite designed specifically for benchmarking, or to give grsec kernels to a sample of people (“Call for testing: Hardened Tails”?), and see if they see any performance issues.

#24 Updated by intrigeri 2017-03-11 10:23:55

> What are the results of feature/7649-grsec-nouderef?

Average of the last 4 runs of each of these branches:

feature/8415-overlayfs-stretch: 171 minutes
feature/7649-grsec-nouderef: 229 minutes
feature/7649-grsec-weakuderef: 255 minutes
feature/7649-grsec: 245 minutes

So at first glance, nouderef does have an impact. But again, our test suite needs to be made more robust on Stretch-based ISOs (should happen next week) before we can draw any conclusion from such data (it might already make sense to compare results between feature/7649-grsec* branches, but comparing them with feature/8415-overlayfs-stretch is totally flawed at this point).

On my side I’ll now focus on getting the data I will soon be able to gather with very little effort (and that thankfully is the most valuable we can get), rather than debating for hours about how exactly I should do it and running in circles. Anyone is welcome to do any additional work they want: the ISOs are publicly available, and more data points are always welcome :)

#25 Updated by cypherpunks 2017-03-12 01:25:22

I asked on #grsecurity and showed them those results and was told that trying to use the test suite as a benchmark is going to be completely useless, even moreso than I thought. The differences seen in the test suites are going to be caused by page faults and excessive syscalls being made by the tests. That’s not representative by actual use of the system.

>< strcat> browsers, gaming, etc. will all be 0% difference even with everything enabled
>< strcat> if it’s not 0% something is wrong with the testing methodology

Any actual benchmark of the system that will be representative unfortunately will require some effort. Actual benchmarks will need to be run, and carefully picked to represent real-world use, e.g. from https://openbenchmarking.org/tests/pts.

Some possibly useful benchmarks from that page:

https://openbenchmarking.org/test/pts/cairo-demos
https://openbenchmarking.org/test/pts/compress-7zip
https://openbenchmarking.org/test/pts/ffmpeg
https://openbenchmarking.org/test/pts/fs-mark
https://openbenchmarking.org/test/pts/glmark2
https://openbenchmarking.org/test/pts/gnupg
https://openbenchmarking.org/test/pts/gtkperf
https://openbenchmarking.org/test/pts/network-loopback

I think I’ll set up Tails with grsecurity and run some of those tests when I have the time.

#26 Updated by intrigeri 2017-03-12 06:46:09

> The differences seen in the test suites are going to be caused by page faults and excessive syscalls being made by the tests. That’s not representative by actual use of the system.

I don’t understand how exercising precisely the actions we expect users to take (see *.feature in https://git-tails.immerda.ch/tails/tree/features) is not representative of actual use of the system. I’m really not willing to argue about it, but these strong affirmations of yours make me very curious, and wondering who got what wrong, and why :)

#27 Updated by cypherpunks 2017-03-12 10:12:44

intrigeri wrote:
> > The differences seen in the test suites are going to be caused by page faults and excessive syscalls being made by the tests. That’s not representative by actual use of the system.
>
> I don’t understand how exercising precisely the actions we expect users to take (see *.feature in https://git-tails.immerda.ch/tails/tree/features) is not representative of actual use of the system. I’m really not willing to argue about it, but these strong affirmations of yours make me very curious, and wondering who got what wrong, and why :)

Because those actions are not benchmarks. They’re done once, rather than done in a rapid loop. Isn’t it also done with simulated user input? It simply isn’t a benchmark. So while it’s representitive of what the users do, it’s not representitive of where the users are spinning their cycles. A user is going to spin their cycles in JavaScript heavy websites, or videos that require decoding large h.264 files, not opening single .png files, then closing it, then trying to open another .png file to see if AppArmor denies it or not. Instead, they’d want a browser benchmark, a video decoding benchmark, and a gtk function benchmark.

I got even stronger affirmations on #grsecurity from strcat, who is used to these specific types of test suites, and when asking about this elsewhere, as well. It just seems like common sense to me, given the differences in results between pax_weakuderef and pax_nouderef, which does not seem possible. But instead of arguing, as you say, it’s certainly better for me to wait and ask spender or pipacs, and in the meantime just keep collecting data points. Since the goal is just to find out what grsecurity’s perf hit is, after all.

#28 Updated by intrigeri 2017-03-12 16:35:49

> Isn’t it also done with simulated user input?

Yes, it is.

> It simply isn’t a benchmark. So while it’s representitive of what the users do, it’s not representitive of where the users are spinning their cycles. A user is going to spin their cycles in JavaScript heavy websites, or videos that require decoding large h.264 files, not opening single .png files, then closing it, then trying to open another .png file to see if AppArmor denies it or not. Instead, they’d want a browser benchmark, a video decoding benchmark, and a gtk function benchmark.

OK, I agree in theory. Thanks!

It would be interesting to look closer at what exact operations are this much slower with grsec, because benchmark or not, our test suite clearly shows that there is a set of operations that’s much slower, and it would be interesting to know if this can affect real users even when they’re not doing CPU-intensive stuff (e.g. boot time and Tor Browser startup time do matter). I might do that (purely for the sake of curiosity) one of these days.

Anyway, thanks again!

#29 Updated by intrigeri 2017-06-23 11:17:59

Status changed from Confirmed to Rejected

grsec is not FOSS anymore.