Feature #11179

Enable automatic NUMA balancing on lizard

Added by intrigeri 2016-02-29 01:23:55 . Updated 2018-08-19 12:59:49 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2016-02-29
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

After reading chapter 8 on https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Virtualization_Tuning_and_Optimization_Guide/index.html#sect-Virtualization_Tuning_Optimization_Guide-NUMA-Auto_NUMA_Balancing and other Red Hat performance tuning doc, I came to the conclusion that we should enable automatic NUMA balancing on lizard (echo 1 > /proc/sys/kernel/numa_balancing, i.e.
sysctl::value { 'kernel.numa_balancing': value => 1 }).

It is supposed to give us better performance, and the other options are not practical:

  • manual NUMA tuning: quite some initial + maintenance work, let’s avoid this if we can
  • numad: the package is in Stretch but not in Jessie, and according to Red Hat it’s not really more efficient than automatic NUMA balancing

To be able to evaluate how well this works, we can:


Subtasks


Related issues

Related to Tails - Bug #15832: lizard kernel oops on 4.9.0-8 kernel Resolved 2018-08-22
Blocked by Tails - Feature #11178: Upgrade lizard host system to Jessie Resolved 2016-02-29
Blocked by Tails - Feature #11817: Optimize I/O settings on lizard Resolved 2016-09-20

History

#1 Updated by intrigeri 2016-02-29 01:24:12

  • blocked by Feature #11178: Upgrade lizard host system to Jessie added

#2 Updated by intrigeri 2016-02-29 01:24:31

  • Target version set to Tails_2.3

I’d like to do that late March / early April. If we don’t manage to do it, no big deal, it’s no emergency => we can postpone quite a bit the target version.

#4 Updated by intrigeri 2016-04-16 15:41:09

  • Target version changed from Tails_2.3 to Tails_2.4

#5 Updated by intrigeri 2016-04-29 14:26:02

  • Target version deleted (Tails_2.4)

#6 Updated by intrigeri 2016-09-20 09:44:54

  • Description updated

#7 Updated by intrigeri 2016-09-20 09:48:13

  • Description updated

#8 Updated by intrigeri 2016-09-20 15:17:03

#9 Updated by intrigeri 2016-09-20 15:17:27

(I don’t want to mix two concurrent experiments so I’ll wait.)

#10 Updated by intrigeri 2016-09-20 15:19:10

  • Description updated

#11 Updated by intrigeri 2016-11-06 10:04:46

Before enabling:

sudo numastat -c qemu-system-x86_64

Per-node process memory usage (in MBs)
PID              Node 0 Node 1  Total
---------------  ------ ------ ------
8121 (sudo)           3      1      4
23663 (qemu-syst   8855     34   8889
23685 (qemu-syst    110    534    644
23709 (qemu-syst     89    529    618
23727 (qemu-syst    101    528    629
23745 (qemu-syst     79  23575  23653
23775 (qemu-syst    641     12    653
23793 (qemu-syst    168   8732   8900
23840 (qemu-syst    101  23570  23672
23866 (qemu-syst    130    539    668
23885 (qemu-syst  23645     12  23657
23925 (qemu-syst   1120     14   1134
23943 (qemu-syst    151   2745   2895
23961 (qemu-syst     90    530    620
24031 (qemu-syst    145   8738   8883
24140 (qemu-syst    100   1563   1663
24159 (qemu-syst   1193     17   1209
24179 (qemu-syst    189   4147   4336
24198 (qemu-syst     92  23577  23669
24235 (qemu-syst    155   1552   1707
24295 (qemu-syst   3342  20335  23676
24446 (qemu-syst   8869     32   8901
24468 (qemu-syst  23644     21  23665
31929 (qemu-syst  23645     22  23667
31970 (qemu-syst  23648     16  23664
46498 (qemu-syst   1083     25   1108
---------------  ------ ------ ------
Total            121386 121397 242783

#12 Updated by intrigeri 2016-11-06 10:23:51

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

And after enabling + shutting down a bunch of VMs and starting the ones that are allocated large amounts of RAM first:

sudo numastat -c qemu-system-x86_64 

Per-node process memory usage (in MBs)
PID              Node 0 Node 1  Total
---------------  ------ ------ ------
16421 (qemu-syst   8745      5   8750
16559 (qemu-syst     23   8731   8753
16625 (qemu-syst   8736      4   8740
17331 (qemu-syst     27  23562  23589
17447 (qemu-syst  23581     10  23590
17626 (qemu-syst     68  23523  23592
17717 (qemu-syst  23575     15  23590
19667 (qemu-syst     22    523    545
21982 (qemu-syst     15  23574  23590
22173 (qemu-syst   1725   2409   4134
22210 (qemu-syst   1552     22   1575
22659 (sudo)          3      1      4
23709 (qemu-syst     90    528    618
23727 (qemu-syst     99    530    629
23745 (qemu-syst     78  23575  23653
23775 (qemu-syst    639     14    653
23793 (qemu-syst    165   8735   8900
23866 (qemu-syst    130    538    668
23925 (qemu-syst   1117     16   1134
23943 (qemu-syst    150   2746   2896
23961 (qemu-syst     88    531    620
24140 (qemu-syst     97   1563   1661
24159 (qemu-syst   1193     16   1209
24468 (qemu-syst  23645     20  23665
31929 (qemu-syst  23645     22  23667
46498 (qemu-syst   1085     22   1108
---------------  ------ ------ ------
Total            120295 121237 241532

The isotesters (and all isobuilders but one that I did not restart) now have their memory in the right NUMA node. I’m not sure what’ll happen on next reboot: will it look nice, or will it depend on the startup order of the VM:s?

#13 Updated by intrigeri 2017-05-28 13:59:09

Ouch, things are still not ideally balanced:

$ sudo numastat
                           node0           node1
numa_hit               393402402       453822993
numa_miss                 284650       544264909
numa_foreign           544105750          443809
interleave_hit             59805           59815
local_node             393402402       453822993
other_node                     0               0

$ sudo numastat -c qemu-system-x86_64 

Per-node process memory usage (in MBs)
PID              Node 0 Node 1  Total
---------------  ------ ------ ------
12071 (qemu-syst  26505     37  26542
12119 (qemu-syst   1566     24   1590
12160 (qemu-syst    775      6    780
12203 (qemu-syst      6   2744   2750
12269 (qemu-syst   1256     11   1267
12451 (qemu-syst    890  25650  26539
12511 (sudo)          1      2      4
12790 (qemu-syst   3106   1048   4155
12837 (qemu-syst    462   1122   1584
12977 (qemu-syst     11  26530  26541
13040 (qemu-syst     35    532    567
13083 (qemu-syst      9  14546  14555
13131 (qemu-syst      5    874    879
13172 (qemu-syst      7    553    560
13213 (qemu-syst    556      4    560
13254 (qemu-syst     27    533    560
13296 (qemu-syst  10489  16054  26543
13350 (qemu-syst     10    553    563
13393 (qemu-syst     10    554    564
13438 (qemu-syst  14551      5  14556
13492 (qemu-syst  24073   2469  26541
13548 (qemu-syst      9  14545  14553
13608 (qemu-syst      8    555    563
---------------  ------ ------ ------
Total             84367 108950 193316

So I’ve shut down all isobuilders and isotesters, and tried:

  • turning off automatic NUMA balancing, running numad, starting isotesters one after the other: several are badly aligned
  • turning on automatic NUMA balancing, stopping numad, starting 5 isotesters sequentially: several are badly aligned
  • turning off both automatic NUMA balancing and numad, starting 5 isotesters sequentially: all correctly aligned

So I’m not quite sure what’s going on. Possible explanations and ideas:

  • One of the algorithms used by automatic NUMA balancing is Migrate-on-Fault, so it’s plausible that the situation gets better over time, and looking at stats immediately after starting the VMs might be mostly worthless.
  • We use <vcpu placement='static'>, while <vcpu placement='auto'> would query numad; this is best combined with <numatune><memory mode='strict' placement='auto'/></numatune>, that queries numad too.

#14 Updated by intrigeri 2017-05-28 14:10:48

intrigeri wrote:
> * We use <vcpu placement='static'>, while <vcpu placement='auto'> would query numad; this is best combined with <numatune><memory mode='strict' placement='auto'/></numatune>, that queries numad too.

I’ve tried that (with numad running), let’s see how it goes.

#15 Updated by intrigeri 2017-06-29 10:01:23

  • blocks Feature #13232: Core work 2017Q2: Sysadmin (Maintain our already existing services) added

#16 Updated by intrigeri 2017-06-29 13:33:33

  • blocked by deleted (Feature #13232: Core work 2017Q2: Sysadmin (Maintain our already existing services))

#17 Updated by intrigeri 2018-08-18 09:28:04

intrigeri wrote:
> intrigeri wrote:
> > * We use <vcpu placement='static'>, while <vcpu placement='auto'> would query numad; this is best combined with <numatune><memory mode='strict' placement='auto'/></numatune>, that queries numad too.
>
> I’ve tried that (with numad running), let’s see how it goes.

I get better results than previously: all but one iso{builder,tester}s are correctly placed on a single NUMA node. So I’ll configure the same settings for all our other VMs and will call this done.

#18 Updated by intrigeri 2018-08-18 09:35:22

  • Target version set to Tails_3.9
  • % Done changed from 10 to 70
  • QA Check set to Ready for QA

Done, but not restarted the VMs so the changes are not really applied yet. I’ll check how things are after the next reboot.

Updated our VM creation doc accordingly.

#19 Updated by intrigeri 2018-08-19 12:59:49

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 70 to 100
  • QA Check changed from Ready for QA to Pass

Interestingly, 4 out of our 9 iso{build,test}ers still seem to be badly balanced. We could probably improve things by tweaking the order in which stuff is started. But the numa_miss / numa_hit ratio is 0 on node0 and ~1% on node1, which is much better than what I’ve seen before, so I’ll call this good enough.

#20 Updated by groente 2018-08-22 09:48:42

  • related to Bug #15832: lizard kernel oops on 4.9.0-8 kernel added