Bug #15832

lizard kernel oops on 4.9.0-8 kernel

Added by groente 2018-08-22 09:48:13 . Updated 2018-10-11 17:27:19 .

Status:
Resolved
Priority:
Normal
Assignee:
groente
Category:
Infrastructure
Target version:
Start date:
2018-08-22
Due date:
% Done:

80%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

Booting lizard on a 4.9.0-8 kernel resulted in a non-functional system spewing out the following oops:

Aug 22 08:27:57 lizard kernel: [ 387.963350] Oops: 0000 [#1] SMP

Aug 22 08:27:57 lizard kernel: [ 388.244093] Call Trace:
Aug 22 08:27:57 lizard kernel: [ 388.246539] [] ? do_huge_pmd_numa_page+0xa6/0x5c0
Aug 22 08:27:57 lizard kernel: [ 388.252884] [] ? handle_mm_fault+0x676/0x12b0
Aug 22 08:27:57 lizard kernel: [ 388.258888] [] ? __do_page_fault+0x255/0x4f0
Aug 22 08:27:57 lizard kernel: [ 388.264799] [] ? page_fault+0x28/0x30

Some suggestions on how to proceed:

- perhaps we can reboot on the 4.9.0-8 kernel but disable the l1tf fixes (iirc one can enable/disable parts of it selectively), they’re the only change in –8 so likely the cause of the trouble

- reboot on the 4.9.0-8 kernel, disabling libvirtd and numad on the kernel cmdline (iirc systemd has means to disable service startup this way), log in, start numad, make sure it’s really really up and ready (Type=forking does not really guarantee the daemon is ready to answer requests), disable autostarting of all VMs, start libvirtd, start the biggest (RAM-wise) VMs one after the other, check numa allocation, then start everything else if no trouble. maybe that would help diagnose what’s going on wrt numa.

- reboot on a much newer kernel, in the hope that the problem is the backport of this big pile of fixes to 4.9


Subtasks


Related issues

Related to Tails - Feature #11179: Enable automatic NUMA balancing on lizard Resolved 2016-02-29
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29

History

#1 Updated by groente 2018-08-22 09:48:41

  • related to Feature #11179: Enable automatic NUMA balancing on lizard added

#2 Updated by groente 2018-08-22 09:49:41

intri: i’ve assigned this to you for now since i won’t have time to properly look into it the next few weeks, please feel free to reassign to me if you don’t have time either :)

#3 Updated by intrigeri 2018-09-19 17:39:13

  • Assignee changed from intrigeri to groente

groente wrote:
> intri: i’ve assigned this to you for now since i won’t have time to properly look into it the next few weeks, please feel free to reassign to me if you don’t have time either :)

Indeed, I don’t have time either, so please go ahead. We’re now far enough from the major 3.9 release to afford a little bit of well-managed downtime. Please keep https://tails.boum.org/contribute/calendar/ in mind when scheduling this work :)

If you want to first try to revert my recent NUMA changes, in order to check whether they’re the culprit, that’s 0ac2378b0919c3778a41e59a5609317864d373f2 in lizard’s /etc.

#4 Updated by intrigeri 2018-09-19 17:39:31

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#5 Updated by groente 2018-10-08 14:12:15

  • Status changed from Confirmed to Resolved

upgraded the kernel to 4.18 from stretch-backports and the problem disappeared \o/

#6 Updated by intrigeri 2018-10-11 11:54:28

  • Status changed from Resolved to In Progress
  • % Done changed from 0 to 80

I’ve reviewed the corresponding Puppet changes and have found two issues:

  • ensure => $ensure feels wrong/useless (there’s no $ensure variable in this context, is there?)
  • missing origin in the APT pinning; I think you want release o=Debian Backports,a=stretch-backports

#7 Updated by groente 2018-10-11 12:47:42

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Info Needed

> * ensure => $ensure feels wrong/useless (there’s no $ensure variable in this context, is there?)
> * missing origin in the APT pinning; I think you want release o=Debian Backports,a=stretch-backports

Fixed both, but it does make me wonder about the diffoscope pinning for isobuilders, is the pinning in puppet-tails:manifests/iso_builder.pp on line 18 broken then?

#8 Updated by intrigeri 2018-10-11 12:56:41

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

> but it does make me wonder about the diffoscope pinning for isobuilders, is the pinning in puppet-tails:manifests/iso_builder.pp on line 18 broken then?

I think it’s a no-op indeed. No idea how diffoscope from backports got installed there anyway. Best would be to check: deinstall diffoscope, run Puppet again, see which version it installs.

#9 Updated by groente 2018-10-11 13:23:17

  • Status changed from In Progress to Resolved

Okay, the diffoscope pin for isobuilders was indeed broken, that’s also fixed now, thanks for the review!

#10 Updated by intrigeri 2018-10-11 17:27:20

> Okay, the diffoscope pin for isobuilders was indeed broken, that’s also fixed now, thanks for the review!

Glad it had the side effect of fixing something else :)