Bug #9157

ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel

Added by intrigeri 2015-04-04 15:25:48 . Updated 2016-03-31 18:09:42 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2015-04-04
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
0
Affected tool:
Deliverable for:
280

Description

… but works fine when boooted with Wheezy’s kernel. In both cases, the system is running Jessie.

Symptoms: the system is unresponsible, including ping.

I can’t reproduce that bug (with nested virtualization as well) in another Jessie VM, that’s running on different hardware, but with a very similar setup. I can’t reproduce it on bare metal either.

More information I could gather:

[  152.976056] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 2} (detected by 1, t=5254 jiffies, g=7943, c=7942, q=6728)
[  360.096135] INFO: task kworker/u6:0:6 blocked for more than 120 seconds.
[  360.097693]       Not tainted 3.16.0-4-amd64 #1
[  360.098687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.100691] INFO: task jbd2/dm-0-8:139 blocked for more than 120 seconds.
[  360.102189]       Not tainted 3.16.0-4-amd64 #1
[  360.103178] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.105015] INFO: task jbd2/vdb-8:328 blocked for more than 120 seconds.
[  360.106475]       Not tainted 3.16.0-4-amd64 #1
[  360.107462] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.109328] INFO: task rs:main Q:Reg:700 blocked for more than 120 seconds.
[  360.110740]       Not tainted 3.16.0-4-amd64 #1
[  360.111654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.113449] INFO: task master:787 blocked for more than 120 seconds.
[  360.114731]       Not tainted 3.16.0-4-amd64 #1
[  360.115645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.117405] INFO: task pickup:798 blocked for more than 120 seconds.
[  360.118682]       Not tainted 3.16.0-4-amd64 #1
[  360.119595] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.121365] INFO: task cucumber:2205 blocked for more than 120 seconds.
[  360.122598]       Not tainted 3.16.0-4-amd64 #1
[  360.123446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.125100] INFO: task qemu-system-x86:2308 blocked for more than 120 seconds.
[  360.126514]       Not tainted 3.16.0-4-amd64 #1
[  360.127370] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

and:

BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:34]

Files


Subtasks


Related issues

Related to Tails - Feature #9264: Consider buying more server hardware to run our automated test suite Resolved 2015-12-15
Related to Tails - Bug #10717: Concerning amount of test suite runs aborted on Jenkins due to timeout Rejected 2015-12-06
Blocked by Tails - Feature #11178: Upgrade lizard host system to Jessie Resolved 2016-02-29

History

#1 Updated by intrigeri 2015-04-04 15:32:49

Note that I’ve tried removing irqbalance, as it seemed to be a workaround for people affected by similar problems, but it didn’t help.

Next steps are to try and reproduce in various settings, e.g.:

  • starting the same ISO with QEMU, outside of the test suite environment and of libvirt
  • starting the same ISO with libvirt, outside of the test suite environment
  • changing the QEMU command line a bit
  • allocating HugePages statically
  • disabling automatic allocation of AnonHugePages
  • more ideas welcome!

#2 Updated by intrigeri 2015-04-09 13:07:57

Next step is probably to retry with linux (3.16.7-ckt9-1) first.

#3 Updated by intrigeri 2015-04-17 13:01:00

With linux 3.16.7-ckt9-2, I see basically the same:

[  231.044003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=5250 jiffies g=3872 c=3871 q=50)
[  256.064003] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  256.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  284.064002] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  284.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  294.056003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=21003 jiffies g=3872 c=3871 q=50)
[  320.064002] BUG: soft lockup - CPU#1 stuck for 23s! [qemu-system-x86:1221]
[  320.080003] BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:2:328]
[  348.064003] BUG: soft lockup - CPU#1 stuck for 23s! [qemu-system-x86:1221]
[  348.080003] BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:2:328]
[  357.068003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=36756 jiffies g=3872 c=3871 q=50)
[  384.064002] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  384.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  412.064002] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  412.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  420.080003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=52509 jiffies g=3872 c=3871 q=50)
[  448.064004] BUG: soft lockup - CPU#1 stuck for 23s! [qemu-system-x86:1221]
[  448.080003] BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:2:328]
[  476.064006] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  476.080004] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]

#4 Updated by intrigeri 2015-04-17 13:20:58

With Linux 3.19.3-1~exp1:

[   76.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  104.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  132.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  160.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  188.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  216.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  244.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  272.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  300.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  328.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  356.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  384.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  412.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  440.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  468.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]

#5 Updated by intrigeri 2015-04-17 15:58:33

  • On current Jessie kernel, I can’t reproduce the bug with noapic disableapic. But then, of course, only one CPU is used, so that’s not a valid workaround. This leads me to think that there’s a deadlock between CPUs.
  • Booting with notsc nohpet noapictimer doesn’t help with current Jessie kernel.
  • sudo qemu-system-x86_64 -cdrom /srv/tails-i386-1.3.2.iso -m 1280 -nographic seems to work just fine. As soon as I add -machine pc-0.15,accel=kvm, it doesn’t work anymore. Removing usb=off, adding mem-merge=off, using -machine pc-q35-2.1 doesn’t help.
  • Both attached traces were produced by current Jessie kernel, with sudo qemu-system-x86_64 -cdrom /srv/tails-i386-1.3.2.iso -m 1280 -nographic -machine pc-0.15,accel=kvm
  • Disabling transparent hugepages (echo never >/sys/kernel/mm/transparent_hugepage/enabled) doesn’t help.

#6 Updated by intrigeri 2015-04-17 16:36:09

Reported on the Debian BTS: https://bugs.debian.org/782774

#7 Updated by hybridwipe 2015-07-07 21:38:33

How is the first kvm instance being started (i.e., what QEMU command line), and what image/iso is it running? I tried booting a testing image to boot tails 1.3.2 (to minimize changes..), but kvm wouldn’t even load inside kvm instance 1.

#8 Updated by intrigeri 2015-07-08 01:42:52

> How is the first kvm instance being started (i.e., what QEMU command line), and what image/iso is it running? I tried booting a testing image to boot tails 1.3.2 (to minimize changes..), but kvm wouldn’t even load inside kvm instance 1.

We’re using libvirt, both on the host and (via the test suite) in the level 1 guest.
I suggest taking this question to our support channels.

#9 Updated by intrigeri 2015-07-08 11:05:40

Reproduced with today’s Jessie kernel (3.16.7-ckt11-1).

#10 Updated by bertagaz 2015-09-19 09:40:09

Reproduced with jessie-backport kernel (4.1.0-0.bpo.2 : 4.1.6-1~bpo8+1 (2015-09-09)).

Tried to boot the L1 VM with nmi_watchdog=0, also fixed the CPUs frequency on the host as I read it may be due to frequency scaling, but none of that worked so far.

I’ve tried to set the L1 VM cpu in host-passthrough mode, and then saw an error I haven’t seen before:

[  231.456005] INFO: rcu_sched detected stalls on CPUs/tasks: { 2} (detected by 1, t=36762 jiffies, g=432, c=431, q=92)
[  231.456005] rcu_sched kthread starved for 36762 jiffies!
[  240.168087] INFO: task jbd2/dm-0-8:139 blocked for more than 120 seconds.
[  240.169158]       Tainted: G             L  4.1.0-0.bpo.2-amd64 #1
[  240.170057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  256.076002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:3:329]

#11 Updated by bertagaz 2015-09-20 04:02:22

Did some other tests with this backport kernel.

trace_3 is the basic trace.

trace_no_hpet_no_acpi is when trying to start the guest L2 qemu-system-x86_64 with the -no-hpet and -no-acpi option.

trace_no_disk_cache_L1 is when starting the L1 guest with cache='none' for all its attached disks.

I also failed to find a way to set enable_apicv to Yes for the guest L1 kvm_intel module.

I’ll post some links to pages that can interest us later.

There’s definitely a problem with interrupts or apic. I tried to diable X2APIC access to the L1 guest, but didn’t change anything.

#12 Updated by bertagaz 2015-09-22 02:40:36

This tests were done with qemu-system-x86 from Jessie (which is 2.1 : 2.1+dfsg-12+deb8u4). I’ve tried with the version from Stretch (2.4 : 2.4+dfsg-3) and got a bit of a different trace.

#13 Updated by bertagaz 2015-09-22 02:47:42

And this other trace is when adding -smp cpus=1 to the usual qemu-system-x86_64 -cdrom $ISO -m 1280 -machine accel=kvm used for all this testings.

#14 Updated by bertagaz 2015-09-25 03:30:54

Note that all this tests were done with a qemu from Wheezy on the host, which is a bit old (1.1.2). Meanwhile there have been changes in qemu related to APIC and such. There’s a wheezy-backports version available which is 2.1, so a next step could be to upgrade to it on the host and give some more testings.

#15 Updated by intrigeri 2015-11-01 02:09:55

  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Info Needed

bertagaz wrote:
> I’ve tried to set the L1 VM cpu in host-passthrough mode, and then saw an error I haven’t seen before:
>
> […]

I don’t understand what you mean here: this is essentially the same error message as the one that’s in the ticket description.

#16 Updated by intrigeri 2015-11-01 02:12:58

bertagaz wrote:
> Note that all this tests were done with a qemu from Wheezy on the host, which is a bit old (1.1.2). Meanwhile there have been changes in qemu related to APIC and such. There’s a wheezy-backports version available which is 2.1, so a next step could be to upgrade to it on the host and give some more testings.

Good idea.

I suggest we hold on, and block on Feature #9264 though: if we’re going to stop doing nested virtualization soonish, then it’s not worth investing lots of time into fixing it now (fixing it 6 months ago would have been different). What do you think?

#17 Updated by bertagaz 2015-11-02 04:15:25

  • Assignee changed from bertagaz to intrigeri

intrigeri wrote:
> bertagaz wrote:
> > Note that all this tests were done with a qemu from Wheezy on the host, which is a bit old (1.1.2). Meanwhile there have been changes in qemu related to APIC and such. There’s a wheezy-backports version available which is 2.1, so a next step could be to upgrade to it on the host and give some more testings.
>
> Good idea.
>
> I suggest we hold on, and block on Feature #9264 though: if we’re going to stop doing nested virtualization soonish, then it’s not worth investing lots of time into fixing it now (fixing it 6 months ago would have been different). What do you think?

Well, I’m tempted to agree, but I’m also tempted to give a try to this upgrade. There are quite a good amount of aborted test jobs in Jenkins, and I wonder if it won’t help with that. Let say it’s not mandatory but can be something that can be played with while waiting to see how Feature #9264 goes?

#18 Updated by intrigeri 2015-11-04 02:37:40

  • Assignee changed from intrigeri to bertagaz

> Well, I’m tempted to agree, but I’m also tempted to give a try to this upgrade.

I certainly won’t prevent you from spending your spare time on this, if you wish so :)

> There are quite a good amount of aborted test jobs in Jenkins, and I wonder if it won’t help with that.

This is another problem. It looks like a severe one, that needs a ticket.

> Let say it’s not mandatory but can be something that can be played with while waiting to see how Feature #9264 goes?

There are lots of things wrt. Feature #9264 and follow-up tasks that can be done without “waiting”, e.g. make sure that we’ll have more info when we come back to it, that we can host another box at the colo, that our contacts there will be available to purchase and plug the new box, etc. If all this is well handled and you have spare time, well, again, feel free to use it as you wish :)

#19 Updated by intrigeri 2015-12-06 15:22:24

I think we should now do this, in that order, and stopping on success:

  1. upgrade the host (lizard) to Jessie: we have to do it soonish anyway, and it’ll give us the QEMU upgrade bertagaz wanted to test :)
  2. reproduce the bug on another isotesterN if not done yet
  3. test with Linux 4.3 on isotesterN
  4. make sure Feature #9264 saves the day

#20 Updated by intrigeri 2015-12-06 15:22:43

  • related to Feature #9264: Consider buying more server hardware to run our automated test suite added

#21 Updated by bertagaz 2015-12-07 04:25:22

  • related to Bug #10717: Concerning amount of test suite runs aborted on Jenkins due to timeout added

#22 Updated by kytv 2015-12-07 12:44:51

intrigeri wrote:

> # test with Linux 4.3 on isotesterN

I need to try 4.3 (and 4.4-rc) but to add a datapoint which may be helpful even though my hardware is wildly different.

With 4.2.x my kernel reliably OOPSes during the UEFI Tails-Installer test. I reverted to a self-compiled 4.1.13 in the meantime.

#23 Updated by kytv 2015-12-07 14:34:25

kytv wrote:
> intrigeri wrote:
>
> > # test with Linux 4.3 on isotesterN
>
> I need to try 4.3 (and 4.4-rc) but to add a datapoint which may be helpful even though my hardware is wildly different.
>

The oops can be reproduced with 4.4.0-rc4, but the screen is still responsive, unlike with 4.2.x.

Perhaps isoteesterX has better luck with it than I do.

#24 Updated by intrigeri 2016-02-29 01:18:36

  • blocked by Feature #11178: Upgrade lizard host system to Jessie added

#25 Updated by intrigeri 2016-02-29 15:26:43

FWIW:

  • I’ve set cpu mode=host-passthrough on all isotesters, as suggested by the OpenStack doc when doing nested KVM.
  • Since commit:8544f14485289653b8dde9a29dc496f79e768866 (introduced on feature/jessie, part of Tails 2.0+) the test suite is run with cpu mode='host-model'.

It might be that one of these changes fixes the problem this ticket is about => worth retrying with Linux 3.16 from Jessie, and Linux 4.4 from sid, maybe. I might play with this one of these days when bored.

Also, wrt. the QEMU backport for Wheezy: it seems to be unmaintained and lacks a big pile of security fixes, so I don’t think it’s an option.

After the conclusions I’ve reached on Feature #11011, I think we’re back to needing to fix this problem before Wheezy’s Linux 3.2 is not maintained anymore, that is before the end of wheezy-lts, since we can’t count on a 2nd testing machine to magically solve this problem for us => bertagaz, can you set an appropriate target version?

#26 Updated by intrigeri 2016-02-29 16:45:22

  • Target version set to Tails_2.3

Reproduced on an isotester running current Jessie kernel (3.16.7-ckt20-1+deb8u3), same with the one from Stretch (4.3.0-1-amd64 version 4.3.5-1). I wonder if we should bisect Linux 3.2..3.16 packages, to see when the problem appeared.

Another option would be to try a newer kernel on the host.

I’m giving up until we’ve done Feature #11178, setting the same tentative milestone at that one.

#27 Updated by anonym 2016-03-01 10:40:44

I think I’ve stumbled into this issue too now: Feature #6354#note-16.

intrigeri wrote:
> Another option would be to try a newer kernel on the host.

On my level-0 host I am running Debian Unstable, with kernel version 4.3.0-1-amd64 (and I’ve tried with the same backported kernels in the other two VMs), so I don’t think this will help. :/

#28 Updated by intrigeri 2016-03-01 18:31:57

> I think I’ve stumbled into this issue too now: Feature #6354#note-16.

> intrigeri wrote:
>> Another option would be to try a newer kernel on the host.

> On my level-0 host I am running Debian Unstable, with kernel version 4.3.0-1-amd64 (and I’ve tried with the same backported kernels in the other two VMs), so I don’t think this will help. :/

I’d like to confirm this is really the same problem we’re seeing, first, so: is it fixed for you if the level-1 VM (where Vagrant will be run) runs Wheezy’s kernel + Jessie’s userspace?

#29 Updated by anonym 2016-03-01 22:58:49

intrigeri wrote:
> > On my level-0 host I am running Debian Unstable, with kernel version 4.3.0-1-amd64 (and I’ve tried with the same backported kernels in the other two VMs), so I don’t think this will help. :/
>
> I’d like to confirm this is really the same problem we’re seeing, first, so: is it fixed for you if the level-1 VM (where Vagrant will be run) runs Wheezy’s kernel + Jessie’s userspace?

As mentioned on #tails-dev, this doesn’t fix the problem. Specifically I was running:

  • level 0: 4.4.0-1-amd64 (Sid kernel/user space)
  • level 1: 3.2.0-4-amd64 (Wheezy kernel, Jessie user space)
  • level 2: 3.16.0-4-amd64 (Jessie kernel/user space)

I tried twice, and the first time even level-0 crashed! Second time I got this in level-2:

kernel:[ 1092.132037] BUG: soft lockup - CPU#3 stuck for 22s! [apt-get:25441] 


which actually isn’t exactly the same as I reported in Feature #6354#note-16. I couldn’t see what happened in level-1 since it crashed, and the journal was lost.

I wonder if it matters that level-0 runs kernel 4.4.0. I guess lizard runs the Jessie (or Wheezy?) kernel?

#30 Updated by intrigeri 2016-03-02 10:43:27

> I wonder if it matters that level-0 runs kernel 4.4.0. I guess lizard runs the Jessie (or Wheezy?) kernel?

lizard level-0 host runs 3.16 from wheezy-backports.

#31 Updated by anonym 2016-03-02 21:54:34

I do not have the issue when both level-0 and level-1 are running Debian Sid, see Feature #6354#note-18. So if my kernel issue is related somehow to the kernel issue this ticket is about, perhaps everything will be good once Debian Stretch is out, and lizard and the isobuildes migrate to it.

An interesting experiment that we could do how would be installing the 4.4.0-1 kernel from Jessie backports on both lizard and the isotesters (and isobuilders?). What do you think? Perhaps the user land also matters.

#32 Updated by intrigeri 2016-03-12 10:25:58

anonym wrote:
> I do not have the issue when both level-0 and level-1 are running Debian Sid, see Feature #6354#note-18. So if my kernel issue is related somehow to the kernel issue this ticket is about, perhaps everything will be good once Debian Stretch is out, and lizard and the isobuildes migrate to it.
>
> An interesting experiment that we could do how would be installing the 4.4.0-1 kernel from Jessie backports on both lizard and the isotesters (and isobuilders?). What do you think? Perhaps the user land also matters.

Tried it, but sadly we can’t do that as long as we’re running Wheezy on the host system: “linux-image-4.3.0-0.bpo.1-amd64 : Breaks: udev (< 208-8~) but 175-7.2 is to be installed”. So we’re back to blocking on Feature #11178, as previously.

#33 Updated by intrigeri 2016-03-12 10:28:18

  • Subject changed from isotester1.lizard crashes when running the test suite with Jessie's kernel to ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel
  • QA Check deleted (Info Needed)

#34 Updated by intrigeri 2016-03-31 15:45:41

After upgrading the level-0 host to Jessie, this combination (quite expectedly) still fails:

  • level-0: Jessie userspace + kernel
  • level-1: Jessie userspace, 4.4 kernel from jessie-backports
  • level-2: Jessie userspace + kernel

But this works:

  • level-0: Jessie userspace, 4.4 kernel from jessie-backports
  • level-1: Jessie userspace, 4.4 kernel from jessie-backports
  • level-2: Jessie userspace + kernel

Thanks anonym for the lead!

#35 Updated by intrigeri 2016-03-31 18:09:09

  • Status changed from Confirmed to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 0 to 100

#36 Updated by intrigeri 2016-03-31 18:09:42

  • Deliverable for set to SponsorS_Internal