Bug #9707

Jessie: System sometimes does not poweroff after memory erasure

Added by intrigeri 2015-07-08 06:37:01 . Updated 2016-11-28 12:55:23 .

Status:
Rejected
Priority:
High
Assignee:
Category:
Test suite
Target version:
Start date:
2015-07-08
Due date:
% Done:

0%

Feature Branch:
test/9707-power-off-after-memory-erasure
Type of work:
Research
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

Reproduced in VMs and via the I shutdown Tails and wait for the computer to power off test suite step: I see a bunch of stars (sdmem progress) then it stops and nothing happens anymore.


Subtasks


Related issues

Related to Tails - Bug #10487: Improve VM and OOM settings for erasing memory Resolved 2015-11-05
Related to Tails - Bug #11786: System often crashes during/after memory wipe since Linux 4.6 Duplicate 2016-09-08
Blocked by Tails - Bug #10733: Run our initramfs memory erasure hook earlier Resolved 2015-12-09

History

#1 Updated by romeopapa 2015-08-10 05:59:37

I have failed to reproduce this in VirtualBox 4.3.18_Debian.

#2 Updated by intrigeri 2015-08-10 06:49:02

> I have failed to reproduce this in VirtualBox 4.3.18_Debian.

With an ISO built from which branch/commit, exactly?

#3 Updated by romeopapa 2015-08-17 02:59:23

intrigeri wrote:
> With an ISO built from which branch/commit, exactly?

Sorry, hadn’t seen your reply.

7days ago, when I first replied to this ticket, I had tested on a VM with my very own jessie build (must have been f9530ea316716d117e7660045b694428be9700c9 of 2015-08-09 22:33:10).

I have tested again last night with nightly jessie 20150816T0408Z-e01d609 (yesterday’s): on a VBox VM, and on 2 laptops (an asus without EFI, a samsung ultrabook with EFI).

Intregeri, do you still see occurrences of this bug? Does it happen systematically? I haven’t seen it once since I’ve been looking for it. I might have said, on IRC that I had already seen this bug in the wild, but now I’m wondering if I perhaps hadn’t dreamed..

#4 Updated by intrigeri 2015-08-19 08:48:22

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

I’ve not seen it recently. It might have been a bug in my QEMU/libvirt/seabios/whatever. If nobody reproduces it within 1-2 months, I’ll close this ticket. Thanks!

#5 Updated by intrigeri 2015-09-17 00:23:55

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 20 to 100

Not since it recently, closing.

#6 Updated by intrigeri 2015-12-07 14:09:44

  • Category set to Test suite
  • Status changed from Resolved to Confirmed
  • Assignee set to intrigeri
  • % Done changed from 100 to 0

Sadly I have to reopen this ticket, as I’ve seen it regularly when running the test suite against feature/jessie ISO images. When it happens I see “VM is still running after 600 seconds”. QEMU is eating a CPU core. The system under test’s screen if full of stars, but no progress.

IMO what’s really blocking 2.0 is the part that blocks the test suite update (Feature #7563): we can live with this bug, so a test suite -specific workaround would be good enough, hence the category I’m setting. Now, it might be that a real fix is easier to find & implement than an ugly workaround :)

#7 Updated by intrigeri 2015-12-07 14:10:33

  • blocks Feature #7563: Update the automated test suite for Jessie ISO images added

#8 Updated by intrigeri 2015-12-07 14:24:56

Note that the test suite goes on (at least when it was rebooting after creating a persistent volume) if I kill the QEMU process at that point.

#9 Updated by intrigeri 2015-12-07 17:43:10

  • related to Bug #10487: Improve VM and OOM settings for erasing memory added

#10 Updated by intrigeri 2015-12-07 20:01:13

The (_) cursor that’s after the last star char is still blinking… but very slowly (cycle period is ~ 7 seconds). Ideas to better undertand what’s going on:

  • check if the never submitted stuff from feature/faster_sdmem_on_shutdown might help
  • raise the timeout (currently = 10 minutes): the display stays just the same (including the blinking cursor), just longer
  • ask libvirt/QEMU to send some keybinding to ask via sysrq what’s going on to the guest kernel
  • try the QEMU console
  • dump the memory to see how much was erased
  • remove “the running Tails is installed on USB drive” from the usb-install-tails-greeter checkpoint: saves ~2 minutes on all scenarios that install Tails to USB, which is totally welcome when running them repeatedly without keeping snapshots; but doesn’t help with the problem at hand
  • check that our hook runs at the right time (init-premount) compared to e.g. udev
  • bypass memory erasure once we’ve kexec’ed, unless specifically testing memory erasure; and add a failing test case reproducing this bug; at least this removes this one blocker for running all other USB tests
  • 2 vcpus for the guest doesn’t help

#11 Updated by intrigeri 2015-12-08 08:47:57

FWIW, the machine where I can reproduce this is running Jessie, and has TailsToasterStorage on ext4 / dm-crypt / SSD. I can’t reproduce this on another machine that’s running sid, with storage in a tmpfs. (scratch that: I’ve eventually seen one such failure on that other — and faster — machine when the system was under heavy load and swapping)

#12 Updated by anonym 2015-12-08 17:22:04

I have seen it when running the test suite on a baremetal i7-3520M system, also using “ext4 / dm-crypt / SSD” and Debian Sid.

#13 Updated by intrigeri 2015-12-09 06:21:04

  • related to Bug #10733: Run our initramfs memory erasure hook earlier added

#14 Updated by intrigeri 2015-12-09 06:21:45

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

Status: fine-tuning stuff for Bug #10487, that fixes this problem.

#15 Updated by intrigeri 2015-12-09 13:04:14

  • Assignee changed from intrigeri to anonym
  • % Done changed from 20 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory

Please review, test and merge into feature/jessie.

#16 Updated by intrigeri 2015-12-16 06:55:25

  • Priority changed from Elevated to High

#17 Updated by anonym 2015-12-18 04:13:47

  • Status changed from In Progress to Fix committed
  • % Done changed from 50 to 100

Applied in changeset commit:cd53d12c651c3713289d983e2c09551f8c14a898.

#18 Updated by intrigeri 2015-12-18 10:15:12

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

#19 Updated by anonym 2016-01-03 00:25:03

  • Status changed from Fix committed to Confirmed
  • Assignee set to intrigeri
  • % Done changed from 100 to 0
  • Feature Branch deleted (bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory)

I’ve seen this again now, while running the automated test suite. I suppose the previous branch (bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory) at least improved the situation, or it was just coincidental that I saw no occurrences of it for quite a while during my tests of the branch.

To be very specific, what I see now is a screen filled with sdmem’s asterisks, the cursor is blinking (although veeeeery slowly), and the CPU is used at 100. This reminds my slightly of when running fillram in erase_memory.feature without some of the kernel tweaks — when the memory is essentially 100 filled up, the system becomes unusable with 100% VM CPU being utilized. Possibly the memory has become so scarce that the kernel cannot function properly? Perhaps vm.admin_reserve_kbytes and/or vm.min_free_kbytes needs to be bumped above their defaults like we currently do in erase_memory.feature?

#20 Updated by intrigeri 2016-01-03 15:35:40

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • QA Check changed from Pass to Dev Needed

(Please… it at least on my test hardware, the branch for Bug #10487 made a huge difference :)

OK, back to the drawing board!

#21 Updated by intrigeri 2016-01-03 15:59:10

> To be very specific, what I see now is a screen filled with sdmem’s asterisks, the cursor is blinking (although veeeeery slowly), and the CPU is used at 100%.

This perfectly matches what I saw myself (Bug #9707#note-6, Bug #9707#note-10).

When you say “the CPU is used at 100%”, I guess you mean that QEMU is using a full CPU core, right? (Given we have no means at that point to know what’s happening inside the VM.)

Did you see that on the 32-bit or 64-bit kernel? (Not sure it’s relevant, but these tests are slow so if I need to run them hundreds of times to get stats, better focus on the cases that are known to expose the bug.)

Meta: I still stand behind what I wrote a month ago, that is “IMO what’s really blocking 2.0 is the part that blocks the test suite update (Feature #7563): we can live with this bug, so a test suite -specific workaround would be good enough, hence the category I’m setting. Now, it might be that a real fix is easier to find & implement than an ugly workaround :)”

> Possibly the memory has become so scarce that the kernel cannot function properly? Perhaps vm.admin_reserve_kbytes and/or vm.min_free_kbytes needs to be bumped above their defaults like we currently do in erase_memory.feature?

I would guess the same. Now, I don’t know what’s best:

  • optimizing for erasing more memory, with the increased risk that occasionally the system ends up freezing and never powers off (this is the current situation on the devel branch);
  • optimizing for clean shutdown after memory erasure, with the increased risk of leaving important data unerased in memory.

A potentially relevant data point is that on most EFI systems I’ve seen, we currently don’t shut down properly after erasing memory. So on current hardware, as long as that other bug isn’t fixed, we can as well optimize for erasing memory more completely, no?

Thoughts, opinions?

Meanwhile, I’m going to focus on the test suite side of things:

  1. try and reproduce the bug, so I have baseline data to compare candidate changes with; I’m assuming I will finally manage to reproduce it at all, and I hope I’m right;
  2. have the test suite raise vm.admin_reserve_kbytes and vm.min_free_kbytes, in the kexec’ed system, the exact same way it does in the regular system, and measure if it makes any difference wrt. this robustness problem. I dunno if it’s acceptable that the test suite tests something slightly different than what happens normally, though — it feels risky. Perhaps it would be OK to tweak things this way in most cases, as long as we add one (fragile) test case for the real, unmodified memory erasure process.

#22 Updated by anonym 2016-01-03 16:32:17

intrigeri wrote:
> When you say “the CPU is used at 100%”, I guess you mean that QEMU is using a full CPU core, right? (Given we have no means at that point to know what’s happening inside the VM.)

Yes!

> Did you see that on the 32-bit or 64-bit kernel? (Not sure it’s relevant, but these tests are slow so if I need to run them hundreds of times to get stats, better focus on the cases that are known to expose the bug.)

This was with the default (in the test suite) 64-bit kernel (during the Cat:ing a Tails isohybrid ... scenario, the I shutdown Tails and wait for the computer to power off step).

> Meta: I still stand behind what I wrote a month ago, that is “IMO what’s really blocking 2.0 is the part that blocks the test suite update (Feature #7563): we can live with this bug, so a test suite -specific workaround would be good enough, hence the category I’m setting. Now, it might be that a real fix is easier to find & implement than an ugly workaround :)”

Sure. I hope frontdesk won’t suffer too much from it.

> > Possibly the memory has become so scarce that the kernel cannot function properly? Perhaps vm.admin_reserve_kbytes and/or vm.min_free_kbytes needs to be bumped above their defaults like we currently do in erase_memory.feature?
>
> I would guess the same. Now, I don’t know what’s best:
>
> * optimizing for erasing more memory, with the increased risk that occasionally the system ends up freezing and never powers off (this is the current situation on the devel branch);
> * optimizing for clean shutdown after memory erasure, with the increased risk of leaving important data unerased in memory.

I suppose we should optimize for maximal memory erasure.

> A potentially relevant data point is that on most EFI systems I’ve seen, we currently don’t shut down properly after erasing memory. So on current hardware, as long as that other bug isn’t fixed, we can as well optimize for erasing memory more completely, no?

True.

> Thoughts, opinions?
>
> Meanwhile, I’m going to focus on the test suite side of things:
>
> # try and reproduce the bug, so I have baseline data to compare candidate changes with; I’m assuming I will finally manage to reproduce it at all, and I hope I’m right;

@product
Feature: foo
  Scenario: bar
  Given I have started Tails from DVD without network and logged in
  When I shutdown Tails and wait for the computer to power off

But who knows, maybe some extra conditions are necessary? In my newest instance, for instance, Tails had booted from a (isohybrid) USB. I doubt it, though.

> # have the test suite raise vm.admin_reserve_kbytes and vm.min_free_kbytes, in the kexec’ed system, the exact same way it does in the regular system, and measure if it makes any difference wrt. this robustness problem. I dunno if it’s acceptable that the test suite tests something slightly different than what happens normally, though — it feels risky. Perhaps it would be OK to tweak things this way in most cases, as long as we add one (fragile) test case for the real, unmodified memory erasure process.

To do it the “exact” same way requires starting the remote shell after kexec:ing, of course. And possibly some way to prevent the wiping from starting until you have applied the necessary kernel tweaks via the remote shell. Also, I would suggest --pause-on-fail — on the off chance that the remote shell remains responsive during the “freeze”, it would be extremely useful to interactively debug the situation with vm-execute.

#23 Updated by intrigeri 2016-01-03 17:03:59

> To do it the “exact” same way requires starting the remote shell after kexec:ing, of course.

I thought I would instead piggy-tail on the debug=wipemem codepaths.

I don’t want to bother putting the remote shell and all its deps into the initramfs.

#24 Updated by intrigeri 2016-01-04 18:08:44

I’ve run that:

  Scenario: Memory erasure on a modern computer
    Given a computer
    And the computer is a modern 64-bit system
    And the computer has 8 GiB of RAM
    And I set Tails to boot with options "debug=wipemem"
    And I start Tails from DVD with network unplugged and I login
    Then the PAE kernel is running
    And at least 8 GiB of RAM was detected
    And process "memlockd" is running
    And process "udev-watchdog" is running
    And udev-watchdog is monitoring the correct device
    And I shutdown and wait for Tails to finish wiping the memory

… 500 times and didn’t see this bug. I guess I’ll retry with a scenario that involves booting from USB.

#25 Updated by intrigeri 2016-01-05 01:46:53

And now I’ve run:

@product
Feature: System memory erasure on shutdown

  Scenario: Test 1
    Given I have started Tails without network from a USB drive without a persistent partition and stopped at Tails Greeter's login screen  
    And I log in to a new session
    When I shutdown Tails and wait for the computer to power off

… 100 times, and no problem either :(

#26 Updated by anonym 2016-01-06 22:23:48

I saw it again now, when creating the

I have started Tails without network from a USB drive without a persistent partition and stopped at Tails Greeter's login screen


checkpoint so, again, when running from USB.

I also noticed that if I take over the VM in virt-viewer, then the cursor starts blinking with a normal frequency again (it was slow in the vnc viewer window provided by --view). CPU is still stuck at 100% usage. I have no idea if this is relevant.

#27 Updated by anonym 2016-01-06 22:25:58

BTW, these 500 + 100 runs you did, was that with a special IAO? Or were you just trying to reproduce the bug?

#28 Updated by intrigeri 2016-01-07 00:39:59

> BTW, these 500 + 100 runs you did, was that with a special IAO?

IAO?

> Or were you just trying to reproduce the bug?

Yes.

#29 Updated by anonym 2016-01-07 00:42:54

intrigeri wrote:
> > BTW, these 500 + 100 runs you did, was that with a special IAO?
>
> IAO?

ISO. Sorry! :)

> > Or were you just trying to reproduce the bug?
>
> Yes.

Ok. If you ever prepare a branch with the proposed debugging stuff, let me know and I’ll run it in parallel with you. Perhaps my setup makes it more likely to trigger, for some reason.

#30 Updated by intrigeri 2016-01-09 18:42:46

  • related to Bug #10776: Step "I shutdown and wait for Tails to finish wiping the memory" fails when memory wiping causes a freeze added

#31 Updated by intrigeri 2016-01-09 19:14:53

  • Feature Branch set to test/9707-power-off-after-memory-erasure

#32 Updated by intrigeri 2016-01-09 20:24:43

  • Assignee changed from intrigeri to anonym
  • % Done changed from 10 to 20
  • QA Check changed from Dev Needed to Ready for QA

Would this help?

#33 Updated by anonym 2016-01-23 00:37:39

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Ready for QA to Dev Needed

I started a run with the scenario from Bug #9707#note-25 repeated two hundred times in a single feature, and already during the sixth run the bug appeared. Then I did a second run, and all 200 scenarios passed. Also, I haven’t seen this issues since I last reported it in this thread (19 days ago).

I am not sure what to make out of this. But the problem remains, how ever more/equally/less frequent.

#34 Updated by intrigeri 2016-01-23 12:24:23

  • Target version changed from Tails_2.0 to Tails_2.2

#35 Updated by intrigeri 2016-02-05 13:35:40

  • blocked by deleted (Feature #7563: Update the automated test suite for Jessie ISO images)

#36 Updated by intrigeri 2016-03-08 13:45:40

  • Target version changed from Tails_2.2 to Tails_2.3

#37 Updated by intrigeri 2016-04-16 14:59:48

  • Target version changed from Tails_2.3 to Tails_2.4

#38 Updated by intrigeri 2016-04-29 13:57:25

  • Target version changed from Tails_2.4 to Tails_2.5

#39 Updated by intrigeri 2016-06-01 12:49:53

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
> Meanwhile, I’m going to focus on the test suite side of things:
>
> # try and reproduce the bug, so I have baseline data to compare candidate changes with; I’m assuming I will finally manage to reproduce it at all, and I hope I’m right;

Last time I’ve tried, I could not reproduce the bug, but anonym could easily do it. So it feels safer to do it this way: anonym, can you reproduce with the updated branch?

> # have the test suite raise vm.admin_reserve_kbytes and vm.min_free_kbytes, in the kexec’ed system

Done on the topic branch. Pushed, but the two affected scenarios are fragile so Jenkins won’t run them (and besides, I don’t recall seeing this issue on Jenkins anyway).

#40 Updated by intrigeri 2016-07-13 12:37:26

  • Target version changed from Tails_2.5 to Tails_2.7

#41 Updated by intrigeri 2016-07-19 03:14:47

I say let’s do Bug #10733 first.

#42 Updated by intrigeri 2016-07-19 10:23:21

Merged stable and pushed, so that we have Jenkins data once we resume working on it.

#43 Updated by intrigeri 2016-07-30 11:59:02

  • related to deleted (Bug #10733: Run our initramfs memory erasure hook earlier)

#44 Updated by intrigeri 2016-07-30 11:59:14

  • blocked by Bug #10733: Run our initramfs memory erasure hook earlier added

#45 Updated by anonym 2016-09-28 04:21:38

So the problems I had with this branch was from before Tails 2.0 was released, 9 months ago. I haven’t had these problem for a long time, so I guess we are good, and can just drop the new commit on this branch. Or, perhaps it is somehow related to Bug #11786 (Memory wipe fails more often since Linux 4.6) just a new way for it to manifest? That one I can easily reproduce, so let me see if I can with an image built with this new commit.

#46 Updated by anonym 2016-09-28 07:47:43

  • Assignee changed from anonym to intrigeri

I ran this scenario 100 times both for an image built from this branch, and for Tails 2.6:

  Scenario: bar
    Given I have started Tails from DVD without network and logged in
    When I execute "poweroff" as root
    Then the computer shuts down within 60 seconds


and I got 11 and 13 failures (all of them stop at the busybox shell) respectively. So, IMHO, let’s skip this branch since it doesn’t seem to improve things, and then close this ticket in favor of Bug #11786. Agreed?

#47 Updated by anonym 2016-09-28 07:48:10

  • related to Bug #11786: System often crashes during/after memory wipe since Linux 4.6 added

#48 Updated by intrigeri 2016-10-02 10:28:02

> So the problems I had with this branch was from before Tails 2.0 was released, 9 months ago. I haven’t had these problem for a long time, so I guess we are good,

Great!

> and can just drop the new commit on this branch.

Well, I’m unsure: what this branch does is basically synchronizing our memory erasure
settings with the ones we use in the test suite while filling memory. I think that
still makes sense, even though it does not immediately fix a visible problem. What do
you think?

#49 Updated by intrigeri 2016-10-02 10:28:42

  • Assignee changed from intrigeri to anonym
  • Target version changed from Tails_2.7 to Tails_2.9.1
  • QA Check changed from Info Needed to Ready for QA

> I ran this scenario 100 times both for an image built from this branch, and for Tails 2.6:
> […]
> and I got 11 and 13 failures (all of them stop at the busybox shell) respectively.
> So, IMHO, let’s skip this branch since it doesn’t seem to improve things, and then
> close this ticket in favor of Bug #11786. Agreed?

Thanks for testing.

The fact that this branch does not fix what looks like a different problem (Bug #11786,
that has different symptoms i.e. the poweroff command fails after erasing memory,
compared to a blinking cursor while apparently still being erasing memory) is not
enough to convince me that it’s entirely irrelevant. As said in my previous command,
I think this branch still makes sense.

So, I propose that we:

  • merge this branch for 2.8, on the grounds that the changes it brings make sense
  • close this ticket, on the grounds that we can’t reproduce the bug anymore, but not because the branch would fix it (nothing proves that)

What do you think?

#50 Updated by anonym 2016-11-02 13:40:01

  • related to deleted (Bug #10776: Step "I shutdown and wait for Tails to finish wiping the memory" fails when memory wiping causes a freeze)

#51 Updated by anonym 2016-11-28 12:55:23

  • Status changed from In Progress to Rejected
  • Assignee deleted (anonym)
  • % Done changed from 20 to 0
  • QA Check deleted (Ready for QA)

Ok, I agree, and have merged the branch.

So, to be clear: I’m rejecting this branch on the grounds that we cannot reproduce the original problem, and Bug #11786 (which still is a problem) is considered unrelated.