Bug #12354
Fix shutdown and memory wipe regressions on 3.0~betaN
100%
Description
We’ve been reported a number of regressions vs. 2.x on 3.0~beta1 and beta2: on shutdown, the kernel is kexec’ed but then either nothing else happens (blinking caps lock == kernel panic) or the system fails to shut down and leaves the user facing an initramfs prompt.
So:
- Do we see any cheap way to debug this? If not:
- Is it better to have an unreliable memory wiping feature, that leaves the system in a weird (and suspicious) state when it fails, or no such feature at all? In other words, do we want to optimize for the high-risk users who need this feature and got hardware where it is reliable? Or for everybody else? And is it OK to provide this feature (and then some users will rely on it) even though it doesn’t work reliably (and then some users will be bitten because they rely on it and today / on other hardware) it fails?
Note that Feature #12089 might be enough (see discussion on Feature #12107 and tails-dev) to erase most memory without any special “memory wipe on shutdown” process.
Subtasks
Feature #12397: Evaluate how kernel memory poisoning works for tmpfs | Resolved | 100 |
|||
Feature #12398: Evaluate how kernel memory poisoning works for disk cache | Resolved | 100 |
|||
Feature #12428: Ensure disk caches and aufs read-write branch are emptied during emergency shutdown | Resolved | 100 |
Related issues
Related to Tails - |
Resolved | 2016-12-27 | |
Related to Tails - |
Resolved | 2017-03-20 | |
Related to Tails - |
Resolved | 2014-07-19 | |
Related to Tails - |
Resolved | 2017-05-18 | |
Has duplicate Tails - |
Duplicate | 2016-09-08 | |
Blocked by Tails - |
Resolved | 2017-05-16 |
History
#1 Updated by intrigeri 2017-03-17 10:20:37
- related to
Feature #12089: Enable the kernel page allocator poisoning added
#2 Updated by intrigeri 2017-03-17 10:20:45
- related to
Feature #12107: Can PAX_MEMORY_SANITIZE replace memory erasure on shutdown? added
#3 Updated by intrigeri 2017-03-17 10:21:12
Next step is to research this a bit and answer the “Do we see any cheap way to debug this?” question.
#4 Updated by emmapeel 2017-03-18 10:09:19
I am not sure if is part of this ticket, but the Emergency Shutdown now acts quickly, but the screen stayed rendered for much longer time than Tails > 3.0.
Before I would only see the screen for 2 seconds at most, before it went black.
Now it stays there and after 4 seconds start to corrupt slowly, but you can still see pretty much everything on it.
Shall I open another ticket about it, or it fits the scope of this ticket?
#5 Updated by intrigeri 2017-03-18 10:30:27
> I am not sure if is part of this ticket, but the Emergency Shutdown now acts quickly, but the screen stayed rendered for much longer time than Tails > 3.0.
> Before I would only see the screen for 2 seconds at most, before it went black.
> Now it stays there and after 4 seconds start to corrupt slowly, but you can still see pretty much everything on it.
Can you please clarify what’s “the screen”?
#6 Updated by Anonymous 2017-03-18 11:35:16
intrigeri wrote:
> > I am not sure if is part of this ticket, but the Emergency Shutdown now acts quickly, but the screen stayed rendered for much longer time than Tails > 3.0.
>
> > Before I would only see the screen for 2 seconds at most, before it went black.
>
> > Now it stays there and after 4 seconds start to corrupt slowly, but you can still see pretty much everything on it.
>
> Can you please clarify what’s “the screen”?
I think she meant to say that she continues to see the Tails desktop for 2 seconds.
I also had the feeling that it was slower on the 3.0 beta 2 but it’s just a feeling.
#7 Updated by emmapeel 2017-03-18 11:38:45
u wrote:
> >
> > Can you please clarify what’s “the screen”?
>
> I think she meant to say that she continues to see the Tails desktop for 2 seconds.
>
Yes, the whole Desktop of my session stays there for long… while the screen goes corrupting slowly…
> I also had the feeling that it was slower on the 3.0 beta 2 but it’s just a feeling.
More than 2 seconds. 2 seconds or 3 without any change, and then some 4 seconds of slow corruption. Plenty of time to discover you were reading the ‘lesbian teen support’ webpage or similar.
#8 Updated by intrigeri 2017-03-18 15:51:15
> Yes, the whole Desktop of my session stays there for long… while the screen goes corrupting slowly…
OK, then please file this as a separate ticket: this one is about failure to shutdown (and possibly to actually wipe the memory).
#9 Updated by intrigeri 2017-03-18 16:13:32
- related to
Bug #11786: System often crashes during/after memory wipe since Linux 4.6 added
#10 Updated by anonym 2017-03-18 17:35:50
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 10
- Type of work changed from Research to Discuss
From the description I’m extracting the two problems we have: On shutdown:
- the kernel panics when kexec:ing: no memory is wiped, and the system is powered so the memory is kept in a perfect state unlike if the system had shut down; the memory will quickly start to decay without power. In other words, in this failure mode our memory wipe results in something worse than a normal shutdown. :S
- the kernel kexec:ing works, but then the system fails to shut down: the userland memory is (probably) wiped, but not the memory reserved for the kernel.
In both cases the user is faced with a system that is still running, with weird stuff on the screen, and has to manually power off the computer.
Problem 1 is due to hardware compatibility, i.e. kexec is broken on some hardware. New kernel releases may fix it for some hardware, and may break it for others. We’ve seen it go both ways, and it’s safe to say that we neither have the expertise, time nor access to enough hardware to follow this closely and make sure each kernel version doesn’t regress. So it seems impossible that we can completely (= for all hardware) solve problem 1. Conclusion: it seems kexec-based memory wiping will always remain unreliable on some of our users’ hardware.
Problem 2 is more tractable for us. We have several ideas of how to proceed, e.g. someone could spend a day or two to try e.g. Bug #11786#note-60. We can probably solve it, but it will still take quite some time.
So, given that we will not be able to solve problem 1, is it even worth spending time on problem 2? At best we’ll only be able to make “guarantees” that memory wiping works on certain specific hardware that we list somewhere, that very serious “high risk” users might invest in, but users with $random hardware will be left out, and some of them will be hit by this annoying issue and the uncertainty of whether the wiping worked at all.
So we have three options:
- a) We live with problem 1 but fix problem 2.
- b) Same as a), but we also enable
Feature #12090(memory allocated to userspace is wiped whenfree()
:d) - c) We drop our kexec-based memory wiping, enable
Feature #12090and rely on that as our only active memory wiping measure.
(b) is quite obviously better than (a) except that we cannot automatically test our memory wiping feature any more (Feature #12090 makes this test impossible). So, really, (b) or ©? More detailed:
- b): crappy UX and less safety on some hardware, but more safety when it actually works.
- c): more polished UX, and reliable though imperfect memory wiping
Another way to look at it (read “>” as “is better than”):
- Safety: (b) when it works > © > (b) when it fails
- UX: (b) when it works == © > (b) when it fails
So, yeah, it’s a nasty trade-off we have to decide.
#11 Updated by alant 2017-03-19 14:22:50
7 tails contributors discussed this question while working on 3.0.
Conclusion: we would go for solution © after verifing that it not only wipes application memory but also tmpfs and disk cache. If © doesn’t fill these requirements, we’ll need to discuss again wether (b) is best or not.
Reason: a reliable safe solution is better that an unreliable safer solution. When (b) fails with a kernel panic (problem 1), the user ends up an a worse situation than without any memory wiping, as there is no realistic way to solve that in a reliable way. © wipes less kernel memory, but we will know what it wipes or not and have a reliable implementation.
#12 Updated by intrigeri 2017-03-19 18:21:00
> Conclusion: we would go for solution © after verifing that it not only wipes application memory but also tmpfs and disk cache.
I’ll try to write automated tests for these two things. It’ll probably happen during the 6th Stretch sprint (mid-May), unless I find it an exciting enough candidate for structured procrastination earlier ;)
#13 Updated by intrigeri 2017-03-19 22:04:29
- % Done changed from 10 to 0
- Type of work changed from Discuss to Research
To test tmpfs: start Tails, check the known pattern cannot be found, mount a tmpfs of decent size (no need to aim for the maximum), fill it with a known pattern, dump memory and check the known pattern can be found, umount the tmpfs, verify that the known pattern cannot be found anymore. If it still can, test shut down and/or drop caches.
To test disk cache:
- background:
- start Tails
- create+unlock+mount a ext4-on-LUKS filesystem
- test write buffers:
- drop caches
- check that the known pattern cannot be found in memory
- fill the FS with a known pattern
- dump memory to check the known pattern can be found
- test read cache:
- drop caches
- check that the known pattern cannot be found
- read the entire filesystem
- check that the known pattern can be found
- test umount (after each of previous tests):
- umount the filesystem
- check if the known pattern can be found
- if it can, drop caches, check again, and if this works better: consider dropping caches during the shutdown process
#14 Updated by intrigeri 2017-03-19 22:06:30
anonym, I’m under the impression that you’ll be about twice faster than me to implement the test cases designed in my previous comment. Interested? Excited? If not, reassign to me :)
#15 Updated by intrigeri 2017-03-20 16:52:41
- related to deleted (
)Feature #12107: Can PAX_MEMORY_SANITIZE replace memory erasure on shutdown?
#16 Updated by alant 2017-03-20 18:07:52
Procedures looks fine, but
> create+unlock+mount a ext4-on-LUKS filesystem
I think it would be good to make the same tests with vfat. This would test the common use case of reading/saving a file from/to an USB stick, e.g. to share it. Then I expect that once the stick is unplugged and Tails is shut down, the file is not accessible anymore.
#17 Updated by intrigeri 2017-03-21 08:26:48
> Procedures looks fine,
Great! Thanks :)
> but
>> create+unlock+mount a ext4-on-LUKS filesystem
> I think it would be good to make the same tests with vfat. This would test the common use case of reading/saving a file from/to an USB stick, e.g. to share it. Then I expect that once the stick is unplugged and Tails is shut down, the file is not accessible anymore.
I would expect that in most situations when traces of activity in RAM matter, the file will still be (in cleartext) on the USB stick that’s still pretty close, so wiping memory doesn’t change much. But whatever: it’s trivial to also test this once we have the other tests, so let’s do it just to be on the safe side!
#18 Updated by intrigeri 2017-04-05 13:37:04
- Feature Branch set to bugfix/12354-drop-kexec-memory-wipe
#19 Updated by intrigeri 2017-04-05 15:48:49
- Assignee changed from intrigeri to anonym
- QA Check set to Ready for QA
I have written automated tests (that pass successfully and reliably on my system) that, I believe, answer the few questions that were raised when we discussed this. Next steps:
- anonym reviews the branch, confirms that kernel poisoning is good enough for us, and merges it up to commit:0c8e076a5dac72cc6898ad9d3ce4156699ed2d06 into feature/stretch if happy;
- I’m going to file a ticket about dropping the kexec-based implementation.
#20 Updated by intrigeri 2017-04-05 15:53:52
- related to deleted (
)Bug #11786: System often crashes during/after memory wipe since Linux 4.6
#21 Updated by intrigeri 2017-04-05 15:53:59
- has duplicate
Bug #11786: System often crashes during/after memory wipe since Linux 4.6 added
#22 Updated by intrigeri 2017-04-05 15:55:57
intrigeri wrote:
I’m going to file a ticket about dropping the kexec-based implementation.
Actually not, this topic is already spread over too many tickets. I’ll just reuse this ticket to actually fix the issue it’s about, i.e. drop the kexec-based implementation. So please reassign to me once you’ve reviewed’n’merged the branch (that only adds automated tests and enables poisoning, but that’s a good start and useful in itself).
#23 Updated by intrigeri 2017-04-05 16:11:49
- Subject changed from What to do wrt. shutdown regressions on 3.0~betaN? to Fix shutdown and memory wipe regressions on 3.0~betaN
- Type of work changed from Research to Code
#24 Updated by anonym 2017-04-17 13:13:20
intrigeri wrote:
> I have written automated tests (that pass successfully and reliably on my system) that, I believe, answer the few questions that were raised when we discussed this. Next steps:
>
> # anonym reviews the branch, confirms that kernel poisoning is good enough for us, and merges it up to commit:0c8e076a5dac72cc6898ad9d3ce4156699ed2d06 into feature/stretch if happy;
Review + testing done! Great job, I didn’t find any blockers! I’ve pushed some minor nitpicky stuff on top, and other than that I only have these non-blocker remarks relating to re-usability:
def mount_USB_drive(disk, fs)
[...]
if /\bencrypted with password\b/.match(fs)
password = /encrypted with password "([^"]+)"/.match(fs)[1]
IMHO the fs
variable shouldn’t be human-readable text, but optional arguments (I recommend a Hash
so a call could look like mount_USB_drive(some_disk, filesystem: 'ext4', luks_passphrase: 'asdf')
) so it fits in more caller contexts.
When(/^I plug and mount a (\d+) MiB USB drive with an? (.*)$/) do |size_MiB, fs|
Here you can parameterize the size unit instead of fixing on MiB
. The same applies to several other steps you created.
#25 Updated by intrigeri 2017-04-17 13:16:25
- Assignee changed from anonym to intrigeri
- QA Check changed from Ready for QA to Dev Needed
> IMHO the fs
variable shouldn’t be human-readable text, but optional arguments (I recommend a Hash
so a call could look like mount_USB_drive(some_disk, filesystem: 'ext4', luks_passphrase: 'asdf')
) so it fits in more caller contexts.
> […]
> Here you can parameterize the size unit instead of fixing on MiB
. The same applies to several other steps you created.
Agreed for both. I’ll do that when I resume work on this branch.
#26 Updated by intrigeri 2017-04-17 17:04:46
intrigeri wrote:
> # anonym reviews the branch, confirms that kernel poisoning is good enough for us, and merges it up to commit:0c8e076a5dac72cc6898ad9d3ce4156699ed2d06 into feature/stretch if happy;
This happened.
#27 Updated by intrigeri 2017-04-18 15:04:56
- Target version changed from Tails_3.0 to Tails_3.0~rc1
#28 Updated by intrigeri 2017-05-18 07:29:09
>
> def mount_USB_drive(disk, fs)
> [...]
> if /\bencrypted with password\b/.match(fs)
> password = /encrypted with password "([^"]+)"/.match(fs)[1]
>
> IMHO the fs
variable shouldn’t be human-readable text, but optional arguments (I recommend a Hash
so a call could look like mount_USB_drive(some_disk, filesystem: 'ext4', luks_passphrase: 'asdf')
) so it fits in more caller contexts.
Full ACK, done locally.
>
> When(/^I plug and mount a (\d+) MiB USB drive with an? (.*)$/) do |size_MiB, fs|
>
> Here you can parameterize the size unit instead of fixing on MiB
. The same applies to several other steps you created.
IMO the additional code complexity it would introduce is not deserved here: the MiB unit is suitable, in these steps, for all use cases I can think of. I’d rather avoid make stuff generic “just in case”, as it adds some maintenance cost.
#29 Updated by intrigeri 2017-05-18 08:50:18
- related to
Bug #12393: Emergency shutdown leaves the Desktop on for much longer in Stretch added
#30 Updated by intrigeri 2017-05-18 09:08:17
- Assignee changed from intrigeri to anonym
- QA Check changed from Dev Needed to Ready for QA
#31 Updated by intrigeri 2017-05-18 09:10:51
- related to
Feature #5417: Improve user experience when wiping memory at shutdown added
#32 Updated by intrigeri 2017-05-18 09:57:55
- blocked by
Feature #12554: Bump APT snapshots for 3.0~rc1 added
#33 Updated by anonym 2017-05-18 15:02:19
- Assignee changed from anonym to intrigeri
- QA Check changed from Ready for QA to Dev Needed
Awesome work! I found no blockers (so feel free to merge this yourself as long as fixing my remarks below are tracked somehow and fixed before Tails 3.0) and I was hard-pressed to find anything to comment on, but here goes:
In wiki/src/contribute/design/memory_erasure.mdwn
:
> In order to protect against memory recovery such as cold boot attack,
> most of the system RAM is overwritten when Tails is being shutdown or when the
> boot medium is physically removed.
I guess we could add “and immediately after applications exit” or something to that effect? That is an improvement over the old kexec
-based approach when treated in isolation.
> #### The big picture
Can we put the story about the old design in another sub section, like “##### Obsolete kexec-based approach
”, and the current one in “##### Current memory poisoning-based approach
”? I.e. just insert these two headers ( and possibly adapt the edges of the affected paragraphs so the text still makes sense). I’d just like to make the history lesson more optional. :) Is it really needed for understanding the current design?
> Different kinds of events trigger the memory erasure process. All lead
> to run the `tails-kexec` shutdown script.
There’s no tails-kexec
any more.
> [[!tails_gitweb features/erase_memory.feature desc=“Automated tests”]]
> ensure that the most important parts of memory are erased this way.
Can we describe “the most important parts of memory” a bit better than this (not necessarily at this place in the design page, btw)? Or do we want the .feature
file to enumerate what we think is “important”? I really think this design page should list any limitations of the memory poisoning approach.
In features/step_definitions/common_steps.rb
:
> -def mount_USB_drive(disk, fs)
> +def mount_USB_drive(disk, fs_options = {})
> + fs_options[:encrypted] ||= false
Just a FYI: nil
is treated as false
when interpreted as a boolean, so this assignment is unnecessary. That said, I don’t really mind making the default more explicit, for readability.
In features/step_definitions/erase_memory.rb
:
> kernel_mem_reserved_k = 64*1024 # Duplicated in /usr/share/initramfs-tools/scripts/init-premount/memory_wipe
> […]
> admin_mem_reserved_k = 128*1024 # Duplicated in /usr/share/initramfs-tools/scripts/init-premount/memory_wipe
These don’t duplicate anything any longer.
In config/chroot_local-includes/usr/share/initramfs-tools/hooks/shutdown
:
> # systemd-shutdown itself
> mkdir -p $DESTDIR/lib/systemd
> copy_exec /lib/systemd/systemd-shutdown /shutdown
>
> # Our shutdown hook (run from inside the initramfs)
> mkdir -p $DESTDIR/lib/systemd/system-shutdown
> copy_file "regular file" \
> @ /usr/local/lib/initramfs-pre-shutdown-hook \@
> @ /lib/systemd/system-shutdown/tails@
To me it looks like you are following the common pattern of mkdir -p
:ing the target directories for the subsequent copy actions, but these directories are not used as the target (in fact they are not used at all). I’m just double-checking that these mkdir
calls still are relevant, and no residues from some Git history rewrite.
In features/emergency_shutdown.feature
:
> Then I find very few patterns in the guest’s memory
> And Tails eventually shuts down
This means that we will always wait the full two minutes (the “Happy dumping!” timeout) for each of these scenarios: there’s currently four, so that’s almost 2*4 = 8 minutes of needless waiting. I’m tempted to suggest that we add a scenario dedicated to make sure that the shutdown happens without any of the memory erasure checks, and that we simply remove the Tails eventually shuts down
step from these four scenarios. In fact, I think you should move these four scenario to erase_memory.feature
as well — I know it can be argued both ways, but I slightly prefer what I suggest (so no strong opinion here, just ignore this move if you disagree).
#34 Updated by intrigeri 2017-05-18 15:35:19
- related to
Bug #12560: Polish the new memory wiping implementation and design doc added
#35 Updated by intrigeri 2017-05-18 15:37:41
- Status changed from In Progress to Fix committed
Applied in changeset commit:5f588e526056ed6384659e7e04ef39edf0a93634.
#36 Updated by intrigeri 2017-05-18 15:39:42
- Assignee deleted (
intrigeri) - QA Check changed from Dev Needed to Pass
Follow-ups are tracked on Bug #12560. Thanks!
#37 Updated by intrigeri 2017-05-23 09:13:25
- Status changed from Fix committed to Resolved