Bug #17104

"Erasure of memory freed by killed userspace processes" test scenario regression caused by the ugprade to Linux 5.2.0-3

Added by intrigeri 2019-09-29 12:56:58 . Updated 2019-10-21 11:44:42 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Target version:
Start date:
Due date:
% Done:

100%

Feature Branch:
bugfix/17104-memory-erasure-scenario-fragile
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

See Bug #17100.


Subtasks


Related issues

Related to Tails - Bug #17100: devel branch FTBFS: linux-image-5.2.0-2-amd64 not available Resolved
Related to Tails - Bug #17117: Upgrade to Linux 5.3 Resolved
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

History

#1 Updated by intrigeri 2019-09-29 12:57:14

#2 Updated by intrigeri 2019-09-29 12:57:20

  • related to Bug #17100: devel branch FTBFS: linux-image-5.2.0-2-amd64 not available added

#3 Updated by intrigeri 2019-10-03 08:37:53

  • related to Bug #17117: Upgrade to Linux 5.3 added

#4 Updated by intrigeri 2019-10-03 08:52:38

  • Assignee set to intrigeri

Will investigate via Bug #17117 if more recent kernels are affected.

#5 Updated by intrigeri 2019-10-04 04:43:32

In the last 8 test suite runs on devel, this problem happened 5 times.
In the first 2 test suite runs with Linux 5.3-rc5, this problem did not happen.
This makes me hopeful but that’s not enough data points to draw conclusions yet ⇒ I’ll keep crossing fingers and will keep an eye on this while working on Bug #17117.

#6 Updated by intrigeri 2019-10-05 07:59:54

  • Assignee deleted (intrigeri)

Unfortunately, Bug #17117 does not fix that problem: it happens on most recent test suite runs for that branch on Jenkins.

#7 Updated by segfault 2019-10-06 10:52:56

  • Status changed from Confirmed to In Progress

Applied in changeset commit:tails|61bed2146c5e2739a90424109e90282bf5449cb9.

#8 Updated by intrigeri 2019-10-07 06:13:53

  • Feature Branch set to bugfix/17104-memory-erasure-scenario-fragile

@segfault, oh oh, interesting! :)

#9 Updated by segfault 2019-10-07 09:19:00

intrigeri wrote:
> @segfault, oh oh, interesting! :)

Not really, I just increased the memory constants as a wild guess. It didn’t work, in both test runs for this commit the scenario failed again. I will delete the commit from the feature branch.

And I still don’t have a working setup to run the test suite locally, so I think I’m not in a good position to work on this.

#10 Updated by anonym 2019-10-07 10:48:05

  • Assignee set to anonym

Taking a look…

#11 Updated by anonym 2019-10-07 12:45:01

  • Assignee deleted (anonym)

So the problem reproduces with similar frequency in my setup as on Jenkins, good (historically this has not been the case for me, which has made debugging a nightmare)!

I tried playing with the values we set in the “I prepare Tails for memory erasure tests” step:

  • Dropping all of it ⇒ same crash
  • Bumping {kernel,admin}_mem_reserved_k with various amounts ⇒ same crash
  • I tried things we used to do before, like vm.overcommit_memory = 2 + various values of vm.overcommit_ratio ⇒ the remote shell server crashes by failing to allocate memory or failing to start a subprocess

So no luck. I also tried bumping the timeout for fillram to an hour, just to see if the deadlock would resolve eventually, but it didn’t. So the VM seems to have crashed completely.


Taking a step back, do we really have to attempt to fill all RAM and thus involve OOM? It seems to me that it would be enough to fill a portion of free RAM, leaving a good safety margin to avoid these crashes (which we’ve had problems with before). In fact, this is effectively what the filesystem cache tests do. Thoughts?

#12 Updated by anonym 2019-10-07 14:59:31

I just noticed that this scenario lacks an “anti-test”, and I can see how it would be hard to implement: the point you want to dump the memory and verify that the patterns are there is right before OOM kills fillram which is pretty hard to pin-point to say the least. :) Still, it would inspire confidence in our testing method to have such a test!

Luckily, if we do a partial fill, like I suggest above, we will retain full control and can easily make the partial filler signal when done (e.g. create a file that we wait for via the remote shell) so we know when to dump and verify the pattern, and wait until we manually kill it after which we proceed to dump and verify that the pattern is gone.

So I am just wondering if I am missing something important about filling specifically all of RAM. Back when Tails was 32-bit (or mixed with 64-bit) there were some of uncertainty around the 4 GiB limit, which motivated us to fill a larger region than 4 GiB, so without much thought we might have just gone ahead and filled all of it. But, AFAICT, if we went with the approach I describe above and allocated >4 GiB then the test would still have done what we wanted.

Also note that in today’s 64-bit world we have no reason for allocating >4 GiB, so we could go with 128 MiB like the other tests arbitrarily (?) do.

#13 Updated by intrigeri 2019-10-09 15:10:51

Hi @anonym,

> I just noticed that this scenario lacks an “anti-test”, and I can see how it would be hard to implement: the point you want to dump the memory and verify that the patterns are there is right before OOM kills fillram which is pretty hard to pin-point to say the least. :) Still, it would inspire confidence in our testing method to have such a test!

Right, there’s a possibility that the fillram instances fail for unrelated reasons, and then we’re testing exactly nothing here.

> Luckily, if we do a partial fill, like I suggest above, we will retain full control and can easily make the partial filler signal when done (e.g. create a file that we wait for via the remote shell) so we know when to dump and verify the pattern, and wait until we manually kill it after which we proceed to dump and verify that the pattern is gone.

> So I am just wondering if I am missing something important about filling specifically all of RAM. Back when Tails was 32-bit (or mixed with 64-bit) there were some of uncertainty around the 4 GiB limit, which motivated us to fill a larger region than 4 GiB, so without much thought we might have just gone ahead and filled all of it. But, AFAICT, if we went with the approach I describe above and allocated >4 GiB then the test would still have done what we wanted.

> Also note that in today’s 64-bit world we have no reason for allocating >4 GiB, so we could go with 128 MiB like the other tests arbitrarily (?) do.

This proposal looks entirely good enough to me. And of course, it should solve this ticket :)

#14 Updated by anonym 2019-10-10 09:22:17

  • Assignee set to anonym

Then I’ll implement my proposal!

#15 Updated by anonym 2019-10-10 11:46:55

  • Status changed from In Progress to Needs Validation
  • % Done changed from 0 to 50

Let’s see what Jenkins thinks!

#16 Updated by anonym 2019-10-14 12:20:22

  • Assignee deleted (anonym)
  • Type of work changed from Research to Code

Out of six runs, four were failure-free, and the other two had only unrelated failures. Looks good to me!

#17 Updated by intrigeri 2019-10-14 12:31:43

  • Assignee set to intrigeri

> Out of six runs, four were failure-free, and the other two had only unrelated failures. Looks good to me!

Ooh yeah. I’ll try to look into it today. Failing that, it’ll be on Thursday or Friday, so if someone wants to take this review on Tue or Wed, that’s totally fine by me.

#18 Updated by intrigeri 2019-10-14 14:28:17

  • Assignee changed from intrigeri to anonym

Neat!

Just one nit: any good reason to use pgrep instead of $vm.has_process??
If there’s one, please merge into testing → devel yourself :)

#19 Updated by anonym 2019-10-14 14:49:40

  • Status changed from Needs Validation to Fix committed
  • Assignee deleted (anonym)
  • % Done changed from 50 to 100

intrigeri wrote:
> Just one nit: any good reason to use pgrep instead of $vm.has_process??

I tried $vm.has_process?, which is based on pidof -x, but it doesn’t match against the arguments, i.e. the script name used to identify the process at all. I didn’t dare switch implementation of $vm.has_process?.

I was considering to actually make the remote shell record the PID of the last executed command and store it in the RemoteShell::ShellCommand object that is returned (like $! in bash etc.) but went with this simpler solution in the end.

> If there’s one, please merge into testing → devel yourself :)

Ack, I’ll merge this and close this ticket, but if you think it’s worth spending time on any of the above, let me know and I file a ticket.

#20 Updated by intrigeri 2019-10-14 15:30:15

Hi @anonym,

> I tried $vm.has_process?, which is based on pidof -x, but it doesn’t match against the arguments, i.e. the script name used to identify the process at all. I didn’t dare switch implementation of $vm.has_process?.

OK!

> I was considering to actually make the remote shell record the PID of the last executed command and store it in the RemoteShell::ShellCommand object that is returned (like $! in bash etc.) but went with this simpler solution in the end.

In passing, for your curiosity mostly: I’ve played a bit with systemd-run a few weeks ago for other reasons (finding a way to avoid having to reboot isotesters between test suite runs, which would be desirable for a number of reasons). I think it would provide exactly what we want here: a way to reliably monitor and manage a process’s life cycle (we can give it a unique identifier with --unit; --collect might be useful too). It’s probably not worth the effort in this specific case where we already have something simple that works well, but now that I’ve re-discovered this tool, I’m pretty sure I’ll find cases where it’s exactly the one we need :)

> if you think it’s worth spending time on any of the above, let me know and I file a ticket.

Let’s not bother IMO. Good enough is good enough!

#21 Updated by intrigeri 2019-10-21 11:44:42

  • Status changed from Fix committed to Resolved