Bug #12092

The Greeter keeps eating lots of memory after logging in

Added by intrigeri 2016-12-27 13:37:30 . Updated 2020-04-15 06:02:03 .

Status:
Resolved
Priority:
High
Assignee:
Category:
Hardware support
Target version:
Start date:
2016-12-27
Due date:
% Done:

0%

Feature Branch:
bugfix/12092-kill-gdm-session-after-login+force-all-tests
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Welcome Screen
Deliverable for:

Description

The test suite started failing on feature/stretch in Jenkins (Bug #12088) once we’ve merged the new Greeter in there. I’ve noticed that a full GNOME session is kept running as the Debian-gdm user once one has logged in. This seems like a big waste of memory. Can we somehow free that RAM, to keep supporting older hardware?

As part of this, revert commit:895387ece8cec86831cd617ee08dbdfc49cdc144.


Subtasks


Related issues

Related to Tails - Bug #12088: Test suite fails for feature/stretch on Jenkins due to lack of space in /tmp/TailsToaster Resolved 2016-12-26
Related to Tails - Bug #16305: GDM's GNOME Shell floods the Journal with XFIXES/cursor issues on Buster Resolved 2019-01-06
Related to Tails - Bug #17011: Disable GDM debug logs Resolved
Has duplicate Tails - Bug #12999: Stopping session-c1.scope after starting tails Duplicate 2017-06-23
Blocks Tails - Bug #16281: Update the test suite for Buster Resolved 2019-01-05
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

History

#1 Updated by intrigeri 2016-12-27 13:37:47

  • related to Bug #12088: Test suite fails for feature/stretch on Jenkins due to lack of space in /tmp/TailsToaster added

#2 Updated by alant 2016-12-31 18:15:51

On a stretch system using X session for both the greeter and the user session, I can reproduce this behaviour: after logging in, I have:

        ├─gdm3─┬─gdm-session-wor─┬─gdm-x-session(Debian-gdm)─┬─Xorg───{InputThread}
        │      │                 │                           ├─gnome-session-b─┬─gnome-settings-─┬─{dconf worker}
        │      │                 │                           │                 │                 ├─{gdbus}
        │      │                 │                           │                 │                 ├─{gmain}
        │      │                 │                           │                 │                 └─{pool}
        │      │                 │                           │                 ├─gnome-shell─┬─{JS GC Helper}
        │      │                 │                           │                 │             ├─{JS Sour~ Thread}
        │      │                 │                           │                 │             ├─{dconf worker}
        │      │                 │                           │                 │             ├─{gdbus}
        │      │                 │                           │                 │             ├─{gmain}
        │      │                 │                           │                 │             └─{threaded-ml}
        │      │                 │                           │                 ├─xbrlapi
        │      │                 │                           │                 ├─{dconf worker}
        │      │                 │                           │                 ├─{gdbus}
        │      │                 │                           │                 └─{gmain}
        │      │                 │                           ├─{gdbus}
        │      │                 │                           └─{gmain}
        │      │                 ├─{gdbus}
        │      │                 └─{gmain}
        │      ├─gdm-session-wor─┬─gdm-x-session(user)─┬─Xorg───{InputThread}
        │      │                 │                     ├─gnome-session-b─┬─evolution-alarm─┬─{dconf worker}
        │      │                 │                     │                 │                 ├─3*[{evolution-alarm}]
        │      │                 │                     │                 │                 ├─{gdbus}
        │      │                 │                     │                 │                 └─{gmain}

However, with a sid system using wayland sessions for both the greeter and the user, there are no Debian-gdm processes anymore after login.

We can try to find a way to (have these) processes killed, but I’m not sure it could be easy…

#3 Updated by intrigeri 2016-12-31 19:27:01

> We can try to find a way to (have these) processes killed, but I’m not sure it could be easy…

I’m open to an ugly hack to keep hardware requirements as low as possible, on the grounds that we can drop the hack once we use Wayland.

#4 Updated by intrigeri 2017-03-17 12:52:56

  • Priority changed from Normal to Elevated

#5 Updated by alant 2017-03-18 09:21:39

Let’s first evaluate the impact on memory usage, then decide what to do.

#6 Updated by alant 2017-05-11 15:42:54

  • Assignee changed from alant to intrigeri

I won’t handle that before May 15.

> Let’s first evaluate the impact on memory usage, then decide what to do.

Please assign it again to me if you decide it’s an issue and that deadline is 3.0 final.

#7 Updated by intrigeri 2017-05-16 13:55:20

  • Assignee changed from intrigeri to alant

> Please assign it again to me if you decide it’s an issue and that deadline is 3.0 final.

Yes, please try to get this done (possibly with an ugly hack we can drop once we go the Wayland way) by the end of May.

#8 Updated by intrigeri 2017-05-20 06:41:20

  • Priority changed from Elevated to Normal

Please focus on Bug #12364 first, but still, it would be nice to have this one fixed.

#9 Updated by intrigeri 2017-06-04 14:06:48

#10 Updated by intrigeri 2017-06-23 12:04:26

  • has duplicate Bug #12999: Stopping session-c1.scope after starting tails added

#11 Updated by intrigeri 2017-06-23 12:05:29

Bug #12999 has a potential solution to this problem.

#12 Updated by intrigeri 2018-05-17 13:04:09

https://siliconislandblog.wordpress.com/2018/05/16/gnome-performance-hackfest/ suggests it’s expected behaviour (but patches are WIP) unless one uses auto-login.

#13 Updated by intrigeri 2019-01-07 16:28:13

  • related to Bug #16305: GDM's GNOME Shell floods the Journal with XFIXES/cursor issues on Buster added

#14 Updated by intrigeri 2019-01-07 19:38:50

(Coming here from Bug #16305, since fixing this old issue would probably also fix that new one.)

On Buster, loginctl terminate-session c1 (run from a root terminal in the “amnesia” session) successfully kills the Debian-gdm session and its processes without breaking anything obvious at first glance. We could try to run that with sudo as a systemd --user/ service. Not super nice but we can drop it once we’ve switched to Wayland (in Tails 5.0, according to our current plans).

#15 Updated by sajolida 2019-01-22 18:33:27

#16 Updated by anonym 2019-08-16 07:57:40

  • Priority changed from Normal to Elevated
  • Target version set to Tails_4.0

This ticket is now a release blocker for Tails 4.0.

Rationale: in our test suite we have seen that 2 GB of memory (which is our advertised minimum) is dangerously close to low-memory situations (that essentially freezes the system) under very normal usage (e.g. install a package with a-s-p and then opening Tor Browser). More details:

#17 Updated by anonym 2019-08-16 07:57:54

  • blocks Bug #16281: Update the test suite for Buster added

#18 Updated by anonym 2019-08-16 08:00:48

#19 Updated by anonym 2019-08-16 08:01:12

  • Assignee deleted (alant)

I don’t want to force a blocker on you, @alant, so I’ll put this on the FT plate for now. Let us know if you want to work on this!

#20 Updated by intrigeri 2019-08-16 10:25:01

  • Description updated

#21 Updated by intrigeri 2019-08-19 07:58:25

  • Priority changed from Elevated to High
  • Type of work changed from Research to Code

This will fix a bunch of totally broken test cases: Bug #16281#note-35.

#22 Updated by intrigeri 2019-08-19 11:37:10

  • blocks Bug #16004: Make our automated test suite take into account USB image updates wrt. supported installation & update methods added

#23 Updated by intrigeri 2019-08-19 19:52:28

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri
  • Feature Branch set to bugfix/12092-kill-gdm-session-after-login+force-all-tests

Giving it a very basic try.

#24 Updated by intrigeri 2019-08-20 07:34:33

This breaks ASP starting the Persistence wizard. I see the firewall blocks connections from the amnesia user to TCP port 6001 (x11-1). It looks like that’s because the ASP code sets DISPLAY to :1. Setting it to :0 seems to fix the problem (only tested manually so far) but I wonder if there was a good reason for :1 and if changing it will break other stuff; we’ll see!

#25 Updated by intrigeri 2019-08-21 10:58:10

intrigeri wrote:
> This breaks ASP starting the Persistence wizard. I see the firewall blocks connections from the amnesia user to TCP port 6001 (x11-1). It looks like that’s because the ASP code sets DISPLAY to :1. Setting it to :0 seems to fix the problem

Scratch this, it depends on when exactly we kill the Greeter’s session: if we kill it before amnesia’s X.Org is started (which we should not do), :0 is free and amnesia’s X.Org takes it. commit:05aadd63b0d4d3d19b7a49e3029ff0c8ad5af6c0 should fix the root cause that caused :0 to be used, which broke some ASP stuff.

#26 Updated by intrigeri 2019-08-22 07:46:53

Current code runs in a very robust manner on my local Jenkins: in 5 full test suite runs, only failures are Bug #11592 and some failed GnuPG key fetches. But on our shared Jenkins I see lots of failures. Analyzing the Journal seems to tell us that the whole thing is racing against GDM, that starts a new gdm-x-session immediately after we’ve killed the first one, which prevents the amnesia’s session from loading (or being displayed, or something, and in any case we won’t get the memory savings we’re looking for when that happens).

One way to fix this would be to have tails-kill-gdm-session.service disable gdm.service’s OnFailure=tails-gdm-failed-to-start.service, then kill the gdm process. But I’m not sure how the system will behave then: it might be that we’re sent back to plymouth or something, while we’d like to remain on the VT where amnesia’s X.Org is running.

Or, we could tweak our config/chroot_local-includes/usr/lib/gdm3/gdm-x-session.tails wrapper: if some flag file created by tails-kill-gdm-session.service exists, do nothing, keep running, so that GDM believes it has successfully started its X session. That’s arguably a bit ugly but if it works, it might be the easiest way to get what we want reliably.

#27 Updated by intrigeri 2019-08-22 19:12:12

intrigeri wrote:
> Or, we could tweak our config/chroot_local-includes/usr/lib/gdm3/gdm-x-session.tails wrapper: if some flag file created by tails-kill-gdm-session.service exists, do nothing, keep running, so that GDM believes it has successfully started its X session. That’s arguably a bit ugly but if it works, it might be the easiest way to get what we want reliably.

Instead I made our gdm-x-session wrapper exit 0 when it’s called after we’ve killed the GDM session. At first glance, this seems to work:

/usr/lib/gdm3/gdm-x-session[2920]: (II) Server terminated successfully (0). Closing log file.
systemd[1]: session-c1.scope: Succeeded.
systemd[1]: Stopped Session c1 of user Debian-gdm.
systemd-logind[2085]: Removed session c1.

Let’s see what these to Jenkins think.

#28 Updated by intrigeri 2019-08-23 05:58:55

It seems to work a bit better on Jenkins (one run worked pretty fine, which did not happen before), but I still see some failures. After we click Login in the Greeter, ~15s later the screen turns black, at the time when amnesia’s X.Org is started; nothing fishy in the logs that claim GNOME Shell started just fine, except that 10s after amnesia’s X.Org starts, as soon as we’ve successfully killed GDM’s session, a new one briefly starts (without starting X.Org though, thanks to commit:2e6edc540a58eddd23f495f50a526137e0c78b18):

Aug 22 20:54:10 amnesia systemd[1]: session-c1.scope: Succeeded.
Aug 22 20:54:10 amnesia systemd[1]: Stopped Session c1 of user Debian-gdm.
Aug 22 20:54:10 amnesia systemd-logind[2466]: Removed session c1.
Aug 22 20:54:10 amnesia gdm-launch-environment][7270]: pam_unix(gdm-launch-environment:session): session opened for user Debian-gdm by (uid=0)
Aug 22 20:54:10 amnesia systemd-logind[2466]: New session c10 of user Debian-gdm.
Aug 22 20:54:10 amnesia systemd[1]: Started Session c10 of user Debian-gdm.
Aug 22 20:54:10 amnesia gdm-launch-environment][7270]: pam_unix(gdm-launch-environment:session): session closed for user Debian-gdm
Aug 22 20:54:10 amnesia gdm3[3345]: Child process -7297 was already dead.
Aug 22 20:54:10 amnesia systemd[1]: session-c10.scope: Succeeded.
Aug 22 20:54:10 amnesia systemd-logind[2466]: Session c10 logged out. Waiting for processes to exit.
Aug 22 20:54:10 amnesia systemd-logind[2466]: Removed session c10.

I suspect that GDM switches VTs when doing so, which would explain why the screen remains black. I think the only robust solution, to take control on the screen away from GDM in a non-racy way, would be to:

  • Set KillSignal=SIGKILL for gdm.service. It may be that KillMode=control-group is needed too.
  • Have tails-kill-gdm-session.service run systemctl stop gdm before (or instead of) loginctl terminate-session c1.

This way, GDM should be left no chance to react to its first gdm-x-session aborting. Then commit:2e6edc540a58eddd23f495f50a526137e0c78b18 should not be needed anymore as we have a guarantee that our gdm-x-session wrapper is not started again once we’ve killed it along with the rest of the GDM session.

#29 Updated by intrigeri 2019-08-23 16:19:40

> After we click Login in the Greeter, ~15s later the screen turns black, at the time when amnesia’s X.Org is started; nothing fishy in the logs that claim GNOME Shell started just fine

I still see this occasionally at commit:107efb89771ad96cf74b025b9bad1e1f55c825e2 :/
But I have another lead for a possible fix :)

#30 Updated by intrigeri 2019-08-25 13:04:25

OK, this finally seems robust enough on Jenkins. I’ll test this on bare metal before I unleash this monstrosity to some poor soul who’ll review this. The good news is that presumably, all this can go away once we’re on Wayland.

#31 Updated by intrigeri 2019-08-25 13:06:01

@segfault, if/once you’re done with the other reviews, you might want to do a first code review here, in the hope that this allows us to merge this earlier, therefore unblocking various other things. Otherwise, no big deal, of course :)

#32 Updated by intrigeri 2019-08-26 13:23:17

> I’ll test this on bare metal […]

I’ve successfully started Tails and logged in 3 times for each combination among:

  • machine: ThinkPad X200 / Elitebook 840G1
  • USB stick: very fast / rather slow
  • administration password: set / unset

Additionally:

  • I tested the same thing in a VM started from a virtual USB drive: 3 times with an admin password + 3 times without any.
  • The full test suite has been run on the current state of this branch at least 5 times on each of the 2 Jenkins I’m using.

In none of these dozens of tests could I reproduce any of the bugs that were very common on Jenkins in earlier iterations of this work. So either there’s still a race condition bug somewhere but it’s very hard to trigger, or the current state of this branch has effectively got rid of all such issues. I’m rather confident it’s the latter but it’s of course hard to prove the absence of something.

Except that at (what looked like) the last minute, while doing these tests, I’ve just realized that this branch breaks unlocking the screen: gnome-shell: Failed to open reauthentication channel: Gio:DBusError: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.gnome.DisplayManager was not provided by any .service files. Ahem. So I’m back to the drawing board ⇒ will try to find some way to keep GDM running just enough to support the lock/unlock use case, but without eating too much memory.

FTR the goal here is to:

  • fix our automated test suite that currently often fails due to lack of “disk” space (memory snapshots get smaller)
  • avoid giving more RAM to our test VM, and in turn: avoid bumping the documented RAM requirements in 4.0
  • lower the effective RAM requirements compared to 3.x; we’re not going to lower the documented ones, but still :)

#33 Updated by intrigeri 2019-08-26 16:11:33

> @segfault, if/once you’re done with the other reviews, you might want to do a first code review here, in the hope that this allows us to merge this earlier, therefore unblocking various other things. Otherwise, no big deal, of course :)

I hope you did not waste time on the implementation that was on the branch back when I wrote that. Hold on! That implementation had a serious flaw (on top of being overly complex) and I’m now testing a much simpler one.

#34 Updated by segfault 2019-08-27 09:35:59

> I hope you did not waste time on the implementation that was on the branch back when I wrote that. Hold on! That implementation had a serious flaw (on top of being overly complex) and I’m now testing a much simpler one.

No worries, I didn’t look at it yet.

#35 Updated by intrigeri 2019-08-29 12:12:32

  • blocked by deleted (Bug #16004: Make our automated test suite take into account USB image updates wrt. supported installation & update methods)

#36 Updated by segfault 2019-08-29 21:34:28

I read the comments and think I somewhat understand the problem and I took a look at the code and understand what you’re trying to do. But I feel like this needs a more extensive review which I don’t have the time for right now :/

#37 Updated by intrigeri 2019-08-30 06:19:23

> I read the comments and think I somewhat understand the problem and I took a look at the code and understand what you’re trying to do. But I feel like this needs a more extensive review which I don’t have the time for right now :/

This sounds fair. Besides, one of my 5 test suite runs on Jenkins (against the state of the branch that you looked at) identified one remaining problem with that version of the implementation, so I had to change it (again × N) quite a bit ⇒ I’ll have to run a bunch of tests again and will have results late today, earliest. Even if these tests indicate that all robustness problems are gone, I don’t know if there’s a chance I find someone to review the branch by the end of the week, so I’m not confident anymore that this will get fixed in time for 4.0~beta2. That’s OKish, even though I would really have liked to ship this sort of risky changes in a beta, before including it in 4.0 final. Worst case we’ll have to release 4.0~rc1 early October, which could be a good idea anyway.

#38 Updated by intrigeri 2019-08-30 07:54:22

  • Subject changed from The new Greeter keeps eating lots of memory after logging in to The Greeter keeps eating lots of memory after logging in

#39 Updated by intrigeri 2019-08-31 05:46:11

Oooookay, it looks like this branch is finally robust enough for my taste:

  • For each previous iteration of this branch, I’ve run the full test suite 5 times on my local Jenkins + 5 times on our shared Jenkins, and the latter always exposed a race condition during at least one of these run; the former sometimes did too.
  • At commit:e540df45b13286b98597a7efb649fcd36d835ff3, I did the same 2×5 runs, and for the first time there’s been no regression vs. devel.

I’ll now test this on various bare metal machines, with fast and slow USB drives, and also from DVD, in the “hope” that any remaining race condition is triggered.

#40 Updated by intrigeri 2019-08-31 10:40:49

  • Status changed from In Progress to Needs Validation

> I’ll now test this on various bare metal machines, with fast and slow USB drives, and also from DVD, in the “hope” that any remaining race condition is triggered.

Done: 18 boots in total, in various combinations of slow/fast USB stick / DVD, default Greeter settings / persistence with Additional Software / French + disabled network; I did all these tests on a ThinkPad X200 and a HP Elitebook 840G1. Login was successful in all cases.

Also, in the last few days, thanks to GDM debug logging + reading lots of its source code, I’ve reached a point when I think I know reasonably well what I’m doing here. This contrasts with my initially semi-random fiddling, even though some finer points are still not 100% clear to me.

Finally, AFAIK we should be able to drop all these kludges in 5.0, as they should not be needed on Wayland anymore: in a sid GNOME VM, Debian-gdm’s GNOME+Wayland session vanishes once I log into a GNOME+Wayland session. Knowing that we’re going to carry this pile of hacks only during the 4.x era makes me more comfortable proposing this branch.

So I’m hereby sending this for review. Note that the branch should not be merged as-is! See below for more about this.

FTR the goal here is to:

  • Help fix our automated test suite that currently often fails due to lack of “disk” space (this branch should make memory snapshots smaller)
  • Avoid giving more RAM to our test VM, and in turn: avoid bumping the documented RAM requirements in 4.0
  • Lower the effective RAM requirements compared to 3.x; we’re not going to lower the documented ones, but still :)

I’ve left a bunch of “squash!” commits for now:

  • I think their commit message may help the reviewer understand what I’m doing, how, and why.
  • My plan is that once we reach an agreement, I’ll squash all this, move lots of commit messages text to comments in the code, verify that the diff between the reviewed+ack’ed branch and the rewritten one consists purely of updating comments, and finally merge.

I’m not sure if I should leave GDM debug logging enabled. Ideally, I’d like to:

  • Ship this in 4.0~betaN, or more likely in ~rc1, with debug logging enabled ⇒ if there’s any problem, I’ll have the info I need in WhisperBack reports.
  • Disable debug logging in 4.0 final. File a ticket about it so we don’t forget.

#41 Updated by intrigeri 2019-08-31 10:41:28

  • Assignee deleted (intrigeri)

#42 Updated by segfault 2019-08-31 17:51:09

  • Assignee set to intrigeri

I did another review and I agree that it would be good to have this in the beta release. So I propose you clean up the branch and merge into devel, if it’s not too late for 4.0~beta2.

Please also review the commit I pushed on top, which adds some quotes to parameter expansions and command substitutions.

#43 Updated by intrigeri 2019-08-31 18:25:05

  • related to Bug #17011: Disable GDM debug logs added

#44 Updated by intrigeri 2019-08-31 18:33:12

  • Status changed from Needs Validation to In Progress

Applied in changeset commit:tails|9e6df4512e6a88d88893446457fb6d1892fd3957.

#45 Updated by intrigeri 2019-08-31 18:36:04

> * My plan is that once we reach an agreement, I’ll squash all this, move lots of commit messages text to comments in the code, verify that the diff between the reviewed+ack’ed branch and the rewritten one consists purely of updating comments, and finally merge.

I’ve merged devel into the bugfix/12092-kill-gdm-session-after-login+force-all-tests branch → commit:5931565a7f2c945beb4bb17e3e5ae463b0eb9830.
I’ve done what I’m quoting in the bugfix/12092-kill-gdm-session-after-login-squashed branch → commit:edc71c6ff8a7577fc75a1d447fd954b7187ebd37.
The only diff is commit:edc71c6ff8a7577fc75a1d447fd954b7187ebd37, which only changes comments.
So I’m going to merge!

#46 Updated by intrigeri 2019-08-31 18:37:35

  • Status changed from In Progress to Fix committed
  • % Done changed from 0 to 100

Applied in changeset commit:tails|85b14c91bc9591eab8dc7880ade7a678bcfc70a3.

#47 Updated by intrigeri 2019-08-31 18:41:31

  • Status changed from Fix committed to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 100 to 0

#48 Updated by intrigeri 2020-04-15 06:02:03

  • Affected tool changed from Greeter to Welcome Screen