Bug #12589
Enabling LUKS-backed PVs on lizard takes ages in the initramfs
100%
Description
Notices this today after upgrading to Stretch: it took several minutes after each PV unlocking. Each time, 4 pvscan processes were running for a while.
Random ideas:
- Maybe that’s because the initramfs is looking for the nested PVs (PV-on-LV) we have? They’re painful anyway, maybe we should get rid of them.
- Maybe there’s a timeout somewhere that we should lower.
Subtasks
Related issues
Blocks Tails - Feature #13284: Core work: Sysadmin (Adapt our infrastructure) | Confirmed | 2017-06-30 |
History
#1 Updated by intrigeri 2017-05-29 06:45:09
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 10
intrigeri wrote:
> * Maybe that’s because the initramfs is looking for the nested PVs (PV-on-LV) we have? They’re painful anyway, maybe we should get rid of them.
Indeed, the 4 pvscan instances are looking for LVs whose VG is built on top of a PV that’s stored as a LV. These commands are run by lib/udev/rules.d/69-lvm-metad.rules
. I’ve verified that the affected block devices are hard-coded nowhere in our initramfs.
We should do one of:
- get rid of this overly complex setup; we don’t have enough spare space to migrate the
bitcoin-data
one so we would have to download the entire blockchain again, but that’s no big deal; - teach our initramfs not to wait for these LVs to appear: see
filter
andglobal_filter
inlvm.conf
, and perhapsuse_lvmetad = 0
.
The first option has more chances to work out-of-the-box, and it’s easy to predict the (limited) amount of time it’ll take; while the 2nd option may take a few retries (i.e. reboots), entails the risk of a non-booting machine, and I’m not even sure it’ll work in the end. Both options will cause some limited downtime. So I’ll go for the first option.
#2 Updated by intrigeri 2017-05-29 07:06:00
tl;dr: VG-on-PV-on-partition-inside-LV is OK; VG-on-PV-on-LV is not.
Migration procedure, for each of the four affected VG-on-PV-on-LV (as said above, bitcoin-data
will require a different procedure though):
- in the VM that uses the VG:
- comment out every line that relies on this VG in
fstab
- update the initramfs
- power off the VM
- comment out every line that relies on this VG in
- on the host system:
- create a new LV with the same size as the problematic one
- dd the filestystem hosted by the old LV-on-VG-on-PV-on-LV to the new LV
- deactivate the VG-on-PV-on-LV
- deactivate the old PV-on-LV
- delete the old LV
- give the old LV the name the old one had
- ensure the new LV is backed by an appropriate PV, pvmove if needed
- update the VM accordingly:
- start the VM
- enter the VM
- update
fstab
to point to the new backing storage location - update the initramfs
- reboot the VM and check that everything is up
And finally, reboot lizard to confirm the problem is gone.
Note: all our VMs have another VG-over-LV, that hosts their root filesystem. I think these didn’t cause problems because these VGs are backed by a PV that’s a partition inside the LV lizard can see, so the LV lizard can see is not a PV and our initramfs doesn’t care about it.
#3 Updated by intrigeri 2017-06-19 10:57:34
Reminder: bertagaz, you might hit this problem next time you reboot lizard. Last time I had to kill the faulty (see above) pvscan processes by hand a few times.
#4 Updated by intrigeri 2017-06-29 10:17:23
- blocks
Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added
#5 Updated by intrigeri 2017-07-06 06:05:03
bertagaz, did you notice this problem during the reboot you did 9 days ago? If yes, how did you solve/workaround it?
#6 Updated by bertagaz 2017-07-06 11:11:02
intrigeri wrote:
> bertagaz, did you notice this problem during the reboot you did 9 days ago? If yes, how did you solve/workaround it?
Yes, both time I rebooted lizard I had the problem written in the description of the ticket, and had to apply the same fix (killing pvscan processes).
#7 Updated by intrigeri 2017-07-13 19:09:15
Another option could be to use filter
in /etc/lvm/lvm.conf
to make the host ignore guest LVM VGs, e.g. https://lists.debian.org/debian-devel/2017/07/msg00221.html.
#8 Updated by intrigeri 2017-09-02 16:03:12
- blocked by deleted (
)Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services)
#9 Updated by intrigeri 2017-09-02 16:03:23
- blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added
#10 Updated by intrigeri 2017-09-02 16:28:21
- % Done changed from 10 to 20
Hopefully fixed with commit 718e98cbcdfb9ae3a3fcaf75ac2af3abbde1e0c7 in our manifests repo. Refreshed initramfs, let’s see how it goes on next reboot.
#11 Updated by intrigeri 2017-09-04 12:20:13
Sadly, that’s not enough to fix the problem. I think that’s because lvmetad is not running in the initramfs so we might need to use filter
instead of global_filter
. Tried this (commit dac6c4c).
#12 Updated by intrigeri 2017-09-04 12:47:34
Still not enough => tried harder (commit c4f2fc7).
#13 Updated by intrigeri 2017-09-04 15:02:13
If that next try doesn’t work either, I’ll document the workaround one must apply at boot time and will give up: we can apply this workaround for many, many reboots, across many years, before it costs more than the more involved fix I’ve mentioned earlier on this thread.
So: groente & bertagaz, if you handle the next reboot, please pay attention and report back here wrt. whether the problem is solved or not.
#14 Updated by intrigeri 2017-09-04 18:08:53
- blocked by deleted (
Feature #13284: Core work: Sysadmin (Adapt our infrastructure))
#15 Updated by intrigeri 2017-09-04 18:09:23
- Status changed from In Progress to Resolved
- % Done changed from 20 to 100
- Parent task deleted (
)Feature #12160
Documented the workaround with instructions to report back here => calling this done.
#16 Updated by intrigeri 2017-09-04 18:09:32
- Parent task set to
Feature #12160
#17 Updated by intrigeri 2017-09-04 18:09:42
- Subject changed from Enabling LUKS-backed PVs takes ages in the initramfs to Enabling LUKS-backed PVs on lizard takes ages in the initramfs
#18 Updated by intrigeri 2017-09-04 18:18:06
- blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added
#19 Updated by groente 2017-10-19 13:45:12
alas, the problem is not solved :(