Bug #15071

Make our server backup process more usable

Added by intrigeri 2017-12-18 16:36:15 . Updated 2019-08-05 13:50:14 .

Status:
Resolved
Priority:
Normal
Assignee:
intrigeri
Category:
Infrastructure
Target version:
Start date:
2018-11-28
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description


Subtasks

Feature #16165: make puppet-lizard-manifests suitable for masterless puppet Resolved groente

100

Feature #16202: create a resource for backups Resolved

100

Bug #16211: move data away from virtual pv's Resolved

100

Feature #16214: Add stone to our VPN Resolved groente

100

Feature #16215: Add monitoring to stone Resolved Sysadmins

60

Feature #16234: add option for excludes to our backupscript Resolved groente

100


Related issues

Blocks Tails - Feature #13284: Core work: Sysadmin (Adapt our infrastructure) Confirmed 2017-06-30

History

#1 Updated by intrigeri 2017-12-18 16:36:37

  • blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added

#2 Updated by intrigeri 2017-12-18 17:03:09

  • Subject changed from Make our server backup more usable to Make our server backup process more usable

#3 Updated by intrigeri 2018-04-08 13:06:49

  • Assignee set to groente

#4 Updated by groente 2018-08-15 10:55:47

  • Assignee changed from groente to intrigeri
  • % Done changed from 0 to 10
  • QA Check set to Info Needed

I would like to propose the following scheme:

We set up a dedicated backup machine called stone.
Root on lizard gets a passwordless ssh-key.
This ssh key allows for running borg serve —append-only on stone.
Sysadmins use key forwarding to run other borg commands on lizard.

Now, for each logical volume on lizard, we create a borg repository:

export BORG_PASSPHRASE=‘ourbigsecret’
borg init —encryption=keyfile stone:lvname

Sysadmins then make a backup of root@lizard:~/.config/borg/keys as well as ‘ourbigsecret’ to their local machines.

A script called by cron on lizard will iterate through the logical volumes, make snapshots, mount them, and send over the backup:

for lv in `ls /dev/lizard |grep -v ‘\-swap’`; do
lvcreate -L1G -s -n $lv-backup /dev/lizard/$lv;
guestmount -a /dev/lizard/$lv-backup -i —ro /mnt/backup;
borg create stone:~/$vm::`date +%g%m%d%H%M` /mnt/backup/;
umount /mnt/backup;
lvremove -f /dev/lizard/$lv-backup;
done

and the same for all volume groups.

this way we will have automated periodic backups with the following properties:

- snapshots ensure data on disk is in sync with databases

- compromise of a VM will not affect backups

- compromise of lizard can only lead to garbage being appended to backups, not to compromise of older backups

- compromise of stone will only reveal encrypted data
- sysadmins can extract both individual files and entire filesystems at a given timestamp for fast recovery in case of fire

caveats:

- sysadmins will need to manually purge old backups when stone’s disk runs full
- the guestmount will not do the trick when a vg inside a vm spans multiple pv’s

please let me know what you think.

#5 Updated by intrigeri 2018-08-18 16:24:26

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

> Root on lizard gets a passwordless ssh-key.

I’ll assume that root on each of our systems that’s outside of lizard’s realm also gets this.

> export BORG_PASSPHRASE=‘ourbigsecret’

What’s the benefit of using a passphrase on top of a keyfile? Where will the passphrase be stored on lizard?

> borg init —encryption=keyfile stone:lvname

If we want the new keyfile-blake2, looks like we simply need to install borgbackup from stretch-backports. Let’s decide now because changing our mind later implies re-uploading everything.

> Sysadmins then make a backup of root@lizard:~/.config/borg/keys as well as ‘ourbigsecret’ to their local machines.

I guess we’ll want a git-remote-gcrypt repo for that.

> A script called by cron on lizard will iterate through the logical volumes, make snapshots, mount them, and send over the backup:

> for lv in `ls /dev/lizard |grep -v ‘\-swap’`; do
> lvcreate -L1G -s -n $lv-backup /dev/lizard/$lv;
> guestmount -a /dev/lizard/$lv-backup -i —ro /mnt/backup;
> borg create stone:~/$vm::`date +%g%m%d%H%M` /mnt/backup/;
> umount /mnt/backup;
> lvremove -f /dev/lizard/$lv-backup;
> done

Two comments here:

  • 1G will probably be too small in some cases.
  • I think we’ll want to keep backing up only selected directories and not whole filesystems, at least in some cases. I assume it’s not too hard to reuse our existing config for that. It’ll need to be adjusted a bit because we’ll need per-LV config, which is a finer granularity than the per-system config we have currently.

> and the same for volume groups.

What do you mean? Looping over each volume group whose PV is a LV / on a LV? If that’s about the vg1 thing, see below :)

> - snapshots ensure data on disk is in sync with databases

This assumes that databases guarantee 100% consistent on-disk state. I’m not sure that’s the case for those we use. In doubt, I think it makes sense to err on the safe side, i.e. keep the backupninja jobs we have, which dump the DBs somewhere under /var/backups and borg can backup that.

But regardless, yay for snapshots! They’ll definitely give us more consistent backups than what we have now, at least in other areas \o/

> - compromise of a VM will not affect backups

To be clear, it will only affect new generations of backups, right?

> - sysadmins will need to manually purge old backups when stone’s disk runs full

As long as it’s automated in a robust manner and we have monitoring alerts, fine by me!

> - the guestmount will not do the trick when a vg inside a vm spans multiple pv’s

I guess this is referring to the vg1 that most our VMs have, with a PV that’s a partition itself hosted on a LV passed by lizard as a disk to the VM. I would like us to stop doing that because having 2 levels of LVM is a PITA in many cases. If we did migrate all this data to a simpler setup (1-1 mapping between LV-on-lizard and filesystem-in-a-VM), would the problem go away?

#6 Updated by groente 2018-08-19 10:51:53

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Info Needed

Heya, thanks for the feedback!

Some quick answers to your questions:

- indeed, root on each of our systems that’s outside of lizard’s realm also gets a paswordless key

- there’s no real benefit of a passphrase on top of the keys, just a borg default to have one

- personally i don’t care much for the blake2

- storing the keyfiles in a git-remote-gcrypt repo was indeed what i imagined

- i haven’t yet seen the 1GB run full on another system where i use this setup, what would you suggest if you thing 1GB is insufficient?

- backing up only selected directories is definately possible, but why would we want that? diskspace is quite cheap and i really like the idea of being able to recover fast when shit hits the fan, which is a fair bit easier when you have full filesystems.

- i meant looping not only over lizard, but also over spinninglizard, that’s all.

- i definately want to keep the backupninja database dumps, but with this we’ll already have a 95% chance things will just work (the effect would be similar to a hard powercycle on a running system).

- indeed, compromise will only affect new generations of backups

- the vg1 is no problem, guestmount takes care of that nicely. what i meant is things might go wrong when we have a volumegroup inside a vm that spans multiple virtual disks inside that same vm (so for example a volumegroup having pv’s in both vda and vdb). i don’t think this is the case on any of our vm’s at the moment, but it’s something we need to keep in mind.

#7 Updated by intrigeri 2018-08-19 12:48:58

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

I’m only commenting where I have something to tell. For everything else: OK!

> - i haven’t yet seen the 1GB run full on another system where i use this setup,

Maybe that other system does not spend most of its time creating 1.2GB ISO images? :)

> what would you suggest if you thing 1GB is insufficient?

If I got it right the criterion is:

  • min: enough space to hold all changes made on the snapshot’ed LV during the backup process; lots of our systems deal with large amounts of ISO images (+ USB images soon) so it’s not uncommon that a few new GB are written to a filesystem in a few seconds; our disks are way faster than our uplink so if that filesystem is being backup’ed when such files are written, then 1GB won’t be enough.
  • max: whatever fits on our VGs (on lizard we’re good but I don’t know how much spare space we have on ecours and buse)

So I would say 12GB (so that “4 isobuilders * 2 * ISO size” fits confortably) to start with and be ready to increase this value, because if it’s too small, the impact is not “broken backup run”, it’s “the services that use this LV are broken”.

> - backing up only selected directories is definately possible, but why would we want that? diskspace is quite cheap and i really like the idea of being able to recover fast when shit hits the fan, which is a fair bit easier when you have full filesystems.

OK. Maybe we still need a list of excluded directories/patterns though, such as /tmp, the ISOs built by Jenkins, and perhaps the time-based APT snapshots.

> - the vg1 is no problem, guestmount takes care of that nicely. what i meant is things might go wrong when we have a volumegroup inside a vm that spans multiple virtual disks inside that same vm (so for example a volumegroup having pv’s in both vda and vdb). i don’t think this is the case on any of our vm’s at the moment, but it’s something we need to keep in mind.

ACK. I’m not sure how best to keep this in mind (or enforce this constraint). What’s the practical impact if we somehow forget and create the problematic situation?

#8 Updated by intrigeri 2018-08-19 16:24:07

  • related to #15779 added

#9 Updated by groente 2018-08-19 20:18:38

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Info Needed

> > - i haven’t yet seen the 1GB run full on another system where i use this setup,
>
> Maybe that other system does not spend most of its time creating 1.2GB ISO images? :)

touché!

> > what would you suggest if you thing 1GB is insufficient?
>
> If I got it right the criterion is:
>
> * min: enough space to hold all changes made on the snapshot’ed LV during the backup process; lots of our systems deal with large amounts of ISO images (+ USB images soon) so it’s not uncommon that a few new GB are written to a filesystem in a few seconds; our disks are way faster than our uplink so if that filesystem is being backup’ed when such files are written, then 1GB won’t be enough.
> * max: whatever fits on our VGs (on lizard we’re good but I don’t know how much spare space we have on ecours and buse)
>
> So I would say 12GB (so that “4 isobuilders * 2 * ISO size” fits confortably) to start with and be ready to increase this value, because if it’s too small, the impact is not “broken backup run”, it’s “the services that use this LV are broken”.

makes sense, 12GB for lizard it is! there’s also 12GB spare on ecours, buse doesn’t even use LVM so snapshots are not really an option there :-/ I guess we’ll have to do buse the oldfashioned way and live with some possible data-inconsistencies in the backup (or reconfigure buse into an lvm-based machine, but that’s a scary operation to perform on a live system).

> > - backing up only selected directories is definately possible, but why would we want that? diskspace is quite cheap and i really like the idea of being able to recover fast when shit hits the fan, which is a fair bit easier when you have full filesystems.
>
> OK. Maybe we still need a list of excluded directories/patterns though, such as /tmp, the ISOs built by Jenkins, and perhaps the time-based APT snapshots.

ok, i’ll go through the current script and see what really doesn’t make sense to backup.

> > - the vg1 is no problem, guestmount takes care of that nicely. what i meant is things might go wrong when we have a volumegroup inside a vm that spans multiple virtual disks inside that same vm (so for example a volumegroup having pv’s in both vda and vdb). i don’t think this is the case on any of our vm’s at the moment, but it’s something we need to keep in mind.
>
> ACK. I’m not sure how best to keep this in mind (or enforce this constraint). What’s the practical impact if we somehow forget and create the problematic situation?

The impact would be we’d no longer be able to do snapshot-based backups from lizard (atleast i don’t think libguestfs supports combining multiple disk images to mount a volume groups that spans multiple pv’s). This could quite easily be circumvented by running the backups for the affected logical volumes from within that particular VM. A bit more annoying custom work, but no major blocker.

#10 Updated by intrigeri 2018-08-20 07:38:17

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

> buse doesn’t even use LVM so snapshots are not really an option there :-/ I guess we’ll have to do buse the oldfashioned way and live with some possible data-inconsistencies in the backup (or reconfigure buse into an lvm-based machine, but that’s a scary operation to perform on a live system).

Note that we’ll have to migrate buse somewhere else soonish which will be a good opportunity to move storage to LVM (regardless of whether we reinstall the OS from scratch or not, which is left to be decided).

I agree with all your other points so please proceed with the next steps. I think the most urgent ones (for quite unrelated reasons i.e. #15779) are: whatever blocks buying the disks for the backup machine. I guess this means 1. specify what kind of storage (disk format, performance, power consumption, this sort of things) we need; 2. estimate how much storage space we need (it depends on deciding what exactly we’re ready to exclude from our backups because e.g. ISOs built by Jenkins and time-based APT snapshots are huge); 3. pick a vendor, model and seller (I think we can have CCT do the actual purchase from Germany, which would avoid one of us having to spend the money before getting reimbursed; if that’s needed let me know and I’ll ask them to confirm).

#11 Updated by groente 2018-08-20 09:04:21

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Info Needed

Okay, great!

Hardwarewise I’m thinking an APU2 board with sata-multiplier, that will give us a nicely low-power machine (with coreboot <3) with serial console as OOB (the colocation I have in mind provides serial access and remote powerswitching, so no need for IPMI).

If we simply backup everything, we’ll end up with roughly 1.2TB of data. Assuming not so very great deduplication with all the ISO images, let’s say every incremental run will cost around 250GB (I would consider this a very pessimistic estimate). Then we’d need 16GB of netto storage to provide a years worth of weekly backups. Having three 8TB disks in RAID5 seems like the most economic solution for that.

As for disks, I’d like to get three different vendors to prevent disks from the same production batch giving up all at the same time. I’d propose WD Purple, Seagate Ironwolf Pro, and HGST Ultrastar. They’re all reasonably priced and average on 7W poweruse each. That way we’d have a backup machine running at less than 30W in total.

Two software considerations I’d like your feedback on are the following:

- let’s deliberately not puppetise the backup machine. puppetising it would allow for compromise of lizard to escalate to backup.

- i don’t think there’s much added value in FDE since the backup data is already encrypted, what do you think?

#12 Updated by intrigeri 2018-08-20 09:56:35

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

> Hardwarewise I’m thinking an APU2 board […]

OK, nice. I assume you’ve already tested that this hardware can handle the kind of stuff borg will make it do.

> If we simply backup everything, we’ll end up with roughly 1.2TB of data. Assuming not so very great deduplication with all the ISO images, let’s say every incremental run will cost around 250GB (I would consider this a very pessimistic estimate). Then we’d need 16GB of netto storage to provide a years worth of weekly backups.

s/16GB/16TB/

> Having three 8TB disks in RAID5 seems like the most economic solution for that.

I usually dislike RAID5 but this is one of these cases where it makes sense.

> As for disks, I’d like to get three different vendors to prevent disks from the same production batch giving up all at the same time. I’d propose WD Purple, Seagate Ironwolf Pro, and HGST Ultrastar. They’re all reasonably priced and average on 7W poweruse each. That way we’d have a backup machine running at less than 30W in total.

Great. So let’s find an appropriate seller, quote, and a way to buy them.

> - let’s deliberately not puppetise the backup machine. puppetising it would allow for compromise of lizard to escalate to backup.

Ah. Hmmm. Good point. I’m not very comfortable with having a system without config management whatsoever, so how about picking a rather simple, lightweight option that does not cause this kind of escalation:

  • manage the box with master-less Puppet, i.e. puppet apply $MANIFEST_DIRECTORY (then we can’t use exported resources but for the most part we won’t need them)
    • pro: this allows us to reuse as-is most, if not all, of the code we have to standardize our systems and keep them so ⇒ no wasted efforts
    • cons:
      • need to find a nice way to have the correct version of the needed Puppet modules installed on the backup box
      • Puppet might be on the heavy-weight side for the kind of low-power board you have in mind?
  • manage the box with Ansible, i.e. master-less as well
    • pros: no need for the Git repo to be accessible by any machine except the sysadmins’ personal laptop ⇒ no config management code deployment problem
    • cons:
      • the bar for doing Tails sysadmin is quite high already; adding a 2nd config management system won’t help with that :/
      • we’ll inevitably end up with writing and maintaining the very same code in Puppet and Ansible; but perhaps the subset of config management code that the backup box will need will be so small that it’s not a serious problem

Anyway, we can come back to this after purchasing the disks, e.g. in a month, no?

> - i don’t think there’s much added value in FDE since the backup data is already encrypted, what do you think?

Usually I approach this question from the other side: is there any reason not to do FDE? :)
Rationale: a reasoning that leads one to not encrypt stuff tends to become obsolete and it’s hard to notice when it does.

#13 Updated by groente 2018-08-20 10:25:09

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
> > Hardwarewise I’m thinking an APU2 board […]
>
> OK, nice. I assume you’ve already tested that this hardware can handle the kind of stuff borg will make it do.

Not personally, but a friend runs restic backups on this board, which should be the same kind of load.

> > If we simply backup everything, we’ll end up with roughly 1.2TB of data. Assuming not so very great deduplication with all the ISO images, let’s say every incremental run will cost around 250GB (I would consider this a very pessimistic estimate). Then we’d need 16GB of netto storage to provide a years worth of weekly backups.
>
> s/16GB/16TB/

Ehr, yes, 16TB :)

> Great. So let’s find an appropriate seller, quote, and a way to buy them.

Okay, I’ll see what I can find and communicate quotes through other channels.

> > - let’s deliberately not puppetise the backup machine. puppetising it would allow for compromise of lizard to escalate to backup.
>
> Ah. Hmmm. Good point. I’m not very comfortable with having a system without config management whatsoever, so how about picking a rather simple, lightweight option that does not cause this kind of escalation:
>
> * manage the box with master-less Puppet, i.e. puppet apply $MANIFEST_DIRECTORY (then we can’t use exported resources but for the most part we won’t need them)
> pro: this allows us to reuse as-is most, if not all, of the code we have to standardize our systems and keep them so ⇒ no wasted efforts
> cons:
> * need to find a nice way to have the correct version of the needed Puppet modules installed on the backup box
> * Puppet might be on the heavy-weight side for the kind of low-power board you have in mind?
> * manage the box with Ansible, i.e. master-less as well
> pros: no need for the Git repo to be accessible by any machine except the sysadmins’ personal laptop ⇒ no config management code deployment problem
> cons:
> * the bar for doing Tails sysadmin is quite high already; adding a 2nd config management system won’t help with that :/
> * we’ll inevitably end up with writing and maintaining the very same code in Puppet and Ansible; but perhaps the subset of config management code that the backup box will need will be so small that it’s not a serious problem

Given what the machine will be doing, this all seems like a bit of an overkill. Basically, it’s setting up a stock debian box with FDE, apt install borgbackup unattended-upgrades, add ssh-keys for ecours, buse, lizard, and the sysadmins, and that’s it.
Those two packages I’m fine doing by hand and the ssh key management shouldn’t be pulled from a remote source lest we introduce an escalation vector. Or do you see other reasons for introducing config management or some way around key management that doesn’t imply possible escalation?

> Anyway, we can come back to this after purchasing the disks, e.g. in a month, no?

Yes, I’ll go after the hardware, take your time to think about config management, I’ll do the same.

> Usually I approach this question from the other side: is there any reason not to do FDE? :)
> Rationale: a reasoning that leads one to not encrypt stuff tends to become obsolete and it’s hard to notice when it does.

Fair enough, FDE it is.

#14 Updated by intrigeri 2018-08-21 07:22:10

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

>> OK, nice. I assume you’ve already tested that this hardware can handle the kind of stuff borg will make it do.

> Not personally, but a friend runs restic backups on this board, which should be the same kind of load.

OK, great!

> Given what the machine will be doing, this all seems like a bit of an overkill. Basically, it’s setting up a stock debian box with FDE, apt install borgbackup unattended-upgrades, add ssh-keys for ecours, buse, lizard, and the sysadmins, and that’s it.

Meta: I specifically don’t want to argue endlessly to have the final word on this. I’m fine with saying “your call but please consider coming back to it once the other basic things are done if there are leftovers on the budget line for this task”. Still, looks like we’re not on the same page so I’ll share my thoughts so that we understand each other better :)

FTR/FWIW I’m not fully convinced by this “that’s it”: taking a quick look at our tails::base class, I see a number of other basic setup things that we might want: monitoring (e.g. of disk space so we know when to clean things up), firewall, basic safeguards like safe-rm, outgoing email, sudo, sysctl hardening, time sync. I suspect that if we stick to “that’s it” initially, we’ll incrementally do most of these along the way later, and the initial justification for not doing config mgmt will quickly be less valid.

> Those two packages I’m fine doing by hand and the ssh key management shouldn’t be pulled from a remote source lest we introduce an escalation vector. Or do you see other reasons for introducing config management

I got used to being able to apply changes globally on our infra, e.g. the tails::base stuff mentioned above. We’ve occasionally used it e.g. to deploy a sysctl mitigation for critical kernel security issues until we could reboot. Having to think “there’s this special snowflake over there that I need to do stuff by hand on” feels error-prone, i.e. we’ll inevitably forget from time to time (maybe not you, but let’s not assume everyone on the current/future team is so rigorous :)

I think that’s all I had to say on this topic so maybe I should now shut up :)))

> or some way around key management that doesn’t imply possible escalation?

Sure: do config mgmt for everything else but handle keys by hand.

#15 Updated by intrigeri 2018-08-21 07:43:24

Just one more thing, that perhaps should be thought through before we buy $x TB of storage (because if the answer is “no” perhaps we’ll backup less things and then perhaps smaller disks would do): does the ~250GB/week i.e. ~1TB/month bandwidth cost fit into the budget we have for hosting+bandwidth, which FTR is 40€/month?

#16 Updated by groente 2018-08-21 21:35:51

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Info Needed

> > Given what the machine will be doing, this all seems like a bit of an overkill. Basically, it’s setting up a stock debian box with FDE, apt install borgbackup unattended-upgrades, add ssh-keys for ecours, buse, lizard, and the sysadmins, and that’s it.
>
> Meta: I specifically don’t want to argue endlessly to have the final word on this. I’m fine with saying “your call but please consider coming back to it once the other basic things are done if there are leftovers on the budget line for this task”. Still, looks like we’re not on the same page so I’ll share my thoughts so that we understand each other better :)
>
> FTR/FWIW I’m not fully convinced by this “that’s it”: taking a quick look at our tails::base class, I see a number of other basic setup things that we might want: monitoring (e.g. of disk space so we know when to clean things up), firewall, basic safeguards like safe-rm, outgoing email, sudo, sysctl hardening, time sync. I suspect that if we stick to “that’s it” initially, we’ll incrementally do most of these along the way later, and the initial justification for not doing config mgmt will quickly be less valid.
>
> > Those two packages I’m fine doing by hand and the ssh key management shouldn’t be pulled from a remote source lest we introduce an escalation vector. Or do you see other reasons for introducing config management
>
> I got used to being able to apply changes globally on our infra, e.g. the tails::base stuff mentioned above. We’ve occasionally used it e.g. to deploy a sysctl mitigation for critical kernel security issues until we could reboot. Having to think “there’s this special snowflake over there that I need to do stuff by hand on” feels error-prone, i.e. we’ll inevitably forget from time to time (maybe not you, but let’s not assume everyone on the current/future team is so rigorous :)
>
> I think that’s all I had to say on this topic so maybe I should now shut up :)))

Oh, please don’t, I’m quite keen on doing this the right way and you’re right that my “that’s it” was oversimplifying things, I’m just worried about possible escalation vectors.

I should probably read up on the masterless puppet, but intuitively I can’t really see how you’d be able to automagically deploy e.g. sysctl-foo to all systems in one go and not at the same time be unable to get root access to all said systems.

Unless I’m missing something, any automated deployment of tails::base stuff will take commands from the repo hosted on the puppet-git VM on lizard and run them as root, no?

#17 Updated by groente 2018-08-21 21:41:02

intrigeri wrote:
> Just one more thing, that perhaps should be thought through before we buy $x TB of storage (because if the answer is “no” perhaps we’ll backup less things and then perhaps smaller disks would do): does the ~250GB/week i.e. ~1TB/month bandwidth cost fit into the budget we have for hosting+bandwidth, which FTR is 40€/month?

Yes, bandwidth use is no problem there.

#18 Updated by intrigeri 2018-08-22 10:19:56

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

> I should probably read up on the masterless puppet, but intuitively I can’t really see how you’d be able to automagically deploy e.g. sysctl-foo to all systems in one go and not at the same time be unable to get root access to all said systems.

Right, as I said we would need another way to push the Puppet code to the backup box. We’ll get root on that box but lizard should not.

> Unless I’m missing something, any automated deployment of tails::base stuff will take commands from the repo hosted on the puppet-git VM on lizard and run them as root, no?

With a masterless mode, we can decide which commit of puppet-tails.git is applied on the backup box, e.g. via its own manifests parent repo and submodules. So yes, plausibly the code comes from puppet-git.lizard but Git will check for us that it’s the code we want.

#19 Updated by intrigeri 2018-11-15 10:32:45

  • blocked by Bug #11181: Better abstraction of ::local::node added

#20 Updated by intrigeri 2018-11-29 10:25:15

  • Status changed from Confirmed to In Progress

#21 Updated by intrigeri 2018-11-29 10:49:26

  • blocks deleted (Bug #11181: Better abstraction of ::local::node)

#22 Updated by intrigeri 2018-12-10 08:09:41

Regarding the --append-only trick, a blog post I’ve read today taught me that it does not really reject non-append operations, it will merely queue & postpone them, and then they will take effect as soon as you write to the repo in non-append-only mode (e.g. prune, delete or create archives from an admin machine). So it seems to me that:

  • We have to avoid performing any automated write operation in non-append-only mode. My understanding is that the current design is fine in this respect.
  • When we’ll go prune old backups manually, we’ll need to first check that no past transaction, made in append-only mode, includes queued non-append changes. Otherwise an attacker may make us delete past backups, which we’re trying to avoid here. I hope that somebody already solved this problem, somehow, and we won’t have to write custom code ourselves.

Does this make sense to you?

#23 Updated by groente 2018-12-10 10:10:43

intrigeri wrote:
> Regarding the --append-only trick, a blog post I’ve read today taught me that it does not really reject non-append operations, it will merely queue & postpone them, and then they will take effect as soon as you write to the repo in non-append-only mode (e.g. prune, delete or create archives from an admin machine). So it seems to me that:
>
> * We have to avoid performing any automated write operation in non-append-only mode. My understanding is that the current design is fine in this respect.
> * When we’ll go prune old backups manually, we’ll need to first check that no past transaction, made in append-only mode, includes queued non-append changes. Otherwise an attacker may make us delete past backups, which we’re trying to avoid here. I hope that somebody already solved this problem, somehow, and we won’t have to write custom code ourselves.
>
> Does this make sense to you?

yes, this is expected behaviour.

ideally, we’d write a script to check whether there are any unvalidated DELETE tags in any of the segments. however, this is non-trivial at best.

a cheap way of checking would be to see if in the transaction log there are any entries that increment the transaction_id by just one (this means either close-to-zero data was sent, which in practice is unlikely to happen, or someone deleted something).

#24 Updated by groente 2018-12-17 22:02:15

  • Assignee changed from groente to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

short update:

all lv’s on lizard are now being backed up, except:

- isobuilder*-data

- isobuiler*-libvirt
- isotest*-data

and:

- apt-snapshots
- jenkins-data

the first three i think do not contain any data worth backing up that cannot be recreated by puppet. can you confirm this?

the last two will need some exclude-options to the backup-script so we won’t waste resources like crazy.

after that, it’s time to have a look at lizard’s own rootfs and buse and ecours.

#25 Updated by groente 2018-12-17 22:02:39

  • Estimated time set to 0 h

#26 Updated by groente 2018-12-17 22:23:15

  • Estimated time deleted (0 h)

#27 Updated by intrigeri 2018-12-19 11:13:14

  • update the doc for creating new VMs so we set up backups for them (grep backup in our team’s repo)
  • remove rdiff-backup and ensure it’s not installed by default anymore

#28 Updated by intrigeri 2019-01-02 20:55:31

FYI until the 3.12 freeze (date not announced yet, likely around Jan 19) and possibly even until the 3.12 release (Jan 29), I’ll focus on things that need to go in that release, so I might have to postpone this review to after 3.12.

#29 Updated by intrigeri 2019-01-21 11:32:07

  • Assignee changed from intrigeri to groente
  • QA Check changed from Ready for QA to Dev Needed

On top of my previous comments, sysadmin.git:backups.mdwn needs an update (you would have found it with that git grep but still :)

Also, it would be nice to have the setup briefly described somewhere, including very rough & basic pointers to answer the “how do I restore X?” question (I wondered a few days ago because it would have been useful, but decided against learning how to do it from scratch in an emergency). I would like to test this as part of validating this work. Would make me feel more confident :)

> the first three i think do not contain any data worth backing up that cannot be recreated by puppet. can you confirm this?

Yes.

> the last two will need some exclude-options to the backup-script so we won’t waste resources like crazy.

We need to at backup our “tagged” snapshots but we can probably skip the “time-based” ones, (losing them would be a major disturbance but perhaps it’s just too much everchanging data to backup?).

Apart of that, I’ve unblocked you on the corresponding ticket, please go ahead :)

> after that, it’s time to have a look at lizard’s own rootfs and buse and ecours.

OK.

Have you considered generating the many tails::borgbackup::lv resources from YAML with create_resources? Could be a good opportunity to learn but obviously there are pros & cons :) YMMV.

#30 Updated by intrigeri 2019-02-24 11:23:51

  • Target version changed from 2018 to Tails_3.13

2018 is over :)

#31 Updated by CyrilBrulebois 2019-03-20 14:35:11

  • Target version changed from Tails_3.13 to Tails_3.14

#32 Updated by CyrilBrulebois 2019-05-23 21:23:25

  • Target version changed from Tails_3.14 to Tails_3.15

#33 Updated by intrigeri 2019-06-02 13:28:25

groente wrote:
> after that, it’s time to have a look at lizard’s own rootfs and buse and ecours.

BTW, given the buse situation, I would feel much more comfortable if we had backups for it. If you have bandwidth for Tails sysadmin work in the near future, I suggest this (+ buse’s migration if I don’t handle it myself) is on top of your list :)

#34 Updated by CyrilBrulebois 2019-07-10 10:34:03

  • Target version changed from Tails_3.15 to Tails_3.16

#35 Updated by groente 2019-07-19 12:10:26

  • Status changed from In Progress to Needs Validation
  • Assignee changed from groente to intrigeri

All systems are now being backed up. There is documentation on the retrieval procedure in the sysadmins repo. Please let me know if you manage to restore some stuff!

#36 Updated by intrigeri 2019-08-01 20:41:18

  • Status changed from Needs Validation to In Progress
  • Assignee changed from intrigeri to groente

Yo!

> All systems are now being backed up.

Ooh yeah, great job! \o/

I’ve reviewed the code and the doc. I’ve found them very good and easy to understand! I have a few questions below but nothing fundamental is at stake here. In many cases it mostly shows that I know little about borgbackup and need a tiny bit more hand-holding :)

I’ve pushed some polishing and clarifications to all corresponding repos. Some of them are perhaps a matter of taste, but some of them might actually teach you a small thing or two about Puppet code design.

Wrt. the code:

  • I was initially confused because in runbackuplv.sh, we set RAWDISK=0 if the -d option is passed, i.e. if rawdisk => true is set. Do you mind if I invert this and have RAWDISK=1 mean “rawdisk mode is enabled” instead of the opposite?
  • If there’s a good reason to store configuration in /usr/local/etc/borgbackup (as opposed to /etc), please document it in tails::borgbackup. I see this file was moved there in a2b13201caf239b79a410a3cb2f1fca6ca2d83af but I don’t understand why.
  • It seems that you’ve missed Bug #15071#note-27: I see rdiff-backup is still installed and systems/lizard/install_base_vm.mdwn has obsolete instructions.

Wrt. the doc:

  • I understand that to backup a new LV, I need to do more than declaring tails::borgbackup::{lv,fs}, e.g. I should probably generate a key and add it to backups/keys/ in Git. I could probably reverse-engineer how to proceed from the code but it would be sweet to have some basic instructions :) Once this is documented, I’ll follow this doc for the new LV I’ve added in a793c3bb46b742523e7a091cb357715fc3707ba3.
  • “keep backups up-to-date” in https://tails.boum.org/contribute/working_together/roles/sysadmins/ is obsolete, and there’s nothing else we have to do instead during sysadmin shifts, right? Happy to update this myself once you confirm.

> There is documentation on the retrieval procedure in the sysadmins repo. Please let me know if you manage to restore some stuff!

It worked!

Wrt. the retrieval doc:

  • One thing that’s not 100% clear to me wrt. retrieving data from backups is file ownership & permissions. I guess I should run borg extract as root (and thus, have the borg config & keys done as root) to properly restore such metadata, right? But then, as root I don’t have SSH access to stone, so I’m a bit confused wrt. how I’m supposed to handle this. It would be nice to make this clear(er) in the doc.
  • If I want to restore a file from backups on one of our VMs, I have to first restore it to my own system, then scp it to the VM over the Internet, right? Or is there another way that would work better for large files / slow Internet connections, without compromising security?

#37 Updated by groente 2019-08-02 09:46:56

  • Status changed from In Progress to Needs Validation
  • Assignee changed from groente to Sysadmins

> I’ve pushed some polishing and clarifications to all corresponding repos. Some of them are perhaps a matter of taste, but some of them might actually teach you a small thing or two about Puppet code design.

thanks!

> * I was initially confused because in runbackuplv.sh, we set RAWDISK=0 if the -d option is passed, i.e. if rawdisk => true is set. Do you mind if I invert this and have RAWDISK=1 mean “rawdisk mode is enabled” instead of the opposite?

sure, go right ahead.

> * If there’s a good reason to store configuration in /usr/local/etc/borgbackup (as opposed to /etc), please document it in tails::borgbackup. I see this file was moved there in a2b13201caf239b79a410a3cb2f1fca6ca2d83af but I don’t understand why.

i’ve added some comments in the puppet code about this. the ‘problem’ is that once you start using borg as a client to retrieve data from archives, it creates an /etc/borgbackup directory with some configuration there. to avoid confusion between these config files (which should never exist on our servers that only push data), i’ve moved our server secrets and excludes files to /usr/local.

> * It seems that you’ve missed Bug #15071#note-27: I see rdiff-backup is still installed and systems/lizard/install_base_vm.mdwn has obsolete instructions.

ah, right, that should be fixed in 6c20311f0cc70881cbe7d3c1e5294456d7a156fd .

> Wrt. the doc:
>
> * I understand that to backup a new LV, I need to do more than declaring tails::borgbackup::{lv,fs}, e.g. I should probably generate a key and add it to backups/keys/ in Git. I could probably reverse-engineer how to proceed from the code but it would be sweet to have some basic instructions :) Once this is documented, I’ll follow this doc for the new LV I’ve added in a793c3bb46b742523e7a091cb357715fc3707ba3.

I’ve added instructions in our documentation. Basically, the keys are generated automagically, you just have to copy them to our repo after the first backup has run.

> * “keep backups up-to-date” in https://tails.boum.org/contribute/working_together/roles/sysadmins/ is obsolete, and there’s nothing else we have to do instead during sysadmin shifts, right? Happy to update this myself once you confirm.

Correct, we’ve automated ourselves away there, please remove it from our tasks :)

> > There is documentation on the retrieval procedure in the sysadmins repo. Please let me know if you manage to restore some stuff!
>
> It worked!

\o/

> Wrt. the retrieval doc:
>
> * One thing that’s not 100% clear to me wrt. retrieving data from backups is file ownership & permissions. I guess I should run borg extract as root (and thus, have the borg config & keys done as root) to properly restore such metadata, right? But then, as root I don’t have SSH access to stone, so I’m a bit confused wrt. how I’m supposed to handle this. It would be nice to make this clear(er) in the doc.

I’ve updated the doc with a section about file ownership and running as root. Please let me know if it’s clear.

> * If I want to restore a file from backups on one of our VMs, I have to first restore it to my own system, then scp it to the VM over the Internet, right? Or is there another way that would work better for large files / slow Internet connections, without compromising security?

Well, i guess you can use ssh agent forwarding, but then we’d have to reconfigure our ssh servers which currently have that feature disabled…

#38 Updated by intrigeri 2019-08-03 17:44:50

  • Status changed from Needs Validation to In Progress

Applied in changeset commit:tails|463a16652fb4343484acd96618d2870e345cf34b.

#39 Updated by intrigeri 2019-08-03 17:46:38

  • Assignee changed from Sysadmins to groente

>> * I was initially confused because in runbackuplv.sh, we set RAWDISK=0 if the -d option is passed, i.e. if rawdisk => true is set. Do you mind if I invert this and have RAWDISK=1 mean “rawdisk mode is enabled” instead of the opposite?

> sure, go right ahead.

Done in commit a678eedc0897872fbdc3bd630e3c1df8192192b9 (puppet-tails.git).

>> * If there’s a good reason to store configuration in /usr/local/etc/borgbackup (as opposed to /etc), please document it in tails::borgbackup. I see this file was moved there in a2b13201caf239b79a410a3cb2f1fca6ca2d83af but I don’t understand why.

> i’ve added some comments in the puppet code about this. the ‘problem’ is that once you start using borg as a client to retrieve data from archives, it creates an /etc/borgbackup directory with some configuration there. to avoid confusion between these config files (which should never exist on our servers that only push data), i’ve moved our server secrets and excludes files to /usr/local.

OK, I see. I understand the need to avoid confusion but somehow, I’d rather see config in /etc, so: would /etc/tails-borgbackup work for you? If not, I can swallow my concerns and live with the current state of things :) Note that I’m happy to deal with the migration to that new directory myself, in case it matters.

Once we reach agreement & implement it here, and once Feature #16165 is completed, we can close this ticket \o/

>> * It seems that you’ve missed Bug #15071#note-27: I see rdiff-backup is still installed and systems/lizard/install_base_vm.mdwn has obsolete instructions.

> ah, right, that should be fixed in 6c20311f0cc70881cbe7d3c1e5294456d7a156fd .

Very nice. I’m glad you did the refactoring while you were at it :)

> Basically, the keys are generated automagically, you just have to copy them to our repo after the first backup has run.

Done!

>> * “keep backups up-to-date” in https://tails.boum.org/contribute/working_together/roles/sysadmins/ is obsolete, and there’s nothing else we have to do instead during sysadmin shifts, right? Happy to update this myself once you confirm.

> Correct, we’ve automated ourselves away there, please remove it from our tasks :)

Done.

For everything else I’m not quoting here, I’m happy with the way you’ve promptly fixed it. Thanks!

#40 Updated by groente 2019-08-05 10:00:05

intrigeri wrote:
> Done in commit a678eedc0897872fbdc3bd630e3c1df8192192b9 (puppet-tails.git).

Thanks!

> >> * If there’s a good reason to store configuration in /usr/local/etc/borgbackup (as opposed to /etc), please document it in tails::borgbackup. I see this file was moved there in a2b13201caf239b79a410a3cb2f1fca6ca2d83af but I don’t understand why.
>
> > i’ve added some comments in the puppet code about this. the ‘problem’ is that once you start using borg as a client to retrieve data from archives, it creates an /etc/borgbackup directory with some configuration there. to avoid confusion between these config files (which should never exist on our servers that only push data), i’ve moved our server secrets and excludes files to /usr/local.
>
> OK, I see. I understand the need to avoid confusion but somehow, I’d rather see config in /etc, so: would /etc/tails-borgbackup work for you? If not, I can swallow my concerns and live with the current state of things :) Note that I’m happy to deal with the migration to that new directory myself, in case it matters.

Feel free to change directories, /etc/tails-borgbackup should work just as well. At this point it seems like an aesthetical issue, which I can’t say I’m particularly concerned about.

> For everything else I’m not quoting here, I’m happy with the way you’ve promptly fixed it. Thanks!

Thanks for the review & feedback!

#41 Updated by groente 2019-08-05 10:45:56

  • Assignee changed from groente to intrigeri

#42 Updated by intrigeri 2019-08-05 13:50:14

  • Status changed from In Progress to Resolved

> Feel free to change directories, /etc/tails-borgbackup should work just as well. At this point it seems like an aesthetical issue, which I can’t say I’m particularly concerned about.

Done. I think that all my concerns were resolved and all subtasks are done, so: closing! Great job \o/

I’ll now update https://tails.boum.org/contribute/roadmap/.