Bug #11830

Tagged APT snapshots' backup is impractical

Added by intrigeri 2016-09-23 10:11:15 . Updated 2018-08-21 11:36:29 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2016-09-23
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

For each release we currently add about 6-7GB of data to backup, which is painful when running backups on a poor Internet connection. I think we should investigate deduplication:

  • either in the source filesystem itself, which has the advantage of saving storage space on lizard;
    • using hardlinks-based deduplication tools) should work, e.g. http://jak-linux.org/projects/hardlink/ that we use during our ISO build process
    • using a filesystem that deduplicates data would not help on the backup side (unless we use tools specific to that filesystem to back up our data); and last time I checked, no such filesystem was ready for production use on Linux
  • or in the backup process itself, e.g. using bup instead of rdiff-backup
    • bup supports pull-style backups (see bup-on(1))

Subtasks


Related issues

Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29

History

#1 Updated by intrigeri 2016-09-23 10:16:57

  • Description updated

#2 Updated by intrigeri 2016-09-23 10:27:24

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to bertagaz
  • Target version changed from 284 to Tails_2.7
  • % Done changed from 0 to 10
  • QA Check set to Info Needed

The hardlink(1)-based solution seems to be pretty efficient on our current 31GB tagged snapshots repo:

reprepro-tagged-snapshots@apt:~$ time hardlink --dry-run --ignore-time /srv/apt-snapshots/tagged/repositories/
Mode:     dry-run
Files:    38410
Linked:   29650 files
Compared: 0 xattrs
Compared: 41484 files
Saved:    22.34 GiB
Duration: 77.18 seconds

… and the amount of space saved will only grow as we add releases. I’m tempted to simply go this way and be done with it. Given it’s super easy to run this command via cron, I’m setting a closer target version (this can be done very quickly if we want).

bertagaz, what do you think?

#3 Updated by intrigeri 2016-09-23 10:27:44

  • Type of work changed from Research to Sysadmin

#4 Updated by bertagaz 2016-10-01 02:06:58

  • Assignee changed from bertagaz to intrigeri

intrigeri wrote:
> bertagaz, what do you think?

That sounds pretty nice, and I agree that’s probably the best path to follow. Do you have an idea how much it ameliorates the backup time?

#5 Updated by intrigeri 2016-10-01 02:36:09

  • QA Check changed from Info Needed to Dev Needed

> That sounds pretty nice, and I agree that’s probably the best path to follow.

OK, I’ll go ahead then.

> Do you have an idea how much it ameliorates the backup time?

  • a full backup of our current tagged snapshot repo should require transferring 9GB instead of 31GB
  • next incremental backup of our tagged snapshot repo should require transferring only unique files that were added, instead of 6-7GB

#6 Updated by intrigeri 2016-10-01 03:00:07

  • % Done changed from 10 to 50
  • QA Check deleted (Dev Needed)

Deployed, let’s see what happens the first time the cronjob runs (in a couple hours).

#7 Updated by intrigeri 2016-10-01 05:23:10

As expeced:

$ sudo du -csh /srv/apt-snapshots/tagged/repositories/*
5.4G    /srv/apt-snapshots/tagged/repositories/2.4
131M    /srv/apt-snapshots/tagged/repositories/2.4-rc1
577M    /srv/apt-snapshots/tagged/repositories/2.5
1.1G    /srv/apt-snapshots/tagged/repositories/2.6
615M    /srv/apt-snapshots/tagged/repositories/2.6-rc1
4.0K    /srv/apt-snapshots/tagged/repositories/robots.txt
7.8G    total

I’ll now try building an ISO that uses one of these tagged snapshots.

#8 Updated by intrigeri 2016-10-01 05:33:58

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 50 to 100
  • Deliverable for set to 270

The build system managed to download most .deb’s and then I killed it. So it works!

#9 Updated by intrigeri 2017-04-01 09:20:05

  • Status changed from Resolved to In Progress
  • Assignee set to bertagaz
  • Target version changed from Tails_2.7 to Tails_2.12
  • % Done changed from 100 to 10
  • QA Check set to Info Needed

Argh, the actual consequence of this problem is still here. Apparently rdiff-backup ignores hardlinks, and as a result: the same data is downloaded and stored N times (each tagged repo directory in my backup store takes multiple GB). Sorry I didn’t notice this earlier. bertagaz, can you confirm? (I’d like to make sure this is not due to some weirdness of my own system.)

#10 Updated by intrigeri 2017-04-01 09:20:17

  • Deliverable for changed from 270 to SponsorS_Internal

#11 Updated by intrigeri 2017-04-20 07:14:07

  • Target version changed from Tails_2.12 to Tails_3.0~rc1

You’re on duty next week, so you should be able to answer my question by the end of the month.

#12 Updated by intrigeri 2017-05-03 05:25:52

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 10 to 100
  • QA Check deleted (Info Needed)

intrigeri wrote:
> Argh, the actual consequence of this problem is still here. Apparently rdiff-backup ignores hardlinks, and as a result: the same data is downloaded and stored N times (each tagged repo directory in my backup store takes multiple GB).

I’ve looked closer and actually I was wrong: each tagged snapshot takes exactly as much space locally as on apt.lizard, and I’ve verified with stat --format='%i' that .deb’s with the same name/version are de-duplicated via hardlinks locally as well. So everything works as expected here :)

#13 Updated by intrigeri 2017-07-29 06:04:03

  • Status changed from Resolved to In Progress
  • Assignee set to intrigeri
  • Target version changed from Tails_3.0~rc1 to Tails_3.2
  • % Done changed from 100 to 80
  • QA Check set to Ready for QA

I’m back here again :/ It seems that we’re still transferring much more data than we could. I think I know why; hardlink(1) says:

       -O or --keep-oldest
              Among equal files, keep the oldest file (least recent  modification  time).
              By  default, the newest file is kept. If --maximize or --minimize is speci‐
              fied, the link count has a higher precedence than the time of modification.

The way I understand this, “by default, the newest file is kept” implies that in practice, files that were in the tagged repo for Tails version N-1 will become hardlinks to the same files in the tagged repo for Tails version N once it’s out. And indeed, the tagged repo for 3.0 is 4.7GB big, while those for 3.0~betaN and 3.0~rcN are all 2.1GB or smaller. If I’m not mistaken, this implies that we will re-download these duplicated files when performing a backup, and the copy we already had will become a hardlink to the new copy. This feels wrong, and I think the --keep-oldest option should avoid this problem. I’ve done this and deployed, but this does not take effect immediately: it’ll only impact newly duplicated files, so we’ll only know after the 3.1 release how it went. I’ll thus evaluate the outcome during the 3.2 cycle.

As a data point, what we have now is:

4.7G    3.0
225M    3.0.1
3.6G    3.0-alpha1
2.1G    3.0-beta1
1.5G    3.0-beta2
545M    3.0-beta3
1.2G    3.0-beta4
771M    3.0-rc1
47M     3.0-rc2

#14 Updated by intrigeri 2017-07-29 06:05:05

  • blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#15 Updated by intrigeri 2017-08-13 15:27:39

  • Assignee changed from intrigeri to bertagaz

We now have:

5.1G    3.0
225M    3.0.1
4.0G    3.0-alpha1
2.3G    3.0-beta1
1.6G    3.0-beta2
546M    3.0-beta3
1.2G    3.0-beta4
771M    3.0-rc1
47M     3.0-rc2
725M    3.1

The fact the 3.1 snapshot is small seems to indicate that the problem has indeed been fixed. bertagaz, please confirm this while doing the backups during your current sysadmin shift.

#16 Updated by anonym 2017-09-28 18:29:31

  • Target version changed from Tails_3.2 to Tails_3.3

#17 Updated by intrigeri 2017-10-01 10:01:09

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#18 Updated by intrigeri 2017-10-01 10:01:10

  • blocked by deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#19 Updated by anonym 2017-11-15 11:30:50

  • Target version changed from Tails_3.3 to Tails_3.5

#20 Updated by anonym 2018-01-23 19:52:36

  • Target version changed from Tails_3.5 to Tails_3.6

#21 Updated by bertagaz 2018-03-14 10:57:51

  • Target version changed from Tails_3.6 to Tails_3.7

#22 Updated by intrigeri 2018-04-08 19:36:11

  • Deliverable for deleted (SponsorS_Internal)

#24 Updated by bertagaz 2018-05-10 11:09:17

  • Target version changed from Tails_3.7 to Tails_3.8

#25 Updated by intrigeri 2018-06-26 16:27:54

  • Target version changed from Tails_3.8 to Tails_3.9

#26 Updated by Anonymous 2018-08-17 14:59:52

This seems to be waiting for review.

#27 Updated by intrigeri 2018-08-21 11:36:29

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 80 to 100
  • QA Check deleted (Ready for QA)

I’m giving up on waiting for feedback here: good progress is being made on our next-gen backup setup, the fix I applied a year ago works for me when I update my backups, and bertagaz has not done sysadmin shifts for a while so there’s little chance he gives me feedback here.