Bug #11830
Tagged APT snapshots' backup is impractical
100%
Description
For each release we currently add about 6-7GB of data to backup, which is painful when running backups on a poor Internet connection. I think we should investigate deduplication:
- either in the source filesystem itself, which has the advantage of saving storage space on lizard;
- using hardlinks-based deduplication tools) should work, e.g. http://jak-linux.org/projects/hardlink/ that we use during our ISO build process
- using a filesystem that deduplicates data would not help on the backup side (unless we use tools specific to that filesystem to back up our data); and last time I checked, no such filesystem was ready for production use on Linux
- or in the backup process itself, e.g. using bup instead of rdiff-backup
- bup supports pull-style backups (see
bup-on(1)
)
- bup supports pull-style backups (see
Subtasks
Related issues
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) | Confirmed | 2017-06-29 |
History
#1 Updated by intrigeri 2016-09-23 10:16:57
- Description updated
#2 Updated by intrigeri 2016-09-23 10:27:24
- Status changed from Confirmed to In Progress
- Assignee changed from intrigeri to bertagaz
- Target version changed from 284 to Tails_2.7
- % Done changed from 0 to 10
- QA Check set to Info Needed
The hardlink(1)-based solution seems to be pretty efficient on our current 31GB tagged snapshots repo:
reprepro-tagged-snapshots@apt:~$ time hardlink --dry-run --ignore-time /srv/apt-snapshots/tagged/repositories/
Mode: dry-run
Files: 38410
Linked: 29650 files
Compared: 0 xattrs
Compared: 41484 files
Saved: 22.34 GiB
Duration: 77.18 seconds
… and the amount of space saved will only grow as we add releases. I’m tempted to simply go this way and be done with it. Given it’s super easy to run this command via cron, I’m setting a closer target version (this can be done very quickly if we want).
bertagaz, what do you think?
#3 Updated by intrigeri 2016-09-23 10:27:44
- Type of work changed from Research to Sysadmin
#4 Updated by bertagaz 2016-10-01 02:06:58
- Assignee changed from bertagaz to intrigeri
intrigeri wrote:
> bertagaz, what do you think?
That sounds pretty nice, and I agree that’s probably the best path to follow. Do you have an idea how much it ameliorates the backup time?
#5 Updated by intrigeri 2016-10-01 02:36:09
- QA Check changed from Info Needed to Dev Needed
> That sounds pretty nice, and I agree that’s probably the best path to follow.
OK, I’ll go ahead then.
> Do you have an idea how much it ameliorates the backup time?
- a full backup of our current tagged snapshot repo should require transferring 9GB instead of 31GB
- next incremental backup of our tagged snapshot repo should require transferring only unique files that were added, instead of 6-7GB
#6 Updated by intrigeri 2016-10-01 03:00:07
- % Done changed from 10 to 50
- QA Check deleted (
Dev Needed)
Deployed, let’s see what happens the first time the cronjob runs (in a couple hours).
#7 Updated by intrigeri 2016-10-01 05:23:10
As expeced:
$ sudo du -csh /srv/apt-snapshots/tagged/repositories/*
5.4G /srv/apt-snapshots/tagged/repositories/2.4
131M /srv/apt-snapshots/tagged/repositories/2.4-rc1
577M /srv/apt-snapshots/tagged/repositories/2.5
1.1G /srv/apt-snapshots/tagged/repositories/2.6
615M /srv/apt-snapshots/tagged/repositories/2.6-rc1
4.0K /srv/apt-snapshots/tagged/repositories/robots.txt
7.8G total
I’ll now try building an ISO that uses one of these tagged snapshots.
#8 Updated by intrigeri 2016-10-01 05:33:58
- Status changed from In Progress to Resolved
- Assignee deleted (
intrigeri) - % Done changed from 50 to 100
- Deliverable for set to 270
The build system managed to download most .deb’s and then I killed it. So it works!
#9 Updated by intrigeri 2017-04-01 09:20:05
- Status changed from Resolved to In Progress
- Assignee set to bertagaz
- Target version changed from Tails_2.7 to Tails_2.12
- % Done changed from 100 to 10
- QA Check set to Info Needed
Argh, the actual consequence of this problem is still here. Apparently rdiff-backup
ignores hardlinks, and as a result: the same data is downloaded and stored N times (each tagged repo directory in my backup store takes multiple GB). Sorry I didn’t notice this earlier. bertagaz, can you confirm? (I’d like to make sure this is not due to some weirdness of my own system.)
#10 Updated by intrigeri 2017-04-01 09:20:17
- Deliverable for changed from 270 to SponsorS_Internal
#11 Updated by intrigeri 2017-04-20 07:14:07
- Target version changed from Tails_2.12 to Tails_3.0~rc1
You’re on duty next week, so you should be able to answer my question by the end of the month.
#12 Updated by intrigeri 2017-05-03 05:25:52
- Status changed from In Progress to Resolved
- Assignee deleted (
bertagaz) - % Done changed from 10 to 100
- QA Check deleted (
Info Needed)
intrigeri wrote:
> Argh, the actual consequence of this problem is still here. Apparently rdiff-backup
ignores hardlinks, and as a result: the same data is downloaded and stored N times (each tagged repo directory in my backup store takes multiple GB).
I’ve looked closer and actually I was wrong: each tagged snapshot takes exactly as much space locally as on apt.lizard, and I’ve verified with stat --format='%i'
that .deb’s with the same name/version are de-duplicated via hardlinks locally as well. So everything works as expected here :)
#13 Updated by intrigeri 2017-07-29 06:04:03
- Status changed from Resolved to In Progress
- Assignee set to intrigeri
- Target version changed from Tails_3.0~rc1 to Tails_3.2
- % Done changed from 100 to 80
- QA Check set to Ready for QA
I’m back here again :/ It seems that we’re still transferring much more data than we could. I think I know why; hardlink(1)
says:
-O or --keep-oldest
Among equal files, keep the oldest file (least recent modification time).
By default, the newest file is kept. If --maximize or --minimize is speci‐
fied, the link count has a higher precedence than the time of modification.
The way I understand this, “by default, the newest file is kept” implies that in practice, files that were in the tagged repo for Tails version N-1 will become hardlinks to the same files in the tagged repo for Tails version N once it’s out. And indeed, the tagged repo for 3.0 is 4.7GB big, while those for 3.0~betaN and 3.0~rcN are all 2.1GB or smaller. If I’m not mistaken, this implies that we will re-download these duplicated files when performing a backup, and the copy we already had will become a hardlink to the new copy. This feels wrong, and I think the --keep-oldest
option should avoid this problem. I’ve done this and deployed, but this does not take effect immediately: it’ll only impact newly duplicated files, so we’ll only know after the 3.1 release how it went. I’ll thus evaluate the outcome during the 3.2 cycle.
As a data point, what we have now is:
4.7G 3.0
225M 3.0.1
3.6G 3.0-alpha1
2.1G 3.0-beta1
1.5G 3.0-beta2
545M 3.0-beta3
1.2G 3.0-beta4
771M 3.0-rc1
47M 3.0-rc2
#14 Updated by intrigeri 2017-07-29 06:05:05
- blocks
Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added
#15 Updated by intrigeri 2017-08-13 15:27:39
- Assignee changed from intrigeri to bertagaz
We now have:
5.1G 3.0
225M 3.0.1
4.0G 3.0-alpha1
2.3G 3.0-beta1
1.6G 3.0-beta2
546M 3.0-beta3
1.2G 3.0-beta4
771M 3.0-rc1
47M 3.0-rc2
725M 3.1
The fact the 3.1 snapshot is small seems to indicate that the problem has indeed been fixed. bertagaz, please confirm this while doing the backups during your current sysadmin shift.
#16 Updated by anonym 2017-09-28 18:29:31
- Target version changed from Tails_3.2 to Tails_3.3
#17 Updated by intrigeri 2017-10-01 10:01:09
- blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added
#18 Updated by intrigeri 2017-10-01 10:01:10
- blocked by deleted (
)Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services)
#19 Updated by anonym 2017-11-15 11:30:50
- Target version changed from Tails_3.3 to Tails_3.5
#20 Updated by anonym 2018-01-23 19:52:36
- Target version changed from Tails_3.5 to Tails_3.6
#21 Updated by bertagaz 2018-03-14 10:57:51
- Target version changed from Tails_3.6 to Tails_3.7
#22 Updated by intrigeri 2018-04-08 19:36:11
- Deliverable for deleted (
SponsorS_Internal)
#24 Updated by bertagaz 2018-05-10 11:09:17
- Target version changed from Tails_3.7 to Tails_3.8
#25 Updated by intrigeri 2018-06-26 16:27:54
- Target version changed from Tails_3.8 to Tails_3.9
#26 Updated by Anonymous 2018-08-17 14:59:52
This seems to be waiting for review.
#27 Updated by intrigeri 2018-08-21 11:36:29
- Status changed from In Progress to Resolved
- Assignee deleted (
bertagaz) - % Done changed from 80 to 100
- QA Check deleted (
Ready for QA)
I’m giving up on waiting for feedback here: good progress is being made on our next-gen backup setup, the fix I applied a year ago works for me when I update my backups, and bertagaz has not done sysadmin shifts for a while so there’s little chance he gives me feedback here.