Bug #17414
Slow networking for release management
0%
Description
(Opening as suggested by intrigeri.)
I’m regularly getting very slow transfers when performing release management duties, e.g. sync-ing stuff with git-annex right now:
kibi@armor:~/work/clients/tails/release/isos.git$ git annex get tails-amd64-4.*
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.apt-sources (from origin...)
SHA256E-s455--8759b3d834ba480061a814178fcaba653183e063ec554d2792da7cab31244d1d
455 100% 444.34kB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.build-manifest (from origin...)
SHA256E-s115862--f1ad4992232b8f026e612affecfa5d44c28061736d3fd82d35fc34029b3c743e
115,862 100% 367.36kB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.buildlog (from origin...)
SHA256E-s1167089--21d7a15c8065b52e49662273c6a7a0d734ea4bf7c69f623266774767ecc767ec
1,167,089 100% 2.32MB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.img (from origin...)
SHA256E-s1136656384--384db4d74da56c31a4e50bf093526c962d0eb3dee19de3e127fd5acccc063f9b.img
1,136,656,384 100% 6.09MB/s 0:02:58 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.img.sig (from origin...)
SHA256E-s833--b8d1f3e4843d9586b811c16272df34759a6fab8c89c26839e0c0373f091ab343.img.sig
833 100% 813.48kB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.iso (from origin...)
SHA256E-s1126864896--c4c25d9689d8c5927f8ce1569454503fc92494ce53af236532ddb0d6fb34cff3.iso
1,126,864,896 100% 3.93MB/s 0:04:33 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.iso.sig (from origin...)
SHA256E-s833--924ed01855ed92471526c9d68db40d692d6b3dfe970554333f52f3297d6f4f1a.iso.sig
833 100% 813.48kB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.packages (from origin...)
SHA256E-s46699--7d7ce059b09604f08676561e2a62c6142ff3226878257b695f20269f4eab6e8d
46,699 100% 44.54MB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.build-manifest (from origin...)
SHA256E-s115583--95117b4429829ce7e0f29fa887789e1c29c37d6ff9414a2a8e898d88099f426e
115,583 100% 714.39kB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.buildlog (from origin...)
SHA256E-s1165670--352f0e02919afccacaf763a9831497678a9175211f423aae0fa63465018f79bd
1,165,670 100% 3.32MB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.img (from origin...)
SHA256E-s1136656384--920e48fb7b8ab07573f6ad334749dd965c453794b6d33766d545c943c21296ad.img
1,136,656,384 100% 4.05MB/s 0:04:27 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.img.sig (from origin...)
SHA256E-s228--212807a88aca27d186eaebe66e7045b88888ab7f8e806e77ca780a8a8646697e.img.sig
228 100% 222.66kB/s 0:00:00 (xfr#1, to-chk=0/1)
Meanwhile, I’m easily reaching 20+ MB/s when downloading some big files from other servers, so that’s definitely not a bandwidth issue on my side.
FWIW this is with a git-annex that leverages a direct connection to git.puppet.tails.boum.org defined as an SSH alias to lizard, so there’s no tor involved.
At other times, I can get up to 15 MB/s from the iso history HTTPS access, or over SSH. But it usually only happens when I’m double-checking the numbers after mentioning this issue to my fellow RMs, while I’m getting the slowness moments when such transfers are on the critical path to a release.
Please let me know what you need from me to help you help me. Thanks!
Subtasks
History
#1 Updated by CyrilBrulebois 2020-01-09 19:02:29
And now that pushing to ISO history (that particular repository) is also on the critical path since one needs to upload the built images there so as to be be able to build IUKs on Jenkins. Currently seeing this:
121,602,048 10% 954.03kB/s 0:18:10
and I’ve got ~ 2.4 GB to upload.
#2 Updated by CyrilBrulebois 2020-01-09 19:43:01
To be a tad more complete:
copy tails-amd64-4.2.1/tails-amd64-4.2.1.img (checking origin...) (to origin...)
SHA256E-s1161822208--19f20ad2dc3d28c695162479e5b1d527381baadddf125f3c66ece34c17c154d1.1.img
1,161,822,208 100% 1.87MB/s 0:09:52 (xfr#1, to-chk=0/1)
ok
copy tails-amd64-4.2.1/tails-amd64-4.2.1.iso (checking origin...) (to origin...)
SHA256E-s1151539200--4fcc4f2d0877f4ac7fdd5867467842eeed49094e37473e347a84798978064533.1.iso
1,151,539,200 100% 1.71MB/s 0:10:40 (xfr#1, to-chk=0/1)
ok
while I could upload those elsewhere ~ 10 times faster.
#3 Updated by intrigeri 2020-01-09 20:28:03
(18:45:54) intrigeri: taggart: fwiw, we're seeing <100 Mbps transfer rates for lizard (did not check if it's "always" or "only when we need it to be fast",
(18:46:09) intrigeri: taggart: which is not consistent with the fact it's supposed to be plugged on a gigabit switch now
(19:45:12) taggart: intrigeri: can you run ethtool and confirm it's got a gig link
(19:45:27) taggart: intrigeri: it's plugged straight into a gig port on our router
(19:46:11) taggart: intrigeri: do you have a bandwidth graph somewhere? also we might need to do some traceroutes to see how it's routing
(19:46:34) intrigeri: taggart: it says Speed: 1000Mb/s
(19:46:43) intrigeri: taggart: yeah, we are on your munin
(19:47:13) intrigeri: I'll report more details later, busy now, I just wanted to check if there was a known issue, sorry I made you context switch!
(19:54:18) taggart: intrigeri: found it https://munin.riseup.net/riseup.net/wren.riseup.net/if_eth7.html
(20:15:49) intrigeri: taggart: ok, so it does go slightly above 100Mb/s, but I do remember that it went closer to gigabit after if was plugged in.
[…]
(21:22:48) taggart: intrigeri: this is what I use for speed testing https://github.com/richb-hanover/OpenWrtScripts
(21:22:59) taggart: the betterspeedtest.sh script
(21:23:20) taggart: I think it needs netperf or flent installed (but the error will be clear)
#4 Updated by intrigeri 2020-01-11 10:00:13
- related to Bug #17361: Streamline our release process added
#5 Updated by intrigeri 2020-01-12 19:58:42
- Category set to Infrastructure
- Status changed from New to Confirmed
#6 Updated by zen 2020-01-20 16:08:43
@intrigeri, did you run any test that shows that the problem is not on our side?
I ran the speedtest available in Debian some times in a row, and consistently gave around the following values:
$ speedtest --secure --simple
Ping: 26.323 ms
Download: 716.46 Mbit/s
Upload: 240.98 Mbit/s
I’ll look for a way to run the proposed custom script so our numbers are more consistent with the provider expects. Other suggestions for measuring bandwidth are also welcome.
#7 Updated by zen 2020-01-20 16:12:00
- Assignee changed from Sysadmins to zen
#8 Updated by intrigeri 2020-01-25 17:18:15
> intrigeri, did you run any test that shows that the problem is not on our side?
I did not.
> I ran the speedtest available in Debian some times in a row, and consistently gave around the following values:
>
>
> $ speedtest --secure --simple
> Ping: 26.323 ms
> Download: 716.46 Mbit/s
> Upload: 240.98 Mbit/s
>
Hmmm. 240.98 Mbit/s is better than what kibi has seen, but still very far from download rates, let alone from maxing out a gigabit link. This suggests there’s indeed a problem somewhere.
#9 Updated by intrigeri 2020-02-05 08:50:58
FWIW, I have a hunch that the problem may not be caused by networking problems, but rather by puppet-git.lizard
being resource-constrained and slow. This hunch comes from the fact the problem seems to happen mostly (only?) with git annex
operations, and not with operations that connect to another VM.
To confirm this, next time this sort of trouble happens, the RM could:
- Share the exact timestamp so we can look into our Munin graphs and see what was going on around that time. Or, even better, if a sysadmin is around when the problem occurs, they would be able to check current resources usage directly on
puppet-git.lizard
, which would give us finer-grained data. - Try downloading a large file over HTTPS from lizard at the same time as the slow upload is ongoing, and see if that one is slow too.
… and sysadmins could check:
- Was
rsync.lizard
uploading tons of data to mirrors? This could explain why there’s less bandwidth available for other needs. - Was
puppet-git.lizard
bottlenecked by CPU, I/O, or anything else? - Global lizard bandwidth usage
I could try to be around next Monday, at the time anonym will go through potentially affected steps of the release process, in order to check things live.
#10 Updated by intrigeri 2020-03-28 08:09:39
FWIW, yesterday between 3:00 and 5:00 PDT (presumably while mirrors were sync’ing from rsync.lizard
), lizard pushed up to 800 Mbps of traffic, and was above 400 Mbps half of the time:
So it seems to me that network bandwidth alone does not explain the problem.
This is consistent with my hunch that the problem is specific to git-annex
and puppet-git.lizard
.
#11 Updated by CyrilBrulebois 2020-03-28 12:59:39
I don’t see anything matching /annex/i
in wiki/src/blueprint/GitLab.mdwn
(which I was kind of expecting), so the upcoming switch to GitLab at Immerda will probably not change anything here?
I’ll try and remember to download a big file from there next time I’m noticing a slow push, to double check the bandwidth aspect (which I didn’t remember because of let’s say suboptimal working conditions).
#12 Updated by intrigeri 2020-03-28 14:04:43
> I don’t see anything matching /annex/i
in wiki/src/blueprint/GitLab.mdwn
(which I was kind of expecting), so the upcoming switch to GitLab at Immerda will probably not change anything here?
I won’t change anything directly: migrating git-annex repos to GitLab is out of scope (besides, AFAIK GitLab supports git-lfs, but not git-annex).
It may improve things indirectly, if the problem is merely “puppet-git.lizard
is overloaded”, by migrating a little bit of the load away from that machine.
> I’ll try and remember to download a big file from there next time I’m noticing a slow push, to double check the bandwidth aspect
Great!
If one of our sysadmins happens to be around at the time, it would be nice if you asked them to take a look at what seems to be the limiting factor for the server-side git-annex processes.
#13 Updated by CyrilBrulebois 2020-04-06 21:11:19
Just confirmed: downloading 4.5~rc1 IMG/ISO from https://iso-history.tails.boum.org/ at a speed varying between 10-15 MB/s while the upload through git-annex has troubles reaching 3 MB/s. Uploading on my side doesn’t seem to be an issue, I’m over 20 MB/s when pushing stuff over SSH elsewhere.
Edit: It seems a tad better than other times, but not hugely:
kibi@armor:~/work/clients/tails/release/isos.git$ git annex copy --to origin tails-amd64-4.5
copy tails-amd64-4.5/tails-amd64-4.5.img (checking origin...) (to origin...)
SHA256E-s1177550848--b992d32826d572d80ddad5a7506e86daed1726661b52ba5a88513eae1e2cda65.5.img
1,177,550,848 100% 3.17MB/s 0:05:54 (xfr#1, to-chk=0/1)
ok
copy tails-amd64-4.5/tails-amd64-4.5.img.sig (checking origin...) (to origin...)
SHA256E-s833--92d79b7f94406dae41c78d8bb6cbde358bc9f6cc59f7b1bcce939c0f3ffc0c1d.img.sig
833 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)
ok
copy tails-amd64-4.5/tails-amd64-4.5.iso (checking origin...) (to origin...)
SHA256E-s1167990784--eddd88ab6726dbaf62a9ec1bf552e2a90d01b0ea539d7aaae4ad4772473d51ea.5.iso
1,167,990,784 100% 3.65MB/s 0:05:05 (xfr#1, to-chk=0/1)
ok
copy tails-amd64-4.5/tails-amd64-4.5.iso.sig (checking origin...) (to origin...)
SHA256E-s833--dd87b5b47416aae5003decd63d55a46acd8c0e0cb386c10edc0f5621656262e5.iso.sig
833 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)
ok
(recording state in git...)
on 2020-04-06, roughly around 23:05 → 23:15 CEST (21:05 → 21:15 UTC).
I’ve seen a spike up to around 6 MB/s once, but it can drop to between 1-2 MB/s at times, hence the average reported above.