Bug #17647

Connection failures from VMs to git.tails.boum.org

Added by CyrilBrulebois 2020-04-22 10:12:23 . Updated 2020-05-10 05:55:46 .

Status:
Confirmed
Priority:
Elevated
Assignee:
Sysadmins
Category:
Target version:
Start date:
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

Hi sysadmins,

This issue has been somewhat intermittent over the last few days, but it can persist for a while, leading to test suite failures plus regular cron mails about git’s being unreachable.

Just tested this from the apt.lizard VM (192.168.122.3):

kibi@apt:~$ wget https://git.tails.boum.org/tails/
--2020-04-22 10:01:26--  https://git.tails.boum.org/tails/
Resolving git.tails.boum.org (git.tails.boum.org)... 212.103.72.241
Connecting to git.tails.boum.org (git.tails.boum.org)|212.103.72.241|:443... connected.

It took more than a minute to get to the “connected” state; and after 10 minutes now, I’m still waiting for a possible reply…

Meanwhile, I haven’t had any issues connecting to the same service (resolving to the same IP) from my home connection; the reply is instantaneous so I don’t think this is due to git being too busy (hammered by bots or something) to reply.

Could you please have a look and let me know if you need more information? Thanks for your time.


Subtasks


History

#1 Updated by CyrilBrulebois 2020-04-22 10:15:52

For completeness, we know at least about two failure modes from apt.lizard (the aforementioned VM):

fatal: unable to access 'https://git.tails.boum.org/tails/': Operation timed out after 300045 milliseconds with 0 out of 0 bytes received

and:

fatal: unable to access 'https://git.tails.boum.org/tails/': Failed to connect to git.tails.boum.org port 443: Connection timed out

but that’s not limited to that VM, we’ve also seen this from isobuilders (e.g. isobuilder3.lizard/192.168.122.26):

hudson.plugins.git.GitException: Command "git submodule update --init --recursive submodules/jenkins-tools" returned status code 1:
stdout:
stderr: Cloning into '<https://jenkins.tails.boum.org/job/build_Tails_ISO_devel/ws/submodules/jenkins-tools'...>
fatal: unable to access 'https://git-tails.immerda.ch/jenkins-tools/': Failed to connect to git-tails.immerda.ch port 443: Connection timed out
fatal: clone of 'https://git-tails.immerda.ch/jenkins-tools' into submodule path '<https://jenkins.tails.boum.org/job/build_Tails_ISO_devel/ws/submodules/jenkins-tools'> failed 
Failed to clone 'submodules/jenkins-tools'. Retry scheduled
Cloning into '<https://jenkins.tails.boum.org/job/build_Tails_ISO_devel/ws/submodules/jenkins-tools'...>
fatal: unable to access 'https://git-tails.immerda.ch/jenkins-tools/': Operation timed out after 300020 milliseconds with 0 out of 0 bytes received
fatal: clone of 'https://git-tails.immerda.ch/jenkins-tools' into submodule path '<https://jenkins.tails.boum.org/job/build_Tails_ISO_devel/ws/submodules/jenkins-tools'> failed
Failed to clone 'submodules/jenkins-tools' a second time, aborting

#2 Updated by CyrilBrulebois 2020-04-22 10:17:18

  • Subject changed from Connection failures from apt.lizard to git.tails.boum.org to Connection failures from VMs to git.tails.boum.org

Adjusting title accordingly: this is not limited to connections coming from apt.lizard.

#3 Updated by groente 2020-04-23 15:09:16

  • Status changed from Confirmed to Needs Validation
  • Assignee changed from Sysadmins to CyrilBrulebois

Hey Cyril,

We had a lot of packet loss on incoming TCP packets, but that seems to have magically fixed itself after a shorewall restart… Are things working better for you now?

#4 Updated by CyrilBrulebois 2020-04-23 15:28:42

  • Status changed from Needs Validation to Resolved

Hey @groente,

The last failure notification from cron seems to be dated “Thu, 23 Apr 2020 02:37:17 +0000”, and the testsuite run between 07:57:12 & 13:05:07 (UTC) seems to have worked fine as well, so that looks good, thanks!

Closing for the time being, I’ll open another ticket if I notice a come-back of this problem.

#5 Updated by CyrilBrulebois 2020-04-25 08:12:22

  • Status changed from Resolved to Confirmed
  • Assignee changed from CyrilBrulebois to Sysadmins

This started again already, first notification received: Sat, 25 Apr 2020 07:07:16 +0000

#6 Updated by CyrilBrulebois 2020-04-25 19:49:54

This seems to also affect updating the website when pushing the master branch:

remote: To redmine.tails.boum.org:/srv/repositories/tails.git
remote:    c161c88aff..5f8074c850  master -> master
remote:    d20a4f21c2..c161c88aff  labs/master -> labs/master
remote: ssh: connect to host git.tails.boum.org port 22: Connection timed out

I don’t think I have any means to trigger a rebuild on my own, i.e. I’m not aware of a workaround for this specific issue.

#7 Updated by zen 2020-04-27 16:53:54

Update: it looks like the problem is not in our server, as our router is also affected. We’re investigating to pin where the problem lies and fix it.

#8 Updated by intrigeri 2020-04-29 13:04:01

Hi,

> Update: it looks like the problem is not in our server, as our router is also affected. We’re investigating to pin where the problem lies and fix it.

Thank you for investigating!

I’d like to propose a mitigation measure for the “RMs are spammed like crazy” user-facing problem: decrease the frequency of the cronjob that runs tails-update-reprepro-config.

#9 Updated by CyrilBrulebois 2020-05-04 16:32:48

I’m having serious troubles to release… This is happening between 16:00 UTC and 16:30 UTC, 2020-05-04:

At first:

./bin/tag-apt-snapshots "${BUILD_MANIFEST:?}" "${TAG:?}"tag-apt-snapshots BUILD_MANIFEST TAG
I: Preparing a workspace on apt.lizard
Authentication failed.

Now:

./bin/tag-apt-snapshots "${BUILD_MANIFEST:?}" "${TAG:?}"I: Preparing a workspace on apt.lizard
ERROR: Got error response from SOCKS server: 6 (TTL expired).
FATAL: failed to begin relaying via SOCKS.
ssh_exchange_identification: Connection closed by remote host

It’s not entirely clear where errors are coming from, but I thought I’d mention them.

EDIT: The latter might have been tor reconnecting on my end or something. I’ve now managed to get past this step.

Extra info: I suppose you’re aware given the monitoring, but we’re seeing another round of disruption, ranging from (at least): Sat, 02 May 2020 13:18:23 +0000Mon, 04 May 2020 16:43:17 +0000 (again, trying to access git.tails.boum.org:443 from a VM on lizard).

#10 Updated by CyrilBrulebois 2020-05-04 21:29:04

On a related note, trying to push the 4.6 images (IMG+ISO so a little over 2 GB), I’m seeing a transfer to lizard:3004 measured at… 20 kB/s!

(Meanwhile, I can push stuff elsewhere at 20+ MB/s, so the local connection is not the problem.)

We need to find a solution, soon. Uploading those images is a prerequisite for building IUKs on Jenkins (relatedly: Bug #17658), and waiting 20 hours is not acceptable.

#11 Updated by CyrilBrulebois 2020-05-05 00:39:52

I suppose this might also explain why the test suite got killed after 60 minutes of inactivity (see Bug #17678)…

#12 Updated by CyrilBrulebois 2020-05-10 05:55:46

Could we please get the firewall kicked one more time?

This seems to be back, first notification timestamped Sun, 10 May 2020 04:04:19 +0000