Bug #11557: Nightly ISO images are regularly unavailable for a couple minutes

Bug #11557

Nightly ISO images are regularly unavailable for a couple minutes

Added by intrigeri 2016-07-04 02:38:31 . Updated 2019-03-14 13:12:34 .

Status:

In Progress

Priority:

Normal

Assignee:

Category:

Continuous Integration

Target version:

Start date:

2016-07-04

Due date:

% Done:

10%

Feature Branch:

Type of work:

Sysadmin

Blueprint:

Starter:

Affected tool:

Deliverable for:

Description

Reference: https://icingaweb2.tails.boum.org/monitoring/service/history?host=ecours.tails.boum.org&service=nightly_devel

IIRC I’ve reported this bug during the initial dev phase of the monitoring system, and we decided it wasn’t a bug in the monitoring system but in the actual service. So let’s fix the bug in the actual service then :)

Today the 1st occurrence of this problem in www.lizard nginx’ error.log is at 07:54:08. Yesterday it was at 07:54:17. On July 2 it was at 07:54:26, etc. I doubt that’s a coincidence, and I suspect we have a cronjob or something that breaks this stuff everyday or close to it (and then either our monitoring is fast enough to report it to us, or it’s not).

Subtasks

History

#2 Updated by intrigeri 2016-08-05 07:47:57

Target version set to Tails_2.7

#4 Updated by bertagaz 2016-09-22 05:47:35

Target version changed from Tails_2.7 to Tails_2.9.1

#5 Updated by anonym 2016-12-14 20:11:25

Target version changed from Tails_2.9.1 to Tails 2.10

#6 Updated by intrigeri 2016-12-18 10:02:48

Target version changed from Tails 2.10 to Tails_2.11

#7 Updated by bertagaz 2017-03-08 10:38:05

Target version changed from Tails_2.11 to Tails_2.12

#8 Updated by bertagaz 2017-03-08 11:09:21

Target version changed from Tails_2.12 to Tails_3.0

#9 Updated by bertagaz 2017-05-15 13:33:20

Target version changed from Tails_3.0 to Tails_3.1

#10 Updated by bertagaz 2017-06-08 16:27:30

Status changed from Confirmed to In Progress

Easiest lead: the manage_lastest_iso_symlinks cronjob on the jenkins VM (ran every 5 minutes) sometimes deletes a symlink that the nightly_* HTTP checks (ran every 5 minutes) are expecting to find. That sounds the most plausible. Note that NFS is in play, as it is mounted from the jenkins VM in the www VM.

Also note that it does not raise a critical error, no notification to us as most of the time the check resolved at the second attempt of retrying. So in some regards, the monitoring system reply to this false positive quite well.

Probably changing the cronjob or HTTP checks cycle will at least show it is related if the false positives in the monitoring system gets lower, I’ll try that to validate this hypothesis.

#11 Updated by intrigeri 2017-06-08 17:04:02

> So in some regards, the monitoring system reply to this false positive quite well.

I don’t understand why you say it’s a false positive: the ticket description says it’s a bug in the actual service, which seems quite convincing.
So to me, the monitoring system’s response was good (it let us know we have a problem), but for the opposite reason to the one you’re using here. Now, whether we should get an alert when this problem happens is another question, that I’d rather not discuss today + here.

Or did I miss something?

#12 Updated by intrigeri 2017-06-29 09:22:45

This happened again today and I had a quick look:

the last successful build of devel created its artifacts at 08:08, and finished archiving artifacts at 08:10; and indeed, in the artifacts store, build artifacts were last modified at 08:08, but their parent directory was last modified at 08:10
the /srv/nightly/www/build_Tails_ISO_devel/builds/lastSuccessfulBuild symlink was created at 08:10:27, and points to a directory (/srv/nightly/www/build_Tails_ISO_devel/builds/2017-06-29_06-56-00/) that was last modified at 08:10:28
the icinga check failed between 08:11:22 and 08:15:22, (according to nginx logs: ecours’ clock is wrong so the icinga2 history is useless for such debugging, will file a ticket about that); then it started working fine again at 08:22:10
the /srv/nightly/www/build_Tails_ISO_devel/builds/2017-06-29_06-56-00/archive/latest.* symlinks were last modified at 08:15:01

So I think that between 08:10 and 08:15, lastSuccessfulBuild was pointing to a build that had no latest.* symlink yet. Ideally, this should never be the case, so this suggests a mismatch between the design of our monitoring check and how we manage these symlinks. There are thus two obvious options:

ensure the latest.* symlinks are never created after a build is flagged as the last successful one; I guess this means creating them as part of an ISO build job, somehow
adjust the check so it verifies something that actually holds true, rather than something that’s true most of the time only; e.g. make it check there’s an ISO in build_Tails_ISO_devel/lastSuccessful/archive/build-artifacts/ (that would be best)

Which one is best depends on what the latest.* symlinks are used for: if they have consumers (other than icinga2), then we should ensure they always exist whenever something needs them; if they have no consumer, then we can keep managing them as-is and adjust the monitoring check.

AFAIK at least jenkins.debian.net relies on these symlinks, so I think we should fix them.

#13 Updated by bertagaz 2017-07-07 13:39:13

% Done changed from 0 to 10
QA Check set to Dev Needed

intrigeri wrote:
> This happened again today and I had a quick look:

Thanks.

> So I think that between 08:10 and 08:15, lastSuccessfulBuild was pointing to a build that had no latest.* symlink yet. Ideally, this should never be the case, so this suggests a mismatch between the design of our monitoring check and how we manage these symlinks. There are thus two obvious options:

Seems to confirm my hypothesis about the manage_latest_ISO_symlinks cronjob being responsible, as being run async with the rest of the archiving management done by Jenkins.

> # ensure the latest.* symlinks are never created after a build is flagged as the last successful one; I guess this means creating them as part of an ISO build job, somehow
>
> AFAIK at least jenkins.debian.net relies on these symlinks, so I think we should fix them.

Ok, I think I came out with something that should work better: add a post-build script after the artifacts archiving step that will essentially ssh into the Jenkins master to run a script that creates the symlinks. No need for an async cronjob anymore, and the symlink will be created a few seconds after the ISO is available on nightly. Will try that option.

#14 Updated by intrigeri 2017-07-07 14:42:03

> Ok, I think I came out with something that should work better: add a post-build script after the artifacts archiving step that will essentially ssh into the Jenkins master to run a script that creates the symlinks.

Do you mean our ISO builders will have SSH access to our Jenkins master?

#15 Updated by bertagaz 2017-07-07 15:54:56

intrigeri wrote:
> > Ok, I think I came out with something that should work better: add a post-build script after the artifacts archiving step that will essentially ssh into the Jenkins master to run a script that creates the symlinks.
>
> Do you mean our ISO builders will have SSH access to our Jenkins master?

Yes. I kinda (wrongly) remembered they already had, but I checked and I messed up, probably with Gitolite access.

So, another way to do that would be to use a triggered build job on the master after the build that would run what command/script is necessary to create the symlinks. It brings a bit more Jenkins fragility, but leverage the access rights from slave to master to what we already use (i.e the reboot job).

#16 Updated by intrigeri 2017-07-17 16:32:46

Background (I had to research this to refresh my memories, as the corresponding commits don’t document this): we have to do this on the master because of https://issues.jenkins-ci.org/browse/JENKINS-5597.

> So, another way to do that would be to use a triggered build job on the master after the build that would run what command/script is necessary to create the symlinks.

It’s ugly but it would work much better than what we have now, and I can’t think of a better way to do it. Note that to really fix this problem, the new triggered job’s completion must block the parent build from being marked as successful, otherwise the whole thing will still be racy (there would still be a window during which lastSuccessful points to a place where there’s no latest.* symlink yet). I guess one can do that with Jenkins, right?

And if that’s not feasible for some reason, i.e. if we have to live with this race condition anyway, then don’t bother making our Jenkins setup more complex: just run manage_latest_iso_symlinks in an endless loop (with a short interval between iterations), instead of via cron. Rationale: I can live with the added complexity if it really fix the problem, but if we’re only aiming at making the problem happen less often, any added complexity doesn’t seem worth it.

#17 Updated by bertagaz 2017-07-25 19:15:16

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
> Background (I had to research this to refresh my memories, as the corresponding commits don’t document this): we have to do this on the master because of https://issues.jenkins-ci.org/browse/JENKINS-5597.

First time I see this link.

> > So, another way to do that would be to use a triggered build job on the master after the build that would run what command/script is necessary to create the symlinks.
>
> It’s ugly but it would work much better than what we have now, and I can’t think of a better way to do it.

I just did: we can use a inotify systemd service creating this symlink for any new ISO it finds. IIRC you were proposing another symlink for usabitlity, maybe it can make both?

> Note that to really fix this problem, the new triggered job’s completion must block the parent build from being marked as successful, otherwise the whole thing will still be racy (there would still be a window during which lastSuccessful points to a place where there’s no latest.* symlink yet). I guess one can do that with Jenkins, right?

I think so, there’s something like a “blocking” option we’re already using somewhere in the reboot job process. I don’t think we can change the default behavior, which is to fail the build if this blocking remote step failed though.

> And if that’s not feasible for some reason, i.e. if we have to live with this race condition anyway, then don’t bother making our Jenkins setup more complex: just run manage_latest_iso_symlinks in an endless loop (with a short interval between iterations), instead of via cron. Rationale: I can live with the added complexity if it really fix the problem, but if we’re only aiming at making the problem happen less often, any added complexity doesn’t seem worth it.

That’s why I’m not very much fond of trying to do that in Jenkins. Too much overhead and overly complexity to eventually use something based on a tool I’m not so trusty for robustness. The systemd idea sounds better in this regard :)

#18 Updated by intrigeri 2017-07-26 07:31:30

Assignee changed from intrigeri to bertagaz
QA Check changed from Info Needed to Dev Needed

> intrigeri wrote:
>> Background (I had to research this to refresh my memories, as the corresponding commits don’t document this): we have to do this on the master because of https://issues.jenkins-ci.org/browse/JENKINS-5597.

> First time I see this link.

You pointed me to it two years ago: ~~Feature #9597#note-41~~.

>> > So, another way to do that would be to use a triggered build job on the master after the build that would run what command/script is necessary to create the symlinks.
>>
>> It’s ugly but it would work much better than what we have now, and I can’t think of a better way to do it.

> I just did: we can use a inotify systemd service creating this symlink for any new ISO it finds.

Funny: I had initially mentioned this option in the draft of my previous comment :) And then I’ve rejected it because it’s racy as well so it didn’t seem any better than my hackish “run manage_latest_iso_symlinks in a loop” fallback option.

>> Note that to really fix this problem, the new triggered job’s completion must block the parent build from being marked as successful, otherwise the whole thing will still be racy (there would still be a window during which lastSuccessful points to a place where there’s no latest.* symlink yet). I guess one can do that with Jenkins, right?

> I think so, there’s something like a “blocking” option we’re already using somewhere in the reboot job process. I don’t think we can change the default behavior, which is to fail the build if this blocking remote step failed though.

Good to know there is at least one proper solution that’s not racy.

So I see three candidate options:

Jenkins triggered job; pros = it is the only proposed solution so far that removes the race condition, while our other options merely make it more likely that the right component wins the race; cons = a bit more complex (although we’re still in “basic Jenkins stuff” land), which might make us lose more robustness than what the non-racy property gives us
run manage_latest_iso_symlinks in a loop
something based on inotify

I must say I dislike our tendency to stick with solutions that we know are bound, by design, to fail N% of the time e.g. because they’re racy and we have no guarantee we’ll win the race. I think this tendency is one of the reasons why some of our infra is not very reliable. So on some level I vastly prefer the Jenkins option: Jenkins might not be as robust as we would like (although I personally can’t recall many issues that were caused by Jenkins itself, rather than by the stuff we implemented on top of it), but at least the design would be provably correct. Another reason why I like the Jenkins triggered job option is that it’s more self-contained, and thus easier to grasp and reason about: all the logic about symlinks (our own ones + the lastSuccessful & friends ones) would live in Jenkins-world, which makes it easier for me to reason about the interactions between these processes, as opposed to “Jenkins does one bit and we do something else independently, without any coordination” that is precisely the root cause of the bug this ticket is about.

Now, this being said, this might be one of these cases when KISS is better, even when it implements a broken design. I don’t fully get how more complex the Jenkins option would be (at first glance it seems not substantially more work than the 2 other options). So I’ll let you pick your preferred option, picking a non-Jenkins one if it’s really simpler to understand/maintain/debug and/or faster to implement.

#19 Updated by bertagaz 2017-07-27 15:51:09

intrigeri wrote:
> You pointed me to it two years ago: ~~Feature #9597#note-41~~.

Fun! Seems two years is out of my memory cache. :)

> Funny: I had initially mentioned this option in the draft of my previous comment :) And then I’ve rejected it because it’s racy as well so it didn’t seem any better than my hackish “run manage_latest_iso_symlinks in a loop” fallback option.

Racy by design sure, but it lowers the time where the symlink is missing so much (inotify notifying + a tiny script) that the probability the Icinga2 check fails is very low. But racy I agree. :)

> So I see three candidate options:
> [..]
> Now, this being said, this might be one of these cases when KISS is better, even when it implements a broken design. I don’t fully get how more complex the Jenkins option would be (at first glance it seems not substantially more work than the 2 other options). So I’ll let you pick your preferred option, picking a non-Jenkins one if it’s really simpler to understand/maintain/debug and/or faster to implement.

Ack, I’ll try with the Jenkins one to see how robust it is then.

#20 Updated by intrigeri 2017-07-27 16:26:42

> […] the probability the Icinga2 check fails is very low.

Let’s please switch to a more user-centric approach. Icinga2 is never the user/client/stakeholder we care about, and we should thus essentially never reason in terms of making Icinga2 happy (unless the bug is in the monitoring check itself, of course). Let’s instead think about problems from the perspective of the actual users/clients of the piece of infra we’re working on. Deal? (Not as if it changed much in practice here, but still, let’s teach ourselves good habits instead of bad ones :)

#21 Updated by bertagaz 2017-08-09 14:29:25

Target version changed from Tails_3.1 to Tails_3.2

#22 Updated by bertagaz 2017-09-07 13:03:40

Target version changed from Tails_3.2 to Tails_3.3

#23 Updated by bertagaz 2017-10-13 10:05:57

intrigeri wrote:
> > So, another way to do that would be to use a triggered build job on the master after the build that would run what command/script is necessary to create the symlinks.
>
> It’s ugly but it would work much better than what we have now, and I can’t think of a better way to do it. Note that to really fix this problem, the new triggered job’s completion must block the parent build from being marked as successful, otherwise the whole thing will still be racy (there would still be a window during which lastSuccessful points to a place where there’s no latest.* symlink yet). I guess one can do that with Jenkins, right?

I’ve tried to implement this and came up with something in a dedicated branch. Sadly I didn’t find a way to get a triggered job blocking its parent one, so this won’t fix the situation as much as I believed.

So I’ll go on with this other option.

#24 Updated by bertagaz 2017-10-21 11:49:38

bertagaz wrote:
> intrigeri wrote:
> > > So, another way to do that would be to use a triggered build job on the master after the build that would run what command/script is necessary to create the symlinks.
> >
> > It’s ugly but it would work much better than what we have now, and I can’t think of a better way to do it. Note that to really fix this problem, the new triggered job’s completion must block the parent build from being marked as successful, otherwise the whole thing will still be racy (there would still be a window during which lastSuccessful points to a place where there’s no latest.* symlink yet). I guess one can do that with Jenkins, right?
>
> I’ve tried to implement this and came up with something in a dedicated branch. Sadly I didn’t find a way to get a triggered job blocking its parent one, so this won’t fix the situation as much as I believed.

While working in ~~Feature #12633~~, I’ve discovered the downstream-ext jenkins plugin, that add is a simple publisher to a job able to trigger the run of another job.

Publishers are part of the build process and are blocking operations AFAIK. So if we use this after the artifact archiving publisher, we should be able to start a job on the master to manage the “latest.iso” symlink in a timely manner.

This plugin has no dependencies, so it should be easily installable in our Jenkins. I’ll give a try to this solution.

#25 Updated by anonym 2017-11-15 11:30:50

Target version changed from Tails_3.3 to Tails_3.5

#26 Updated by anonym 2018-01-23 19:52:36

Target version changed from Tails_3.5 to Tails_3.6

#27 Updated by bertagaz 2018-03-14 11:32:11

Target version changed from Tails_3.6 to Tails_3.7

#29 Updated by bertagaz 2018-05-10 11:09:16

Target version changed from Tails_3.7 to Tails_3.8

#30 Updated by intrigeri 2018-06-26 16:27:54

Target version changed from Tails_3.8 to Tails_3.9

#31 Updated by intrigeri 2018-09-05 16:26:53

Target version changed from Tails_3.9 to Tails_3.10.1

#32 Updated by intrigeri 2018-10-24 17:03:37

Target version changed from Tails_3.10.1 to Tails_3.11

#33 Updated by CyrilBrulebois 2018-12-16 13:54:12

Target version changed from Tails_3.11 to Tails_3.12

#34 Updated by anonym 2019-01-30 11:59:15

Target version changed from Tails_3.12 to Tails_3.13

#35 Updated by Anonymous 2019-03-14 13:12:08

Assignee deleted (~~bertagaz~~)
Deliverable for deleted (~~SponsorS_Internal~~)

A year later.

Problem: the latest.iso* symlinks for “nightly” ISO images are broken everyday for ~5 minutes, which breaks their download over HTTPS for contributors, users asked to test something, early testers, external infra like the reproducible one.

Blocking status: having these stable links was not explicitly part of bertagaz’ SponsorS job and we use them less than we used to (for unrelated reasons, we’ve found them less useful than we thought), so the impact is rather low. So I think we can drop the “Deliverable for” and then breath, take a step back, and re-prioritize it as part of our big pile of small-ish sysadmin tasks that would be good to tackle rather sooner than later.

#36 Updated by Anonymous 2019-03-14 13:12:34

Target version deleted (~~Tails_3.13~~)

Tails