Bug #17364

The build of our production website should be self-healing

Added by intrigeri 2019-12-18 11:53:33 . Updated 2020-05-07 09:05:04 .

Status:
Confirmed
Priority:
Elevated
Assignee:
Category:
Infrastructure
Target version:
Start date:
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

In a variety of situations, an ikiwiki refresh triggered by a Git push fails, leaving it in an unclean state, and then the only way to recover is to ssh into the machine and manually start a full rebuild. This is painful because:

  • When this happens during a release process, the release can be left half-published, until someone fixes this. That’s not fun for the RM.
  • It puts timing/availability/expectations pressure on sysadmins.
  • I suspect our technical writers have grown wary of pushing some kinds of changes that typically trigger this sort of problems. Not being able to do one’s job with a reasonable amount of confidence in oneself and in our infra is surely not fun.

Ideally, somehow our infra would notice this situation and run a full rebuild itself.


Subtasks


Related issues

Related to Tails - Bug #17361: Streamline our release process Confirmed

History

#1 Updated by intrigeri 2019-12-18 11:53:51

  • related to Bug #17361: Streamline our release process added

#2 Updated by intrigeri 2020-05-01 16:29:14

FWIW, I’ve played a bit with GitLab pages for a different project and I liked the fact that:

  • the build happens in a controlled, mostly reproducible environment, so problems caused by transition between states are less of a problem
  • everyone can look at the build output: not only the person who pushed, but also the person who should investigate and debug what happened
  • the output of the build is published only if it succeeded ⇒ no partly refreshed, half broken website in production
  • developers can fix stuff themselves via the GitLab CI config file, if needed

I don’t think we’ll want to serve our website via GitLab pages any time soon, but the general idea of building the website via a CI job, and then deploying the output upon success, may solve most of the problems this issue is about, especially if there’s a simple way for a developer or tech writer to force a full rebuild of the CI job, as opposed to the (default) incremental refresh that sometimes breaks and currently requires sysadmin intervention.

#3 Updated by sajolida 2020-05-04 17:57:07

I understand that it will create a delay between a “push” and an update
of the production website of 1 complete build time, right? This might
affect:

- Technical writers and UX designers: I don’t think that we really care
about such a delay in our daily work and a slower but more stable
build would definitely be an improvement for my work.

- Release managers: It might get them back to where they were before the
top bar (Bug #17431).

Or could GitLab pages try a “refresh” first and then a “rebuild” only if
it fails (maybe triggered manually)?

#4 Updated by intrigeri 2020-05-07 09:05:04

> Or could GitLab pages try a “refresh” first and then a “rebuild” only if it fails (maybe triggered manually)?

My preference would be: the CI job refreshes by default, and if needed a developer can force a full rebuild (i.e. invalidate the cache) by passing a parameter or something.