Bug #17364
The build of our production website should be self-healing
0%
Description
In a variety of situations, an ikiwiki refresh triggered by a Git push fails, leaving it in an unclean state, and then the only way to recover is to ssh into the machine and manually start a full rebuild. This is painful because:
- When this happens during a release process, the release can be left half-published, until someone fixes this. That’s not fun for the RM.
- It puts timing/availability/expectations pressure on sysadmins.
- I suspect our technical writers have grown wary of pushing some kinds of changes that typically trigger this sort of problems. Not being able to do one’s job with a reasonable amount of confidence in oneself and in our infra is surely not fun.
Ideally, somehow our infra would notice this situation and run a full rebuild itself.
Subtasks
History
#1 Updated by intrigeri 2019-12-18 11:53:51
- related to Bug #17361: Streamline our release process added
#2 Updated by intrigeri 2020-05-01 16:29:14
FWIW, I’ve played a bit with GitLab pages for a different project and I liked the fact that:
- the build happens in a controlled, mostly reproducible environment, so problems caused by transition between states are less of a problem
- everyone can look at the build output: not only the person who pushed, but also the person who should investigate and debug what happened
- the output of the build is published only if it succeeded ⇒ no partly refreshed, half broken website in production
- developers can fix stuff themselves via the GitLab CI config file, if needed
I don’t think we’ll want to serve our website via GitLab pages any time soon, but the general idea of building the website via a CI job, and then deploying the output upon success, may solve most of the problems this issue is about, especially if there’s a simple way for a developer or tech writer to force a full rebuild of the CI job, as opposed to the (default) incremental refresh that sometimes breaks and currently requires sysadmin intervention.
#3 Updated by sajolida 2020-05-04 17:57:07
I understand that it will create a delay between a “push” and an update
of the production website of 1 complete build time, right? This might
affect:
- Technical writers and UX designers: I don’t think that we really care
about such a delay in our daily work and a slower but more stable
build would definitely be an improvement for my work.
- Release managers: It might get them back to where they were before the
top bar (Bug #17431).
Or could GitLab pages try a “refresh” first and then a “rebuild” only if
it fails (maybe triggered manually)?
#4 Updated by intrigeri 2020-05-07 09:05:04
> Or could GitLab pages try a “refresh” first and then a “rebuild” only if it fails (maybe triggered manually)?
My preference would be: the CI job refreshes by default, and if needed a developer can force a full rebuild (i.e. invalidate the cache) by passing a parameter or something.