Feature #9430

Make our build system more robust vs. apt-get transient errors

Added by anonym 2015-05-19 09:44:45 . Updated 2016-06-08 01:34:19 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2015-05-19
Due date:
% Done:

100%

Feature Branch:
feature/9430-build-system-vs-apt-transient-errors
Type of work:
Research
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

For instance, for build failures like

W: Failed to fetch http://ftp.us.debian.org/debian/dists/experimental/main/binary-i386/Packages  Hash Sum mismatch

E: Some index files failed to download. They have been ignored, or old ones used instead.
Fetched 43.1 MB in 19s (2224 kB/s)
P: Begin unmounting filesystems...

we should teach jenkins to detect them, and trigger a rebuild instead of notifying the responsible party of this (very often) meaningless error.

It would probably be even better to make our build system retry a few times (+ change mirrors, if any?) on such errors before giving up for real.


Subtasks


History

#1 Updated by anonym 2015-05-19 09:47:13

  • Assignee set to intrigeri
  • Target version set to Tails_1.4.1

Please change the milestone as you see fit.

#2 Updated by intrigeri 2015-05-22 16:35:22

For now, just pasting what I wrote on -dev@:

I guess that’s somehow possible with Jenkins only, but it most likely
requires twisting its semantics quite a bit. I’m happy to give
a closer look at it one of these days, in case there’s a neat solution
to this problem => please give me a research ticket :)

However, long-term I think we’ll have to use something like Zuul,
that’s dedicated to orchestrating jobs and to mediating between our
Jenkins job needs, their result, and whatever action should be taken.
IIRC that’s how the OpenStack project CI handles this kind
of problems.

On the short term, perhaps teaching APT or our build system to retry
such operations on failures would be a good enough, and muuuch
simpler, workaround.

#3 Updated by intrigeri 2015-06-13 05:30:46

  • Subject changed from Retry jenkins builds for transient errors to Have our build system more resistant to apt-get transient errors
  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

Looking into “teaching APT or our build system to retry such operations on failures” first.

First, in the APT configuration, for the Acquire group I see:

  • Retries: Number of retries to perform. If this is non-zero APT will retry failed files the given number of times.
  • ForceIPv4 might be useful: our networking config doesn’t support IPv6, which might cause issues

One should also look into acng’s configuration options.

#4 Updated by intrigeri 2015-06-13 05:37:47

  • Feature Branch set to feature/9430-build-system-vs-apt-transient-errors

#5 Updated by intrigeri 2015-06-13 06:11:01

  • Subject changed from Have our build system more resistant to apt-get transient errors to Make our build system more resistant to apt-get transient errors

#6 Updated by intrigeri 2015-06-13 06:12:47

  • % Done changed from 10 to 20

Merged that into experimental, we’ll see if transient errors still happen on Jenkins for that branch.

#7 Updated by intrigeri 2015-06-15 01:58:32

Also note that once Feature #5926 is done, we won’t have problems with hitting differently sync’d mirrors and hashsum mismatches anymore.

#8 Updated by intrigeri 2015-06-18 07:31:21

  • Subject changed from Make our build system more resistant to apt-get transient errors to Make our build system more robust to apt-get transient errors

#9 Updated by intrigeri 2015-06-18 07:31:39

  • Subject changed from Make our build system more robust to apt-get transient errors to Make our build system more robust vs. apt-get transient errors

#10 Updated by intrigeri 2015-06-28 02:17:54

  • Target version changed from Tails_1.4.1 to Tails_1.5

Let’s give it a few more weeks to see if the changes merged into experimental make a difference at all.

#11 Updated by intrigeri 2015-07-12 03:00:52

  • Target version changed from Tails_1.5 to Tails_1.6

I need to lighten my plate.

#12 Updated by intrigeri 2015-07-16 06:58:44

Look for “network problems” in bin/reproducible_maintenance.sh in https://anonscm.debian.org/gitweb/?p=qa/jenkins.debian.net.git — it reschedules builds that failed due to network problems.

#13 Updated by intrigeri 2015-08-26 00:42:46

We could try configuring acng on apt.lizard to use httpredir.debian.org, which could help.

#14 Updated by intrigeri 2015-08-26 00:55:38

intrigeri wrote:
> We could try configuring acng on apt.lizard to use httpredir.debian.org, which could help.

Done, let’s see how it goes.

#15 Updated by bertagaz 2015-08-26 04:00:39

intrigeri wrote:
> Look for “network problems” in bin/reproducible_maintenance.sh in https://anonscm.debian.org/gitweb/?p=qa/jenkins.debian.net.git — it reschedules builds that failed due to network problems.

Hmmm well, that’s a serious bit of scripts. They seem to create their own sqlite database to gather data and act on them. We may prefer others options.

#16 Updated by intrigeri 2015-08-26 06:48:17

> Hmmm well, that’s a serious bit of scripts. They seem to create their own sqlite database to gather data and act on them. We may prefer others options.

The idea I meant to point to is: grep failed build logs for 'E: Failed to fetch.*(Connection failed|Size mismatch|Cannot initiate the connection to|Bad Gateway)', and restart those that match.

#17 Updated by intrigeri 2015-08-28 07:31:40

intrigeri wrote:
> intrigeri wrote:
> > We could try configuring acng on apt.lizard to use httpredir.debian.org, which could help.
>
> Done, let’s see how it goes.

It makes things worse, reverted, but kept the upgraded acng to see if that one is causing issues (and not httpredir).

#19 Updated by intrigeri 2015-08-31 06:30:20

I should retry httpredir on apt-proxy.lizard: I’m told that some mirrors were broken precisely during the days when I’ve tested it, so it might be that the failures we’ve seen were _not_caused by httpredir.

#20 Updated by intrigeri 2015-09-22 12:18:48

  • Target version changed from Tails_1.6 to Tails_1.7

#21 Updated by intrigeri 2015-11-01 04:27:47

#22 Updated by intrigeri 2015-11-01 04:28:32

  • Target version changed from Tails_1.7 to Tails_2.3

Feature #5926 will magically solve 99% of this problem, so IMO I should not waste time trying to fix it differently here.

#23 Updated by intrigeri 2016-04-14 21:19:05

  • Target version changed from Tails_2.3 to Tails_2.4

#24 Updated by intrigeri 2016-05-18 18:44:33

  • Target version changed from Tails_2.4 to Tails_2.5

It’s only during the next cycle that we can confirm that the freezable APT repo has indeed improved things in this respect (even if I really can’t see how it could not be the case).

#25 Updated by intrigeri 2016-06-01 12:32:40

  • Status changed from In Progress to Fix committed
  • Assignee deleted (intrigeri)
  • Target version changed from Tails_2.5 to Tails_2.4
  • % Done changed from 20 to 100

I’ve thougt about it more, and I can’t see how our freezable APT repo would not solve this.

#26 Updated by anonym 2016-06-08 01:33:52

We have enough of Feature #5926 solved so that the blocker can be removed, and this ticket resolved.

#27 Updated by anonym 2016-06-08 01:34:00

#28 Updated by anonym 2016-06-08 01:34:19

  • Status changed from Fix committed to Resolved