Bug #10229

ISO testing jobs seem to lack a timeout

Added by intrigeri 2015-09-22 01:50:15 . Updated 2015-10-15 11:17:47 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2015-09-22
Due date:
% Done:

100%

Feature Branch:
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:
267

Description

https://jenkins.tails.boum.org/job/test_Tails_ISO_isotester4/27/ and https://jenkins.tails.boum.org/job/test_Tails_ISO_isotester3/23/ have been running for 4 days now. ISO testing jobs should not keep Jenkins slaves busy forever, and cleaning that up should not require manual intervention.


Subtasks


History

#1 Updated by intrigeri 2015-09-22 01:50:27

#2 Updated by intrigeri 2015-09-22 01:51:00

  • blocks #8668 added

#3 Updated by bertagaz 2015-09-25 03:26:09

Timeout is a way to workaround the problem, so it sure is a way to go at first.

But it would be far better if this didn’t happen. It seems that the isotesters at some point get stuck and a fair amount of testing won’t take place because of that.

I’m very much inclined to think that this is a manifestation of Bug #9157, with the wheezy kernel that doesn’t notice that the CPUs are stucked like the >= Jessie one does. I gave a try to some tests but so far didn’t find a way to get around this.

There’s a … Jenkins plugin to configure a timeout for jobs, I’ll use this in the meantime.

#4 Updated by intrigeri 2015-09-25 08:24:03

> Timeout is a way to workaround the problem, so it sure is a way to go at first. But it would be far better if this didn’t happen.

Yes, timeout is a way to limit the impact of such problems so that we don’t suffer from it too much while the people responsible to fix it do their part of the job.

> I’m very much inclined to think that this is a manifestation of Bug #9157, with the wheezy kernel that doesn’t notice that the CPUs are stucked like the >= Jessie one does.

Bug #9157 is about Jessie’s kernel crashing while running the test suite, while with Wheezy’s kernel we’re able to run it just fine most of the time. In other words, it is a regression in Jessie’s kernel on this respect. That regression might be caused by an actual improvement (noticing that CPUs are stuck) as you’re suggesting, but I’d rather not mix these two aspects on the same ticket before we’re sure they indeed have the same root cause. So please file another ticket than Bug #9157 to track the problems with the Wheezy kernel, and use Bug #9157 to keep track of what’s blocking us from running Jessie’s one.

#5 Updated by sajolida 2015-10-01 07:26:51

  • Priority changed from Normal to Elevated

Note that this is due on October 15 which is actually before Tails 1.7. Raising priority accordingly.

#6 Updated by bertagaz 2015-10-07 06:05:19

  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Ready for QA

intrigeri wrote:
> Yes, timeout is a way to limit the impact of such problems so that we don’t suffer from it too much while the people responsible to fix it do their part of the job.

So I’ve installed the BuildTimeout Jenkins plugin, and set it up on the test_Tails_ISO_isotester1_metrics job as an example. It is setup to abort the job if no new entry appears in the logs for 35 minutes, deadline chosen with anonym because our bigger waiting time with sikuli is 30 minutes.

> > I’m very much inclined to think that this is a manifestation of Bug #9157, with the wheezy kernel that doesn’t notice that the CPUs are stucked like the >= Jessie one does.
>
> Bug #9157 is about Jessie’s kernel crashing while running the test suite, while with Wheezy’s kernel we’re able to run it just fine most of the time.

Well, not that fine, the isotesters stuckness appears quite often.

> In other words, it is a regression in Jessie’s kernel on this respect. That regression might be caused by an actual improvement (noticing that CPUs are stuck) as you’re suggesting, but I’d rather not mix these two aspects on the same ticket before we’re sure they indeed have the same root cause. So please file another ticket than Bug #9157 to track the problems with the Wheezy kernel, and use Bug #9157 to keep track of what’s blocking us from running Jessie’s one.

Ok, I’m a bit afraid this will cause redmine overhead, because to me it’s clearly the same problem handled differently depending on the kernel that is used. Before opening a ticket, I think I’d like to try to upgrade the lizard qemu version to the one that are in backports, more close the the one used in the isotesters. I suspect it may help. If it doesn’t, I’ll open a new bug. Sounds fair?

#7 Updated by intrigeri 2015-10-13 23:30:34

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 0 to 80
  • QA Check changed from Ready for QA to Dev Needed

> So I’ve installed the BuildTimeout Jenkins plugin, and set it up on the test_Tails_ISO_isotester1_metrics job as an example. It is setup to abort the job if no new entry appears in the logs for 35 minutes, deadline chosen with anonym because our bigger waiting time with sikuli is 30 minutes.

Sounds good, yay!

My only concern in the “Try the build-timeout plugin version1.14.1.” commit:

  • What’s preventing us from using the latest version of this plugin?
  • How will we know, when doing plugin upgrades as part of sysadmin shifts, that we must not upgrade this plugin?

> Ok, I’m a bit afraid this will cause redmine overhead, because to me it’s clearly the same problem handled differently depending on the kernel that is used. Before opening a ticket, I think I’d like to try to upgrade the lizard qemu version to the one that are in backports, more close the the one used in the isotesters. I suspect it may help. If it doesn’t, I’ll open a new bug. Sounds fair?

Yes, absolutely.

#8 Updated by bertagaz 2015-10-14 03:27:24

intrigeri wrote:
> My only concern in the “Try the build-timeout plugin version1.14.1.” commit:
>
> * What’s preventing us from using the latest version of this plugin?

I have to admit this is a week old and I don’t clearly remember the why. I remember having hard time finding a version compatible with our JJB version, and this one worked out after trying different versions of the plugin. I’ll give a try to the latest version to check again if it in fact works.

> * How will we know, when doing plugin upgrades as part of sysadmin shifts, that we must not upgrade this plugin?

That’s a problem that we face for other plugins too, as long as we keep this old Jenkins version. So I propose not to bother with this in this ticket, but rather open a new one assigned to me since it’s my sysadmin shift, to document which plugin we can’t upgrade to the latest and why. This should probably be done in the tails_sysadmins git repo.

> > Ok, I’m a bit afraid this will cause redmine overhead, because to me it’s clearly the same problem handled differently depending on the kernel that is used. Before opening a ticket, I think I’d like to try to upgrade the lizard qemu version to the one that are in backports, more close the the one used in the isotesters. I suspect it may help. If it doesn’t, I’ll open a new bug. Sounds fair?
>
> Yes, absolutely.

Awesome.

#9 Updated by bertagaz 2015-10-14 03:54:09

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

bertagaz wrote:
> intrigeri wrote:
> > My only concern in the “Try the build-timeout plugin version1.14.1.” commit:
> >
> > * What’s preventing us from using the latest version of this plugin?

Ok, now I remember: I had that working not because of the plugin version, but because the JJB upgrade command was not retrieving the plugins informations except with commits @@ and @@ of the puppet-tails repo.

So I’ve upgraded the plugin to 1.5. The build timeout configuration of the test jobs in Jenkins didn’t change, still ok. So I started build 2535 of the experimental branch to do another check.

For real testing, we should probably wait that a timeout actually happens, but it might take a while, and given the configuration hasn’t changed in Jenkins, we can be pretty confident it will work as with 1.4.1. That’s why I’m a bit bold to put it again in ReadyforQA, but if you agree with this checks and not waiting for the next timeout, maybe we can close this ticket once the other one about frozen plugins versions is created.

#10 Updated by intrigeri 2015-10-14 05:24:42

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

> I have to admit this is a week old and I don’t clearly remember the why.

Oops. By the way, I meant to ask you to try and write commit messages that tell something about why a change was made: in the puppet-tails repo, I’ve seen a bit too many commits recently whose message paraphrases the diff, which isn’t very helpful a week later :)

>> * How will we know, when doing plugin upgrades as part of sysadmin shifts, that we must not upgrade this plugin?

> That’s a problem that we face for other plugins too, as long as we keep this old Jenkins version. So I propose not to bother with this in this ticket, but rather open a new one assigned to me since it’s my sysadmin shift, to document which plugin we can’t upgrade to the latest and why.

I can’t find this ticket so reassigning to you.

> This should probably be done in the tails_sysadmins git repo.

Proposal: add this info as comments in the tails::jenkins::master class definition, that is at the same place as what we want to document, i.e. what version of which plugin we want.

#11 Updated by intrigeri 2015-10-14 05:27:39

> For real testing, we should probably wait that a timeout actually happens, but it might take a while

I don’t think we need to wait for this to happen, since we can easily trigger it with a minimal test case job that essentially does echo before ; sleep 40m; echo after, garded by a timeout. No?

#12 Updated by bertagaz 2015-10-15 00:28:24

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:
> > For real testing, we should probably wait that a timeout actually happens, but it might take a while
>
> I don’t think we need to wait for this to happen, since we can easily trigger it with a minimal test case job that essentially does echo before ; sleep 40m; echo after, garded by a timeout. No?

Well, even better we won’t need to push such a dummy test commit since a timeout happened since then in test 17 of the devel branch

Seems to work fine with 1.5.

#13 Updated by bertagaz 2015-10-15 01:17:46

intrigeri wrote:
> I can’t find this ticket so reassigning to you.

Yes, I proposed to create one, so was Bug #10373 born (today :D).

#14 Updated by intrigeri 2015-10-15 02:37:49

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100
  • QA Check changed from Ready for QA to Pass

> Well, even better we won’t need to push such a dummy test commit since a timeout
> happened since then in test 17 of the devel
> branch

> Seems to work fine with 1.5.

Cool, closing then.

#15 Updated by intrigeri 2015-10-15 02:40:33

>> I can’t find this ticket so reassigning to you.
> Yes, I proposed to create one, so was Bug #10373 born (today :D).

I suggest you open the ticket straight away next time we’re in a similar situation, in order to avoid an unnecessary roundtrip (and to push me a bit more out of the loop :)

#16 Updated by intrigeri 2015-10-15 11:17:47

  • Assignee deleted (intrigeri)