Bug #10215

Suboptimal advance booking of Jenkins slaves for testing ISOs

Added by intrigeri 2015-09-17 10:38:51 . Updated 2015-10-05 02:57:33 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2015-09-17
Due date:
% Done:

100%

Feature Branch:
Type of work:
Research
Blueprint:

Starter:
Affected tool:
Deliverable for:
266

Description

Today we’ve seen a case when an isotesterN was idle, while a test_Tails_ISO_isotesterN job was waiting in queue. Ideally, this should not happen.

If I understood bertagaz’ analysis right: a given run of a given test_Tails_ISO_isotesterN job can be put in queue because another job of that kind is running already. In such a case, the new run has been scheduled to run on a specific slave (that was just rebooted) already. And then, when the old run is completed and no concurrency issue prevents the new run from starting, the slave it was scheduled to run on can very well be busy with another job run.

In practice, even if this suboptimal scheduling is not fixed, there are great chances that we run enough test suite to avoid seeing idle isotesterN slaves, so we should not be wasting resources. We’ll just be running jobs in a suboptimal order. If we are inclined to decide it’s no big deal, we should first evaluate by how many hours this can delay a given test suite run in the worst case.


Subtasks


Related issues

Related to Tails - Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones Resolved 2016-03-31

History

#1 Updated by intrigeri 2015-09-17 10:39:02

  • blocks #8668 added

#2 Updated by intrigeri 2015-09-17 10:39:13

#3 Updated by intrigeri 2015-09-17 13:05:59

If the approach I tried in feature/reboot-after-testing doesn’t work e.g. due to block: true (see Feature #9486#note-39), then an option would be to:

  • get rid of the current wrapper;
  • have the test job trigger a new wrapper job, running on the master, that handles the put offline + reboot + put back online dance, either as we do right now, or with something different (see below).

As a bonus, I suspect that the new 2nd job (the one that triggers the reboot) would actually be green, because it wouldn’t have to turn off the very slave it’s running on :)

The entire “put offline + reboot + put back online” dance could actually be moved from Jenkins and integrated, as a couple of services run by pid 1, into each isotesterN’s shutdown sequence: Jenkins simply would have to trigger the node’s reboot, and everything would follow. One service would put the jenkins slave offline very early during the reboot sequence. Another service would put it back online soon after the jenkins-slave service has been stopped (if we put a rebooting slave back online while Jenkins is still running, then some job could be scheduled on the slave during the shutdown sequence), but before the networking is shut down (otherwise it can’t talk to the master). Given how complicated ordering feels on Jenkins, I’d rather deal with this kind of stuff using my preferred pid 1. It also avoids having to do anything based on guesses about whether the slave has rebooted already etc.

#4 Updated by bertagaz 2015-09-18 01:12:30

  • Assignee changed from bertagaz to intrigeri

intrigeri wrote:
> Today we’ve seen a case when an isotesterN was idle, while a test_Tails_ISO_isotesterN job was waiting in queue. Ideally, this should not happen.
>
> If I understood bertagaz’ analysis right: a given run of a given test_Tails_ISO_isotesterN job can be put in queue because another job of that kind is running already. In such a case, the new run has been scheduled to run on a specific slave (that was just rebooted) already. And then, when the old run is completed and no concurrency issue prevents the new run from starting, the slave it was scheduled to run on can very well be busy with another job run.
>
> In practice, even if this suboptimal scheduling is not fixed, there are great chances that we run enough test suite to avoid seeing idle isotesterN slaves, so we should not be wasting resources. We’ll just be running jobs in a suboptimal order. If we are inclined to decide it’s no big deal, we should first evaluate by how many hours this can delay a given test suite run in the worst case.

Yes, that’s the idea of the problem.

Now I’m wondering if that’s a real one: shall we really put this test jobs with concurrent = false, which is the root of this? In which case would two jobs of the same branch be run concurrently? This would happen if an ISO was built from a first push and a test job ran on it. Then a second job would happen if a bit later a second push on that branch happened and another test job was queued for that last built ISO.

To me that’s not really a problem if this 2 tests are run in parallel, and I think we could as well just set them to concurrent = true.

> If the approach I tried in feature/reboot-after-testing doesn’t work e.g. due to block: true (see Feature #9486#note-39), then an option would be to:
>
> * get rid of the current wrapper;
> * have the test job trigger a new wrapper job, running on the master, that handles the put offline + reboot + put back online dance, either as we do right now, or with something different (see below).

I’ve tries to implement this, in a different branch (feature/reboot-after_post_build_trigger). Not with a wrapper job though, but this path is a bit simpler.

> The entire “put offline + reboot + put back online” dance could actually be moved from Jenkins and integrated, as a couple of services run by pid 1, into each isotesterN’s shutdown sequence: Jenkins simply would have to trigger the node’s reboot, and everything would follow. One service would put the jenkins slave offline very early during the reboot sequence.

I suspect this might be a bit tricky and raise race conditions: if the node isn’t kept busy by a job, another job could try to run on it while the reboot is starting to happen. So the difficult part would be to able to kill the slave fast enough fo this not to happen, or have a fake job keeping this node busy, by being able to be started before any others in the queue.

> Another service would put it back online soon after the jenkins-slave service has been stopped (if we put a rebooting slave back online while Jenkins is still running, then some job could be scheduled on the slave during the shutdown sequence), but before the networking is shut down (otherwise it can’t talk to the master). Given how complicated ordering feels on Jenkins, I’d rather deal with this kind of stuff using my preferred pid 1. It also avoids having to do anything based on guesses about whether the slave has rebooted already etc.

Yes, handling this with pid 1 sounds more robust, but where I’m less convinced is how Jenkins would interact with such a job. Or maybe the job responsible of starting this reboot would be run on the node itself, and thus hold it?

#5 Updated by intrigeri 2015-09-22 01:47:17

  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Dev Needed

Sorry I missed this ticket since it had neither “Ready for QA” nor “Info Needed”, combined with a large backlog and suboptimal work scheduling on my side.

> Yes, that’s the idea of the problem.

Glad I got it right!

> Now I’m wondering if that’s a real one: shall we really put this test jobs with concurrent = false, which is the root of this?

I don’t know. At first glance I would say “I don’t think so” but we’ve always been using concurrent = false so it might be that some piece of our current design (or of the tentative ISO testing one) silently relies on it.

> To me that’s not really a problem if this 2 tests are run in parallel, and I think we could as well just set them to concurrent = true.

OK, then yay, let’s try this :)

#6 Updated by intrigeri 2015-09-22 02:03:54

  • Assignee changed from bertagaz to intrigeri
  • Target version changed from Tails_1.6 to Tails_1.7
  • QA Check deleted (Dev Needed)

#7 Updated by intrigeri 2015-09-22 02:04:43

  • Assignee changed from intrigeri to bertagaz

#8 Updated by bertagaz 2015-09-28 03:35:38

  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Info Needed

intrigeri wrote:
>
> > Now I’m wondering if that’s a real one: shall we really put this test jobs with concurrent = false, which is the root of this?
>
> I don’t know. At first glance I would say “I don’t think so” but we’ve always been using concurrent = false so it might be that some piece of our current design (or of the tentative ISO testing one) silently relies on it.

Well, as you can specify per job if it can be build concurrently, it shouldn’t affect the rest of our design.

> > To me that’s not really a problem if this 2 tests are run in parallel, and I think we could as well just set them to concurrent = true.
>
> OK, then yay, let’s try this :)

Done, the jobs assigned to isotester{2,3,4} (that is wrap_test_Tails_isotester{2,3,4} and children) are now set up to run concurrently. I can’t seem to reproduce the bug we’ve described here with this new option.

So that’s not the nicer implementation (if possible to have) in term of job organization and dependencies, but it seems to do the job.

Assigning to you so that you can go on the web interface and click around to stress test that a bit.

Have fun! :)

#9 Updated by sajolida 2015-10-01 07:20:18

Note that Tails 1.7 is passed the milestone for SponsorS (October 15).

#10 Updated by sajolida 2015-10-01 07:21:05

Sorry, actually this was due on July 15 :)

#11 Updated by intrigeri 2015-10-05 02:42:09

sajolida wrote:
> Sorry, actually this was due on July 15 :)

FTR: this is incorrect. We’ve identified this potential problem in September, and as the ticket description explains it’s not even clear that this is a problem that needs solving at all in practice.

#12 Updated by intrigeri 2015-10-05 02:57:33

  • Status changed from Confirmed to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 0 to 100
  • QA Check changed from Info Needed to Pass

I could not reproduce the specific problem this ticket is about => closing and will raise any remaining concern on the parent ticket and/or Feature #10117. Yay!

#13 Updated by intrigeri 2016-08-01 16:41:48

  • related to Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones added