Feature #9486: Support running multiple instances of the test suite in parallel

Feature #9486

Support running multiple instances of the test suite in parallel

Added by intrigeri 2015-05-28 15:06:41 . Updated 2015-10-13 05:04:18 .

Status:

Resolved

Priority:

Elevated

Assignee:

Category:

Continuous Integration

Target version:

Tails_1.7

Start date:

2015-06-25

Due date:

% Done:

100%

Feature Branch:

Type of work:

Sysadmin

Blueprint:

Starter:

Affected tool:

Deliverable for:

266

Description

Make sure our infrastructure can cope with running more tests, and run multiple instances of the test suite in parallel, which will follow from automation.

This may imply:

to set up more test VMs (as a follow-up on ~~Feature #6564~~);
to adjust the Puppet module (already created via ~~Feature #5683~~) and manifests to support multiple parallel test runners;
to design and implement a way to fire up a clean test runner VM before each test suite run.

Subtasks

Feature #9646: Upgrade jenkins-job-builder to 1.3.0+

Resolved

100

Bug #10215: Suboptimal advance booking of Jenkins slaves for testing ISOs

Resolved

100

Related issues

Related to Tails - ~~Feature #10117~~: Design how to run our test suite in Jenkins	Resolved	2015-08-28
Related to Tails - ~~Bug #10601~~: isotesterN:s are sometimes put offline and never back online	Needs Validation	2015-11-23
Related to Tails - ~~Bug #11295~~: Test jobs sometimes get scheduled on a busy isotester while there are available ones	Resolved	2016-03-31
Blocked by Tails - ~~Bug #9645~~: Create at least a second VM for testing ISO images	Resolved	2015-06-25	2015-07-15

History

#1 Updated by intrigeri 2015-05-28 15:07:07

blocks #8668 added

#3 Updated by intrigeri 2015-08-02 07:56:44

Due date set to 2015-07-15

#4 Updated by intrigeri 2015-08-02 08:01:15

blocked by ~~Bug #9645~~: Create at least a second VM for testing ISO images added

#5 Updated by intrigeri 2015-08-02 12:35:29

Priority changed from Normal to High

#6 Updated by bertagaz 2015-08-05 05:33:50

Target version changed from Tails_1.5 to Tails_1.6

Postponing

#7 Updated by bertagaz 2015-08-21 05:18:56

Status changed from Confirmed to In Progress
% Done changed from 0 to 20

Now that the 4 isotesters are up and running, I’m running the test suite on them. So far it seems to work well. I’ll run more in the coming days, but to me it seems that it shows we can close this ticket.

#8 Updated by intrigeri 2015-08-22 05:25:33

> * to design and implement a way to fire up a clean test runner VM before each test suite run.

Not sure if we can close this ticket before making sure that we actually don’t need the aforementioned part.
Alternatively, it could be moved to another ticket that would be part of the same set of deliverables.

#9 Updated by bertagaz 2015-08-22 06:34:09

Assignee changed from bertagaz to intrigeri
QA Check set to Info Needed

intrigeri wrote:
> > * to design and implement a way to fire up a clean test runner VM before each test suite run.
>
> Not sure if we can close this ticket before making sure that we actually don’t need the aforementioned part.
> Alternatively, it could be moved to another ticket that would be part of the same set of deliverables.

I don’t think it is of any interest to move that part to another ticket if it stays in the same deliverable.

I’m not very sure how to test this though.

Still, we’ll need to write a wrapper around the run_test_suite script to pass the correct arguments, and given how the filesystems on the isotesters are isolated, IMO if this wrapper cleanup the /srv/git and /tmp/TailsToaster, I don’t see why we’d need to fire up fresh clean isotesters. Maybe on the memory usage front, but I’m not sure it will really be a problem. This would make us loose all the files cached in memory.

The only blocker I might think about is the local.yml config files. If we clean up /srv/git, they’ll varnish and we’ll need to put them back. I think in the end having a —local_config or similar option to run_test_suite would workaround this issue: then we would be able to manage them with puppet in /etc/TailsToaster or something.

#10 Updated by intrigeri 2015-08-25 04:25:54

Assignee changed from intrigeri to bertagaz
QA Check changed from Info Needed to Dev Needed

> I don’t think it is of any interest to move that part to another ticket if it stays in the same deliverable.

As you wish. Let’s keep tracking this here, then.

> I’m not very sure how to test this though.

The thing is, I’m not sure we’ll know if we need it before trying to run production-like workloads without such cleanup.

> Still, we’ll need to write a wrapper around the run_test_suite script to pass the correct arguments, and given how the filesystems on the isotesters are isolated, IMO if this wrapper cleanup the /srv/git and /tmp/TailsToaster, I don’t see why we’d need to fire up fresh clean isotesters.

There have been instances when the test suite is stalled, and others when it left processes running behind. Not sure if all of those are fixed already, and not sure no such regression will appear in the future. It would be super painful if we have to babysit the isotester VMs by hand. We’ll see, I guess: just schedule time to deal with this after the pre-production deployment, and before the deadline for a working production deployment.

> The only blocker I might think about is the local.yml config files. If we clean up /srv/git, they’ll varnish and we’ll need to put them back. I think in the end having a —local_config or similar option to run_test_suite would workaround this issue: then we would be able to manage them with puppet in /etc/TailsToaster or something.

Perhaps. Not sure what exact semantics would be appropriate: we already have 4 different ways of passing configuration to the test suite. OTOH deploying such config could very well be done as a step run after the Git clone, and before running the test suite, which would avoid the need to modify the test suite code.

#11 Updated by bertagaz 2015-08-26 03:45:57

intrigeri wrote:
>
> > I’m not very sure how to test this though.
>
> The thing is, I’m not sure we’ll know if we need it before trying to run production-like workloads without such cleanup.

That’s also why I proposed to -ci@ to setup a simple job that would run the test suite all day long on the 4 isotesters. That would help to have such data.

> There have been instances when the test suite is stalled, and others when it left processes running behind. Not sure if all of those are fixed already, and not sure no such regression will appear in the future. It would be super painful if we have to babysit the isotester VMs by hand. We’ll see, I guess: just schedule time to deal with this after the pre-production deployment, and before the deadline for a working production deployment.

I thought about something: we could use the Jenkins PostBuildScript plugin to issue a shutdown -r command at the end of the job. I found indications that people are using it like this already.

It’s supported by JJB.

> Perhaps. Not sure what exact semantics would be appropriate: we already have 4 different ways of passing configuration to the test suite. OTOH deploying such config could very well be done as a step run after the Git clone, and before running the test suite, which would avoid the need to modify the test suite code.

Ok, the deployment after the git clone seems like a neat idea, I’ll do it this way then.

#12 Updated by intrigeri 2015-08-26 06:51:37

>> > I’m not very sure how to test this though.
>>
>> The thing is, I’m not sure we’ll know if we need it before trying to run production-like workloads without such cleanup.

> That’s also why I proposed to -ci@ to setup a simple job that would run the test suite all day long on the 4 isotesters. That would help to have such data.

Yay.

> I thought about something: we could use the Jenkins PostBuildScript plugin to issue a shutdown -r command at the end of the job. I found indications that people are using it like this already.

> It’s supported by JJB.

I suspect it’ll make the job be marked as failed, since the master will lose connection to the slave. Hopefully I’m wrong.

Please see the automated_builds_and_tests/jenkins blueprint for existing research, and add this other option to it :)

#13 Updated by bertagaz 2015-08-26 07:40:11

intrigeri wrote:
>
> > I thought about something: we could use the Jenkins PostBuildScript plugin to issue a shutdown -r command at the end of the job. I found indications that people are using it like this already.
>
> > It’s supported by JJB.
>
> I suspect it’ll make the job be marked as failed, since the master will lose connection to the slave. Hopefully I’m wrong.

Might be, but as it’s not a build step, but a post-build one, might be that Jenkins doesn’t care. I’ve also read people putting the slave offline first, with a curl command to the master, then restarting the VM, and putting it back online afterward.

I think I’ll try with a simple fake job later.

> Please see the automated_builds_and_tests/jenkins blueprint for existing research, and add this other option to it :)

Yes, that’s my intention, but am a bit too much overloaded right now to update with all my notes/ideas. Later too. :)

#14 Updated by intrigeri 2015-08-26 08:26:51

>> Please see the automated_builds_and_tests/jenkins blueprint for existing research, and add this other option to it :)

> Yes, that’s my intention, but am a bit too much overloaded right now to update with all my notes/ideas. Later too. :)

So I did the copy’n’paste myself.

#15 Updated by bertagaz 2015-08-30 05:46:33

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Ready for QA

Not that much in the end. I created a new job (test_postbuildscript_reboot.yaml) to test this trick, and well, running shutdown -r now mark the build as failed. But running shutdown -r does work, as the reboot is then scheduled one minute later.

Does that sound like a solution to you? If one minute seems to long to wait, we can probably write a wrapper script started by this postbuildscript publisher as a background process, that would sleep N seconds and issue a shutdown -r now afterward.

To be honest, I think one minute should be ok, there are little chances that a job gets rescheduled on the concerned isotester in the meantime.

#16 Updated by intrigeri 2015-08-30 09:44:22

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Dev Needed

> Not that much in the end. I created a new job (test_postbuildscript_reboot.yaml) to test this trick, and well, running shutdown -r now mark the build as failed.

I’m not surprised, it’s made clear by various people in links I added to the relevant section of the blueprint a while ago. But certainly it was worth testing :)

> But running shutdown -r does work, as the reboot is then scheduled one minute later.
> Does that sound like a solution to you?

See below.

> If one minute seems to long to wait, we can probably write a wrapper script started by this postbuildscript publisher as a background process, that would sleep N seconds and issue a shutdown -r now afterward.

I think that a one minute delay is perfectly fine, compared to our test suite’s run time.

> To be honest, I think one minute should be ok, there are little chances that a job gets rescheduled on the concerned isotester in the meantime.

Whenever at least one job is waiting in the queue, Jenkins immediately starts them as soon as a suitable slave is available. This happens pretty often for our builders, e.g.:

with the daily builds
when merging a topic branch into stable -> devel -> {experimental, feature/jessie}

So I wonder what makes you think that it will be fundamentally different for our ISO tester slaves, to the point that we could ignore this potential race condition. I doubt that our testers will be idling too often. And if they are, well, it means we should test more stuff / more often! :)

Am I missing something that backs your “there are little chances […]” argument, or?

Anyway, perhaps we’d better spend time finding a robust solution than arguing over whether this racy one is good enough.

The best solution (short of moving the Jenkins slave off the isotesterN VMs) is probably to chain two jobs, one that runs the test suite, and the next one that reboots the slave, with some tricks to ensure that:

the reboot job is always run on the same slave as the corresponding test suite job was run (I’ve read that the “Parameterized Trigger plugin allows you to set a downstream job to build on the same node as its upstream job”);
the reboot jobs always have higher priority than testing jobs, so they bypass the queue and get executed as soon as a test suite job is completed.

#17 Updated by bertagaz 2015-08-31 09:20:21

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Info Needed

intrigeri wrote:
>
> Whenever at least one job is waiting in the queue, Jenkins immediately starts them as soon as a suitable slave is available. This happens pretty often for our builders, e.g.:
>
> * with the daily builds
> * when merging a topic branch into stable -> devel -> {experimental, feature/jessie}
>
> So I wonder what makes you think that it will be fundamentally different for our ISO tester slaves, to the point that we could ignore this potential race condition. I doubt that our testers will be idling too often. And if they are, well, it means we should test more stuff / more often! :)
>
> Am I missing something that backs your “there are little chances […]” argument, or?

Yes, I know that, the thing is to define “immediately”. I’ve always seen that there was some lag between the moment where a build end and Jenkins allocates a new job on this slave. I guess it’s just the time it takes for Jenkins to compute the fact that the builder is available and assign the next build in the queue. But I agree we can’t really count on that, and one minute is for sure too long and racy.

> Anyway, perhaps we’d better spend time finding a robust solution than arguing over whether this racy one is good enough.
>
>
> The best solution (short of moving the Jenkins slave off the isotesterN VMs) is probably to chain two jobs, one that runs the test suite, and the next one that reboots the slave, with some tricks to ensure that:
>
> # the reboot job is always run on the same slave as the corresponding test suite job was run (I’ve read that the “Parameterized Trigger plugin allows you to set a downstream job to build on the same node as its upstream job”);
> # the reboot jobs always have higher priority than testing jobs, so they bypass the queue and get executed as soon as a test suite job is completed.

Hmmm, downside is that we’ll have a lot of this failing “builds” polluting the jenkins views.

It’s also a bit hackish, and a bit more complex to deploy (I just can’t imagine the generation script).

I might have found an alternative: instead of issuing a shutdhown -r, I’ve tried with a simple script that does:

#!/bin/sh
sleep 5
service jenkins-slave stop # ensure no jobs get assigned
/sbin/shutdown -r now

This script is called in a postbuildscript step, and started in the background (have a look at the new job definition in jenkins-jobs.git). I think that is what does the trick, putting the process in background so that Jenkins gets an exit 0 and can consider the build as successful, rather than complaining because the node gets offline immediately. I think we could even lower the sleep.

I highly doubt that in 5 or 6 seconds Jenkins allocates the new job on the isotester. So far, stress testing seems to indicate it’s enough.

That’s also a bit hackish I admit, but does seem to work and is far more simple.

What’s your opinion?

#18 Updated by intrigeri 2015-09-01 04:28:08

Assignee changed from intrigeri to bertagaz
QA Check changed from Info Needed to Dev Needed

> But I agree we can’t really count on that, and one minute is for sure too long and racy.

Glad we agree on that :)

>> The best solution (short of moving the Jenkins slave off the isotesterN VMs) is probably to chain two jobs, one that runs the test suite, and the next one that reboots the slave, with some tricks to ensure that:

> Hmmm, downside is that we’ll have a lot of this failing “builds” polluting the jenkins views.

Right, that’s ugly. There might be a way to customize the Jenkins views to hide them by default, but let’s not spend time on that if we can find a better solution.

> It’s also a bit hackish, and a bit more complex to deploy (I just can’t imagine the generation script).

We’ll have jobs trigger others anyway, so adding two instead of one might not be that complicated. Anyway, same as above, let’s try to find something better first.

> I might have found an alternative: instead of issuing a shutdhown -r, I’ve tried with a simple script that does:

> !#bin/sh
> sleep 5
> service jenkins-slave stop # ensure no jobs get assigned
> /sbin/shutdown -r now
>

> This script is called in a postbuildscript step, and started in the background (have a look at the new job definition in jenkins-jobs.git). I think that is what does the trick, putting the process in background so that Jenkins gets an exit 0 and can consider the build as successful, rather than complaining because the node gets offline immediately. I think we could even lower the sleep.

Got it. So that’s essentially your previous “we can probably write a wrapper script started by this postbuildscript publisher as a background process” idea, but this time meant to address the raciness instead of the too long waiting time.

> I highly doubt that in 5 or 6 seconds Jenkins allocates the new job on the isotester. So far, stress testing seems to indicate it’s enough.

I did some stress-testing as well. I’ve duplicated the test job to make it easier and more realistic. Sadly, my test results confirm neither your own ones, nor your noubts wrt. what can happen during these 5-6 seconds.

I’ve seen it fail in two ways:

That one: https://jenkins.tails.boum.org/job/test_postbuildscript_reboot/30/console … and then, no reboot happened, and the slave is left offline so no new job gets scheduled on it.

What’s more concerning is that it seems to be super-racy. These three jobs have run within a few seconds, meaning that the 2 last ones have been started on the slave during the sleep 5 (and more precisely, during the exact same second as the one at which the previous job ended):

https://jenkins.tails.boum.org/job/test_postbuildscript_reboot_2/3/console
https://jenkins.tails.boum.org/job/test_postbuildscript_reboot_3/3/console
https://jenkins.tails.boum.org/job/test_postbuildscript_reboot/34/console

So, sadly, it seems that this solution (as currently implemented at least) doesn’t cut it.

What we would need is to be able to put the Jenkins slave into offline mode as soon as it has communicated test results (that might include artifacts, I dunno how these things are ordered wrt. archiving). Perhaps there’s some flag file we can check for, but even that will be racy, and given how fast Jenkins seems to be when it comes to scheduling a queued job on a slave that becomes idle, I’m not very hopeful about that idea.

#19 Updated by bertagaz 2015-09-01 06:07:49

intrigeri wrote:
>
> I did some stress-testing as well. I’ve duplicated the test job to make it easier and more realistic. Sadly, my test results confirm neither your own ones, nor your noubts wrt. what can happen during these 5-6 seconds.

Hmm, good catch, that’s a way more realistic way to test that than I did.

> I’ve seen it fail in two ways:
>
> That one: https://jenkins.tails.boum.org/job/test_postbuildscript_reboot/30/console … and then, no reboot happened, and the slave is left offline so no new job gets scheduled on it.

Yeah, this one I think died because the filesystem was full, with the 100MB file archiving step. I had to cleanup /var/lib/jenkins/.

> What’s more concerning is that it seems to be super-racy. These three jobs have run within a few seconds, meaning that the 2 last ones have been started on the slave during the sleep 5 (and more precisely, during the exact same second as the one at which the previous job ended):
>
> https://jenkins.tails.boum.org/job/test_postbuildscript_reboot_2/3/console
> https://jenkins.tails.boum.org/job/test_postbuildscript_reboot_3/3/console
> https://jenkins.tails.boum.org/job/test_postbuildscript_reboot/34/console
>
> So, sadly, it seems that this solution (as currently implemented at least) doesn’t cut it.

I also did some retries lowering the sleep time (down to 0.1 even), using kill on the jenkins-slave process rather than a longer full service stop or using the quiet-period job option, but none of that worked.

> What we would need is to be able to put the Jenkins slave into offline mode as soon as it has communicated test results (that might include artifacts, I dunno how these things are ordered wrt. archiving).

It doesn’t send the artifacts at that moment, they’re sent earlier. But yes, it seems that Jenkins allocates a pending jobs in the same communication where a slave reports the previous uild status. So that’s pretty fast (faster than I thought, I think the delay I saw and talked about was more a web interface printing delay in the end), and there aren’t probably any way to be faster at shutting down the VM.

> Perhaps there’s some flag file we can check for, but even that will be racy, and given how fast Jenkins seems to be when it comes to scheduling a queued job on a slave that becomes idle, I’m not very hopeful about that idea.

Yeah, it seems that we won’t have much choice and will have to implement the downstream reboot job option in the end. I don’t think the flag file would work, unless I don’t get what you mean, it would mean that the next build would have started.

Is that enough to close this ticket? I don’t think the downstream job reboot needs testing right now. After all it’s a pretty simple job to write.

#20 Updated by intrigeri 2015-09-01 08:54:10

> Is that enough to close this ticket? I don’t think the downstream job reboot needs testing right now.

Reminder: this is part of the deliverable this ticket is about: “if needed, design and implement a way to fire up a clean test runner virtual machine before each test suite run”. As discussed above, we already know that in some situations a failed test suite run doesn’t properly clean up after itself, so it seems that this is needed. I suggest creating a subtask to track this, so that it’s clear that everything else on this deliverable has been done already.

> After all it’s a pretty simple job to write.

I’m glad you now find it simple :)

#21 Updated by bertagaz 2015-09-02 05:38:43

intrigeri wrote:
>
> > After all it’s a pretty simple job to write.
>
> I’m glad you now find it simple :)

Well it is. What is less simple is that our current version of JJB doesn’t support the ParameterizedTrigger plugin. Surprise! :D

#22 Updated by bertagaz 2015-09-02 05:38:56

blocked by ~~Feature #9646~~: Upgrade jenkins-job-builder to 1.3.0+ added

#23 Updated by bertagaz 2015-09-02 07:36:20

blocks deleted (~~~~Feature #9646~~: Upgrade jenkins-job-builder to 1.3.0+~~)

#24 Updated by bertagaz 2015-09-02 07:37:43

bertagaz wrote:
>
> Well it is. What is less simple is that our current version of JJB doesn’t support the ParameterizedTrigger plugin. Surprise! :D

I’m getting to tired I think. In fact it does support it.

#25 Updated by bertagaz 2015-09-02 08:13:33

Gave a try to the downstream reboot job, but so far the conclusion is not so good. Will try again later, getting a bit bored by it.

#26 Updated by bertagaz 2015-09-04 14:04:12

% Done changed from 20 to 50

Ok, good news, I think I’ve found a way to reboot the slave VMs. I’ve documented it in there

Bad news is that JJB support of the PrioritySorter plugin is old and outdated. Even the latest version is not compatible anymore. I’ll push upstream a fix, but it won’t be in JJB before 1.4. :/

Other bad news is that our current JJB version doesn’t support the raw config option. So we can’t use that trick for now, and we need to upgrade to 1.3 as soon as it has left the NEW queue in Debian.

#27 Updated by bertagaz 2015-09-05 02:24:11

bertagaz wrote:
> Bad news is that JJB support of the PrioritySorter plugin is old and outdated. Even the latest version is not compatible anymore. I’ll push upstream a fix, but it won’t be in JJB before 1.4. :/

Well, after some research, it seems that if we use version 1.3 of the PrioritySorter plugin, it works well. It’s the version used by jenkins.d.n, and the one supported by JJB. Later versions of it are big rewrites no one seems to care using. This 1.3 version does the job.

#28 Updated by bertagaz 2015-09-15 01:21:25

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Ready for QA

bertagaz wrote:
> Ok, good news, I think I’ve found a way to reboot the slave VMs. I’ve documented it in there

This has been tested since a week or so and seems to work quite well. I think the priority trick is effective in preventing a new wrapper job to steal the isotester from an ongoing real test job after it has been rebooted.

> Well, after some research, it seems that if we use version 1.3 of the PrioritySorter plugin, it works well. It’s the version used by jenkins.d.n, and the one supported by JJB. Later versions of it are big rewrites no one seems to care using. This 1.3 version does the job.

So that’s what have been deployed in the current state and it works. I don’t think we need more work in this area.

#29 Updated by intrigeri 2015-09-15 07:07:01

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Info Needed

> Ok, good news, I think I’ve found a way to reboot the slave VMs. I’ve documented it in there

OK. I have a few questions before I look into the implementation:

Would it be a problem in practice if a “third job” is not run on the isotester that was rebooted by the wrapper that triggered it? (e.g. because it’s racing with another “third job” that was itself waiting for an available slave)
I don’t understand “It waits a bit and put the node back online again”. Doesn’t the node come back online all by itself after rebooting? If not, what is “a bit” exactly?

> This has been tested since a week or so and seems to work quite well.

I’d like to exercise it a bit (last time I was quite good at breaking the prototype with inventive race conditions ;) — tell me once everything is back up and I can start playing.

> I think the priority trick is effective in preventing a new wrapper job to steal the isotester from an ongoing real test job after it has been rebooted.

Indeed, clever trick!

#30 Updated by bertagaz 2015-09-15 07:55:28

Assignee changed from bertagaz to intrigeri
QA Check changed from Info Needed to Ready for QA

intrigeri wrote:
> * Would it be a problem in practice if a “third job” is not run on the isotester that was rebooted by the wrapper that triggered it? (e.g. because it’s racing with another “third job” that was itself waiting for an available slave)

Hmm, Im not sure to get your question and its meaning. I don’t think it would be a problem, as long as it is rebooted before being run, and the only way to be sure it is was to tied the first and third jobs to the same isotester.

> * I don’t understand “It waits a bit and put the node back online again”. Doesn’t the node come back online all by itself after rebooting? If not, what is “a bit” exactly?

Nope, if the node is marked as temporary offline, then you have to explicitly set it up online for it to get back. This state seems to be stored on the master.

> > This has been tested since a week or so and seems to work quite well.
>
> I’d like to exercise it a bit (last time I was quite good at breaking the prototype with inventive race conditions ;) — tell me once everything is back up and I can start playing.

Sure. I did the same test you did last time that raised the issue you found: assign multiple jobs of that kind to the same isotester and start them all in a row. This time, the issue didn’t came back up. But we’ll see what your experiments find. Crossing fingers, cause otherwise I’m not sure to have another option.

But you can start playing with it right now. Jenkins is back up in a good state. just one thing, please make sure not to delete jobs, as we still need the old run statistics.

#31 Updated by intrigeri 2015-09-15 08:19:54

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Info Needed

> intrigeri wrote:
>> * Would it be a problem in practice if a “third job” is not run on the isotester that was rebooted by the wrapper that triggered it? (e.g. because it’s racing with another “third job” that was itself waiting for an available slave)

> Hmm, Im not sure to get your question and its meaning.

The blueprint only explains what happens in a world where there is only one such set of 3 jobs. I assume that we’ll instead have a set of 3 jobs per active branch. Right? So I’m trying to think about potential concurrency issues, and the first one that came up to mind was: if multiple “third jobs” are waiting for an isotester slave to be available and online, then they may very well end up running on different slaves than the one their respective wrapper has rebooted. That is:

wrapper A starts its job 2 => reboots isotester1
wrapper B starts its job 2 => reboots isotester2
isotester2 comes back online
wrapper B’s job 3 is started on isotester1
wrapper A’s job 3 is started on isotester2

If that’s not a problem, fine.

> I don’t think it would be a problem, as long as it is rebooted before being run, and the only way to be sure it is was to tied the first and third jobs to the same isotester.

How are they tied?
Is that hard-coded for each wrapper job, or is Jenkins passing through the “wrapper job scheduled on slave X this time” info to ensure that the corresponding run of job 3 is dynamically tied to slave X as well?
In other words, does this mean that each per-branch test suite job will permanently be tied to a dedicated slave? (I hope not.)
At this point, perhaps you’ll prefer pointing me to the exact code/config I should look at :)

> Nope, if the node is marked as temporary offline, then you have to explicitly set it up online for it to get back. This state seems to be stored on the master.

OK. So I’m curious about the “waits a bit” part => may you please point me to the corresponding code/config?

#32 Updated by bertagaz 2015-09-15 10:08:06

Assignee changed from bertagaz to intrigeri
QA Check changed from Info Needed to Ready for QA

intrigeri wrote:
> The blueprint only explains what happens in a world where there is only one such set of 3 jobs. I assume that we’ll instead have a set of 3 jobs per active branch. Right? So I’m trying to think about potential concurrency issues, and the first one that came up to mind was: if multiple “third jobs” are waiting for an isotester slave to be available and online, then they may very well end up running on different slaves than the one their respective wrapper has rebooted. That is:
>
> # wrapper A starts its job 2 => reboots isotester1
> # wrapper B starts its job 2 => reboots isotester2
> # isotester2 comes back online
> # wrapper B’s job 3 is started on isotester1
> # wrapper A’s job 3 is started on isotester2

Ok, get it, thanks for the clarification.

> > I don’t think it would be a problem, as long as it is rebooted before being run, and the only way to be sure it is was to tied the first and third jobs to the same isotester.
>
> How are they tied?
> Is that hard-coded for each wrapper job, or is Jenkins passing through the “wrapper job scheduled on slave X this time” info to ensure that the corresponding run of job 3 is dynamically tied to slave X as well?
> In other words, does this mean that each per-branch test suite job will permanently be tied to a dedicated slave? (I hope not.)
> At this point, perhaps you’ll prefer pointing me to the exact code/config I should look at :)

This won’t happen, as the wrapper job explicitly tell to job 3 to run on the same node it itself did (same-node: true in the JJB test_Tails_ISO_* yamls). So they won’t permanently be tied to a given node we’ll have to configure. They’ll just take care to run on the same node.

> > Nope, if the node is marked as temporary offline, then you have to explicitly set it up online for it to get back. This state seems to be stored on the master.
>
> OK. So I’m curious about the “waits a bit” part => may you please point me to the corresponding code/config?

Right, forgot to answer to this. The “wait a bit” is precisely here to avoid that the isotester is put back online too fast before the VM is rebooted, and another wrapper job is trying to be run on it. I’ve put a slight delay to avoid this. In theory the first wrapper ran on the isotester is holding it, so that’s more a “better safe than sorry” protection.

The “wait a bit” is the sleep 5 you can see in jenkins-jobs.git:reboot_job.yaml.

#33 Updated by intrigeri 2015-09-15 11:11:54

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Info Needed

> This won’t happen, as the wrapper job explicitly tell to job 3 to run on the same node it itself did (same-node: true in the JJB test_Tails_ISO_* yamls). So they won’t permanently be tied to a given node we’ll have to configure. They’ll just take care to run on the same node.

Excellent, woohoo! I like it :)

>> > Nope, if the node is marked as temporary offline, then you have to explicitly set it up online for it to get back. This state seems to be stored on the master.
>>
>> OK. So I’m curious about the “waits a bit” part => may you please point me to the corresponding code/config?

> Right, forgot to answer to this. The “wait a bit” is precisely here to avoid that the isotester is put back online too fast before the VM is rebooted, and another wrapper job is trying to be run on it. I’ve put a slight delay to avoid this.

> The “wait a bit” is the sleep 5 you can see in jenkins-jobs.git:reboot_job.yaml.

I don’t understand how waiting can help us win a race here. But that’s probably because I don’t understand what’s supposed to happen exactly, and in which order: after these 5 seconds, has the isotester finished rebooting?

But perhaps I’m just confused because the reboot job uses, to turn the slave online, the very same request that’s used to turn it offline. Is there any explanation for that? Sorry if I missed something :/

> In theory the first wrapper ran on the isotester is holding it, so that’s more a “better safe than sorry” protection.

That’s a bit reassuring, given my confusion above.

#34 Updated by bertagaz 2015-09-16 01:19:10

Assignee changed from bertagaz to intrigeri

intrigeri wrote:
> I don’t understand how waiting can help us win a race here. But that’s probably because I don’t understand what’s supposed to happen exactly, and in which order: after these 5 seconds, has the isotester finished rebooting?

Not really. It’s a bit hard to know how much time it takes to reboot. It’s often fast, but if the TailsToaster VM is still running because the last test suite run didn’t shutdown properly, then it’s longer. But even then, I’ve not seen a job getting its isotester stolen. I guess we could add a little step that ensure to destroy this VM (and its network) in the wrapper job, or maybe in the post_test_cleanup script.

> But perhaps I’m just confused because the reboot job uses, to turn the slave online, the very same request that’s used to turn it offline. Is there any explanation for that? Sorry if I missed something :/

Nop, the ‘turning slave offline’ is the very first step of the wrapper, then it goes on and trigger the reboot. The reboot job issue the ‘shutdown -r now’ command in background (otherwise it gets blocked at this step and fails, which won’t trigger its next ‘put slave back online’ step), and then wait a bit to put the node back online while the rebooting is ongoing.

The reboot job is marked as blocking the wrapper one while it runs (block: true in JJB), so the wrapper job keeps on running on the isotester being rebooted (that’s why it will always fail) and keeps it busy so no other wrapper job is supposed to be run on it.

So there’s not really a race here, but this sleep is more of a safeguard, in case another wrapper wants to sneak in between the second and third job of another one, which is unlikely to happen still. But as we rely on the job priority only to avoid this, I preferred to add this guard.

That’s quite completed, so I’m not surprised you have hard time to get it. It’s just because I spent time testing the various settings that I do. :)

#35 Updated by intrigeri 2015-09-16 05:05:15

Assignee changed from intrigeri to bertagaz

>> But perhaps I’m just confused because the reboot job uses, to turn the slave online, the very same request that’s used to turn it offline. Is there any explanation for that? Sorry if I missed something :/

> Nop, the ‘turning slave offline’ is the very first step of the wrapper, then it goes on and trigger the reboot. The reboot job issue the ‘shutdown -r now’ command in background (otherwise it gets blocked at this step and fails, which won’t trigger its next ‘put slave back online’ step), and then wait a bit to put the node back online while the rebooting is ongoing.

Thanks for the explanation, but sadly this does little to address the confusion I’m raising above about the same HTTP request being supposed to do two opposite things, so I’m still very confused. Let me try to explain better what I don’t understand.

The wrapper runs (https://git-tails.immerda.ch/jenkins-jobs/tree/test_Tails_ISO_isotester2.yaml#n3):

wget -O /dev/null --post-data "offlineMessage=Taken_down_for_reboot" "http://jenkins.lizard:8080/computer/${NODE_NAME}/toggleOffline"

… and I’m not surprised this takes a node offline.

The reboot job runs (https://git-tails.immerda.ch/jenkins-jobs/tree/reboot_job.yaml#n4):

wget -O /dev/null --post-data "offlineMessage=Taken_down_for_reboot" "http://jenkins.lizard:8080/computer/${RESTART_NODE}/toggleOffline"

… and it looks very much the same to me as the previous command, plus the URL and POST data only are about “offline”, so I don’t understand how it can put the node back online. Is it really the command that is supposed to put the node back online, or am I not looking at the right place?

I’ll come back to more general topic of the potential race and sleep once this is clarified.

#36 Updated by bertagaz 2015-09-16 05:34:37

Assignee changed from bertagaz to intrigeri

intrigeri wrote:
> Thanks for the explanation, but sadly this does little to address the confusion I’m raising above about the same HTTP request being supposed to do two opposite things, so I’m still very confused. Let me try to explain better what I don’t understand.
>
> The wrapper runs (https://git-tails.immerda.ch/jenkins-jobs/tree/test_Tails_ISO_isotester2.yaml#n3):
>
> […]
>
> … and I’m not surprised this takes a node offline.
>
> The reboot job runs (https://git-tails.immerda.ch/jenkins-jobs/tree/reboot_job.yaml#n4):
>
> […]
>
> … and it looks very much the same to me as the previous command, plus the URL and POST data only are about “offline”, so I don’t understand how it can put the node back online. Is it really the command that is supposed to put the node back online, or am I not looking at the right place?

Aaaaah ok!

Well, that’s the way it’s done in Jenkins, hit this URL and if the node is online, it’s taken offline, and if it’s already offline, it puts it back online. Yes, that’s right, that’s confusing. Have a look at the toggle_temporarily_offline function in the python-jenkinsapi if you’re in doubt.

#37 Updated by intrigeri 2015-09-16 06:31:59

> Aaaaah ok!

> Well, that’s the way it’s done in Jenkins, hit this URL and if the node is online, it’s taken offline, and if it’s already offline, it puts it back online.

OK, got it, thanks! Indeed there’s “toggle” in toggleOffline. I suggest changing the offlineMessage in the reboot job to make it clear that the intent is to put the node back online: if it wasn’t saying “Taken_down_for_reboot” I might have understood initially.

So I’ll now re-read your explanations to this light. Thanks again :)

#38 Updated by intrigeri 2015-09-16 07:20:39

Assignee changed from intrigeri to bertagaz
QA Check deleted (~~Info Needed~~)

bertagaz wrote:
> But even then, I’ve not seen a job getting its isotester stolen.

That’s good to know, but my understanding is that this doesn’t prove anything at all, because in the current setup, there’s no way for a job to steal another job’s isotester, due to the combination of:

we have concurrent: false so no more than 1 instance of each test_Tails_ISO_isotester*.yaml can be scheduled at a given time;
each test_Tails_ISO_isotester*.yaml is tied to a specific node.

… so it’s no surprise that this works fine without anyone stealing anything, right?

To test this more accurately, one needs to either set concurrent: true for these testing jobs, or to have several test_Tails_ISO_isotester*.yaml jobs for each isotester, or to have all these jobs be able to run on any isotesterN (node: isotester2 || isotester3 || isotester4). I think the last one is better, since it’s closer to what we’ll be doing for our real testing jobs: no concurrency for testing ISOs from a given branch presumably (?), but can happen on any isotesterN.

> I guess we could add a little step that ensure to destroy this VM (and its network) in the wrapper job, or maybe in the post_test_cleanup script.

> So there’s not really a race here, but this sleep is more of a safeguard, in case another wrapper wants to sneak in between the second and third job of another one, which is unlikely to happen still. But as we rely on the job priority only to avoid this, I preferred to add this guard.

I still don’t get what exact problem that sleep is supposed to safeguard against. On the one hand you say there’s not really a race, but then OTOH you try to address the “another wrapper wants to sneak in” problem which looks very much like a race condition. Let me put together a more formal analysis before we both go mad with fuzzy language and misunderstandings:

The resource jobs may be racing for is “isotesterN is online and available to run a job”.
As long as the isotesterN reboot is not completed, there is no race since even though the node might be “online” from the master’s PoV, the slave still hasn’t connected to the master so it isn’t available for scheduling jobs on it yet. The sleep 5 doesn’t impact this.
As long as the reboot job hasn’t turned isotesterN back online, there is no race since the node is still offline. The sleep 5 makes this happen a bit later.
As soon as the reboot has completed (more accurately: jenkins-slave successfully connected to the jenkins master) and the reboot job has turned the node back online, then there is a race. IOW, the race starts at time = max(“reboot finished”, “5 seconds + wget”).
Now I’ll be assuming that the job priority thing + isotesterN being “held” by the wrapper are not enough (which you seem to suggest might be the case, otherwise we wouldn’t be having this discussion). So we can have:
two candidate jobs that Jenkins master might decide to run on isotesterN, namely: our own third job (that follows our reboot job) and another wrapper job. IOW, they are racing for the resource we’re talking about.

The way I understand it, the only potential effect that sleep 5 can have is to make the race start later, if and only if “5 seconds + wget” > “reboot finished”, that is basically if and only if the full reboot process takes less than 5 seconds (unlikely, but oh well). And once the race has started, then the sleep can’t possibly affect the outcome of the race.

I propose to:

implement the more realistic and accurate set of test jobs I suggested above, in order to actually be in a situation when a race might happen; without that, we can’t possibly evaluate a) if there’s a problem to solve, and b) if there is, if the solution we have actually solves it;
remove that sleep 5;
see if there’s a real-world problem to solve.

I suspect that the job priority is enough and it’ll just work without any special trick :)

#39 Updated by intrigeri 2015-09-17 10:32:07

We’ve noticed a problem with the scheduler today, bertagaz has understood the problem and will explain it in a ticket tomorrow.

I have a draft branch (feature/reboot-after-testing) that might help fix that, by rebooting after running tests instead of before, which required to also rework a bit how priorities are handled. I’ve tried to implement it in a way that can be deployed/tested in parallel with the existing testing jobs, so I’ll merge it tomorrow to give it a try unless you object. Of course, it won’t work if your theory about block: true for the first triggered job is correct — we’ll see.

#40 Updated by intrigeri 2015-09-17 10:39:56

intrigeri wrote:
> We’ve noticed a problem with the scheduler today, bertagaz has understood the problem and will explain it in a ticket tomorrow.

That’s ~~Bug #10215~~, I’ll let you fix the explanation if I got it wrong.

#41 Updated by intrigeri 2015-09-17 13:06:31

intrigeri wrote:
> I have a draft branch (feature/reboot-after-testing) that […]

I’ve followed-up on ~~Bug #10215~~, let’s not continue this discussion here.

#42 Updated by bertagaz 2015-09-23 01:31:28

Target version changed from Tails_1.6 to Tails_1.7

#43 Updated by sajolida 2015-10-01 07:23:13

This was due on July 15, please make sure we can at least deliver in time for the next milestone on October 15.

#44 Updated by intrigeri 2015-10-05 03:04:09

I know this is not assigned to me for review yet, but let me state my expectations clearly in advance:

All this is very complex, and it’ll hard for me to understand it without a clear description of the current tentative design. Please update the blueprint and then reassign to me. You’ll need to do that for ~~Feature #10117~~ anyway so if you’re confident with your current design, then it should not add more work to your plate :)

I see we still have three calls to sleep in the jobs Git repo, but those are in three files that are now unused, right? If so:

Please delete all such obsolete files so it’s easier for me to understand what’s going on in there.
Congrats for coming up with a design that does not rely on hard-coded and racy sleep times :)

Thanks in advance!

#45 Updated by intrigeri 2015-10-05 03:06:05

related to ~~Feature #10117~~: Design how to run our test suite in Jenkins added

#46 Updated by bertagaz 2015-10-06 03:55:42

Assignee changed from bertagaz to intrigeri
QA Check set to Info Needed

intrigeri wrote:
> I know this is not assigned to me for review yet, but let me state my expectations clearly in advance:
>
> All this is very complex, and it’ll hard for me to understand it without a clear description of the current tentative design. Please update the blueprint and then reassign to me. You’ll need to do that for ~~Feature #10117~~ anyway so if you’re confident with your current design, then it should not add more work to your plate :)

Ok, gave a try to clarify stuffs as said in ~~Feature #10117#note-38~~. Hope it’s a bit clearer.

> I see we still have three calls to sleep in the jobs Git repo, but those are in three files that are now unused, right? If so:

No, they are in use, they are the one I set up for you to try it with the concurrent: true new setting that fix the problem we’ve seen in ~~Bug #10215~~.

Now I have an explanation for this sleep, as I’ve tried to remove it: if we don’t put such a delay after we issue the shutdown -r command, then the request to put the node back online is done very quickly after it, like in a matter of a second. In this case, it happens that a new job gets assigned to the isotester, because the VM reboot is at its very early step, and the Jenkins slave is still up. Adding a bit of a delay before putting back online the slave helps to avoid this situation.

> * Please delete all such obsolete files so it’s easier for me to understand what’s going on in there.

They are the one you’re supposed to experiment with, so I won’t do that. :)

> * Congrats for coming up with a design that does not rely on hard-coded and racy sleep times :)

Well, the sleep is still there, but mostly to avoid a race condition.

> Thanks in advance!

Hope it helps, I’m still sick, don’t have all my brain, and the break we had last week doesn’t help to still have all this big pile in mind.

#47 Updated by intrigeri 2015-10-07 22:58:24

Assignee changed from intrigeri to bertagaz

>> I see we still have three calls to sleep in the jobs Git repo, but those are in three files that are now unused, right? If so:

> No, they are in use, they are the one I set up for you to try it with the concurrent: true new setting that fix the problem we’ve seen in ~~Bug #10215~~.

I don’t understand where keep_node_busy.yaml and put_node_online.yaml are used:

$ git grep sleep 
keep_node_busy.yaml:      - shell: "sleep 500"
put_node_online.yaml:      - shell: 'sleep 20 && wget -O /dev/null --post-data "offlineMessage=${MESSAGE}" "http://jenkins.lizard:8080/computer/${TARGET_NODE}/toggleOffline"'
reboot_job.yaml:      - shell: 'sleep 5 && wget -O /dev/null --post-data "offlineMessage=Back_online_while_rebooting" "http://jenkins.lizard:8080/computer/${RESTART_NODE}/toggleOffline"'
$ grep name keep_node_busy.yaml put_node_online.yaml reboot_job.yaml
keep_node_busy.yaml:    name: keep_node_busy
put_node_online.yaml:    name: put_node_online
reboot_job.yaml:    name: reboot_job
$ git grep -E 'keep_node_busy|put_node_online|reboot_job'               
keep_node_busy.yaml:    name: keep_node_busy
put_node_online.yaml:    name: put_node_online
reboot_job.yaml:    name: reboot_job
reboot_node.yaml:        - project: reboot_job
test_Tails_ISO_isotester1_metrics.yaml:        - project: reboot_job
test_Tails_ISO_isotester2.yaml:        - project: reboot_job
test_Tails_ISO_isotester3.yaml:        - project: reboot_job
test_Tails_ISO_isotester4.yaml:        - project: reboot_job

=> please clarify.

> Now I have an explanation for this sleep, as I’ve tried to remove it: if we don’t put such a delay after we issue the shutdown -r command, then the request to put the node back online is done very quickly after it, like in a matter of a second. In this case, it happens that a new job gets assigned to the isotester, because the VM reboot is at its very early step, and the Jenkins slave is still up. Adding a bit of a delay before putting back online the slave helps to avoid this situation.

Seems like we’re back to the argument we had weeks ago — I hope not; not re-read the ticket history and won’t have time before next week so I don’t know. You won’t be surprised to read that bits like “helps to avoid”, “bit of delay”, and no explanation about the magical “5” value aren’t overly confidence inspiring to me, since we’ve been through that before ;)

Let’s try to avoid any more misunderstanding. Let’s try and find a way that makes me not too sad, without putting too much work on your shoulders this month (I don’t see that as unexpected work, since you could totally expect that anything with a hardcoded and unproven sleep time would not be seen as a good and robust solution, but well, now it’s too late to argue about it I guess).

So I see two cases:

If you’re confident that this is provably not racy, then please document on the blueprint why “5” is the right sleep time and why it’s not racy, using formal language as I did last time we had a misunderstanding on a similar topic (please refer to the argument I made about this last time for inspiration).
If, on the contrary, “5” is a magic number that happens to work fine in our current setup, then I’ll be sad but I can live with giving it a try, as long as:
- it’s clearly documented to be only worth what it is on the blueprint (that is, when exactly we want the “put back online to happen”, before what, after what; an example of why sleep 1 would be too small; an example of why sleep 20 would be too much);
- we recognize it as a problem, i.e. there’s a ticket about finding a robust, provably non-racy design to handle this problem later (not sure I remember it clearly but IIRC that’s one thing that my feature/reboot-after-testing untested WIP branch was meant to fix);
- if problems occur during the real-world deployment due to this arbitrary sleep, then they’re handled with top-priority.

Fair enough?

#48 Updated by bertagaz 2015-10-08 07:06:00

Assignee changed from bertagaz to intrigeri

First things first, I understand that the come back of this sleep doesn’t make you happy. I’m not pleased either by that, and I agree it sounds hackish. I’m not pleased either and sorry to have to discuss this again. But we’re dealing with Jenkins limitations, and our lack of deep understanding of its internals. We may consider a sleep as something not robust when in the context of a script, but in this different Jenkins machinery context, well, we may not have much choices.

Please consider that I’ve spend hours and hours trying different possible ideas, and so far this one is the only one I had in what we could say a working state.

Also remember that we had a full week of other mostly offline activities, which doesn’t help me neither to remember all the details of this previous testing.

intrigeri wrote:
> I don’t understand where keep_node_busy.yaml and put_node_online.yaml are used:

They are leftovers of previous failed attempts to reboot the node after the test job. I’ll remove them if they are not necessary.

> > Now I have an explanation for this sleep, as I’ve tried to remove it: if we don’t put such a delay after we issue the shutdown -r command, then the request to put the node back online is done very quickly after it, like in a matter of a second. In this case, it happens that a new job gets assigned to the isotester, because the VM reboot is at its very early step, and the Jenkins slave is still up. Adding a bit of a delay before putting back online the slave helps to avoid this situation.
>
> Seems like we’re back to the argument we had weeks ago — I hope not; not re-read the ticket history and won’t have time before next week so I don’t know. You won’t be surprised to read that bits like “helps to avoid”, “bit of delay”, and no explanation about the magical “5” value aren’t overly confidence inspiring to me, since we’ve been through that before ;)

Sure, no surprise there. Here’s a tentative one, that I’ve also added to the blueprint:

This sleep is here to ensure that the jenkins slave gets killed by the shutdown process. Meaning wait enough time for this process to stop after receiving the SIGTERM signal. If we don’t use it, Jenkins is fast enough to run a new job on this isotester before it’s the case. So we need a delay that is not too short, and from the tests we’ve made, 5 seconds seems to be enough for that. Less would probably reintroduce this race condition. More would add unnecessary lagging time. We could measure the maximum time it takes to stop this daemon, but in the end we’ll conclude it takes a few seconds.

> Let’s try to avoid any more misunderstanding. Let’s try and find a way that makes me not too sad, without putting too much work on your shoulders this month (I don’t see that as unexpected work, since you could totally expect that anything with a hardcoded and unproven sleep time would not be seen as a good and robust solution, but well, now it’s too late to argue about it I guess).

Well, for me it’s proven by the number of times I hit the build button for the wrapper jobs, as you also did hopefully. I’ve never seen this number introduce weird behaviors or race conditions. On the contrary, it solves the biggest one we had, as explained above.

> So I see two cases:
>
> * If you’re confident that this is provably not racy, then please document on the blueprint why “5” is the right sleep time and why it’s not racy, using formal language as I did last time we had a misunderstanding on a similar topic (please refer to the argument I made about this last time for inspiration).

Hopefully done. I’m not sure which previous topic you’re taking about, but I’ve try to be as clear as I could.

> * If, on the contrary, “5” is a magic number that happens to work fine in our current setup, then I’ll be sad but I can live with giving it a try, as long as:
> it’s clearly documented to be only worth what it is on the blueprint (that is, when exactly we want the “put back online to happen”, before what, after what; an example of why sleep 1 would be too small; an example of why sleep 20 would be too much);

Hope my explanation fits your request.

> we recognize it as a problem, i.e. there’s a ticket about finding a robust, provably non-racy design to handle this problem later (not sure I remember it clearly but IIRC that’s one thing that my feature/reboot-after-testing untested WIP branch was meant to fix);

This branch of yours was not working: rebooting the VM in a builder would make the whole build fail, and the post-build steps like artifact archiving and cucumber reporting would not be able to take place given the VM is at this moment down. I didn’t even try it, because there’s no doubt it doesn’t work. In place I’ve gave a try to an alternate solution in the feature/reboot-after_post_build_trigger branch. I’ve tested that too, but came to the conclusion this solution was not working.

I’m not convinced there are other ways of doing it TBH. I’ve spent quite some time testing other possible alternatives, but it might be that this one is the only workable one.

Now we can also maybe consider that “robust” define something that works when deployed. I mean, it’s not because there is a sleep of N seconds that it means it is not.

> if problems occur during the real-world deployment due to this arbitrary sleep, then they’re handled with top-priority.

Sure.

#49 Updated by intrigeri 2015-10-12 11:01:20

Assignee changed from intrigeri to bertagaz

Hi bertagaz,

There’s a lot in your last comment that I disagree with, but it’s neither the right time nor the right place to debate it. I’ll simply share a tiny bit of feelings and will focus on resolving this ticket.

> Please consider that I’ve spend hours and hours trying different possible ideas,

I totally acknowledge this.

> and so far this one is the only one I had in what we could say a working state.

First, congrats for putting the whole thing into a working state. That’s an amazing achievement! :)

What makes me sad is that with all the communication problems we’ve had on this ticket, it took basically a month until you described the actual problem this sleep is working around in a way that I understood clearly. I was excited at the idea of finding a more elegant solution to this problem, wanted to think about it, and possibly to be a useful rubber duck or slightly more, but I feel that I could not participate positively in this discussion: with the info I had I could only do the “bad guy” part of why I’m here for (which I apparently did thoroughly enough, if not too much). I think we should try work meetings on Mumble, or something, for this kind of topics, because Redmine wasn’t very helpful as a tool.

> Sure, no surprise there. Here’s a tentative one, that I’ve also added to the blueprint:

> This sleep is here to ensure that the jenkins slave gets killed by the shutdown process. Meaning wait enough time for this process to stop after receiving the SIGTERM signal. If we don’t use it, Jenkins is fast enough to run a new job on this isotester before it’s the case. So we need a delay that is not too short, and from the tests we’ve made, 5 seconds seems to be enough for that. Less would probably reintroduce this race condition. More would add unnecessary lagging time. We could measure the maximum time it takes to stop this daemon, but in the end we’ll conclude it takes a few seconds.

Thanks a looooooot! That’s the info I wanted :)

In my own words, we want to turn the slave back online:

strictly after the Jenkins slave process disconnects from the master;
but as soon as possible, because the more we wait, the more working time of our ISO testers we’re wasting.

Short term, the 5 seconds sleep time we have sounds OK. If we see it’s not 100% of the time enough in the real word, we can bump the delay up to a few minutes, without any significant effect since this is still negligible in front of a test suite run. Given this, I wonder if I would not bump it a bit right now, just to avoid the failure that happens 1% of the time in the real world because some other service that the init system stops before Jenkins would be taking its time to shut down, but I’m being superstitious so ignore this for now :)

Longer term, if we had to bump the delay enough that it becomes too ugly or impractical, or if someone feels like making this nicer, then it could be nice to deterministically wait until the Jenkins slave process is dead, but no need to keep this in mind for now.

>> So I see two cases:
>>
>> * If you’re confident that this is provably not racy, then please document on the blueprint why “5” is the right sleep time and why it’s not racy, using formal language as I did last time we had a misunderstanding on a similar topic (please refer to the argument I made about this last time for inspiration).

> Hopefully done. I’m not sure which previous topic you’re taking about, but I’ve try to be as clear as I could.

Yes, it’s very clear now. Thanks!

> Now we can also maybe consider that “robust” define something that works when deployed. I mean, it’s not because there is a sleep of N seconds that it means it is not.

I’m fine with such a pragmatic stance on it, given:

>> if problems occur during the real-world deployment due to this arbitrary sleep, then they’re handled with top-priority.

> Sure.

As far as I’m concerned, we’re done on this ticket. Please mark it as resolved if you agree.

#50 Updated by bertagaz 2015-10-13 05:04:18

Status changed from In Progress to Resolved
Assignee deleted (~~bertagaz~~)
QA Check deleted (~~Info Needed~~)

intrigeri wrote:
> What makes me sad is that with all the communication problems we’ve had on this ticket, it took basically a month until you described the actual problem this sleep is working around in a way that I understood clearly. I was excited at the idea of finding a more elegant solution to this problem, wanted to think about it, and possibly to be a useful rubber duck or slightly more, but I feel that I could not participate positively in this discussion: with the info I had I could only do the “bad guy” part of why I’m here for (which I apparently did thoroughly enough, if not too much). I think we should try work meetings on Mumble, or something, for this kind of topics, because Redmine wasn’t very helpful as a tool.

Yes, I had difficulties to explain this sleep clearly. Most of the discussion moments on this ticket have happened sometimes a week later than the work that was to review, and it wasn’t easy to remember what was this trick for.

> Thanks a looooooot! That’s the info I wanted :)

Finally! :D

> In my own words, we want to turn the slave back online:
>
> * strictly after the Jenkins slave process disconnects from the master;
> * but as soon as possible, because the more we wait, the more working time of our ISO testers we’re wasting.

Yes!

> Short term, the 5 seconds sleep time we have sounds OK. If we see it’s not 100% of the time enough in the real word, we can bump the delay up to a few minutes, without any significant effect since this is still negligible in front of a test suite run. Given this, I wonder if I would not bump it a bit right now, just to avoid the failure that happens 1% of the time in the real world because some other service that the init system stops before Jenkins would be taking its time to shut down, but I’m being superstitious so ignore this for now :)

For now I’ve never see this sleep being an issue with the build triggering process. I propose we wait to see if it is and if necessary raise it.

> Longer term, if we had to bump the delay enough that it becomes too ugly or impractical, or if someone feels like making this nicer, then it could be nice to deterministically wait until the Jenkins slave process is dead, but no need to keep this in mind for now.

Yes.

> > Now we can also maybe consider that “robust” define something that works when deployed. I mean, it’s not because there is a sleep of N seconds that it means it is not.
>
> I’m fine with such a pragmatic stance on it, given:
>
> >> if problems occur during the real-world deployment due to this arbitrary sleep, then they’re handled with top-priority.

Glad we found a common ground.

> As far as I’m concerned, we’re done on this ticket. Please mark it as resolved if you agree.

Yay!

#51 Updated by intrigeri 2015-11-23 01:13:23

related to ~~Bug #10601~~: isotesterN:s are sometimes put offline and never back online added

#52 Updated by intrigeri 2016-02-08 12:50:55

related to deleted (~~~~Bug #10601~~: isotesterN:s are sometimes put offline and never back online~~)

#53 Updated by intrigeri 2016-02-08 12:52:42

related to ~~Bug #10601~~: isotesterN:s are sometimes put offline and never back online added

#54 Updated by intrigeri 2019-09-23 09:55:01

related to ~~Bug #11295~~: Test jobs sometimes get scheduled on a busy isotester while there are available ones added