Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones

Bug #11295

Test jobs sometimes get scheduled on a busy isotester while there are available ones

Added by bertagaz 2016-03-31 15:12:45 . Updated 2020-01-25 11:54:40 .

Status:

Resolved

Priority:

Normal

Assignee:

intrigeri

Category:

Continuous Integration

Target version:

Tails_4.3

Start date:

2016-03-31

Due date:

% Done:

Feature Branch:

Type of work:

Sysadmin

Blueprint:

Starter:

Affected tool:

Deliverable for:

280

Description

While investigating ~~Bug #10601~~, we discovered that sometimes after a reboot_job completed, rather than starting the test job that triggered it for this isotester, Jenkins assigns this same isotester to another test job, resulting in the first test job waiting for hours for the other one to be over. See ~~Bug #10601#note-5~~ for details.

Subtasks

Related issues

Related to Tails - ~~Bug #10215~~: Suboptimal advance booking of Jenkins slaves for testing ISOs	Resolved	2015-09-17
Related to Tails - ~~Bug #10601~~: isotesterN:s are sometimes put offline and never back online	Needs Validation	2015-11-23
Related to Tails - Bug #16959: Gather usability data about our current CI	In Progress
Related to Tails - ~~Feature #9486~~: Support running multiple instances of the test suite in parallel	Resolved	2015-06-25
Related to Tails - Bug #17216: Make the test suite clean up after itself even in most tricky failure modes	Confirmed
Blocked by Tails - ~~Bug #10068~~: Upgrade to Jenkins 2.x, using upstream packages	Resolved	2018-01-08

History

#1 Updated by intrigeri 2016-04-06 12:48:34

Description updated

I suggest to first set up a very simple test case to confirm what’s the deal with job priority, and whether our current configuration is based on a correct understanding of how the priority sorter plugin works (~~Bug #10601#note-5~~ has more precise pointers about where this doubt of mine comes from).

Rationale: even if the bug isn’t obvious in our current setup for some reason, I’d rather not keep config designed based on erroneous assumptions, since if it’s the case it’ll be confusing in the future next time I have to debug weird race conditions again.

#2 Updated by bertagaz 2016-05-10 08:00:17

Target version changed from Tails_2.4 to Tails_2.5

#3 Updated by bertagaz 2016-06-10 16:42:47

Target version changed from Tails_2.5 to Tails_2.6

Probably won’t have time to work on it before that.

#4 Updated by intrigeri 2016-07-31 15:20:09

Subject changed from Test jobs sometimes get their isotester stolen by another one. to Test jobs sometimes get their isotester stolen by another one

I’ve just seen something similar happen again: https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/ is “(pending—Waiting for next available executor on isotester2) UPSTREAMJOB_BUILD_NUMBER=15” while https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/ is running on isotester2. Five other isotesters are available, so it’s a shame that job 15 was scheduled on isotester2 as well and now has to wait for 3 hours before it’s run.

Job 14 was run on Jul 31, 2016 9:18:49 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/, which also run
https://jenkins.tails.boum.org/job/reboot_job/8542/ with parameter RESTART_NODE=isotester2. The wrap job had NODE_NAME=isotester2.

Job 15 was run on Jul 31, 2016 9:19:19 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/, which also run https://jenkins.tails.boum.org/job/reboot_job/8543/ with parameter RESTART_NODE=isotester2. The wrap job had NODE_NAME=isotester2.

As said in the ticket description, I’ve already investigated such a problem 8 months ago (~~Bug #10601#note-5~~), so the next debugging steps should be easy, if done before the corresponding system logs and Jenkins artifacts expire.

I believe this clearly answers the “We first need to see if this still happens or not” part of this ticket: something is wrong with our job priority setup.

#5 Updated by intrigeri 2016-08-01 16:40:12

Subject changed from Test jobs sometimes get their isotester stolen by another one to Test jobs sometimes get scheduled on a busy isotester while there are available ones

Same thing as we speak, between https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-from-intrigeri-for-2.6/7/ and job 9 on the same project: here again, 2 isotesters are free but job 9 is waiting for isotester1 to be available, while job 7 is running there.

#6 Updated by intrigeri 2016-08-01 16:41:48

related to ~~Bug #10215~~: Suboptimal advance booking of Jenkins slaves for testing ISOs added

#7 Updated by anonym 2016-09-20 16:54:06

Target version changed from Tails_2.6 to Tails_2.7

#8 Updated by bertagaz 2016-10-03 20:57:14

Target version changed from Tails_2.7 to Tails_2.9.1

#9 Updated by anonym 2016-12-14 20:11:24

Target version changed from Tails_2.9.1 to Tails 2.10

#10 Updated by intrigeri 2016-12-18 10:02:25

Target version changed from Tails 2.10 to Tails_2.11

#11 Updated by bertagaz 2017-03-08 10:38:05

Target version changed from Tails_2.11 to Tails_2.12

#12 Updated by bertagaz 2017-03-08 11:09:21

Target version changed from Tails_2.12 to Tails_3.0

#13 Updated by bertagaz 2017-04-06 14:29:28

Target version changed from Tails_3.0 to Tails_3.1

#14 Updated by bertagaz 2017-05-27 10:17:33

Target version changed from Tails_3.1 to Tails_3.2

#15 Updated by bertagaz 2017-08-11 16:16:29

Target version changed from Tails_3.2 to Tails_3.3

#16 Updated by bertagaz 2017-10-23 09:52:37

Target version changed from Tails_3.3 to Tails_3.5

Realistically reschedule for 3.4.

#17 Updated by bertagaz 2017-11-15 14:39:51

Target version changed from Tails_3.5 to Tails_3.6

#18 Updated by bertagaz 2018-03-14 11:32:11

Target version changed from Tails_3.6 to Tails_3.7

#19 Updated by intrigeri 2018-04-08 19:22:18

Description updated

#20 Updated by intrigeri 2018-04-08 19:27:57

related to ~~Bug #10601~~: isotesterN:s are sometimes put offline and never back online added

#22 Updated by bertagaz 2018-05-10 11:09:16

Target version changed from Tails_3.7 to Tails_3.8

#23 Updated by intrigeri 2018-06-26 16:27:54

Target version changed from Tails_3.8 to Tails_3.9

#24 Updated by intrigeri 2018-09-05 16:26:53

Target version changed from Tails_3.9 to Tails_3.10.1

#25 Updated by intrigeri 2018-10-24 17:03:37

Target version changed from Tails_3.10.1 to Tails_3.11

#26 Updated by CyrilBrulebois 2018-12-16 14:11:20

Target version changed from Tails_3.11 to Tails_3.12

#27 Updated by anonym 2019-01-30 11:59:15

Target version changed from Tails_3.12 to Tails_3.13

#28 Updated by Anonymous 2019-03-14 13:22:10

blocked by ~~Bug #10068~~: Upgrade to Jenkins 2.x, using upstream packages added

#29 Updated by Anonymous 2019-03-14 13:24:45

If I understand correctly, then upgrading the Priority Sorter plugin will magically fix that. bertagaz will probably have to do this upgrade for ~~Bug #10068~~ → so “blocked by”.

#30 Updated by Anonymous 2019-03-15 10:19:18

Assignee changed from bertagaz to intrigeri
Target version changed from Tails_3.13 to Tails_3.16

Once bertagaz did the Jenkins update (~~Bug #10068~~), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

#31 Updated by intrigeri 2019-03-15 10:20:38

(This problem generally arises when our CI is overloaded, which often happens in the last few days before a release. So once ~~Bug #10068~~ is done by end of June, either I should notice the problem before the 3.15 or 3.16 release. If I don’t, I’ll be happy to call this fixed.)

#32 Updated by intrigeri 2019-08-05 08:23:39

Target version changed from Tails_3.16 to Tails_3.17

u wrote:
> Once bertagaz did the Jenkins update (~~Bug #10068~~), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

bertagaz told me that he made progress on ~~Bug #10068~~ but it is not done yet so I’ll postpone this by a month too.

#33 Updated by intrigeri 2019-09-01 13:29:26

related to Bug #16959: Gather usability data about our current CI added

#34 Updated by intrigeri 2019-09-12 14:25:12

Target version changed from Tails_3.17 to Tails_4.0

#35 Updated by intrigeri 2019-09-21 17:16:28

Status changed from Confirmed to Needs Validation

I’ve adapted our config to be compatible with the latest version of the Priority Sorter plugin, and made the configured priorities correctly implement the documented design: https://git.tails.boum.org/jenkins-jobs/commit/?id=d5586f9642fa4cd63f49016cbe090fa863149db8 might help. I’ll keep an eye on this problem and will report back if I see it again.

Note, however, that I don’t think that fixing the priorities is sufficient to fully fix the race condition: between “the time when reboot_job puts the node back online and wrap_test_* triggers test_Tails_ISO_”, and "the time when test_Tails_ISO_ actually starts, on my local Jenkins I see that 9 seconds have passed. So there’s still a chance that another job, that was already in queue, and that is allowed to run on that node, starts there in the meantime. Such a job can be, for example, another wrap_test_*. At this point I can think of no easy way to fix this within the current design of our pipeline. Given the uncertainty around the future of our Jenkins setup, I won’t spend much more time on it.

I suspect that our time would be better invested in making the test suite jobs clean up properly after themselves when stuff crashes, like we did for the build jobs when we switched them to Vagrant. This way, we could drop the whole “reboot before running the test suite” dance. Most likely, the outcome of this effort would still be valuable even if we move to GitLab CI or whatnot, which it why I think it could be a more worthwhile investment than fixing this part our Jenkins setup 4 years after the problem was spotted. Also, making the test suite jobs clean up after themselves reliably is needed if we want to have nodes that can run both builds & tests, which would provide great performance improvements to our feedback loop.

#36 Updated by intrigeri 2019-09-22 16:37:09

Type of work changed from Research to Sysadmin

#37 Updated by intrigeri 2019-09-23 09:55:00

related to ~~Feature #9486~~: Support running multiple instances of the test suite in parallel added

#38 Updated by intrigeri 2019-09-25 07:26:38

Status changed from Needs Validation to In Progress

> Note, however, that I don’t think that fixing the priorities is sufficient to fully fix the race condition: between “the time when reboot_job puts the node back online and wrap_test_* triggers test_Tails_ISO_”, and "the time when test_Tails_ISO_ actually starts, on my local Jenkins I see that 9 seconds have passed. So there’s still a chance that another job, that was already in queue, and that is allowed to run on that node, starts there in the meantime. Such a job can be, for example, another wrap_test_*. At this point I can think of no easy way to fix this within the current design of our pipeline. Given the uncertainty around the future of our Jenkins setup, I won’t spend much more time on it.

I’ve thought about this some more and tested some other options. I’m now convinced that job priorities alone (even fixed as done a few days ago) can’t solve this problem. At the very least, we would need wrap_test_Tails_ISO_* to wait before it frees the current node, to ensure the test_Tails_ISO_ it triggers has actually reached the queue.

Thing is, I understood that last part while testing a totally different approach (based on unfinished work started by myself, and later improved by bertagaz, in feature/reboot-after_post_build_trigger)… which I think is pretty robust, easier to maintain and to fix if needed:

It includes lots of comments that explain why every workaround is needed and why it works.
Every time there’s a call to sleep, if it turns out that the sleep duration does not work reliably, we can simply increase it, with well understood consequences (that boil down to: every test suite run takes a little more time; given such a run takes hours, adding a few dozen seconds here and there is not a problem).

I don’t think my time would be well spent transplanting all this work onto the current design (I don’t even know if that’s possible) so what I’ll do is:

Keep stress-testing this new design on my local Jenkins.
Identify whatever corner cases I’ve not thought of yet. Handle them.
After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.

This should solve ~~Bug #11295~~ and ~~Bug #10601~~ for good (modulo possible sleep duration bumps later).

Nevertheless, that’s still not really good on the long term, because we’re still blocking some CI performance improvements and potential future options:

> I suspect that our time would be better invested in making the test suite jobs clean up properly after themselves when stuff crashes, like we did for the build jobs when we switched them to Vagrant. This way, we could drop the whole “reboot before running the test suite” dance. Most likely, the outcome of this effort would still be valuable even if we move to GitLab CI or whatnot, which it why I think it could be a more worthwhile investment than fixing this part our Jenkins setup 4 years after the problem was spotted. Also, making the test suite jobs clean up after themselves reliably is needed if we want to have nodes that can run both builds & tests, which would provide great performance improvements to our feedback loop.

I think I have a good lead for this. If I work on this further or if this ticket gets closed, I should move this to a dedicated ticket, but for now let’s keep tracking our various options here.

The idea is this:

1. Whenever the test suite completes (success, failure, abort for unexpected reasons), that is whenever the wrap_test_suite process exits, ensure that all its child processes are killed (even if they were started in the background). This matters for things like Xvfb, unclutter, tor processes started by Chutney, avconv, local services run by cucumber, etc.

I’ve got a simple PoC for this that uses systemd-run, which creates a new CGroup (currently all test suite processes are in the /system.slice/jenkins-slave.service CGroup, which we can’t kill because the Jenkins slave process is also part of it).

Applying this recipe to wrap_test_suite looks like this:

systemd-run --user \
--collect \
--service-type=oneshot \
--unit=tails-wrap-test-suite.service \
--setenv=USE_LAST_RELEASE_AS_OLD_ISO=yes \
/usr/local/bin/wrap_test_suite \
https://iso-history.tails.boum.org \
/tmp/TailsToaster

We’ll need some trickery to send more environment variables over to this transient service. For example, those defined in tmp/tails-build-env.list. This seems totally feasible.

A problem here is process ownership: wrap_test_suite runs all its children process as root, with sudo, so I doubt that the systemd --user instance this would be run under is allowed to kill them. I suspect this can easily be solved by running wrap_test_suite itself as root instead of using sudo in this script. And then we can remove --user from the systemd-run command line.

Developers could use a similar technique when running the test suite locally, which should fix the problems with zombie processes that we have there.

2. Clean up before at the beginning of any new job started on a Jenkins node:

Forcibly kill the TailsToaster libvirt guest if it’s still running and for good measure, restart libvirtd.
Empty /tmp/TailsToaster. Currently we do some of this in post_test_cleanup but it would be more reliable to do it before the test suite starts: there are probably cases when a test suite run aborts in such a broken way that post_test_cleanup is not run.
Ensure tails-wrap-test-suite.service is stopped and all the processes that are part of it are killed.

We can do this as another Jenkins builder step.

My current understanding is that this should free all temporary resources that a given test suite run creates outside of the scope the transient unit created by systemd-run manages. But there might be more such resources that I forgot. I’m confident they can be freed in the same way.

To start with we would do this only when we start a test suite run. But eventually, we should do this when we start any job, so we can use all our nodes for any kind of job.

Note, in passing, that this strategy is already the one we’ve picked for build jobs, when we started using Vagrant and libvirt there: we have pre-build cleanup steps there: remove_libvirt_volumes and cleanup_build_jobs_leftovers. It works pretty well.

3. Don’t reboot Jenkins nodes between test suite jobs.

At this point we should not need to reboot nodes anymore between test suite runs.

In terms of strategy, at first glance all this can be done incrementally:

We can start with step (1) alone, which would already solve some problems for developers who run the test suite locally outside of Jenkins: we all struggle with lingering zombie processes.
Then we can do step (2), which should be a no-op as long as we keep rebooting nodes between test suite runs.
Once we’re confident (1) and (2) work well, we can do step (3).

I can (stress-)test all this on my local Jenkins before we consider deploying to production.

#39 Updated by intrigeri 2019-10-18 16:01:35

intrigeri wrote:
> I don’t think my time would be well spent transplanting all this work onto the current design (I don’t even know if that’s possible) so what I’ll do is:
>
> # Keep stress-testing this new design on my local Jenkins.
> # Identify whatever corner cases I’ve not thought of yet. Handle them.
> # After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.

So far so good locally for steps 1 & 2.

Corner cases that I’ve seen being correctly handled:

A test suite job is deleted (by our job generation logic) while it’s running → the node gets rebooted as expected.
A build is aborted before starting to execute builders, that is, in practice, while cloning the Git repo → the node gets rebooted as expected.

After 4.0 is out, whenever I find some time for it, I’ll prepare branches, submit this for review, and move my systemd-run idea to a dedicated ticket.

#40 Updated by intrigeri 2019-10-18 16:02:11

Target version changed from Tails_4.0 to Tails_4.1

#41 Updated by intrigeri 2019-11-09 10:14:50

intrigeri wrote:
> > # After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.

I’ve asked zen if he wants to spend 2h with me to review this.

#42 Updated by intrigeri 2019-11-09 10:41:51

Feature Branch set to jenkins-jobs:SponsorS-leftovers

#43 Updated by intrigeri 2019-11-09 10:45:45

Feature Branch changed from jenkins-jobs:SponsorS-leftovers to jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers

#44 Updated by intrigeri 2019-11-09 10:59:20

related to Bug #17216: Make the test suite clean up after itself even in most tricky failure modes added

#45 Updated by intrigeri 2019-11-09 11:00:12

intrigeri wrote:
> After 4.0 is out, whenever I find some time for it, I’ll […] move my systemd-run idea to a dedicated ticket.

Done: Bug #17216

#46 Updated by intrigeri 2019-11-09 11:01:11

Status changed from In Progress to Needs Validation

#47 Updated by intrigeri 2019-11-28 08:52:29

Target version changed from Tails_4.1 to Tails_4.2

intrigeri wrote:
> intrigeri wrote:
> > > # After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.
>
> I’ve asked zen if he wants to spend 2h with me to review this.

I’ll ping zen about this during the 4.2 cycle.

#48 Updated by intrigeri 2019-12-13 08:54:29

Assignee changed from intrigeri to zen

On 2019-12-26, I will help zen review my work.

#49 Updated by intrigeri 2019-12-26 13:44:34

Assignee changed from zen to intrigeri

#50 Updated by intrigeri 2019-12-26 13:45:08

Status changed from Needs Validation to In Progress

#51 Updated by intrigeri 2019-12-26 13:48:35

zen and I spent 1.5 hours together looking at the proposed code, at how it works in a real-world scenario on my local Jenkins, and reasoning about possible failure modes. zen gave me his go ahead to deploy to production, see how it goes, and revert if it breaks stuff.

But I should first:

~~document in wiki/src/contribute/working_together/roles/sysadmins/automated_tests_in_Jenkins.mdwn the deployment constraint about number of executors on the Jenkins master~~
~~document the big picture in wiki/src/contribute/working_together/roles/sysadmins/automated_tests_in_Jenkins.mdwn and point to the actual “code”, that has plenty of comments, for details~~
~~fix references to “wait” in the commit message~~
make it clear in the commit message that this implementation is not supposed to be racy in practice as long as the “sleep N” durations are big enough
make it clear somewhere (in the commit message?) that ideally, we would not sleep, but instead somehow check if what we’re waiting for has happened; this might raise non-trivial issues about API access control and corresponding security matters
~~document for developers that the success/failure of the keep_node_busy_during_cleanup job does not matter~~

#52 Updated by intrigeri 2019-12-26 14:54:30

Feature Branch changed from jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers to SponsorS-leftovers, jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers

#53 Updated by intrigeri 2019-12-26 17:42:53

Status changed from In Progress to Needs Validation

Deployed in production, let’s see how it goes. I’ve not merged the doc updates to tails.git’s master branch yet: I would like it to be super simple to revert the whole thing in the next couple days, if we notice the new setup behaves badly and I can’t trivially fix it.

#54 Updated by intrigeri 2019-12-27 08:52:15

OK, so I had to do this:

A couple trivial post-deployment fixes (https://git.tails.boum.org/jenkins-jobs/commit/?id=415fb5e09cd8137bf7630da9eb981ec3d5795066, https://git.tails.boum.org/jenkins-jobs/commit/?id=8da9fecd1c875f5303d0b434d09a07831c554206)
Rework how we put the node back online, because it was racing against the reboot operation.
- Corresponding commits:
  - https://git.tails.boum.org/jenkins-jobs/commit/?id=bc9deac6aae542c608de6bf9d4477d4c6427aafd
  - https://git.tails.boum.org/puppet-tails/commit/?id=bfb553f8e85feea29ed5a5c498787b613226d1f6
- Notes for the reviewer:
  - The good news is that this removes one call to sleep, in favor of a mechanism that actually waits for the necessary pre-condition to be verified, before we put the node back online. Yeah!
  - I could simply have bumped the sleep value but I was worried that to make this really robust, we would have to sleep (and thus immobilize an isotester) for several minutes every time, while in most cases waiting a few seconds would be enough: for example, if the test suite run left a TailsToaster VM running, we would have to wait for libvirtd to kill it (libvirtd first tries to do this nicely, then waits a few minutes, then finally kills the VM). So it seemed to be a good opportunity to implement the more robust mechanism I had already mentioned in a comment.

Let’s see if that’s enough!

#55 Updated by intrigeri 2019-12-28 18:51:14

Assignee changed from intrigeri to zen

I had to fix a few more things (see puppet-tails.git and jenkins-jobs.git) and it now seems to be working flawlessly under high load (most of the day, all isotesters were busy and there were test jobs waiting in the queue).

#56 Updated by intrigeri 2019-12-28 18:52:17

Feature Branch deleted (~~SponsorS-leftovers, jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers~~)

#57 Updated by CyrilBrulebois 2020-01-07 18:00:37

Target version changed from Tails_4.2 to Tails_4.3

#58 Updated by zen 2020-01-20 16:31:40

Assignee changed from zen to intrigeri

intrigeri wrote:
> I had to fix a few more things (see puppet-tails.git and jenkins-jobs.git) and it now seems to be working flawlessly under high load (most of the day, all isotesters were busy and there were test jobs waiting in the queue).

I’ve looked at the fixes, they all make sense and look good. Is there something missing here or can we mark as resolved?

#59 Updated by intrigeri 2020-01-25 11:54:40

Status changed from Needs Validation to Resolved

> I’ve looked at the fixes, they all make sense and look good.
> Is there something missing here or can we mark as resolved?

Thank you for the review! So I think we can close this.
FTR, I’ve not seen this problem happen around the 4.2.x release times, which is typically when it would happen.

See you on ~~Bug #10601~~ :)