Bug #10601

isotesterN:s are sometimes put offline and never back online

Added by intrigeri 2015-11-23 00:42:50 . Updated 2020-05-06 04:28:52 .

Status:
Needs Validation
Priority:
Normal
Assignee:
zen
Category:
Continuous Integration
Target version:
Start date:
2015-11-23
Due date:
% Done:

10%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:
280

Description

I see that isotester2 and isotester4 are currently offline from Jenkins’ PoV. They page (https://jenkins.tails.boum.org/computer/isotester2/) reads “Disconnected by anonymous : Taken_down_for_reboot”. They both are running fine.

For isotester2, the last message I see in the logs is:

Nov 21, 2015 6:09:58 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
WARNING: NioChannelHub keys=6 gen=72376715: Computer.threadPoolForRemoting [#1] for + isotester2 terminated
java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@2bbbdba1[name=isotester2]

… which seems to be normal when we take a slave down (I’ve compared with another isotesterN that was successfully taken offline then back on). This means that it was taken offline 1 day and 14 hours ago, which happens to be its uptime, so it did reboot just fine.

My conclusion is that our way of putting a node back online while it’s rebooting is not working reliably.

I’m curious how often this happens. I don’t remember seeing this recently, but it might be that someone else has been dealing with it manually.


Subtasks


Related issues

Related to Tails - Feature #9486: Support running multiple instances of the test suite in parallel Resolved 2015-06-25
Related to Tails - Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones Resolved 2016-03-31
Related to Tails - Bug #16959: Gather usability data about our current CI In Progress
Related to Tails - Bug #17216: Make the test suite clean up after itself even in most tricky failure modes Confirmed
Blocked by Tails - Bug #10068: Upgrade to Jenkins 2.x, using upstream packages Resolved 2018-01-08

History

#1 Updated by intrigeri 2015-11-23 00:45:56

I’ve manually put both nodes back online and jobs were immediately scheduled on them. We’ll see what happens next.

#2 Updated by intrigeri 2015-11-23 01:13:23

  • related to Feature #9486: Support running multiple instances of the test suite in parallel added

#3 Updated by intrigeri 2015-11-23 01:20:55

  • Subject changed from isotesterN sometimes don't come back up online after rebooting to isotesterN:s sometimes disappear
  • Status changed from Confirmed to In Progress
  • Priority changed from Normal to High
  • % Done changed from 0 to 10

I’ve looked a bit closer at what happened when isotester2 got offline and never back.

https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_feature-jessie/64/ was scheduled on isotester2 on Nov 21, 2015 6:09:47 PM, triggerred https://jenkins.tails.boum.org/job/reboot_job/1412/, scheduled on Nov 21, 2015 6:09:49 PM and took 6.1 seconds. The back online HTTP request ended at 18:09:56. This is confirmed by the Jenkins log:

Nov 21, 2015 6:09:56 PM hudson.model.Run execute
INFO: reboot_job #1412 main build action completed: SUCCESS
Nov 21, 2015 6:09:56 PM hudson.model.Run execute
INFO: wrap_test_Tails_ISO_feature-jessie #64 main build action completed: SUCCESS

https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_experimental/56/ was scheduled on isotester2 on Nov 21, 2015 6:09:56 PM. It failed (https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_experimental/56/console) because isotester2 disappeared at 18:09:58, which is not surprising given we asked it to reboot at 6:09:49 PM.

So I think we have two problems, and the former causes the latter:

  1. our offline + reboot + online dance is racy (we knew that already in theory) enough to allow, in practice, another wrap_test_Tails_ISO_* job to be scheduled at the exact time when we it should not;
  2. when the 2nd wrapper job fails due to loss of connection, Jenkins marks the slave as offline and it never gets back up from Jenkins’ PoV.

If we fix the first problem, then the 2nd one can’t happen anymore and we’re good. Let’s do that.

The main problem is then: we put the slave back online at (shutdown request + 5 seconds), while we should do that after the jenkins-slave service has been stopped. We have already discussed this at length, and we have already agreed on a plan for if/when this happens: Feature #9486#note-49. I’m thus going to raise the sleep time (as you can guess I still feel the right way to order this deterministically is a simple systemd unit file, but I still can live with the pragmatic approach for now).

You said such problems would be “handled with top-priority”, so bumping priority.

#4 Updated by intrigeri 2015-11-23 01:25:09

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 10 to 20

intrigeri wrote:
> I’m thus going to raise the sleep time

Done, up to 30 seconds. While bertagaz is away I’ll monitor this situation.

#5 Updated by intrigeri 2015-11-26 04:26:00

  • Assignee changed from intrigeri to bertagaz

(This should ideally be tracked in a dedicated ticket, but it seems it might be strongly related to the changes I did to address this one, so I’m keeping the discussion here until we figure it out.)

So, I’ve not seen isotesters disappearing anymore, but now I’ve seen 2 different test jobs (https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-8415-overlayfs/16/ and https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-9713-electrum-2.5/7/) scheduled more or less at the same time on isotester1, and the latter is waiting in queue for isotester1… while isotester4 is idling.

Is this a known issue I wasn’t aware of?

Or is it happening because I bumped the sleep to 30 seconds? If that’s the case, then it looks like the reasoning behind preventing raciness with the infamous sleep isn’t quite right.

Debugging info:

And logs:

Nov 26, 2015 9:48:59 AM hudson.model.Run execute
INFO: build_Tails_ISO_feature-8415-overlayfs #23 main build action completed: SUCCESS
Nov 26, 2015 9:59:57 AM hudson.model.Run execute
INFO: build_Tails_ISO_bugfix-9713-electrum-2.5 #10 main build action completed: SUCCESS
Nov 26, 2015 10:42:18 AM hudson.model.Run execute
INFO: test_Tails_ISO_feature-5926-freezable-apt-repository #19 main build action completed: FAILURE

Completion of the last test job above frees isotester1, and then:

Nov 26, 2015 10:42:52 AM hudson.model.Run execute
INFO: reboot_job #1495 main build action completed: SUCCESS
Nov 26, 2015 10:42:52 AM hudson.model.Run execute
INFO: wrap_test_Tails_ISO_feature-8415-overlayfs #16 main build action completed: SUCCESS
Nov 26, 2015 10:43:22 AM hudson.model.Run execute
INFO: reboot_job #1496 main build action completed: SUCCESS
Nov 26, 2015 10:43:22 AM hudson.model.Run execute
INFO: wrap_test_Tails_ISO_bugfix-9713-electrum-2.5 #7 main build action completed: SUCCESS

I’ve tried to understand what happens. At 10:42:52, we tell Jenkins master that isotester1 is back online. So far, so good. And then, why is
https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_bugfix-9713-electrum-2.5/7/console run on it, instead of https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-8415-overlayfs/16/consoleFull (that should in theory be waiting in queue for isotester1 already)? I believe that’s the main problem here, because as a result the overlayfs job waits for one more reboot, and more importantly the electrum one will be blocked for hours, while it could have run immediately on a different slave. https://tails.boum.org/blueprint/automated_builds_and_tests/jenkins/ says that priorities are set in a way that avoids such a situation, so it seems that this does not work as intended.

On https://wiki.jenkins-ci.org/display/JENKINS/Priority+Sorter+Plugin I read “Please note that lower number means higher priority”, but that’s about 3.4 while we’re running 1.3, and on our Jenkins job configuration page I read “A job with higher priority is ran before jobs with lower priorities”.

bertagaz, I’ll let you assess how much of a problem this is, and if it’s a serious one, I’ll let you think about how to fix it. If help is needed to write systemd unit files to replace calls to sleep, let me know.

#6 Updated by bertagaz 2015-12-08 02:05:57

I wonder how this could be related to the OOM we’ve seen lately that made some of our VM being shut down. Let see if we see this issue happening again.

#7 Updated by intrigeri 2015-12-08 05:47:09

> I wonder how this could be related to the OOM we’ve seen lately that made some of our VM being shut down.

I don’t see how it can possibly relate to the problem at hand: we’re talking of well-understood reboots here. I think it’ll be clear to you if you re-read carefully the ticket description and my analysis.

> Let see if we see this issue happening again.

Right.

Shall I file another ticket for the follow-up bug I reported and analyzed on Bug #10601#note-5, or do you plan to answer it here?

#8 Updated by bertagaz 2015-12-15 03:38:41

  • Target version changed from Tails_1.8 to Tails_2.0

Postponing

#9 Updated by intrigeri 2015-12-19 10:58:40

We’ll keep an eye on it and see if it happens again. Whenever it does, drop here any useful info that can help future debugging. It’s not expected that we’ll fully solve that for 2.0 (as actually the problem this ticket is about may be fixed already).

It would be good to see the other problem(s?) discovered on this ticket tracked appropriately elsewhere.

#10 Updated by intrigeri 2016-01-06 14:20:44

  • blocks Feature #5288: Run the test suite automatically on autobuilt ISOs added

#11 Updated by bertagaz 2016-01-27 10:49:48

  • Target version changed from Tails_2.0 to Tails_2.2

#12 Updated by intrigeri 2016-02-08 12:50:54

  • related to deleted (Feature #9486: Support running multiple instances of the test suite in parallel)

#13 Updated by intrigeri 2016-02-08 12:51:00

  • blocked by deleted (Feature #5288: Run the test suite automatically on autobuilt ISOs)

#14 Updated by intrigeri 2016-02-08 12:52:04

  • Deliverable for changed from 268 to 267

#15 Updated by intrigeri 2016-02-08 12:52:22

#16 Updated by intrigeri 2016-02-08 12:52:41

  • related to Feature #9486: Support running multiple instances of the test suite in parallel added

#17 Updated by bertagaz 2016-03-10 18:51:01

  • Target version changed from Tails_2.2 to Tails_2.3

#18 Updated by bertagaz 2016-03-31 15:13:57

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 20 to 60
  • QA Check set to Ready for QA

I did not look closely every day to see if the problem described in this ticket title still happens, but I don’t remember having seen this lately. So I think that for this one we’re good. I’ve created Bug #11295 to track the other problem you found about test job getting their isotester stolen by another ones.

#19 Updated by intrigeri 2016-04-06 12:49:13

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 60 to 100
  • QA Check changed from Ready for QA to Pass

bertagaz wrote:
> I did not look closely every day to see if the problem described in this ticket title still happens, but I don’t remember having seen this lately.

Neither have I! Closing, then.

#20 Updated by intrigeri 2018-04-08 19:27:34

  • Status changed from Resolved to In Progress
  • Assignee set to bertagaz
  • Target version deleted (Tails_2.3)
  • % Done changed from 100 to 10
  • QA Check deleted (Pass)
  • Deliverable for changed from 267 to SponsorS_Internal

I see this happen regularly these days: it happens at least once every two week that I find such an isotester “taken down for reboot” for hours if not days. Every time I switch it back online and it’s good to go again.

See my initial debugging (Bug #10601#note-5) and the related Bug #11295.

#21 Updated by intrigeri 2018-04-08 19:27:46

  • Priority changed from High to Normal

#22 Updated by intrigeri 2018-04-08 19:27:58

  • related to Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones added

#24 Updated by Anonymous 2019-03-14 13:22:43

  • blocked by Bug #10068: Upgrade to Jenkins 2.x, using upstream packages added

#25 Updated by Anonymous 2019-03-14 13:24:55

If I understand correctly, then upgrading the Priority Sorter plugin will magically fix that. bertagaz will probably have to do this upgrade for Bug #10068 → so “blocked by”.

#26 Updated by Anonymous 2019-03-15 10:19:31

  • Assignee changed from bertagaz to intrigeri
  • Target version set to Tails_3.16

Once bertagaz did the Jenkins update (Bug #10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

#27 Updated by intrigeri 2019-03-15 10:21:03

Same as Bug #11295#note-31.

#28 Updated by intrigeri 2019-03-17 08:50:42

FTR this problem happened twice in the last few days, that have been heavy on Jenkins (lots of activity on WIP branches in the last few days before the freeze for 3.13, probably explained in great part by the student syndrome.

#29 Updated by intrigeri 2019-08-05 08:23:20

  • Target version changed from Tails_3.16 to Tails_3.17

u wrote:
> Once bertagaz did the Jenkins update (Bug #10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

bertagaz told me that he made progress on Bug #10068 but it is not done yet so I’ll postpone this by a month too.

#30 Updated by intrigeri 2019-08-12 10:06:34

  • Subject changed from isotesterN:s sometimes disappear to isotesterN:s are sometimes put offline and never back online

#31 Updated by intrigeri 2019-09-01 13:29:38

  • related to Bug #16959: Gather usability data about our current CI added

#32 Updated by intrigeri 2019-09-12 14:25:05

  • Target version changed from Tails_3.17 to Tails_4.0

#33 Updated by intrigeri 2019-09-21 17:17:19

  • Status changed from In Progress to Needs Validation

Same as Bug #11295#note-35.

#34 Updated by intrigeri 2019-09-22 16:37:16

  • Type of work changed from Research to Sysadmin

#35 Updated by intrigeri 2019-09-25 07:27:11

  • Status changed from Needs Validation to In Progress

Same as Bug #11295#note-38.

#36 Updated by intrigeri 2019-10-18 16:02:17

  • Target version changed from Tails_4.0 to Tails_4.1

Same as Bug #11295#note-39.

#37 Updated by intrigeri 2019-11-09 10:42:02

  • Feature Branch set to jenkins-jobs:SponsorS-leftovers

#38 Updated by intrigeri 2019-11-09 10:45:14

  • Feature Branch changed from jenkins-jobs:SponsorS-leftovers to jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers

#39 Updated by intrigeri 2019-11-09 10:59:14

  • related to Bug #17216: Make the test suite clean up after itself even in most tricky failure modes added

#40 Updated by intrigeri 2019-11-09 11:01:38

  • Status changed from In Progress to Needs Validation

#41 Updated by intrigeri 2019-11-28 08:52:38

  • Target version changed from Tails_4.1 to Tails_4.2

#42 Updated by intrigeri 2019-12-13 08:54:17

  • Assignee changed from intrigeri to zen

On 2019-12-26, I will help zen review my work.

#43 Updated by intrigeri 2019-12-26 13:44:32

  • Assignee changed from zen to intrigeri

#44 Updated by intrigeri 2019-12-26 13:45:07

  • Status changed from Needs Validation to In Progress

#45 Updated by intrigeri 2019-12-26 14:54:36

  • Feature Branch changed from jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers to SponsorS-leftovers, jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers

#46 Updated by intrigeri 2019-12-26 17:42:57

  • Status changed from In Progress to Needs Validation

#47 Updated by intrigeri 2019-12-27 16:04:36

In passing, I’ve replaced the calls to toggleOffline (context-dependent, so inherently fragile and harder to reason about) with the enable_node and disable_node setters, that are nowadays available via the Jenkins API. This should avoid putting an isotester offline while we wanted to bring it back online (I’ve no idea if that problem happened in practice, but I prefer to not have to reason about it).

#48 Updated by intrigeri 2019-12-28 18:51:50

  • Assignee changed from intrigeri to zen

(For the last changes I did in this area in puppet-tails.git and jenkins-jobs.git. Nothing major compared to the version we looked at together.)

#49 Updated by intrigeri 2019-12-28 18:52:12

  • Feature Branch deleted (SponsorS-leftovers, jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers)

#50 Updated by CyrilBrulebois 2020-01-07 18:00:36

  • Target version changed from Tails_4.2 to Tails_4.3

#51 Updated by intrigeri 2020-01-25 11:59:06

  • Status changed from Needs Validation to In Progress
  • Assignee changed from zen to intrigeri

intrigeri wrote:
> In passing, I’ve replaced the calls to toggleOffline (context-dependent, so inherently fragile and harder to reason about) with the enable_node and disable_node setters, that are nowadays available via the Jenkins API. This should avoid putting an isotester offline while we wanted to bring it back online (I’ve no idea if that problem happened in practice, but I prefer to not have to reason about it).

OK, this was very nice in theory, but since I made this change, on my local Jenkins I regularly see jenkins-slave.service put the node offline after a reboot (IIRC due to yet another race condition between enable_node and the rest of the startup code). I’ve never seen this happen on lizard, but still, I don’t think it’s good enough: the very goal of this ticket was to fix this class of bugs.

#52 Updated by intrigeri 2020-02-07 08:46:17

  • Target version changed from Tails_4.3 to Tails_4.4

#53 Updated by intrigeri 2020-02-16 12:21:33

  • Status changed from In Progress to Needs Validation

intrigeri wrote:
> intrigeri wrote:
> > In passing, I’ve replaced the calls to toggleOffline (context-dependent, so inherently fragile and harder to reason about) with the enable_node and disable_node setters, that are nowadays available via the Jenkins API. This should avoid putting an isotester offline while we wanted to bring it back online (I’ve no idea if that problem happened in practice, but I prefer to not have to reason about it).
>
> OK, this was very nice in theory, but since I made this change, on my local Jenkins I regularly see jenkins-slave.service put the node offline after a reboot (IIRC due to yet another race condition between enable_node and the rest of the startup code). I’ve never seen this happen on lizard, but still, I don’t think it’s good enough: the very goal of this ticket was to fix this class of bugs.

I hope I’ve fixed this with https://git.tails.boum.org/puppet-tails/commit/?id=46ad3c627b16c54342bf6a380116a4c223e1450e. At least I can’t easily reproduce the bug on my local Jenkins, where it happened pretty often before. Time will tell.

#54 Updated by intrigeri 2020-02-16 13:04:38

Note to myself: if this works, I’ll want to move the 2 Python one-liners from jenkins-slave.service to an external script that’ll be vastly more readable and easier to work with.

#55 Updated by intrigeri 2020-02-17 09:38:18

  • Assignee changed from intrigeri to zen

Hi @zen,

intrigeri wrote:
> Note to myself: if this works, I’ll want to move the 2 Python one-liners from jenkins-slave.service to an external script that’ll be vastly more readable and easier to work with.

Done.

I’ve exercised this updated implementation quite heavily on my local Jenkins (where I saw the problem super often) + a bit on lizard (where I saw the bug only once previously) and could not reproduce the bug, so I think it’s a good time to do a review:

Obviously, I’ll keep an eye on these 2 Jenkins setups and will follow-up if I see the problem coming back again. Time will tell!

#56 Updated by CyrilBrulebois 2020-03-12 09:55:40

  • Target version changed from Tails_4.4 to Tails_4.5

#57 Updated by intrigeri 2020-03-19 08:49:51

I’ve not seen the problem happen again on either of these 2 Jenkins setups since I deployed the new implementation 1 month ago! :)

#58 Updated by CyrilBrulebois 2020-04-07 17:05:08

  • Target version changed from Tails_4.5 to Tails_4.6

#59 Updated by CyrilBrulebois 2020-05-06 04:28:52

  • Target version changed from Tails_4.6 to Tails_4.7