Bug #11858

Monitor if isobuilders systems are running fine

Added by bertagaz 2016-10-03 08:41:14 . Updated 2019-08-18 19:24:01 .

Status:
Resolved
Priority:
Normal
Assignee:
groente
Category:
Infrastructure
Target version:
Start date:
2016-10-03
Due date:
% Done:

100%

Feature Branch:
puppet-tails:feature/11858-monitor-systemd
Type of work:
Sysadmin
Blueprint:

Starter:
1
Affected tool:
Deliverable for:

Description

We experienced times where our isobuilders were slowly getting all down when a branch was triggering the OOM during its build.

We should use our monitoring system to check using systemd and/or anything else if the isobuilders systems are running fine, so that we know if we have to restart them or their jenkins-slave service.


Subtasks


Related issues

Related to Tails - Bug #11632: ISO builds from branch that need more RAM can break all our Jenkins isobuilders without us being notified Resolved 2016-08-11
Related to Tails - Bug #12009: Jenkins ISO builders are highly unreliable Resolved 2016-12-01
Related to Tails - Bug #13582: Monitoring bridge Duplicate 2017-08-04
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29
Blocked by Tails - Bug #8508: jenkins-slave service sometimes fails to start correctly on boot Resolved 2015-01-01

History

#1 Updated by intrigeri 2016-10-03 08:53:58

  • Assignee set to bertagaz

(Assuming that’s what you meant given you’ve set a target version.)

#2 Updated by intrigeri 2016-10-03 09:01:11

  • related to Bug #11632: ISO builds from branch that need more RAM can break all our Jenkins isobuilders without us being notified added

#3 Updated by bertagaz 2016-11-08 20:23:59

  • Target version changed from Tails_2.7 to Tails_2.9.1

#4 Updated by intrigeri 2016-12-01 12:37:30

  • related to Bug #12009: Jenkins ISO builders are highly unreliable added

#5 Updated by anonym 2016-12-14 20:11:27

  • Target version changed from Tails_2.9.1 to Tails 2.10

#6 Updated by anonym 2017-01-24 20:48:52

  • Target version changed from Tails 2.10 to Tails_2.11

#7 Updated by bertagaz 2017-03-08 10:38:06

  • Target version changed from Tails_2.11 to Tails_2.12

#8 Updated by bertagaz 2017-03-08 11:09:21

  • Target version changed from Tails_2.12 to Tails_3.0

#9 Updated by bertagaz 2017-04-06 14:29:28

  • Target version changed from Tails_3.0 to Tails_3.1

#10 Updated by bertagaz 2017-05-21 16:37:36

  • Target version changed from Tails_3.1 to Tails_3.2

#11 Updated by intrigeri 2017-06-29 10:16:53

  • blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#12 Updated by groente 2017-08-06 11:32:11

  • Starter set to Yes

A simple check whether

systemctl --quiet is-failed \*

returns 0 (in which case something is wrong) should do the trick, both for the isobuilders and Bug #13582

#13 Updated by groente 2017-08-06 11:32:47

#14 Updated by intrigeri 2017-08-06 11:54:03

systemctl is-system-running might do exactly what we want.

#15 Updated by bertagaz 2017-09-07 13:02:35

  • Target version changed from Tails_3.2 to Tails_3.3

#16 Updated by bertagaz 2017-09-07 13:33:52

  • blocked by deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#17 Updated by bertagaz 2017-09-07 13:34:08

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#18 Updated by bertagaz 2017-10-19 10:15:59

pynagsystemd sounds like a good candidate. I’ll give a try to this one.

#19 Updated by bertagaz 2017-10-21 10:06:14

  • Status changed from Confirmed to In Progress
  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to puppet-tails:feature/11858-monitor-systemd

bertagaz wrote:
> pynagsystemd sounds like a good candidate. I’ll give a try to this one.

I’ve committed everything in the dedicated branch, merged it in master and deployed that. We now have a systemd check on all agents as we discussed in Bug #13582. To test it, just find one check that will be run soon, and set one service as failing on the related host (e.g by misconfiguring and restarting it so that it does fail to start). Then you’ll see an alert in icinga2 about this service failing.

#20 Updated by intrigeri 2017-10-22 05:53:50

  • Assignee changed from intrigeri to groente

(As per “Shifts for 2018Q1 + intrigeri’s involvement in the sysadmin team”.)

#21 Updated by intrigeri 2017-11-10 15:13:16

As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.

#22 Updated by anonym 2017-11-15 11:30:51

  • Target version changed from Tails_3.3 to Tails_3.5

#23 Updated by intrigeri 2017-11-28 09:41:55

  • Assignee changed from groente to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

intrigeri wrote:
> As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.

Reproduced again: isotester4 was offline in Jenkins for ~1.5 days but the jenkins-slave service was seen as successfully started by systemd. jenkins-slave.log said Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar. So I guess this ticket shall be blocked by a new one about making the jenkins-slave service able to report its state reliably.

#24 Updated by anonym 2018-01-23 19:52:37

  • Target version changed from Tails_3.5 to Tails_3.6

#25 Updated by bertagaz 2018-03-14 11:32:12

  • Target version changed from Tails_3.6 to Tails_3.7

#26 Updated by bertagaz 2018-05-10 11:09:17

  • Target version changed from Tails_3.7 to Tails_3.8

#27 Updated by intrigeri 2018-06-26 16:27:54

  • Target version changed from Tails_3.8 to Tails_3.9

#28 Updated by intrigeri 2018-09-05 16:26:54

  • Target version changed from Tails_3.9 to Tails_3.10.1

#29 Updated by intrigeri 2018-10-24 17:03:38

  • Target version changed from Tails_3.10.1 to Tails_3.11

#30 Updated by CyrilBrulebois 2018-12-16 13:54:19

  • Target version changed from Tails_3.11 to Tails_3.12

#31 Updated by anonym 2019-01-30 11:59:15

  • Target version changed from Tails_3.12 to Tails_3.13

#32 Updated by CyrilBrulebois 2019-03-20 14:35:09

  • Target version changed from Tails_3.13 to Tails_3.14

#33 Updated by intrigeri 2019-03-20 15:52:16

  • related to Bug #8508: jenkins-slave service sometimes fails to start correctly on boot added

#34 Updated by intrigeri 2019-03-20 15:53:17

intrigeri wrote:
> Reproduced again: isotester4 was offline in Jenkins for ~1.5 days but the jenkins-slave service was seen as successfully started by systemd. jenkins-slave.log said Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar. So I guess this ticket shall be blocked by a new one about making the jenkins-slave service able to report its state reliably.

We already have it: Bug #8508.

#35 Updated by intrigeri 2019-03-20 15:54:57

  • related to deleted (Bug #8508: jenkins-slave service sometimes fails to start correctly on boot)

#36 Updated by intrigeri 2019-03-20 15:54:59

  • blocked by Bug #8508: jenkins-slave service sometimes fails to start correctly on boot added

#37 Updated by intrigeri 2019-03-27 11:23:09

  • QA Check changed from Dev Needed to Ready for QA

Fixed, see Bug #8508. Stopping jenkins-slave.service made https://icingaweb2.tails.boum.org/monitoring/host/services?host=isobuilder3.lizard#!/monitoring/service/show?host=isobuilder3.lizard&service=systemd%40isobuilder3.lizard correctly detect the problem, which was the goal here. (And Puppet will start it again next time it runs on the affected node so we also have auto-recovery now.)

@bertagaz, please review.

#38 Updated by CyrilBrulebois 2019-05-23 21:23:21

  • Target version changed from Tails_3.14 to Tails_3.15

#39 Updated by intrigeri 2019-06-02 14:42:54

  • Status changed from In Progress to Needs Validation

#40 Updated by CyrilBrulebois 2019-07-10 10:33:58

  • Target version changed from Tails_3.15 to Tails_3.16

#41 Updated by groente 2019-08-01 09:31:13

  • Assignee changed from bertagaz to Sysadmins

#42 Updated by intrigeri 2019-08-01 20:45:41

To review the code: it’s the same as Bug #8508 (Bug #8508#note-12) so better review both together.

To test behavior: systemctl stop jenkins-slave on an idle isobuilder and verify that our monitoring notices there’s a problem.

#43 Updated by intrigeri 2019-08-01 20:48:33

  • Target version deleted (Tails_3.16)

(A review would be very nice to have, but this was deployed in production 4 months ago, so at this point, it does not make a big difference whether we review this now or in 2 months anymore.)

#44 Updated by groente 2019-08-18 19:24:01

  • Status changed from Needs Validation to Resolved
  • Assignee changed from Sysadmins to groente
  • % Done changed from 50 to 100

intrigeri wrote:
> (A review would be very nice to have, but this was deployed in production 4 months ago, so at this point, it does not make a big difference whether we review this now or in 2 months anymore.)

indeed icinga starts to complain when jenkins-slave is down, i’ve reviewed together with Bug #8508 and calling this done.