Bug #11858
Monitor if isobuilders systems are running fine
100%
Description
We experienced times where our isobuilders were slowly getting all down when a branch was triggering the OOM during its build.
We should use our monitoring system to check using systemd and/or anything else if the isobuilders systems are running fine, so that we know if we have to restart them or their jenkins-slave service.
Subtasks
Related issues
Related to Tails - |
Resolved | 2016-08-11 | |
Related to Tails - |
Resolved | 2016-12-01 | |
Related to Tails - |
Duplicate | 2017-08-04 | |
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) | Confirmed | 2017-06-29 | |
Blocked by Tails - |
Resolved | 2015-01-01 |
History
#1 Updated by intrigeri 2016-10-03 08:53:58
- Assignee set to bertagaz
(Assuming that’s what you meant given you’ve set a target version.)
#2 Updated by intrigeri 2016-10-03 09:01:11
- related to
Bug #11632: ISO builds from branch that need more RAM can break all our Jenkins isobuilders without us being notified added
#3 Updated by bertagaz 2016-11-08 20:23:59
- Target version changed from Tails_2.7 to Tails_2.9.1
#4 Updated by intrigeri 2016-12-01 12:37:30
- related to
Bug #12009: Jenkins ISO builders are highly unreliable added
#5 Updated by anonym 2016-12-14 20:11:27
- Target version changed from Tails_2.9.1 to Tails 2.10
#6 Updated by anonym 2017-01-24 20:48:52
- Target version changed from Tails 2.10 to Tails_2.11
#7 Updated by bertagaz 2017-03-08 10:38:06
- Target version changed from Tails_2.11 to Tails_2.12
#8 Updated by bertagaz 2017-03-08 11:09:21
- Target version changed from Tails_2.12 to Tails_3.0
#9 Updated by bertagaz 2017-04-06 14:29:28
- Target version changed from Tails_3.0 to Tails_3.1
#10 Updated by bertagaz 2017-05-21 16:37:36
- Target version changed from Tails_3.1 to Tails_3.2
#11 Updated by intrigeri 2017-06-29 10:16:53
- blocks
Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added
#12 Updated by groente 2017-08-06 11:32:11
- Starter set to Yes
A simple check whether
systemctl --quiet is-failed \*
returns 0 (in which case something is wrong) should do the trick, both for the isobuilders and Bug #13582
#13 Updated by groente 2017-08-06 11:32:47
- related to
Bug #13582: Monitoring bridge added
#14 Updated by intrigeri 2017-08-06 11:54:03
systemctl is-system-running
might do exactly what we want.
#15 Updated by bertagaz 2017-09-07 13:02:35
- Target version changed from Tails_3.2 to Tails_3.3
#16 Updated by bertagaz 2017-09-07 13:33:52
- blocked by deleted (
)Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services)
#17 Updated by bertagaz 2017-09-07 13:34:08
- blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added
#18 Updated by bertagaz 2017-10-19 10:15:59
pynagsystemd sounds like a good candidate. I’ll give a try to this one.
#19 Updated by bertagaz 2017-10-21 10:06:14
- Status changed from Confirmed to In Progress
- Assignee changed from bertagaz to intrigeri
- % Done changed from 0 to 50
- QA Check set to Ready for QA
- Feature Branch set to puppet-tails:feature/11858-monitor-systemd
bertagaz wrote:
> pynagsystemd sounds like a good candidate. I’ll give a try to this one.
I’ve committed everything in the dedicated branch, merged it in master and deployed that. We now have a systemd check on all agents as we discussed in Bug #13582. To test it, just find one check that will be run soon, and set one service as failing on the related host (e.g by misconfiguring and restarting it so that it does fail to start). Then you’ll see an alert in icinga2 about this service failing.
#20 Updated by intrigeri 2017-10-22 05:53:50
- Assignee changed from intrigeri to groente
(As per “Shifts for 2018Q1 + intrigeri’s involvement in the sysadmin team”.)
#21 Updated by intrigeri 2017-11-10 15:13:16
As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.
#22 Updated by anonym 2017-11-15 11:30:51
- Target version changed from Tails_3.3 to Tails_3.5
#23 Updated by intrigeri 2017-11-28 09:41:55
- Assignee changed from groente to bertagaz
- QA Check changed from Ready for QA to Dev Needed
intrigeri wrote:
> As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.
Reproduced again: isotester4 was offline in Jenkins for ~1.5 days but the jenkins-slave service was seen as successfully started by systemd. jenkins-slave.log
said Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar
. So I guess this ticket shall be blocked by a new one about making the jenkins-slave
service able to report its state reliably.
#24 Updated by anonym 2018-01-23 19:52:37
- Target version changed from Tails_3.5 to Tails_3.6
#25 Updated by bertagaz 2018-03-14 11:32:12
- Target version changed from Tails_3.6 to Tails_3.7
#26 Updated by bertagaz 2018-05-10 11:09:17
- Target version changed from Tails_3.7 to Tails_3.8
#27 Updated by intrigeri 2018-06-26 16:27:54
- Target version changed from Tails_3.8 to Tails_3.9
#28 Updated by intrigeri 2018-09-05 16:26:54
- Target version changed from Tails_3.9 to Tails_3.10.1
#29 Updated by intrigeri 2018-10-24 17:03:38
- Target version changed from Tails_3.10.1 to Tails_3.11
#30 Updated by CyrilBrulebois 2018-12-16 13:54:19
- Target version changed from Tails_3.11 to Tails_3.12
#31 Updated by anonym 2019-01-30 11:59:15
- Target version changed from Tails_3.12 to Tails_3.13
#32 Updated by CyrilBrulebois 2019-03-20 14:35:09
- Target version changed from Tails_3.13 to Tails_3.14
#33 Updated by intrigeri 2019-03-20 15:52:16
- related to
Bug #8508: jenkins-slave service sometimes fails to start correctly on boot added
#34 Updated by intrigeri 2019-03-20 15:53:17
intrigeri wrote:
> Reproduced again: isotester4 was offline in Jenkins for ~1.5 days but the jenkins-slave service was seen as successfully started by systemd. jenkins-slave.log
said Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar
. So I guess this ticket shall be blocked by a new one about making the jenkins-slave
service able to report its state reliably.
We already have it: Bug #8508.
#35 Updated by intrigeri 2019-03-20 15:54:57
- related to deleted (
)Bug #8508: jenkins-slave service sometimes fails to start correctly on boot
#36 Updated by intrigeri 2019-03-20 15:54:59
- blocked by
Bug #8508: jenkins-slave service sometimes fails to start correctly on boot added
#37 Updated by intrigeri 2019-03-27 11:23:09
- QA Check changed from Dev Needed to Ready for QA
Fixed, see Bug #8508. Stopping jenkins-slave.service
made https://icingaweb2.tails.boum.org/monitoring/host/services?host=isobuilder3.lizard#!/monitoring/service/show?host=isobuilder3.lizard&service=systemd%40isobuilder3.lizard correctly detect the problem, which was the goal here. (And Puppet will start it again next time it runs on the affected node so we also have auto-recovery now.)
@bertagaz, please review.
#38 Updated by CyrilBrulebois 2019-05-23 21:23:21
- Target version changed from Tails_3.14 to Tails_3.15
#39 Updated by intrigeri 2019-06-02 14:42:54
- Status changed from In Progress to Needs Validation
#40 Updated by CyrilBrulebois 2019-07-10 10:33:58
- Target version changed from Tails_3.15 to Tails_3.16
#41 Updated by groente 2019-08-01 09:31:13
- Assignee changed from bertagaz to Sysadmins
#42 Updated by intrigeri 2019-08-01 20:45:41
To review the code: it’s the same as Bug #8508 (Bug #8508#note-12) so better review both together.
To test behavior: systemctl stop jenkins-slave
on an idle isobuilder and verify that our monitoring notices there’s a problem.
#43 Updated by intrigeri 2019-08-01 20:48:33
- Target version deleted (
Tails_3.16)
(A review would be very nice to have, but this was deployed in production 4 months ago, so at this point, it does not make a big difference whether we review this now or in 2 months anymore.)
#44 Updated by groente 2019-08-18 19:24:01
- Status changed from Needs Validation to Resolved
- Assignee changed from Sysadmins to groente
- % Done changed from 50 to 100
intrigeri wrote:
> (A review would be very nice to have, but this was deployed in production 4 months ago, so at this point, it does not make a big difference whether we review this now or in 2 months anymore.)
indeed icinga starts to complain when jenkins-slave is down, i’ve reviewed together with Bug #8508 and calling this done.