Bug #8508
jenkins-slave service sometimes fails to start correctly on boot
100%
Description
Every time lizard is booted, it is required to manually start the jenkins-slave daemon on the isobuilders. I don’t remember that being necessary in the past.
That’d be so nice if they did so!
Subtasks
Related issues
Blocks Tails - |
Resolved | 2016-10-03 |
History
#1 Updated by intrigeri 2015-09-25 03:29:51
I think we have a race condition that leads to Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar
and then the slave is offline. The service is still marked as being started though (thanks “good old initscript”). It might be a race condition between the startup of the slave and the startup of the master, or a totally different problem.
#2 Updated by intrigeri 2019-03-15 09:48:00
This is a rare problem nowadays but I see this happen at every boot on ant01. An empty slave.jar
is saved. Given how non-robust /usr/share/jenkins/bin/download-slave.sh
is, I’m not very surprised. I am testing changes locally that at least should mark the service as failed when this happens. Then systemd should restart the service until it starts successfully. But of course it would be better to properly order the startup of jenkins-slave.service
so it doesn’t start before everything it needs is ready. Given ant01 is using systemd-networkd, I’ll try something based on After=systemd-networkd-wait-online.service
.
#3 Updated by intrigeri 2019-03-19 07:13:10
- Subject changed from Investigate why jenkins-slaves don't start on boot to jenkins-slave service sometimes fails to start correctly on boot
intrigeri wrote:
> This is a rare problem nowadays but I see this happen at every boot on ant01. An empty slave.jar
is saved. Given how non-robust /usr/share/jenkins/bin/download-slave.sh
is, I’m not very surprised. I am testing changes locally that at least should mark the service as failed when this happens. Then systemd should restart the service until it starts successfully. But of course it would be better to properly order the startup of jenkins-slave.service
so it doesn’t start before everything it needs is ready. Given ant01 is using systemd-networkd, I’ll try something based on After=systemd-networkd-wait-online.service
.
I confirm I now have a robust solution on ant01, based on systemd-networkd
and systemd-networkd-wait-online.service
. For other Jenkins workers, we could either migrate them to systemd-networkd
or use something like ifupdown-wait-online.service
(in Buster’s ifupdown
, not in Stretch).
#4 Updated by intrigeri 2019-03-19 07:13:22
- Category changed from Infrastructure to Continuous Integration
#5 Updated by intrigeri 2019-03-19 07:20:03
Actually, a nicer solution would be to make /usr/share/jenkins/bin/download-slave.sh
(used by jenkins-slave.service
to download slave.jar
) retry the download until it succeeds. This way, there’ll be no need to block the whole boot on *-wait-online.service
.
#6 Updated by intrigeri 2019-03-20 15:52:16
- related to
Bug #11858: Monitor if isobuilders systems are running fine added
#7 Updated by intrigeri 2019-03-20 15:54:57
- related to deleted (
)Bug #11858: Monitor if isobuilders systems are running fine
#8 Updated by intrigeri 2019-03-20 15:54:59
- blocks
Bug #11858: Monitor if isobuilders systems are running fine added
#9 Updated by intrigeri 2019-03-20 15:57:19
Here’s what I have now:
#!/bin/sh
SLAVE_JAR=/var/run/jenkins/slave.jar
JENKINS_URL=$1
set -eu
if [ -z "$JENKINS_URL" ]
then
echo URL of jenkins server must be provided
exit 1
fi
# Retrieve Slave JAR from Master Server
echo "Downloading slave.jar from ${JENKINS_URL}..."
wget -q -O ${SLAVE_JAR}.new ${JENKINS_URL}/jnlpJars/slave.jar
# Check to make sure slave.jar was downloaded.
if [ -f ${SLAVE_JAR}.new ] && [ -s ${SLAVE_JAR}.new ]
then
mv ${SLAVE_JAR}.new ${SLAVE_JAR}
exit 0
else
exit 1
fi
Add a loop to retry the download until it succeeds ⇒ that should fix this ticket and allow us to reject Bug #11858, because we already have a monitoring check that tells us if services didn’t start properly.
#10 Updated by intrigeri 2019-03-26 16:49:24
- Status changed from Confirmed to In Progress
- Assignee set to intrigeri
I can as well finish this, now that I’ve debugged & understood the problem, and have a solution in mind.
#11 Updated by intrigeri 2019-03-26 17:00:31
Given the auto-generated systemd unit file for jenkins-slave.service
has Type=forking
, even when the initscript returns non-zero, the service is considered to have successfully started. So to solve this in a way that also fixes Bug #11858, I’ll write a unit file for that service, and boom!
#12 Updated by intrigeri 2019-03-27 11:20:30
- Assignee changed from intrigeri to bertagaz
- % Done changed from 0 to 50
- QA Check set to Ready for QA
Implemented.
@bertagaz, please review https://git.tails.boum.org/puppet-tails/commit/?id=4eb0497e8efeb2d83a3dd1f011d63edcd5017df2
#13 Updated by intrigeri 2019-06-02 14:42:53
- Status changed from In Progress to Needs Validation
#14 Updated by intrigeri 2019-08-01 20:44:51
- Assignee changed from bertagaz to Sysadmins
Anyone doing this review, better batch it with Bug #11858 (which should be mostly trivial and take only a few minutes).
#15 Updated by groente 2019-08-18 19:19:01
- Status changed from Needs Validation to Resolved
- Assignee changed from Sysadmins to groente
- % Done changed from 50 to 100
looks good!