Bug #8508

jenkins-slave service sometimes fails to start correctly on boot

Added by bertagaz 2015-01-01 15:46:42 . Updated 2019-08-18 19:19:01 .

Status:
Resolved
Priority:
Normal
Assignee:
groente
Category:
Continuous Integration
Target version:
Start date:
2015-01-01
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
0
Affected tool:
Deliverable for:

Description

Every time lizard is booted, it is required to manually start the jenkins-slave daemon on the isobuilders. I don’t remember that being necessary in the past.

That’d be so nice if they did so!


Subtasks


Related issues

Blocks Tails - Bug #11858: Monitor if isobuilders systems are running fine Resolved 2016-10-03

History

#1 Updated by intrigeri 2015-09-25 03:29:51

I think we have a race condition that leads to Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar and then the slave is offline. The service is still marked as being started though (thanks “good old initscript”). It might be a race condition between the startup of the slave and the startup of the master, or a totally different problem.

#2 Updated by intrigeri 2019-03-15 09:48:00

This is a rare problem nowadays but I see this happen at every boot on ant01. An empty slave.jar is saved. Given how non-robust /usr/share/jenkins/bin/download-slave.sh is, I’m not very surprised. I am testing changes locally that at least should mark the service as failed when this happens. Then systemd should restart the service until it starts successfully. But of course it would be better to properly order the startup of jenkins-slave.service so it doesn’t start before everything it needs is ready. Given ant01 is using systemd-networkd, I’ll try something based on After=systemd-networkd-wait-online.service.

#3 Updated by intrigeri 2019-03-19 07:13:10

  • Subject changed from Investigate why jenkins-slaves don't start on boot to jenkins-slave service sometimes fails to start correctly on boot

intrigeri wrote:
> This is a rare problem nowadays but I see this happen at every boot on ant01. An empty slave.jar is saved. Given how non-robust /usr/share/jenkins/bin/download-slave.sh is, I’m not very surprised. I am testing changes locally that at least should mark the service as failed when this happens. Then systemd should restart the service until it starts successfully. But of course it would be better to properly order the startup of jenkins-slave.service so it doesn’t start before everything it needs is ready. Given ant01 is using systemd-networkd, I’ll try something based on After=systemd-networkd-wait-online.service.

I confirm I now have a robust solution on ant01, based on systemd-networkd and systemd-networkd-wait-online.service. For other Jenkins workers, we could either migrate them to systemd-networkd or use something like ifupdown-wait-online.service (in Buster’s ifupdown, not in Stretch).

#4 Updated by intrigeri 2019-03-19 07:13:22

  • Category changed from Infrastructure to Continuous Integration

#5 Updated by intrigeri 2019-03-19 07:20:03

Actually, a nicer solution would be to make /usr/share/jenkins/bin/download-slave.sh (used by jenkins-slave.service to download slave.jar) retry the download until it succeeds. This way, there’ll be no need to block the whole boot on *-wait-online.service.

#6 Updated by intrigeri 2019-03-20 15:52:16

  • related to Bug #11858: Monitor if isobuilders systems are running fine added

#7 Updated by intrigeri 2019-03-20 15:54:57

  • related to deleted (Bug #11858: Monitor if isobuilders systems are running fine)

#8 Updated by intrigeri 2019-03-20 15:54:59

  • blocks Bug #11858: Monitor if isobuilders systems are running fine added

#9 Updated by intrigeri 2019-03-20 15:57:19

Here’s what I have now:

#!/bin/sh
SLAVE_JAR=/var/run/jenkins/slave.jar
JENKINS_URL=$1

set -eu

if [ -z "$JENKINS_URL" ]
then
    echo URL of jenkins server must be provided
    exit 1
fi

# Retrieve Slave JAR from Master Server
echo "Downloading slave.jar from ${JENKINS_URL}..."
wget -q -O ${SLAVE_JAR}.new ${JENKINS_URL}/jnlpJars/slave.jar

# Check to make sure slave.jar was downloaded.
if [ -f ${SLAVE_JAR}.new ] && [ -s ${SLAVE_JAR}.new ]
then
    mv ${SLAVE_JAR}.new ${SLAVE_JAR}
    exit 0
else
    exit 1
fi

Add a loop to retry the download until it succeeds ⇒ that should fix this ticket and allow us to reject Bug #11858, because we already have a monitoring check that tells us if services didn’t start properly.

#10 Updated by intrigeri 2019-03-26 16:49:24

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri

I can as well finish this, now that I’ve debugged & understood the problem, and have a solution in mind.

#11 Updated by intrigeri 2019-03-26 17:00:31

Given the auto-generated systemd unit file for jenkins-slave.service has Type=forking, even when the initscript returns non-zero, the service is considered to have successfully started. So to solve this in a way that also fixes Bug #11858, I’ll write a unit file for that service, and boom!

#12 Updated by intrigeri 2019-03-27 11:20:30

  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA

Implemented.

@bertagaz, please review https://git.tails.boum.org/puppet-tails/commit/?id=4eb0497e8efeb2d83a3dd1f011d63edcd5017df2

#13 Updated by intrigeri 2019-06-02 14:42:53

  • Status changed from In Progress to Needs Validation

#14 Updated by intrigeri 2019-08-01 20:44:51

  • Assignee changed from bertagaz to Sysadmins

Anyone doing this review, better batch it with Bug #11858 (which should be mostly trivial and take only a few minutes).

#15 Updated by groente 2019-08-18 19:19:01

  • Status changed from Needs Validation to Resolved
  • Assignee changed from Sysadmins to groente
  • % Done changed from 50 to 100

looks good!