Bug #17703

Monitoring broken (since April 29?)

Added by intrigeri 2020-05-10 07:48:03 . Updated 2020-05-14 06:49:28 .

Status:
Resolved
Priority:
Elevated
Assignee:
groente
Category:
Infrastructure
Target version:
Start date:
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

On https://icingaweb2.tails.boum.org/monitoring/health/info I see that the last Icinga status update happened on April 29.

On ecours, I see that icinga2.service cannot start because the config files in the teels.tails.boum.org zone refer to a zone that is not declared anymore. Indeed, in etckeeper’s log on ecours, I see that Puppet removed that zone on April 29 (6ffca75bb39d22133064d5d3f306c5be77a9eb46). I could not find where that zone was configured in Puppet so I tried deleting /etc/icinga2/zones.d/teels.tails.boum.org/, which allowed icinga2.service to start. Then I ran Puppet on ecours, and those files did not come back, so I’m confused.

Then I saw the exact same problem on monitor.lizard, and applied the same solution. Here again, running Puppet on that host did not bring back the config files I had deleted.

I’ll stop here for today. I hope I did more good than harm.


Subtasks


Related issues

Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29

History

#1 Updated by groente 2020-05-11 18:45:43

  • Assignee changed from Sysadmins to groente

#2 Updated by groente 2020-05-11 18:46:51

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#3 Updated by groente 2020-05-13 19:10:31

  • Status changed from Confirmed to Resolved
  • % Done changed from 0 to 100

The cause seems to have been a puppet agent process that had been running on teels since April 22nd. Once i killed that process, ran puppet anew on teels, ecours, and monitor, everyone was happy again (except for monitor, which didn’t have enough memory to run puppet, but after bumping its memory a bit, even monitor was happy).

#4 Updated by groente 2020-05-13 19:11:18

Oh, and @intrigeri - thanks for catching this issue and applying the quick fix!

#5 Updated by intrigeri 2020-05-14 06:49:28

> The cause seems to have been a puppet agent process that had been running on teels since April 22nd. Once i killed that process, ran puppet anew on teels, ecours, and monitor, everyone was happy again (except for monitor, which didn’t have enough memory to run puppet, but after bumping its memory a bit, even monitor was happy).

Awesome detective work \o/ :)

Cheers!