Feature #8645: Research and decide what monitoring solution to use

Feature #8645

Research and decide what monitoring solution to use

Added by intrigeri 2015-01-09 17:12:33 . Updated 2016-04-25 03:08:18 .

Status:

Resolved

Priority:

Elevated

Assignee:

Category:

Infrastructure

Target version:

Tails_2.3

Start date:

2015-01-09

Due date:

2015-09-28

% Done:

100%

Feature Branch:

Type of work:

Research

Blueprint:

https://tails.boum.org/blueprint/monitor_servers/

Starter:

Affected tool:

Deliverable for:

268

Description

Subtasks

Related issues

Blocked by Tails - ~~Feature #8649~~: Specify our monitoring needs and build an inventory of the services that need monitoring	Resolved	2015-01-09
Blocked by Tails - ~~Feature #9480~~: Set up a VM for monitoring tools experiments	Resolved	2015-05-28

History

#1 Updated by intrigeri 2015-01-09 17:13:15

blocks ~~Feature #8646~~: Research and decide where to host the monitoring software added

#2 Updated by intrigeri 2015-05-08 01:40:46

https://bugs.debian.org/734453 makes it clear that Nagios is no future-proof, at least for us Debian users.

#3 Updated by Dr_Whax 2015-05-26 21:52:58

It was mentioned that the focus should be on Icinga vs. Zabbix.

#4 Updated by intrigeri 2015-05-28 14:12:00

blocks #8668 added

#5 Updated by intrigeri 2015-05-28 14:12:32

blocked by ~~Feature #8649~~: Specify our monitoring needs and build an inventory of the services that need monitoring added

#6 Updated by intrigeri 2015-05-28 14:12:54

blocked by ~~Feature #9480~~: Set up a VM for monitoring tools experiments added

#7 Updated by intrigeri 2015-05-28 14:20:12

blocks ~~Feature #9481~~: Reset the VM used for monitoring tools experiments to a clean state added

#8 Updated by intrigeri 2015-05-28 14:20:59

Target version changed from Tails_1.8 to Tails_1.5

Adjusting target version to better fit DrWhax’s availability.

#9 Updated by intrigeri 2015-05-28 19:59:32

Blueprint set to https://tails.boum.org/blueprint/monitor_servers/

See the blueprint for specs.

#10 Updated by Dr_Whax 2015-07-18 14:32:38

Status changed from Confirmed to New
% Done changed from 0 to 20

#11 Updated by intrigeri 2015-07-19 01:27:48

Status changed from New to In Progress

> % Done changed from 0 to 20

Where can we see this?

#12 Updated by intrigeri 2015-08-19 11:42:22

Target version changed from Tails_1.5 to Tails_1.6

#13 Updated by bertagaz 2015-09-23 01:32:28

Target version changed from Tails_1.6 to Tails_1.7

#14 Updated by Dr_Whax 2015-09-25 10:11:15

Icinga2 (2.1.1-1)
=

- Works with Medium-High security level within TBB.

- Is capable to send e-mail alerts.

- Supports configuring alerts with a per-check/per-service granularity

- No CVE’s in Debian packages.

- Latest package is from Sept 2014 for Jessie(2.1.1-1) and for stretch it’s 2.3.10-1 from June 17th 2015. With consistent uploads and development in Debian in testing.

- Seems upstream is pretty active.

- No fancy graphs. (requires a bunch of custom fiddling)

- Wheezy-backport available so we can run an agent. 2.1.1-1~bpo70+1

- DFSG = yes!
- Has good enough puppet support we could base our infrastructure on.

Hence why I say, let’s go with Icinga2, even though it’s beta software, and a 1.0.0 release is upcoming.

Zabbix (1:2.2.7+dfsg-2)
===

- Doesn’t work with Medium-High security level within TBB. Medium-Low does work.

- Is capable to send e-mail alerts.

- Supports configuring alerts with a per-check/per-service granularity

- Had 4 CVE’s patched in 2014 (http://metadata.ftp-master.debian.org/changelogs//main/z/zabbix/zabbix_2.2.7+dfsg-2_changelog)

- Latest package is from 14 Jan for Jessie(2.2.7) and for stretch it’s 2.4.5 from April 26th. Sporadic development. Not sure if patches easily get merged.

- Has a wheezy-backport: zabbix-agent (1:2.2.5+dfsg-1~bpo70+1)

- Can also plot fancy graphs out of the box!
- DFSG = yes!

#15 Updated by intrigeri 2015-09-26 06:43:18

Last step before we decide this is done: go through the spec and list which ones of our SHOULD/MUST/etc. it does not satisfy.

#16 Updated by intrigeri 2015-09-26 06:45:20

Due date set to 2015-09-28

#17 Updated by intrigeri 2015-09-26 07:42:07

Description updated

#18 Updated by Dr_Whax 2015-09-27 09:14:24

Hi, this is a report which mentions that Icinga2 passes all the MUST’s listed in the monitoring blueprint. In addition, this report gives an overview of any trade-off’s regarding SHOULD’s that might not be entirely covered.

Human interface

All the MUSTS’s are met for this section. A read only version isn’t included. So the only may in this section isn’t met.

Threat model

Compromised monitored machine

All the MUSTS’s are met for this section. The SHOULD NOT is met, it can’t alter information about other monitored machines.

Compromised monitoring machine

All the MUSTS’s are met for this section. The SHOULD NOT is also met.

Network attacker

All the MUSTS’s are met for this section. The SHOULD NOT is also met.

Availability, sustainability

All the MUSTS’s are met for this section. The SHOULD’s are also met.

Configuration

SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.

^ This requirement is met.

Additionally, if this optional (but warmly welcome) requirement is satisfied, then the “shared Puppet modules” we use SHOULD already support the chosen monitoring system (hint: in practice, this means something compatible with Nagios).

^ This is an open question for me.

Humans can easily review service check configuration in files. Configuration is mostly static and unaffected when mutating service checks.

https://github.com/Icinga/puppet-icinga2

Adequacy to our resources

All the MUSTS’s are met for this section. Of course, it has a small learning curve, as is with everything new. But doesn’t get more complex when adding more and more monitored systems.

Miscellaneous

This MUST isn’t met. When researching Zabbix and Icinga2. I came to the conclusion that both software packages don’t support SOCKS proxies.

Either we can write a plugin that allows us to use torsocks or an iptables setup.

Hosting of the monitoring machine

This is on-going.

#19 Updated by Dr_Whax 2015-09-27 09:14:50

Assignee changed from Dr_Whax to intrigeri
QA Check set to Ready for QA

#20 Updated by intrigeri 2015-09-28 02:42:56

blocked by deleted (~~~~Feature #9481~~: Reset the VM used for monitoring tools experiments to a clean state~~)

#21 Updated by intrigeri 2015-09-28 03:07:08

QA Check changed from Ready for QA to Dev Needed

> h2. Threat model

> h3. Compromised monitored machine

> All the MUSTS’s are met for this section.

How is “It MUST NOT be able to DoS the sysadmin(s) in charge, e.g. by flooding them with alerts.” achieved?

> h3. Compromised monitoring machine

> All the MUSTS’s are met for this section. The SHOULD NOT is also met.

How are checks (= arbitrary code, presumably) distributed to the monitored machines? Is the plan to do that externally (e.g. via Puppet and Debian packages)? Can the monitoring machine distribute such checks (and if yes, how can we tell the monitored machines to refuse it)?

> h3. Network attacker

> All the MUSTS’s are met for this section. The SHOULD NOT is also met.

How is the network traffic protected?

> h2. Configuration

> SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.

> ^ This requirement is met.

This is great. It remains to be seen how we’ll convey that information from the puppetmaster to the monitoring system, but that task has been moved to another ticket on bertagaz’ plate.

> Additionally, if this optional (but warmly welcome) requirement is satisfied, then
> the “shared Puppet modules” we use SHOULD already support the chosen monitoring
> system (hint: in practice, this means something compatible with Nagios).

> ^ This is an open question for me.

OK ⇒ “Dev Needed”.

> Humans can easily review service check configuration in files. Configuration is
> mostly static and unaffected when mutating service checks.

Can we please see an example set of configuration files for:

the service checks configuration
the global configuration of the server components

> h3. Miscellaneous

> This MUST isn’t met. When researching Zabbix and Icinga2. I came to the conclusion
> that both software packages don’t support SOCKS proxies.

> Either we can write a plugin that allows us to use torsocks or an iptables setup.

OK. I want to know how we’ll meet this requirement before we decide that Icinga2 does the job ⇒ “Dev Needed”.

So we have two candidate options:

using transparent torification with iptables/netfilter: that’s the fallback and at first glance it seems to be the cheapest option; before deciding to go this way we should make sure that it’s not going to break unrelated stuff; to start with, we need to know if we can torify the monitoring checks only (e.g. with per-UID rules) instead of the whole system => under what UID (s) do these check run? I guess you got the idea of the kind of thinking that needs to be put into it.
wrap checks with torsocks somehow; how would we do that? You said “we can write a plugin that allows us to use torsocks”. Can a plugin really impact all other checks this way? (e.g. it can only do that for checks that fork another process, not for in-process ones) I’d like to know more about how you see this happen. But if you think the iptables/netfilter option is better/easier, forget it and focus on that one instead.

> h2. Hosting of the monitoring machine

> This is on-going.

… and out-of-scope on this ticket.

#22 Updated by intrigeri 2015-10-05 08:20:19

blocked by deleted (~~~~Feature #8646~~: Research and decide where to host the monitoring software~~)

#23 Updated by intrigeri 2015-10-05 13:18:34

Assignee changed from intrigeri to Dr_Whax

#24 Updated by Dr_Whax 2015-10-25 11:15:49

Assignee changed from Dr_Whax to intrigeri

intrigeri wrote:
>
> How is “It MUST NOT be able to DoS the sysadmin(s) in charge, e.g. by flooding them with alerts.” achieved?

One can disable notifications for a monitored machine when problem is resolved. Notifications can only be pulled by the monitoring machine and not by the monitored machine.

>
> > h3. Compromised monitoring machine
>
> > All the MUSTS’s are met for this section. The SHOULD NOT is also met.
>
> How are checks (= arbitrary code, presumably) distributed to the monitored machines? Is the plan to do that externally (e.g. via Puppet and Debian packages)? Can the monitoring machine distribute such checks (and if yes, how can we tell the monitored machines to refuse it)?

There is a fully fledged Icinga2 “agent” which has its own SSL certificate per monitored machine that is being monitored. The monitoring machine is distributing the checks. We can refuse such checks by disabling the check or firewalling it.

>
> > h3. Network attacker
>
> > All the MUSTS’s are met for this section. The SHOULD NOT is also met.
>
> How is the network traffic protected?

When a service check is being created on the monitoring machine a SSL certificate for the monitored machine will be created. A network attacker would be able to enumerate machines and possibly checks. A DoS wouldn’t be possible on the monitoring machine (see Compromised monitored machine), however, it would be possible to spoof reports, test results and alike, but this is within the threat model. The monitored machine itself can’t send results, it’s pull only. However, if one would have a full compromise (shell+root) access on the monitoring machine, one could of course send arbitrary code to the monitored machines.

>
> > h2. Configuration
>
> > SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.
>
> > ^ This requirement is met.
>
> This is great. It remains to be seen how we’ll convey that information from the puppetmaster to the monitoring system, but that task has been moved to another ticket on bertagaz’ plate.
>
> > Additionally, if this optional (but warmly welcome) requirement is satisfied, then
> > the “shared Puppet modules” we use SHOULD already support the chosen monitoring
> > system (hint: in practice, this means something compatible with Nagios).
>
> > ^ This is an open question for me.
>
> OK ⇒ “Dev Needed”.

ACK

>
> > Humans can easily review service check configuration in files. Configuration is
> > mostly static and unaffected when mutating service checks.
>
> Can we please see an example set of configuration files for:
>
> * the service checks configuration
> * the global configuration of the server components
>

https://github.com/DrWhax/icinga2-configuration

>
> > h3. Miscellaneous
>
> > This MUST isn’t met. When researching Zabbix and Icinga2. I came to the conclusion
> > that both software packages don’t support SOCKS proxies.
>
> > Either we can write a plugin that allows us to use torsocks or an iptables setup.
>
> OK. I want to know how we’ll meet this requirement before we decide that Icinga2 does the job ⇒ “Dev Needed”.
>
> So we have two candidate options:
>
> # using transparent torification with iptables/netfilter: that’s the fallback and at first glance it seems to be the cheapest option; before deciding to go this way we should make sure that it’s not going to break unrelated stuff; to start with, we need to know if we can torify the monitoring checks only (e.g. with per-UID rules) instead of the whole system => under what UID (s) do these check run? I guess you got the idea of the kind of thinking that needs to be put into it.

This option is the cheapest.

> # wrap checks with torsocks somehow; how would we do that? You said “we can write a plugin that allows us to use torsocks”. Can a plugin really impact all other checks this way? (e.g. it can only do that for checks that fork another process, not for in-process ones) I’d like to know more about how you see this happen. But if you think the iptables/netfilter option is better/easier, forget it and focus on that one instead.

I think iptables+netfilter is a better one.

#25 Updated by intrigeri 2015-10-31 08:30:28

QA Check changed from Dev Needed to Ready for QA

(This ticket hadn’t landed on my “people waiting for me” high-priority list, since it wasn’t set as Ready for QA..)

#26 Updated by intrigeri 2015-10-31 10:43:02

QA Check changed from Ready for QA to Dev Needed

>> > h3. Compromised monitoring machine
>>
>> > All the MUSTS’s are met for this section. The SHOULD NOT is also met.
>>
>> How are checks (= arbitrary code, presumably) distributed to the monitored machines? Is the plan to do that externally (e.g. via Puppet and Debian packages)? Can the monitoring machine distribute such checks (and if yes, how can we tell the monitored machines to refuse it)?

> There is a fully fledged Icinga2 “agent” which has its own SSL certificate per monitored machine that is being monitored. The monitoring machine is distributing the checks.

and later you write:

> However, if one would have a full compromise (shell+root) access on the monitoring machine, one could of course send arbitrary code to the monitored machines.

So it’s not 100% clear to me why you stated that Icinga2 is satisfying “It MUST NOT be able to run arbitrary code as root on any of the monitored machines”. May you please clarify? It can run arbitrary code, just not as root, maybe?

(Please re-read the specs we’re talking about, and clarify precisely: given our combined latencies, better avoid yet another back’n’forth :)

> We can refuse such checks by disabling the check or firewalling it.

I don’t understand how this is applicable to a monitored machine protecting itself against a compromised monitoring machine (especially when we don’t know yet that it’s been compromised). Please clarify if it’s worth it, or dismiss it if it was not relevant.

>> > h3. Network attacker
>>
>> > All the MUSTS’s are met for this section. The SHOULD NOT is also met.
>>
>> How is the network traffic protected?

> When a service check is being created on the monitoring machine a SSL certificate for the monitored machine will be created.

This is, by far, insufficient for us to evaluate the thing. E.g. you didn’t tell if the SSL certificate was meant for the monitoring machine to authenticate itself to the monitored one, or the opposite. Please clarify where is the private key created, what’s the CA, where is the private key stored, what it is used for, etc.

You see where I’m headed to, so don’t hesitate reading between the lines and addressing the more general bootstrapping problem here (including “how can the monitored machine know it’s talking to the right monitoring machine?”).

>> > h2. Configuration
>>
>> > SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.
[…]

>> > Additionally, if this optional (but warmly welcome) requirement is satisfied, then the “shared Puppet modules” we use SHOULD already support the chosen monitoring system (hint: in practice, this means something compatible with Nagios).
>>
>> > ^ This is an open question for me.
>>
>> OK ⇒ “Dev Needed”.

> ACK

FTR I was not waiting for an ACK, but for you to answer this question (or clearly state that you’re not going to do it).

>> > Humans can easily review service check configuration in files. Configuration is mostly static and unaffected when mutating service checks.
>>
>> Can we please see an example set of configuration files for:
>>
>> * the service checks configuration
>> * the global configuration of the server components
>>

> https://github.com/DrWhax/icinga2-configuration

Thanks. I’ve found one check in there (https://github.com/DrWhax/icinga2-configuration/blob/master/conf.d/tails.conf). Looks sane. Did I miss others?

>> > h3. Miscellaneous
>>
>> > This MUST isn’t met. When researching Zabbix and Icinga2. I came to the conclusion that both software packages don’t support SOCKS proxies.
>>
>> > Either we can write a plugin that allows us to use torsocks or an iptables setup.
>>
>> OK. I want to know how we’ll meet this requirement before we decide that Icinga2 does the job ⇒ “Dev Needed”.
>>
>> So we have two candidate options:
>>
>> # using transparent torification with iptables/netfilter: that’s the fallback and
>> at first glance it seems to be the cheapest option; before deciding to go this way we should make sure that it’s not going to break unrelated stuff; to start with, we need to know if we can torify the monitoring checks only (e.g. with per-UID rules) instead of the whole system => under what UID (s) do these check run? I guess you got the idea of the kind of thinking that needs to be put into it.

> This option is the cheapest.

I tend to agree but I need you to answer the question I’ve asked, before we can conclude that with certainty.

Cheers!

#27 Updated by intrigeri 2015-11-01 03:36:46

Assignee changed from intrigeri to Dr_Whax
% Done changed from 20 to 40

#28 Updated by bertagaz 2015-11-02 04:01:04

Priority changed from Normal to Elevated
Target version changed from Tails_1.7 to Tails_1.8

Two comments I hope will be helpful on top of intrigeri’s one below.

I’m postponing this ticket as it clearly won’t be done until this release (meaning tomorrow), but then mark it as elevated because it had to be completed 2 releases ago, and it will clearly become a pain if it goes this way during the 1.8 release. I suggest this ticket is treated with high priority and is worked on to be completed ASAP.
As you’ll see I’m starting to do research myself on this ticket, and I admit I dislike that: it’s not really what we agreed on.

intrigeri wrote:
> So it’s not 100% clear to me why you stated that Icinga2 is satisfying “It MUST NOT be able to run arbitrary code as root on any of the monitored machines”. May you please clarify? It can run arbitrary code, just not as root, maybe?
>
> (Please re-read the specs we’re talking about, and clarify precisely: given our combined latencies, better avoid yet another back’n’forth :)

Agree it’s quite unclear at the moment how this MUST NOT is met.

To get this a bit forward as we’re late on this and it will block us quickly in the next release cycle, I’ve made a quick 20 minutes research myself (with “icinga2 test distribution”) and ended up on the icinga2 client documentation, which I guess you’ve read.

It states that you can run “Clients with Local Configuration”, which means that they are “independant satellite using a local scheduler, configuration […]”. It seems to be a different setup than the “Clients as Command Execution Bridge” where “the remote Icinga 2 client will only execute commands the master instance is sending”.

So it seems to be possible to run clients as independant nodes that only own and run their own locally configured checks and reports to the master, so possibly don’t just run random code that the master send to them.

Do you confirm it’s possible? Is that the kind of setup you tested yourself? Are the configuration files you posted on github made that way?

> > When a service check is being created on the monitoring machine a SSL certificate for the monitored machine will be created.
>
> This is, by far, insufficient for us to evaluate the thing. E.g. you didn’t tell if the SSL certificate was meant for the monitoring machine to authenticate itself to the monitored one, or the opposite. Please clarify where is the private key created, what’s the CA, where is the private key stored, what it is used for, etc.
>
> You see where I’m headed to, so don’t hesitate reading between the lines and addressing the more general bootstrapping problem here (including “how can the monitored machine know it’s talking to the right monitoring machine?”).

2 cents on this, I’ve watched the conference at the 2015 debconf from one of the icinga developer. It was a bit interesting: it is mentionned that icinga2 could use the puppet CA and nodes certificates to authenticate the agents. This documentation seems to say it’s possible to do so.

Now, our CA in puppet is hosted on a particular VM, while our monitoring software would be on another machine, so I’m not sure that’s something we want to do. We may want this two different services (puppet and icinga2) to use a different CA. I’ll let you DrWhax think about all this.

#29 Updated by Dr_Whax 2015-11-14 11:38:44

Assignee changed from Dr_Whax to bertagaz

bertagaz wrote:
> Two comments I hope will be helpful on top of intrigeri’s one below.
>
> I’m postponing this ticket as it clearly won’t be done until this release (meaning tomorrow), but then mark it as elevated because it had to be completed 2 releases ago, and it will clearly become a pain if it goes this way during the 1.8 release. I suggest this ticket is treated with high priority and is worked on to be completed ASAP.
> As you’ll see I’m starting to do research myself on this ticket, and I admit I dislike that: it’s not really what we agreed on.
>
> intrigeri wrote:
> > So it’s not 100% clear to me why you stated that Icinga2 is satisfying “It MUST NOT be able to run arbitrary code as root on any of the monitored machines”. May you please clarify? It can run arbitrary code, just not as root, maybe?

This is indeed the case, it could run arbitrary code, just not as root. This would be the case with NRPE, SNMP or the icinga2 satellite system (client/server).

> >
> > (Please re-read the specs we’re talking about, and clarify precisely: given our combined latencies, better avoid yet another back’n’forth :)
>
> Agree it’s quite unclear at the moment how this MUST NOT is met.
>
> To get this a bit forward as we’re late on this and it will block us quickly in the next release cycle, I’ve made a quick 20 minutes research myself (with “icinga2 test distribution”) and ended up on the icinga2 client documentation, which I guess you’ve read.
>
> It states that you can run “Clients with Local Configuration”, which means that they are “independant satellite using a local scheduler, configuration […]”. It seems to be a different setup than the “Clients as Command Execution Bridge” where “the remote Icinga 2 client will only execute commands the master instance is sending”.
>
> So it seems to be possible to run clients as independant nodes that only own and run their own locally configured checks and reports to the master, so possibly don’t just run random code that the master send to them.
>
> Do you confirm it’s possible? Is that the kind of setup you tested yourself? Are the configuration files you posted on github made that way?

My configuration files don’t have that currently on github. Looks like the client with local configuration is plausible. It seems you install Icinga2 on a monitored host and add the monitored host as a node. On the monitoring machine you add the monitored host (Manually Discover Clients on the Master).

>
> > > When a service check is being created on the monitoring machine a SSL certificate for the monitored machine will be created.
> >
> > This is, by far, insufficient for us to evaluate the thing. E.g. you didn’t tell if the SSL certificate was meant for the monitoring machine to authenticate itself to the monitored one, or the opposite. Please clarify where is the private key created, what’s the CA, where is the private key stored, what it is used for, etc.

The PKI here is as follows (always use the fqdn), the monitoring host creates SSL certificates and a CA in order to communicate with the monitored hosts. See here for setting up a monitored host (http://docs.icinga.org/icinga2/snapshot/search?q=PKI#!/icinga2/snapshot/doc/module/icinga2/chapter/icinga2-client?highlight-search=PKI#icinga2-client-installation-master-setup). The monitoring host will generate the key (as far as I understand) and uses that to authenticate the monitored host. It could also do CSR auto-signing, but that probably something we wouldn’t want to do.

> >
> > You see where I’m headed to, so don’t hesitate reading between the lines and addressing the more general bootstrapping problem here (including “how can the monitored machine know it’s talking to the right monitoring machine?”).

If there is anything unclear about the above, please let me know.

>
> 2 cents on this, I’ve watched the conference at the 2015 debconf from one of the icinga developer. It was a bit interesting: it is mentionned that icinga2 could use the puppet CA and nodes certificates to authenticate the agents. This documentation seems to say it’s possible to do so.
>
> Now, our CA in puppet is hosted on a particular VM, while our monitoring software would be on another machine, so I’m not sure that’s something we want to do. We may want this two different services (puppet and icinga2) to use a different CA. I’ll let you DrWhax think about all this.

We’d definitely want to seperate CA’s for usage, ideally, but I wonder what’s the acceptable threat model here? How bad would it be if we’d use the same CA for puppet and icinga2 and the icinga2 box get’s compromised. How do we handle this key compromise? The $adversary still would have to compromise the puppet box.

#30 Updated by bertagaz 2015-12-15 03:37:03

Target version changed from Tails_1.8 to Tails_2.0

Postponing

#31 Updated by bertagaz 2016-01-27 10:49:43

Target version changed from Tails_2.0 to Tails_2.2

#32 Updated by bertagaz 2016-01-31 14:20:26

Assignee changed from bertagaz to intrigeri
% Done changed from 40 to 50
QA Check changed from Dev Needed to Info Needed

Trying to sort out and clarify this looong discussion below. Hope I didn’t miss any of the remaining items to be discussed.

intrigeri wrote:
> >> > h3. Compromised monitoring machine
> >>
> >> > All the MUSTS’s are met for this section. The SHOULD NOT is also met.
> >>
> >> How are checks (= arbitrary code, presumably) distributed to the monitored machines? Is the plan to do that externally (e.g. via Puppet and Debian packages)? Can the monitoring machine distribute such checks (and if yes, how can we tell the monitored machines to refuse it)?

My plan is to distribute the checks through Debian packages (monitoring-plugins-*) and Puppet. The monitored machines when using icinga2 or the nagios-nrpe-server don’t accept checks that the monitoring machine would distribute them unless told to do so with a configuration option.

> > There is a fully fledged Icinga2 “agent” which has its own SSL certificate per monitored machine that is being monitored. The monitoring machine is distributing the checks.
>
> and later you write:
>
> > However, if one would have a full compromise (shell+root) access on the monitoring machine, one could of course send arbitrary code to the monitored machines.
>
> So it’s not 100% clear to me why you stated that Icinga2 is satisfying “It MUST NOT be able to run arbitrary code as root on any of the monitored machines”. May you please clarify? It can run arbitrary code, just not as root, maybe?

Yes, the checks are run as the nagios user. But as said above, unless the icinga2 agent is configured with accept_commands to true, it won’t use checks sent by the monitoring machine. They have to be installed on the monitored machine another way.

> >> > h3. Network attacker
> >>
> >> > All the MUSTS’s are met for this section. The SHOULD NOT is also met.
> >>
> >> How is the network traffic protected?
>
> > When a service check is being created on the monitoring machine a SSL certificate for the monitored machine will be created.
>
> This is, by far, insufficient for us to evaluate the thing. E.g. you didn’t tell if the SSL certificate was meant for the monitoring machine to authenticate itself to the monitored one, or the opposite. Please clarify where is the private key created, what’s the CA, where is the private key stored, what it is used for, etc.

The SSL certificate is used by the monitored machine to authenticate to the monitoring machine. The private key can be created on any host that has access to the CA private key. This one is hosted on the monitoring machine. If the network attacker doesn’t have access to the SSL private keys, in theory she can’t read the communications.

> >> > h2. Configuration
> >>
> >> > SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.

This can be done using exported resources.

> >> > Additionally, if this optional (but warmly welcome) requirement is satisfied, then the “shared Puppet modules” we use SHOULD already support the chosen monitoring system (hint: in practice, this means something compatible with Nagios).

Since the upstream icinga2 puppet module is not in a really good shape (master branch is old and require a different puppet-apt module than the one we use, and the dev branch is under an ongoing rewrite that makes it not usable for us at the moment), we’ll have to use our own exported resources and icinga2 manifests.

The nagios module used by the shared puppet modules is not compatible with icinga2.

> > https://github.com/DrWhax/icinga2-configuration
>
> Thanks. I’ve found one check in there (https://github.com/DrWhax/icinga2-configuration/blob/master/conf.d/tails.conf). Looks sane. Did I miss others?

That’s quite a good overview of the way to configure icinga2, and hosts and service checks.

> >> > h3. Miscellaneous
> >>
> >> > This MUST isn’t met. When researching Zabbix and Icinga2. I came to the conclusion that both software packages don’t support SOCKS proxies.

Here I’ve found a better option: with icinga2 one can easily override the configuration of a check by creating a new one that just import it, and change only the part it needs to be different. So say you have a already existing check that is:

object CheckCommand "ftp" {
        import "plugin-check-command"

        command = [ PluginDir + "/check_ftp" ]

        arguments = {
                "-H" = "$ftp_address$"
        }

        vars.ftp_address = "$address$"
}

We can just create a new one that would be:

object CheckCommand "torified_ftp" {
        import "ftp"

        command = [ "torsocks", PluginDir + "/check_ftp" ]

}

Then this new check will have the same properties than its parent, but will be run using torsocks.

I’ll post later today a description of the setup I’ve done locally, so that you’ll be able to review it and ack if I can go on and start deploying it. I’ll put this ticket in RfQA at that moment.

#33 Updated by bertagaz 2016-01-31 17:28:36

% Done changed from 50 to 70
QA Check changed from Info Needed to Ready for QA

bertagaz wrote:
> I’ll post later today a description of the setup I’ve done locally, so that you’ll be able to review it and ack if I can go on and start deploying it. I’ll put this ticket in RfQA at that moment.

So, the way I planned to deploy icinga2 according to the setup I’ve done locally is:

Install a full icinga2 server on Ecours. It means it will have the mysql server to store the data and the web interface, as well as be the one responsible of sending the notifications.
Install another icinga2 server on Lizard (host). This one will be configured as a satellite of Ecours’ one, meaning it will be the one responsible of gathering datas from others monitored systems (atm Lizard’s VMs), and send them back to Ecours. For other future systems outside of Lizard, we’ll be able to install another satellite on them.
Icinga2 is taking quite a lot of memory, so I’m not sure we should install it on every VMs. The other option is to use either the nagios-nrpe-server package (but then we have to decide if the bad SSL implementation described in the package bug page is a blocker, considering they will communicate on the libvirt interface only), or use ssh to run the checks from Lizard’s icinga2.

The checks will be installed with the monitoring-plugins-* packages or distributed through puppet. Using exported resources, Ecours’ manifest will be configured to collect every checks set up for other hosts. Lizard’s icinga2 will be configured the same way. To ensure that Lizard’s icinga2 will not the listening directly on the internet to communicate with Ecour’s one, it will use the VPN we’ll probably setup for ~~Feature #10760~~.

Does it sound relevant to you? I think this way we isolate quite well Ecours and avoid giving it too much access to other systems.

#34 Updated by intrigeri 2016-02-06 15:58:19

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Dev Needed

> My plan is to distribute the checks through Debian packages (monitoring-plugins-*) and Puppet. The monitored machines when using icinga2 or the nagios-nrpe-server don’t accept checks that the monitoring machine would distribute them unless told to do so with a configuration option.
> […]
> Yes, the checks are run as the nagios user. But as said above, unless the icinga2 agent is configured with accept_commands to true, it won’t use checks sent by the monitoring machine. They have to be installed on the monitored machine another way.

Sounds good.

> The SSL certificate is used by the monitored machine to authenticate to the monitoring machine. The private key can be created on any host that has access to the CA private key. This one is hosted on the monitoring machine.

This was still very unclear to me, so I’ve read some doc and my conclusion is that the way icinga2 uses SSL is very similar to Puppet’s one.

I didn’t look at how it works exactly, but at first glance “CSR Auto-Signing” can’t possibly be safe unless we’re on a trusted network. Given we’re going to run this over either virtual bridges, or VPN connections, perhaps it’s a valid option for us (that would simplify VM creation), instead of having to manually manage client certificates.

>>>>> SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.

> This can be done using exported resources.

Cool.

>>>>> Additionally, if this optional (but warmly welcome) requirement is satisfied, then the “shared Puppet modules” we use SHOULD already support the chosen monitoring system (hint: in practice, this means something compatible with Nagios).

> Since the upstream icinga2 puppet module is not in a really good shape (master branch is old and require a different puppet-apt module than the one we use, and the dev branch is under an ongoing rewrite that makes it not usable for us at the moment), we’ll have to use our own exported resources and icinga2 manifests.

Any pointer to an example of how it would be done in practice?

Did I get it right that the nagios_* Puppet resources are not suitable for our needs? If yes, can you please very quickly tell me why?

>>>>> This MUST isn’t met. When researching Zabbix and Icinga2. I came to the conclusion that both software packages don’t support SOCKS proxies.

> Here I’ve found a better option: with icinga2 one can easily override the configuration of a check by creating a new one that just import it, and change only the part it needs to be different. So say you have a already existing check that is:
> […]
> Then this new check will have the same properties than its parent, but will be run using torsocks.

Looks perfect, thanks!

> So, the way I planned to deploy icinga2 according to the setup I’ve done locally is:

> * Install a full icinga2 server on Ecours. It means it will have the mysql server to store the data and the web interface, as well as be the one responsible of sending the notifications.

OK.

> * Install another icinga2 server on Lizard (host). This one will be configured as a satellite of Ecours’ one, meaning it will be the one responsible of gathering datas from others monitored systems (atm Lizard’s VMs), and send them back to Ecours. For other future systems outside of Lizard, we’ll be able to install another satellite on them.

I have two questions about this part:

What’s the advantage of doing this, over monitoring each monitored system independently?
We need a very good reason to run services on a virtualization host. So, in this case: why run this icinga2 server on the host?

> * Icinga2 is taking quite a lot of memory, so I’m not sure we should install it on every VMs.

What is “quite a lot of memory”?

Did I get it right that if we want to implement the “Clients with Local Configuration” scenario, on each VM me need to run a full blown icinga2 daemon? (icinga2 Debian package)

> The other option is to use either the nagios-nrpe-server package (but then we have to decide if the bad SSL implementation described in the package bug page is a blocker, considering they will communicate on the libvirt interface only),

There are quite a lot of bug reports on that page, so I’ll assume you’re referring to https://bugs.debian.org/547092.

Please make it clear what impact this bug has in terms of compliance to the specification we wrote regarding network security. Then the answer of whether it’s acceptable or not should follow, for free.

Also, you might have missed that according to the upstream bug report, it is fixed in nrpe 2.16-rc2: see the corresponding documentation (I didn’t check how good the fix is, though).

> or use ssh to run the checks from Lizard’s icinga2.

If the checks are configured via Puppet on that icinga2 satellite system, and run using a non-privileged user account, IIRC this would satisfy our specs. Please check, if this is your preferred solution.

Have you looked into check-mk? IIRC that’s what some friends of ours use, and e.g. 1.2.6p12-1 is available from Wheezy to sid.

> The checks will be installed with the monitoring-plugins-* packages or distributed through puppet. Using exported resources, Ecours’ manifest will be configured to collect every checks set up for other hosts. Lizard’s icinga2 will be configured the same way. To ensure that Lizard’s icinga2 will not the listening directly on the internet to communicate with Ecour’s one, it will use the VPN we’ll probably setup for ~~Feature #10760~~.

OK.

> Does it sound relevant to you?

Yes!

I have a few more questions.

This proposal seems to focus on local checks (CPU, swap, etc.), which makes sense since it’s the hardest problem you were facing. So: where would the client run for network checks? E.g. when checking if https://tails.boum.org/ is up, which system will initiate the HTTP connection?

What version of the monitoring system did you test, and on which OS? Same for the satellite component. We’re still running some Wheezy systems, so inter-operability between 2.1 and 2.4 seems important.

Just to clarify: if we’re going to configure the checks on the monitored machines (as opposed to doing it on the monitoring machine), we don’t need Puppet exported resources shared between the monitoring machine and the monitored ones, right? So in the end, we’re managing the monitoring machine with our existing puppetmaster for convenience only, not because the monitoring setup we choose requires it, right? I’m pretty sure I got it wrong, so please explain :)

#35 Updated by bertagaz 2016-02-09 16:54:45

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:
>
> > The SSL certificate is used by the monitored machine to authenticate to the monitoring machine. The private key can be created on any host that has access to the CA private key. This one is hosted on the monitoring machine.
>
> This was still very unclear to me, so I’ve read some doc and my conclusion is that the way icinga2 uses SSL is very similar to Puppet’s one.
>
> I didn’t look at how it works exactly, but at first glance “CSR Auto-Signing” can’t possibly be safe unless we’re on a trusted network. Given we’re going to run this over either virtual bridges, or VPN connections, perhaps it’s a valid option for us (that would simplify VM creation), instead of having to manually manage client certificates.

Beware that the online documentation is about icinga2 2.4 only, which had a bunch of new features compared to 2.1. I’ve tested already this, and it appears that the icinga2 CLI app they mention that help the CSR auto-signing is not available in the 2.1 version that is currently in Jessie. So we’ll have to manage the certificates manually until we get Ecours running Stretch.

> > Since the upstream icinga2 puppet module is not in a really good shape (master branch is old and require a different puppet-apt module than the one we use, and the dev branch is under an ongoing rewrite that makes it not usable for us at the moment), we’ll have to use our own exported resources and icinga2 manifests.
>
> Any pointer to an example of how it would be done in practice?

Not really, but that will be a matter of creating our own defines for hosts, then gather them where needed with exported resources (master and satellite). Thanks to the icinga2 configuration we can add “groups” to this host declarations that will enable a bunch of checks that are common to several of them.

> Did I get it right that the nagios_* Puppet resources are not suitable for our needs? If yes, can you please very quickly tell me why?

Nop, the way they are used to configure nagios checks are not compatible anymore with the way checks are configured in icinga2.

> > * Install another icinga2 server on Lizard (host). This one will be configured as a satellite of Ecours’ one, meaning it will be the one responsible of gathering datas from others monitored systems (atm Lizard’s VMs), and send them back to Ecours. For other future systems outside of Lizard, we’ll be able to install another satellite on them.
>
> I have two questions about this part:
>
> * What’s the advantage of doing this, over monitoring each monitored system independently?

Simplicity of the configuration mainly: we won’t have to configure each host and its checks, plus the monitoring software, but can use exported resource to configure only the satellite to collect data from other systems, and the master.

> * We need a very good reason to run services on a virtualization host. So, in this case: why run this icinga2 server on the host?

My assumption, that is now wrong, is that we didn’t have enough resources to use another VM just for this task. It also assumed that given Lizard is the host, it already has a lot of power on the VMs, so giving it access to every one wasn’t lowering the whole security.

But the situation has changed a bit now regarding resources, so we could envisage to host it in a dedicated VM. Still, depending on the local agent we choose (nagio-nrpe-server or ssh or full icinga2 instance), it may have unprivileged access to every other VMs AND Lizard host.

> > * Icinga2 is taking quite a lot of memory, so I’m not sure we should install it on every VMs.
>
> What is “quite a lot of memory”?

My icinga2 instances (which is a huge C app that is using libs not all used by other software), are using:

for the monitoring system: VSZ: 655544, RSS: 16704
for the satellite system: VSZ: 972848, RSS: 18284

So that’s not so huge in term of RSS, but given it depends on a bunch of liboost libs that are not needed on other systems by other softwares, it’s still quite a lot in term of VSZ.

> Did I get it right that if we want to implement the “Clients with Local Configuration” scenario, on each VM me need to run a full blown icinga2 daemon? (icinga2 Debian package)

In the case I’ve proposed the “Client with local configuration” would be the satellite system only. The other would only have agents of some sort or ssh access.

> > The other option is to use either the nagios-nrpe-server package (but then we have to decide if the bad SSL implementation described in the package bug page is a blocker, considering they will communicate on the libvirt interface only),
>
> There are quite a lot of bug reports on that page, so I’ll assume you’re referring to https://bugs.debian.org/547092.

Yes.

> Please make it clear what impact this bug has in terms of compliance to the specification we wrote regarding network security. Then the answer of whether it’s acceptable or not should follow, for free.

If I get it right, it means that any compromised monitored machine using this package would be able to decrypt (and probably re-encypt, so full tampering of the communication) the communication of other monitored machines that are using this software. In the proposed design, this would mean that the communications on the libvirt bridge won’t be very much safe, but that’d be on the internal network only.

> Also, you might have missed that according to the upstream bug report, it is fixed in nrpe 2.16-rc2: see the corresponding documentation (I didn’t check how good the fix is, though).

Good catch. Unfortunately, I’m not sure it will be helpful. We don’t have time to wait for this package to be updated in Debian and backported for Jessie.

> > or use ssh to run the checks from Lizard’s icinga2.
>
> If the checks are configured via Puppet on that icinga2 satellite system, and run using a non-privileged user account, IIRC this would satisfy our specs. Please check, if this is your preferred solution.

That’s the way it would be deployed yes I don’t have a preferred solution here I admit, but this one is not more difficult to deploy.

> Have you looked into check-mk? IIRC that’s what some friends of ours use, and e.g. 1.2.6p12-1 is available from Wheezy to sid.

It’s not compatible with icinga2, icinga1 only because it was a nagios fork and has been completely rewrote in icinga2

> I have a few more questions.
>
> This proposal seems to focus on local checks (CPU, swap, etc.), which makes sense since it’s the hardest problem you were facing. So: where would the client run for network checks? E.g. when checking if https://tails.boum.org/ is up, which system will initiate the HTTP connection?

Right, forgot this part. For the external checks, my plan was to run them on ecours.t.b.o.

> What version of the monitoring system did you test, and on which OS? Same for the satellite component. We’re still running some Wheezy systems, so inter-operability between 2.1 and 2.4 seems important.

I tested 2.1 in Jessie, and installed 2.4 in Jessie using the upstream repo on another host, to check interoperability between them, and it works fine.

> Just to clarify: if we’re going to configure the checks on the monitored machines (as opposed to doing it on the monitoring machine), we don’t need Puppet exported resources shared between the monitoring machine and the monitored ones, right? So in the end, we’re managing the monitoring machine with our existing puppetmaster for convenience only, not because the monitoring setup we choose requires it, right? I’m pretty sure I got it wrong, so please explain :)

The checks will be configured on the satellite system and the master only. The satellite will either query an agent (icinga2 or nagio-nrpe-server) to get the result of the checks, or run them through ssh on the monitored machines it is responsible for. So here exported resources will be handful to configure both the satellite and the master, by having their manifest collecting the hosts and checks declarations. So we need the ability to use the puppetmaster and its puppetdb on the monitoring machine in this scenario.

#36 Updated by intrigeri 2016-02-09 18:28:28

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Dev Needed

> So we’ll have to manage the certificates manually until we get Ecours running Stretch.

I don’t quite understand what you mean: I see icinga2 2.4.1-1~bpo8+1 in jessie-backports, our spec anyway makes it clear that we’re willing to run Debian testing if needed on the monitoring machine, and most case we might be able to produce backports if needed anyway.

>> Any pointer to an example of how it would be done in practice?

> Not really, but that will be a matter of creating our own defines for hosts, then gather them where needed with exported resources (master and satellite).

This sounds nice and all in theory. I guess you’ll understand why I’ll really believe it only once I see some actual working code samples :)

>> Did I get it right that the nagios_* Puppet resources are not suitable for our needs? If yes, can you please very quickly tell me why?

> Nop, the way they are used to configure nagios checks are not compatible anymore with the way checks are configured in icinga2.

So this is a problem on the collecting side, that produces the configuration. OK.

Can’t we collect these nagios_* resources ourselves, and use them to configure stuff? This would give us compatibility with existing Puppet modules that configure other services and declare them using nagios_* to the monitoring system.

>> * What’s the advantage of doing this, over monitoring each monitored system independently?

> Simplicity of the configuration mainly: we won’t have to configure each host and its checks, plus the monitoring software, but can use exported resource to configure only the satellite to collect data from other systems, and the master.

I don’t quite understand what you mean, let’s chat about it tomorrow.

>> * We need a very good reason to run services on a virtualization host. So, in this case: why run this icinga2 server on the host?

> […] But the situation has changed a bit now regarding resources, so we could envisage to host it in a dedicated VM.

OK, so let’s do this, unless some new very good reason pops up and forces us to run it on the host.

>> What is “quite a lot of memory”?

> My icinga2 instances (which is a huge C app that is using libs not all used by other software), are using:

We’re talking about live instances that have been doing their job for a while, right?

> * for the monitoring system: VSZ: 655544, RSS: 16704
> * for the satellite system: VSZ: 972848, RSS: 18284

> So that’s not so huge in term of RSS, but given it depends on a bunch of liboost libs that are not needed on other systems by other softwares, it’s still quite a lot in term of VSZ.

I don’t think we care about this high VSZ as long as these shared libraries are not actually loaded in memory.

So looking at the RSS value, this seems pretty small to me, compared to the fact we run Puppet on these systems, and the kind of memory we grant to our VMs. So the memory argument in favour of not running Icinga 2 elsewhere seems to be extremely weak.

>> Please make it clear what impact this bug has in terms of compliance to the specification we wrote regarding network security. Then the answer of whether it’s acceptable or not should follow, for free.

> If I get it right, it means that any compromised monitored machine using this package would be able to decrypt (and probably re-encypt, so full tampering of the communication) the communication of other monitored machines that are using this software. In the proposed design, this would mean that the communications on the libvirt bridge won’t be very much safe, but that’d be on the internal network only.

This much I had already understood, it’s pretty obvious when reading the bug report. At some point you’ll need to re-read my question and answer it, if this is the way you want to go (or as a cheap way to remove a unacceptable solution from the table, and simplify decision making, perhaps; we’ll see).

> I tested 2.1 in Jessie, and installed 2.4 in Jessie using the upstream repo on another host, to check interoperability between them, and it works fine.

Cool, thanks! Any reason why you didn’t use the package from jessie-backports? (Perhaps because it’s there only since 3 weeks?)

> The checks will be configured on the satellite system and the master only. The satellite will either query an agent (icinga2 or nagio-nrpe-server) to get the result of the checks, or run them through ssh on the monitored machines it is responsible for. So here exported resources will be handful to configure both the satellite and the master, by having their manifest collecting the hosts and checks declarations. So we need the ability to use the puppetmaster and its puppetdb on the monitoring machine in this scenario.

OK, this is still pretty blury to me. I guess I’ll understand better once I’ve seen actual code samples that implement this design.

#37 Updated by bertagaz 2016-02-14 13:39:59

Assignee changed from bertagaz to intrigeri
% Done changed from 70 to 80
QA Check changed from Dev Needed to Info Needed

So after a chat about this, we took some decisions:

We won’t bother about automating the SSL cert creation and signature. We won’t have to do this operation that much once the setup will be deployed.
We’ll install icinga2 instances as agents on the monitored machines, my weak understanding of how memory is managed lead me to wrong conclusions.
We won’t use the icinga2 internal configuration spreading from the master/satellite to the monitored machine, but will use puppet to configure an install the checks and hosts everywhere needed.

I’ll start deploying icinga2 instances with puppet with this design in mind, and then go on by hand testing it and puppetizing what works in a more agile fashion.

Do you agree that’s what we came up with?

#38 Updated by intrigeri 2016-02-15 16:04:26

Assignee changed from intrigeri to bertagaz
QA Check changed from Info Needed to Dev Needed

> Do you agree that’s what we came up with?

I don’t remember, so I’ll trust the notes you’ve hopefully kept during/after our meeting.

#39 Updated by bertagaz 2016-02-16 13:14:45

intrigeri wrote:
> > Do you agree that’s what we came up with?
>
> I don’t remember, so I’ll trust the notes you’ve hopefully kept during/after our meeting.

Yeah, I’ve kept the conversation window opened and tried to sum up from there.

#40 Updated by intrigeri 2016-02-16 14:07:44

> Yeah, I’ve kept the conversation window opened and tried to sum up from there.

Awesome :)

#41 Updated by bertagaz 2016-03-10 18:51:02

Target version changed from Tails_2.2 to Tails_2.3

#42 Updated by bertagaz 2016-04-20 08:04:34

Assignee changed from bertagaz to intrigeri
QA Check changed from Dev Needed to Ready for QA

Hmm, I think we kept this ticket open to discuss about details, but in the end this kind of discussion actually happen on other tickets, and this ticket hasn’t been active since 2 monthes. I’ll be to mark this bug as resolved, and go on like that.

#43 Updated by intrigeri 2016-04-25 02:34:17

Status changed from In Progress to Resolved
Assignee deleted (~~intrigeri~~)
% Done changed from 80 to 100
QA Check changed from Ready for QA to Pass

OK. Keep in mind that we’ll need a high-level description of how the whole thing works in the end, somewhere.

#44 Updated by bertagaz 2016-04-25 03:08:18

intrigeri wrote:
> OK. Keep in mind that we’ll need a high-level description of how the whole thing works in the end, somewhere.

Right, created ~~Feature #11366~~ to track that.