Feature #7631

Get a server able to run our automated test suite

Added by intrigeri 2014-07-20 18:48:20 . Updated 2015-03-22 12:12:20 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Continuous Integration
Target version:
Start date:
2015-01-01
Due date:
% Done:

100%

Feature Branch:
Type of work:
Communicate
Starter:
0
Affected tool:
Deliverable for:

Description

Either a dedicated server (but then, we have to set up proper communication channels with lizard), or upgrading lizard to Haswell (so that nested virtualization works fast):


Subtasks

Bug #8506: Fix virtual network problems on lizard Resolved

100


Related issues

Related to Tails - Feature #9264: Consider buying more server hardware to run our automated test suite Resolved 2015-12-15
Blocks Tails - Feature #6564: Deploy a platform for automated testing Resolved

History

#1 Updated by intrigeri 2014-07-20 19:02:25

  • Description updated

#2 Updated by intrigeri 2014-07-20 19:04:29

#3 Updated by intrigeri 2014-07-20 19:04:58

  • blocks Feature #6564: Deploy a platform for automated testing added

#4 Updated by intrigeri 2014-07-20 19:05:08

  • Category changed from Infrastructure to Continuous Integration

#5 Updated by intrigeri 2014-08-02 16:42:02

  • Target version set to Hardening_M1

(Subtask of a ticket flagged for 3.0.)

#6 Updated by intrigeri 2014-10-08 11:26:24

Our needs

  • 12 * current VMs (without builders) + 6 * upcoming VMs (average = 1 CPU thread, 1GB of RAM)
    • 18 CPU threads
    • 18 GB of RAM
  • 2 * ISO builders
    • 16 CPU threads
    • 20 GB of RAM
  • 2-4 * ISO testers
    • 6-12 CPU threads
    • 40-80 GB of RAM

=> Maximum total = 46 threads, 118 GB of RAM

Our options

(Removed, superseded by the blueprint that has different options naming, so this part was confusing.)

#7 Updated by sajolida 2014-10-09 00:16:53

We already had part of this discussion in private, but could you detail
a bit better the pros and cons of each options? And maybe give an
approximate budget.

#8 Updated by bertagaz 2014-10-17 11:10:21

  • Status changed from Confirmed to In Progress
  • Blueprint set to https://tails.boum.org/blueprint/hardware_for_automated_tests/

I’ve created a blueprint to takes notes of the bills for different options based on the one with the specifications written here, and their pros and cons. Please review and comment.

#9 Updated by intrigeri 2014-10-17 14:26:26

Excellent job, thanks! I’ll take a look at it in the next few days.

#10 Updated by sajolida 2014-10-18 00:57:40

  • blocks #8117 added

#11 Updated by sajolida 2014-10-18 01:11:12

Thanks a lot for that excellent work!

I’m not in the sysadmin team so I might lack some technical knowledge
regarding our needs. But reading your blueprint I got quite convinced by
option A. I’m not worried at all by the small extra cost for option A
over option B:

  • Option B will require more sysadmin work.
  • Hosting two machines will likely cost more money on the long run (rack
    space, energy, broken parts, etc.).

I think that a key aspect for that decision is whether this machine will
be enough for us in the next 2-3 years. Because if we will need to go
through the extra sysadmin work anyway, not so long from now then I
might reconsider going for option B now with a possible CPU upgrade in a
couple of years.

If we’re too much in doubt regarding our future needs for the next 2-3
years I would still go for option A, I think.

Regarding selling back our hardware to Riseup. That’s fine with me. But
I would also consider donating it. I think that our accounting will be
fine either way.

#12 Updated by bertagaz 2014-10-18 04:34:28

Let assume that the canonical option names are the one in the blueprint.

sajolida wrote:
> I’m not in the sysadmin team so I might lack some technical knowledge
> regarding our needs. But reading your blueprint I got quite convinced by
> option A. I’m not worried at all by the small extra cost for option A
> over option B:
>
> * Option B will require more sysadmin work.
> * Hosting two machines will likely cost more money on the long run (rack
> space, energy, broken parts, etc.).
>
> I think that a key aspect for that decision is whether this machine will
> be enough for us in the next 2-3 years. Because if we will need to go
> through the extra sysadmin work anyway, not so long from now then I
> might reconsider going for option B now with a possible CPU upgrade in a
> couple of years.
>
> If we’re too much in doubt regarding our future needs for the next 2-3
> years I would still go for option A, I think.

Well, option B will probably have some burdens anyway, first one being the
headaches we’ll have to decide where we’ll put our VMs. They won’t easily
fit with 24 threads on 2 boxes. We’ll probably have to split the ISOs
builders on the 2 boxes, probably too much a nightmare. How does that sound
to other admins?

We’re probably left with options A and C in reality, but correct me if
I’m wrong.

I’m wondering if option A is sustainable enough for the 2-3 years coming.
It means 2 CPUs threads left only.

On the other hand, for “not much” more ($800 + hosting cost), option C
brings the advantages that goes with having all the ISO builders and
testers on one dedicated machine, with the possibility to deploy more
if needed.

With that option, I’m not sure the sysadmin work would be that elevated,
at least the question of how the two machine communicates the ISOs and
test results become irrelevant. It brings a bit of isolation between the
Lizard services and our testing infrastructure.

We won’t need to think about new hardware soon enough with that one too.
But that’s expensive.

I admit I’m a bit undecided about this two options, I’m not sure of the
expected growth of our infrastructure needs in the upcoming years.

> Regarding selling back our hardware to Riseup. That’s fine with me. But
> I would also consider donating it. I think that our accounting will be
> fine either way.

I’ll be all for the donation way I think. Riseup just started it’s funding
campaign.

#13 Updated by sajolida 2014-10-19 11:40:52

Ok, so if option B has to be discarded, then yeah, I could prefer option
C too which seems like spending a bit more money to buy a bit of
tranquility.

Just for the sake of having all the relevant money information at hand,
would the hosting of this server cost the same as for lizard?

Because it seem really cheap to me. To the point of being almost
irrelevant for this discussion (at least in comparison with the cost of
hardware).

I’m not sure whether I can paste that info here, intrigeri feel free to
do so :)

Maybe we can wait for intrigeri to answer on this before making a final
decision as he’s the other half of the sysadmin team.

#14 Updated by bertagaz 2014-10-27 10:13:42

sajolida wrote:
> Ok, so if option B has to be discarded, then yeah, I could prefer option
> C too which seems like spending a bit more money to buy a bit of
> tranquility.
>
> Just for the sake of having all the relevant money information at hand,
> would the hosting of this server cost the same as for lizard?

Hosting cost would probably be a bit more than Lizard, as if option C
is chosen, the new machine would have twice the number of CPUs of
Lizard, also low TDP ones. And as electricity is the limiting factor
in the colo…

> Maybe we can wait for intrigeri to answer on this before making a final
> decision as he’s the other half of the sysadmin team.

Sure.

A bit more math about option C though.

With 48 CPU threads for builders and testers only, that’d give:

- 3 builders (24 CPU threads)
- 8 testers (24 CPU threads)
or
- 4 builders (32 threads)
- 5 testers (15 threads)

First option sounds more reasonable, as testers are way longer than
builders (when they run the whole test suite).

Hard to say for now how that would cope with Feature #5288.

#15 Updated by intrigeri 2014-10-27 10:24:54

> Hosting cost would probably be a bit more than Lizard, as if option C
> is chosen, the new machine would have twice the number of CPUs of
> Lizard, also low TDP ones. And as electricity is the limiting factor
> in the colo…

Do we have information about which one(s) of the options are actually possible at the colo?

#16 Updated by bertagaz 2014-10-27 10:33:47

intrigeri wrote:
> Do we have information about which one(s) of the options are actually possible at the colo?

Not yet, I didn’t contact them, I thought about refining the plans a bit before asking them. Maybe now is the right time, and I can start to discuss it with them.

#17 Updated by intrigeri 2014-10-27 22:15:52

Sorry for the delay I’ve needed to look at this. I agree with most things that have been said here, so thanks a lot for working this out! I think that we have a consensus to drop option B, so I’ll ignore it in what follows.

A few comments and opinions:

  • The purchase cost difference between option A and option C is too small to be worth taking into account. Our other criteria seem more important to me, so I’ll dismiss this one.
  • I’m pretty sure that option A will be enough to cover our plans for 2015, and all currently planned or envisioned short-term additional needs. I’d like to point out that even option A leaves 8 CPU threads and 20GiB for the host system and future needs. But our plans for the years after are unclear, so it’s hard to know for sure if it’ll be enough for 2-3 years. OTOH, hardware gets cheaper everyday, so buying computation power now because we might need it in 1.5 years doesn’t look like a good bet to me. IMO we simply currently lack the foreseeing ability that would be needed to make it worth it to purchase bonus hardware and to let it sleep for a year or two. So, the “option A might not be enough in 2 years” argument isn’t convincing me at all.
  • I acknowledge that it would be good to isolate our Jenkins stuff from the rest of our infrastructure. This does speak in favor of option C.
  • I also acknowledge that if we put all builder and tester VMs on a dedicated box (option C), then the additional sysadmin work caused by having a 2nd box is not that big. But it exists nevertheless: we’ll still need to set up and maintain duplicated infrastructure (off the top of my head: puppetmaster, web server, backups, APT proxy, Git repo hosting), that haven’t been taken into account in the plans that were happily drawn on option C as far as I can see. I’m wary of betting on the fact that we’ll have the resources to maintain more systems on the long run. Things have been improving recently, and more people committed to give a hand in the future, but I’m very much tempted to let time tell. When looking at things from this side, option A still seems much more reasonable than option C.
  • Regarding what we do with the current hardware we have, if we pick option A: I’d rather discuss that privately.

So, all in all, to me it looks like the compelling arguments are “better isolation” (option C) vs. “less work” (option A). As you have guessed already, I’m strongly leaning towards option A. We can still revisit this in 18-36 months, once we have some field experience with Feature #5288, and know our needs better.

Now, I think we lack some information to make a decision:

  • Do the quotes on the blueprint take into account sales tax? (It varies greatly depending on the state.)
  • Does our current case really work with the mobo from option A? (I know bertagaz is pretty sure it does, but we’ve already had bad surprises, and I’d like know what our preferred expert thinks.)
  • Is option A’s and option C’s CPU / RAM / mobo combination working? Same as above, I’d like to see bertagaz’ draft validated by our preferred hardware geek.
  • Does the chosen CPU support fast nested KVM? I guess so, since it’s Haswell, but I’d like to see this confirmed by actual data. IIRC, I’ve posted to -dev@ a few months ago some links to documentation that explains exactly what CPU features are needed. If not, lookup “nested virtualization” and “shadow turtles” (yes!) online.
  • Are there maybe better/alternate CPUs for our needs? I’d love to know what our preferred hardware expert thinks. In particular, if we’re allowed to pick CPUs that are not low-voltage, then it would possibly be a game changer (I’m thinking of E5-2690 v3 and E5-2680 v3, that have faster cores that would be useful right now, as opposed to more cores that will be mostly sleeping until we find out how to use them, as in option C). This is related to my next points:
  • Can the colo handle upgrading lizard to a pair of CPUs with 120W TDP each? If yes, what would be the monthly cost?
  • Can the colo handle option A power-wise? If yes, what would be the monthly cost?
  • Can the colo handle option C, space and power-wise? If yes, what would be the monthly cost?

#18 Updated by sajolida 2014-10-28 17:29:02

> So, all in all, to me it looks like the compelling arguments are
> “better isolation” (option C) vs. “less work” (option A). As you have
> guessed already, I’m strongly leaning towards option A.

I’m all for that reasoning.

#19 Updated by sajolida 2014-11-06 20:34:23

  • Target version changed from Hardening_M1 to Tails_1.2.2

#20 Updated by intrigeri 2014-11-10 12:37:18

  • % Done changed from 0 to 20

Other information that would be useful to make a decision: if we can use CPUs that are not low-power, are there suitable AMD options?

Rationale: according to the “Nested virtualization” talk at KVM Forum, nested virtualization still seems to be much more mature and efficient on AMD than on Intel.

#21 Updated by bertagaz 2014-11-17 12:53:19

I agree that it’s probably too soon for us to go on with option C. Option A sounds like a good first step to have a better vision of what we’ll need for Feature #5288. That’s were my mind has landed too in the end.

I’ll get in touch with whoever needed to have some inputs on the questions raised here.

Thanks for the useful inputs.

#22 Updated by sajolida 2014-11-23 11:14:54

  • Due date set to 2015-01-06

I’m setting the due date for that on 2015-01-06 (1.2.2). You’ll have it for the sprint!

#23 Updated by intrigeri 2014-12-05 23:35:58

  • Type of work changed from Research to Communicate

#24 Updated by anonym 2014-12-12 16:41:51

  • Target version changed from Tails_1.2.2 to Tails_1.2.3

#25 Updated by bertagaz 2015-01-01 13:01:46

  • related to Bug #8506: Fix virtual network problems on lizard added

#26 Updated by sajolida 2015-01-06 18:25:21

I’ve seen an invoice for that, so I guess that we have something now. What’s left before we can close that ticket?

#27 Updated by bertagaz 2015-01-06 19:02:08

sajolida wrote:
> I’ve seen an invoice for that, so I guess that we have something now. What’s left before we can close that ticket?

Yes, the hardware is there, but not yet installed. We have hard time finding a moment when the Tails sysadmins are online at the same moment to do the switch.

Intrigeri, you said you’ll have time for that after the 4th, but it seems complicated right now. I won’t have much availabitily before the CI sprint to do that too. If ever I have, is there a problem if I take care of it alone?

Otherwise let assume we’ll do the switch at the very beginning of the CI sprint.

#28 Updated by intrigeri 2015-01-07 10:22:54

  • % Done changed from 20 to 50

bertagaz wrote:
> Yes, the hardware is there, but not yet installed.

Indeed, let’s keep this ticket open until the new server is actually running and usable.

> Intrigeri, you said you’ll have time for that after the 4th, but it seems complicated right now. I won’t have much availabitily before the CI sprint to do that too. If ever I have, is there a problem if I take care of it alone?

Please do.

> Otherwise let assume we’ll do the switch at the very beginning of the CI sprint.

During the CI sprint, I’d rather see us focus on whatever we should do with other attendees.

#29 Updated by intrigeri 2015-01-13 12:54:04

bertagaz, may you please sum up what’s left to do (in your personal todo list, I guess) on this front, before we can close this ticket?

#30 Updated by bertagaz 2015-01-13 18:20:26

  • % Done changed from 50 to 80

Yes, sorry to have kept that “secret”.

For now what is probably mandatory is to fix the DHCP firewalling. Patch will come soon.

Appart from that, I only have noted stuffs like doing some research about X2APIC, activate USB3 in the BIOS and maybe get a hash of every firmwares of the box to compare it to the vendor ones. This last ones are not blocking us to close this ticket I believe.

#31 Updated by intrigeri 2015-01-13 20:02:21

> For now what is probably mandatory is to fix the DHCP firewalling. Patch will come soon.

Agreed.

> Appart from that, I only have noted stuffs like doing some research about X2APIC,
> activate USB3 in the BIOS and maybe get a hash of every firmwares of the box to
> compare it to the vendor ones. This last ones are not blocking us to close this
> ticket I believe.

Agreed.

#32 Updated by intrigeri 2015-01-15 03:54:48

  • related to deleted (Bug #8506: Fix virtual network problems on lizard)

#33 Updated by intrigeri 2015-01-15 03:56:03

  • Target version changed from Tails_1.2.3 to Tails_1.3

Postponing, even if 3.5 months late already.

#34 Updated by bertagaz 2015-02-25 09:54:22

  • Target version changed from Tails_1.3 to Tails_1.3.2

Differing to 1.3.1.

#35 Updated by sajolida 2015-02-26 15:13:15

I’m concerned to see that this ticket has been postponed for the fourth time. Take into account that this is blocking a funders’ deliverable (#8117) and cannot be postponed anymore this time. So marking this as Elevated.

#36 Updated by intrigeri 2015-02-26 15:25:39

> I’m concerned to see that this ticket has been postponed for the fourth time.

Yay. The goal is to deal with it during this week, I think.

#37 Updated by bertagaz 2015-02-26 19:25:28

  • Assignee changed from bertagaz to intrigeri

Yes, I’ve just fixed Feature #8605, so once it is reviewed, we’ll be able to close this ticket. Assigning to intrigeri as he is in charge of this review.

#38 Updated by intrigeri 2015-02-26 21:18:18

  • Assignee changed from intrigeri to bertagaz

#39 Updated by intrigeri 2015-02-28 13:42:10

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • QA Check set to Pass

Congrats!

#40 Updated by BitingBird 2015-03-22 12:12:20

  • Target version changed from Tails_1.3.2 to Tails_1.3.1

#41 Updated by intrigeri 2015-04-19 16:09:47

  • related to Feature #9264: Consider buying more server hardware to run our automated test suite added

#42 Updated by intrigeri 2015-06-11 14:00:09

  • related to deleted (Feature #9264: Consider buying more server hardware to run our automated test suite)

#43 Updated by intrigeri 2015-11-23 01:43:57

  • related to Feature #9264: Consider buying more server hardware to run our automated test suite added