Feature #10503

Run erase_memory.feature first to optimize test suite performance

Added by intrigeri 2015-11-06 11:40:05 . Updated 2016-03-08 19:03:16 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
2015-11-06
Due date:
% Done:

100%

Feature Branch:
test/10503-reorder-feature-execution
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:
270

Description

According to Feature #9401#note-4, the test suite’s peak disk usage in /tmp/TailsToaster should be around 15GB these days. This matters since we mount an ext4 filesystem on that directory, and count on the Linux kernel to do caching and write to disk as little of these GB as possible. When we reach peak usage, the system under test is allocated 8GB of RAM, so to avoid any writes to disk we would need at the very least 15+8GB of RAM on each isotesterN. We “only” give them 20GB each, so inevitably a bunch of GBs have to be written to disk at some point while running erase_memory.feature (and possibly re-read from disk into cache soon after). This adds I/O load to lizard that we could happily do without (we run a bunch of other services there), and on less powerful systems reading/writing that many GBs can have a significant impact on the test suite’s runtime.

According to Feature #9401#note-4 again, peak disk usage could be lowered to 8GB just by running erase_memory.feature first, which would magically fix all the aforementioned problems. It seems to be such a low-hanging fruit with such obvious advantages that exceptionally, perhaps we can do premature optimization and just do it. It could be good enough to just ensure that this change indeed has the intended effect on peak disk usage, and doesn’t make things worse (the test suite’s run time should be a good enough metric).

Moreover, for Feature #9264 (deadline = end of 2015) we may want to run the test suite with absolutely everything in RAM, so decreasing max RAM usage for such usecases by 30% could impact the hardware specs quite a bit.


Subtasks


Related issues

Related to Tails - Feature #9264: Consider buying more server hardware to run our automated test suite Resolved 2015-12-15
Related to Tails - Bug #11582: Some upgrade test scenarios fail due to lack of disk space on Jenkins Resolved 2016-07-21

History

#1 Updated by intrigeri 2015-11-06 11:40:20

  • blocks #8668 added

#2 Updated by anonym 2015-11-06 12:16:38

  • Target version set to Tails_1.8

#3 Updated by intrigeri 2015-12-05 05:44:01

  • related to Feature #9264: Consider buying more server hardware to run our automated test suite added

#4 Updated by anonym 2015-12-16 14:05:58

  • Target version changed from Tails_1.8 to Tails_2.0

#5 Updated by intrigeri 2016-01-06 13:57:22

  • Target version changed from Tails_2.0 to Tails_2.2

#6 Updated by intrigeri 2016-02-05 14:51:26

FYI we’re in the process of updating/refining the hardware config used to run the test suite via Jenkins, so it would be helpful to reach a conclusion here.

#7 Updated by anonym 2016-02-05 16:17:49

  • Status changed from Confirmed to In Progress

Applied in changeset commit:e6aefe9200b585cd10c81676c2b850331a69f2ae.

#8 Updated by anonym 2016-02-05 16:20:22

  • Assignee changed from anonym to intrigeri
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to test/10503-reorder-feature-execution

I believe the current situation yields the highest possible peak disk usage. The feature branch will, I believe, bring it to the smallest possible peak disk usage. :)

#9 Updated by anonym 2016-02-05 18:28:19

I ran a custom feature that runs all the non-temporary snapshot restoring steps, and it came in at around 7.5 GiB. So, when only considering the obvious offenders of disk space usage (excluding logs, videos, screenshots and such optional artifacts) the peak disk usage is:

  max($biggest_ram_dump, $biggest_non_snapshot_disk, $internal_snapshot_disk)
= max(8 GiB, $tails_iso_size, 7.5 GiB)
= 8 GiB.


(Note: for $biggest_non_snapshot_disk = $tails_iso_size, see the Tails can boot from live systems stored on hard drives scenario in untrusted_partitions.feature)

Earlier I think it must have been

8 GiB + $internal_snapshot_disk = 8 + 7.5 GiB = 15.5 GiB,


so this is quite a nice improvement. :)

#10 Updated by anonym 2016-02-05 18:46:19

Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

#11 Updated by intrigeri 2016-02-05 19:25:39

> Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

The reasoning in the ticket description, or the latest one you posted?

#12 Updated by anonym 2016-02-05 19:36:42

anonym wrote:
> Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

Running usb_install.feature while all other snapshots exist brings us to 8.6 GiB. Pushed commit:f35824c to deal with this.

#13 Updated by anonym 2016-02-05 19:39:18

intrigeri wrote:
> > Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.
>
> The reasoning in the ticket description, or the latest one you posted?

The latest one. So much has changed regarding snapshots so the old reasoning isn’t completely valid any more, so I’m mostly disregarding it.

#14 Updated by intrigeri 2016-02-09 18:46:56

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Ready for QA to Info Needed

Code review passes, but I wonder why we need the if not intersection.empty? conditional: the 3 statements it guards would be no-op if the condition is wrong, no?

I’ll now run some tests. Do you already have some measurements for the peak disk usage during a complete run? (Don’t bother if you don’t have this info handy.)

#15 Updated by intrigeri 2016-02-11 00:46:09

I see peak disk usage is 12553240 here.

#16 Updated by intrigeri 2016-02-11 00:56:31

intrigeri wrote:
> I see peak disk usage is 12553240 here.

On Bug #10396#note-10 we measured that it was “almost 12G” on lizard early in December. I expected that the reordering would lower this number. Maybe the @fragile tagging on lizard explains why we don’t save much (because I’m comparing a full run today, against a partial run on lizard 2 months ago).

Anyway: regardless of peak disk space usage, just running memory erasure first should decrease the maximum(RAM allocated to the system under testing + temporary disk space we would like to keep in memory cache), which is what this ticket is actually about IIRC. We haven’t measured that yet IIRC, but if I’m happy with this branch and merge it, I’ll measure how much RAM an isotester needs these days to run the test suite without hitting disk. I’ll need this for Feature #11011 anyawy.

#17 Updated by intrigeri 2016-02-11 01:24:03

  • % Done changed from 50 to 90
  • QA Check changed from Info Needed to Pass

Tests pass. Feel free to merge with, or without the minor style change I suggested above.

#18 Updated by intrigeri 2016-02-11 01:24:21

  • Subject changed from Consider running erase_memory.feature first to optimize test suite performance to Run erase_memory.feature first to optimize test suite performance
  • Type of work changed from Discuss to Code

#19 Updated by anonym 2016-02-11 21:06:05

  • Status changed from In Progress to Fix committed
  • % Done changed from 90 to 100

Applied in changeset commit:c701b57d503318168bea5a0bdc62ddd848a6d0df.

#20 Updated by anonym 2016-02-11 21:07:01

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Pass to Info Needed

intrigeri wrote:
> Code review passes, but I wonder why we need the if not intersection.empty? conditional: the 3 statements it guards would be no-op if the condition is wrong, no?

That is what I initially did, but I found it a bit clearer and less “clever” (in the bad sense) to do it that way. I think I’ll just leave it as-is.

> I’ll now run some tests. Do you already have some measurements for the peak disk usage during a complete run? (Don’t bother if you don’t have this info handy.)

Yeah, I did a full run which I apparently forgot to post here, sorry. I ran this:

while true; do
  stat -c%s "${TMP}"/*.memstate "${TMP}"/TailsToasterStorage/* | \
    sum-column >> /tmp/sizes-monitor
  sleep 5
done


where sum-column is just a simple script that does what you’d expect.

So I only monitored the storage and snapshots and I got a peak of 9104.2 MiB, but the other artifacts shouldn’t add up to much (<20 MiB at any point, typically).

> I see peak disk usage is 12553240 here.

I’m surprised you see over 3 GiB more than me. Did you run with --keep-snapshots or something? Then the temporary snapshots will be kept, and there are five of them which I believe would add up to that number. Since we don’t run the automated test suite with --keep-snapshots on Jenkins, I don’t think the tests you do should use that, if you did.

> Anyway: regardless of peak disk space usage, just running memory erasure first should decrease the maximum(RAM allocated to the system under testing + temporary disk space we would like to keep in memory cache), which is what this ticket is actually about IIRC. We haven’t measured that yet IIRC, but if I’m happy with this branch and merge it, I’ll measure how much RAM an isotester needs these days to run the test suite without hitting disk. I’ll need this for Feature #11011 anyawy.

ACK, so I merged it into devel so you’re not blocked any more. I’m leaving the ticket as Info needed, though, until you’ve answered my question above (I’d like to open another ticket to investigate this if you did not run with --keep-snapshots). Please close this ticket properly once you’ve answered!

#21 Updated by intrigeri 2016-02-13 01:37:40

> That is what I initially did, but I found it a bit clearer and less “clever” (in the bad sense) to do it that way. I think I’ll just leave it as-is.

OK.

> Yeah, I did a full run which I apparently forgot to post here, sorry. I ran this:
> […]
> So I only monitored the storage and snapshots and I got a peak of 9104.2 MiB, but the other artifacts shouldn’t add up to much (<20 MiB at any point, typically).

>> I see peak disk usage is 12553240 here.

> I’m surprised you see over 3 GiB more than me.

So am I :)

> Did you run with --keep-snapshots or something?

I didn’t. But I had a test.feature lying around, not sure what was in there, nor if it matters.
I’ll re-run in a clean environment (because again, I need these results for Feature #11011).

> Please close this ticket properly once you’ve answered!

It’s fix committed and will stay this way until 2.2 is out so I think I should not close it.
Anyway, I’ll follow-up on Feature #11011 or Feature #11113 about it.

#22 Updated by intrigeri 2016-02-14 13:29:35

  • Assignee deleted (intrigeri)
  • QA Check changed from Info Needed to Pass

#23 Updated by anonym 2016-03-08 19:03:16

  • Status changed from Fix committed to Resolved

#24 Updated by intrigeri 2016-07-21 02:57:11

  • related to Bug #11582: Some upgrade test scenarios fail due to lack of disk space on Jenkins added