Bug #16161

optimise pv placement for io-performance

Added by groente 2018-11-28 11:30:32 . Updated 2018-12-16 11:30:15 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2018-11-28
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

lv’s should be spread across the different raid arrays for optimal i/o performance.


Subtasks


Related issues

Related to Tails - Feature #16041: Replace rotating drives with new SSDs on lizard Resolved 2018-10-11

History

#1 Updated by groente 2018-11-28 11:30:42

  • related to Feature #16041: Replace rotating drives with new SSDs on lizard added

#2 Updated by intrigeri 2018-11-28 12:36:38

  • Target version set to Tails_3.12
  • % Done changed from 0 to 20

I’m done with the initial attempt. I’ll check Munin in a week or two and will adjust things as needed.

#3 Updated by intrigeri 2018-11-29 13:23:47

OK, no need to wait a week to adjust a bit: after 24h it’s clear that md4 still gets way more I/O than md3 and md5; all the numbers match (IOPS, latency, throughput). That’s not surprising: my initial plan only covered the top IO consumers and almost everything else is still on md4. I’ll move some more stuff out of it.

#4 Updated by intrigeri 2018-11-29 13:31:12

Moved root, jenkins-system and bittorrent-system from md4 to md5.

#5 Updated by intrigeri 2018-12-08 08:01:20

Over the last 7 days:

  • read IOPS:
    • average is rather well balanced, with md4 quite lower than the other arrays though
    • max is much higher on md5 than elsewhere
  • write IOPS:
    • average is much higher on md4 than other arrays
    • max is very low on md3 and 72% higher on md4 than md5
  • disk latency:
    • average is good on md3 and md5 but it’s 16 times bigger on md4
    • max is OK on md3 and md5 but much bigger (2.13s) on md4 which is somewhat concerning
  • throughput, utilization: nothing remarkable

#6 Updated by intrigeri 2018-12-08 08:24:16

Zooming in, only the md4 write IOPS and latency is still concerning. It’s fully correlated with jenkins-data and it increased substantially since we started generating USB images on all branches (Feature #16154), i.e. just after my initial analysis (Feature #16041#note-13) of the per-LV I/O consumption… too bad, but we knew it would have to be an incremental process anyway. Apart of jenkins-data, the biggest consumers of write I/O on md4 is isobuilder3-data, so I’m moving it to md3 which is relatively underloaded in terms of write I/O. This does not leave much space left on md3 (30GB) but it should improve things until we actually need more space there.

#7 Updated by intrigeri 2018-12-16 11:30:15

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 20 to 100

md4’s average latency (73ms) is now “only” about 7.5 times bigger than the other arrays, i.e. twice better than before my last changes. md4 average write IOPS is now “only” about twice bigger than the other arrays. I think this is good enough. Other metrics are OK.

The only way I see to balance this even better would be to spread jenkins-data over multiple arrays. Such an allocation algorithm would be more difficult to maintain: we already have a hard time ensuring LVs remain on “their” PV when we grow them as part of day-to-day system operations; I’d rather not make this even harder => IMO the cost/benefit of spending more time on this ticket would not be worth it.