Bug #16131

Broken Samsung SSD 850 EVO 1TB on lizard

Added by intrigeri 2018-11-17 09:42:51 . Updated 2018-11-20 19:33:08 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Infrastructure
Target version:
Start date:
2018-11-17
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

One of our 2 Samsung SSD 850 EVO 1TB on lizard is not visible in the BIOS setup nor in the system. At boot I see:

[   57.604069] ata6: COMRESET failed (errno=-16)
[   62.660068] ata6: COMRESET failed (errno=-16)
[   62.664439] ata6: reset failed, giving up

This failure might be what caused the crazy system state that started around 01:40 today: 300+ load, no obvious CPU-hungry culprit.

So md0 and md1 are running in degraded state. We need to either speed up Feature #16041 and migrate everything out of the remaining 1TB SSD to the new ones, or replace the faulty 1TB drive (hoping the drive is faulty, not the SATA port).


Subtasks


Related issues

Related to Tails - Feature #16041: Replace rotating drives with new SSDs on lizard Resolved 2018-10-11
Blocks Tails - Feature #13242: Core work: Sysadmin (Maintain our already existing services) Confirmed 2017-06-29

History

#1 Updated by intrigeri 2018-11-17 09:43:20

  • blocks Feature #13242: Core work: Sysadmin (Maintain our already existing services) added

#2 Updated by intrigeri 2018-11-17 09:43:53

  • related to Feature #16041: Replace rotating drives with new SSDs on lizard added

#3 Updated by groente 2018-11-17 14:52:45

  • Assignee changed from groente to intrigeri
  • QA Check set to Ready for QA

i’d like to propose running:

pvmove /dev/mapper/md1_crypt /dev/disk/by-id/raid-md4_crypt

luckily there’s still enough room to move everything away from the degraded array, so this would get all our data redundant again.
since we had two identical disks in raid1 in /dev/md1, changes are quite considerable the last disk will fail soon aswell.

#4 Updated by intrigeri 2018-11-17 15:25:09

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to groente
  • % Done changed from 0 to 10
  • QA Check deleted (Ready for QA)

Good idea! But please take note on Feature #16041 of what LVs were moved: I’ve put quite some effort into spreading the load accross the different arrays (e.g. at some point I made sure that half the isobuilders used one and the other half another, etc.); this should probably be re-done at some point but once the new drives are plugged in, at the very least I’d like to move “back” to their new array the LVs you’re going to move, so that md4 does not have to sustain the additional load alone forever. sudo lvdisplay --maps | grep -E 'LV Path|Physical volume' can be useful.

#5 Updated by groente 2018-11-17 19:20:40

  • Assignee changed from groente to intrigeri
  • % Done changed from 10 to 20
  • QA Check set to Info Needed

ok, i’ve made a note on Feature #16041 and initiated the move. this could take a while.

once the pvmove is done, i would suggest we pvremove /dev/md1_crypt, stop luks there, remove the md1 array, and unplug the both our 1TB disks together with the spinning disks.

once the new disks are plugged, we can have them replace the missing slots in md0.

does that sound like a good plan to you?

#6 Updated by intrigeri 2018-11-17 19:38:23

  • Assignee changed from intrigeri to groente
  • QA Check changed from Info Needed to Dev Needed

> once the pvmove is done, i would suggest we pvremove /dev/md1_crypt, stop luks there, remove the md1 array, and unplug the both our 1TB disks together with the spinning disks.

> once the new disks are plugged, we can have them replace the missing slots in md0.

> does that sound like a good plan to you?

Sure. I’ve documented the steps to do so on the other ticket, can be used as a checklist.

+ donate these 2 * 1TB SSDs to Riseup, being clear on their faulty/risky status, in case they want to use them :)

#7 Updated by groente 2018-11-18 20:47:41

  • Status changed from In Progress to Resolved
  • % Done changed from 20 to 100
  • QA Check deleted (Dev Needed)

so, i’ve:

- deactived md1 (and md2 that was popping up again)

- taken sde1 out of md0 and shrunk the array to two disks

- zeroed all the relevant superblocks to keep systemd from bringing the arrays back up
- dd’d /dev/urandom 30 times in a row to the first 256MB of sde, sdg, and sdf

i’m closing this ticket now, everything left to be done falls under Feature #16041

#8 Updated by intrigeri 2018-11-19 07:39:07

> - taken sde1 out of md0 and shrunk the array to two disks

Accordingly, I’ve run dpkg-reconfigure grub-pc that told me “The GRUB boot loader was previously installed to a disk that is no longer present” and unchecked /dev/sde in the list of GRUB install devices.

#9 Updated by intrigeri 2018-11-20 19:33:08

  • Assignee deleted (groente)

They’ve been plugged out from the machine. Let’s handle the follow-ups 1. via sysadmin shifts for the basic setup, as suggested over email today; 2. myself, for moving data back there (Feature #16041)