NAS Upgrade: 16TB RAID 1 on Debian 12#

In the following post, we will replace 6 old 3TB disks with 2 new 16TB disks. We will use RAID 1 for maximum redundancy [1].

This is an upgrade in capacity from 9TB to 16TB and a reduction in power usage of ~66%, as now only two disks are needed instead of six.

../_images/02-18_19-51-30.jpg

Our enterprise machine. On the right, inside the bay, you can see the six old WD disks. In front, there are the new Toshiba disks and the boot ssd.#

Why do we do this? There is The Cloud (TM), isn’t there? Of course, we make good use of snapshots and backup servers on Hetzner and AWS, but what if one of their data centers goes up in flames? It happens! Just search for „data center fire news“.

Having off-site backups allows us to do disaster recovery in the worst case. We could of course use another cloud provider to host backups, but downstream is cheap and physical hardware is even cheaper for storage compared to cloud offerings [2].

Our NAS is as simple desktop machine with 6 disk bays. It may not be the most power effective thing, but the nice thing about an i7 processor and 32GB RAM is: You can do other stuff with it. ;)

Before we start, let’s have a look at the disks present in the system:

# lsblk
NAME      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda         8:0    0   2,7T  0 disk
└─sda1      8:1    0   2,7T  0 part
  └─md0     9:0    0   2,7T  0 raid1
sdb         8:16   0   2,7T  0 disk
└─sdb1      8:17   0   2,7T  0 part
  └─md128   9:128  0   2,7T  0 raid1
sdc         8:32   0   2,7T  0 disk
└─sdc1      8:33   0   2,7T  0 part
  └─md1     9:1    0   2,7T  0 raid1
sdd         8:48   0   2,7T  0 disk
└─sdd1      8:49   0   2,7T  0 part
  └─md0     9:0    0   2,7T  0 raid1
sde         8:64   0   2,7T  0 disk
└─sde1      8:65   0   2,7T  0 part
  └─md1     9:1    0   2,7T  0 raid1
sdf         8:80   0   2,7T  0 disk
└─sdf1      8:81   0   2,7T  0 part
  └─md128   9:128  0   2,7T  0 raid1
sdg         8:96   0 931,5G  0 disk
├─sdg1      8:97   0     1G  0 part  /boot/efi
└─sdg2      8:98   0 930,5G  0 part  /
sdh         8:112  0  14,6T  0 disk
sdi         8:128  0  14,6T  0 disk

First, there is the old RAID10 called raid1 consisting of sda … sdf, each with 2.7 TiB, resulting in a total capacity of 8.1 TiB. Then there is sdg, a small SSD drive with Debian 12 installed on its root partition. Lastly, we have two brand new Toshiba disks with 14.6 TiB each.

„But wait! Didn’t you say in the beginning that we had 9TB and upgraded to 16TB?“ Yes, my dear astute reader! Welcome in the world of marketing where counting is done in SI units when we care about powers of 1024. Suddenly 3 TB are 2.7 TiB and 16TB become meager 14.6 TiB. Oh well…

Be S.M.A.R.T.#

Before we start, we’ll run some SMART checks.

https://unix.stackexchange.com/a/588253 says

To interpret the SMART attributes, you have to know they are normalized to 100, and lower is worse.

See also https://superuser.com/a/1171905 for statistics that are relevant and also the ones that you can ignore.

sudo apt install smartmontools

git init ~/hdds
cd ~/hdds

# gather raw data
mkdir -p _raw _tables
for dev in /dev/sd?; do
  disk=$(basename $dev)
  sudo smartctl -a $dev > _raw/$disk || echo "WARNING: errors in $dev" >&2
done

# extract smart attribute tables
for f in _raw/*; do
  disk=$(basename $f)
  rg '^ID#' -A18 $f > _tables/$disk || true
done

# show relevant statistics
# (single-threaded to keep file order)
rg '^\s*(5|187|188|197|198) ' -j1 $(find _tables/ -type f | sort)

git add .
git commit -m initial

E.g.

[...]
_tables/sdh
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

_tables/sdi
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

All the values at the end are zeros. Good.

Create New RAID 1#

Become root

sudo -i

Install prerequisites

apt-get --yes install gdisk mdadm lvm2 cryptsetup

Double check disk identifiers (sdh,sdi in this case):

root@enterprise ~ # lsblk /dev/sd{h,i}
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdh    8:112  0 14,6T  0 disk
sdi    8:128  0 14,6T  0 disk

Create partitions:

gdisk /dev/sdh
# create a new empty GUID partition table (GPT)
o
y
w
y
# add a new partition (type: Linux RAID)
gdisk /dev/sdh
n
<ENTER>
<ENTER>
<ENTER>
fd00
w
y
<ENTER>

# REPEAT for the second drive
gdisk /dev/sdi

Result

root@enterprise ~ # lsblk /dev/sd{h,i}
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdh      8:112  0 14,6T  0 disk
└─sdh1   8:113  0 14,6T  0 part
sdi      8:128  0 14,6T  0 disk
└─sdi1   8:129  0 14,6T  0 part

Choose an md device name. As md0, md1 and md128 are already taken on this system, we will use md2.

mdadm --create --verbose /dev/md2 --level=1 --raid-devices=2 /dev/sdh1 /dev/sdi1
# y
mdadm --detail /dev/md2

Add the array to mdadm.conf

mdadm --detail --scan | rg /dev/md/2 >> /etc/mdadm/mdadm.conf

vi /etc/mdadm/mdadm.conf  # format / verify

As instructed by mdadm.conf, update the initramfs

update-initramfs -u

Check raid status:

cat /proc/mdstat

Example output:

md2 : active raid1 sdi1[1] sdh1[0]
      15625745408 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.2% (43325248/15625745408) finish=1290.3min speed=201267K/sec
      bitmap: 117/117 pages [468KB], 65536KB chunk

This should take around 21 hours:

root@enterprise ~ # echo $(( 1290 / 60 ))
21

See you later! 😅

Actually, you don’t need to wait for resync to finish (see below). You can use the array right away.

But first:

Let’s Encrypt!#

First, be sure to update cryptsetup to the latest version in order to get good defaults:

apt update
apt upgrade
reboot
uname -r
cryptsetup --version

As per the excellent archlinux wiki, run a benchmark, because

If certain AES ciphers excel with a considerable higher throughput, these are probably the ones with hardware support in the CPU.

cryptsetup benchmark

Encrypt the raid device. Default cipher is aes-xts-plain64 (on 2024-02)

cryptsetup --verify-passphrase luksFormat /dev/md2
# YES

Open the encrypted RAID device as crypt2024-02

cryptsetup luksOpen /dev/md2 crypt2024-02

This gives us the block device that we can use as an LVM PV:

root@enterprise ~ # lsblk /dev/mapper/crypt2024-02
NAME         MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
crypt2024-02 253:0    0 14,6T  0 crypt

Setup LVM#

pvcreate /dev/mapper/crypt2024-02
pvdisplay
vgcreate raid2024-02 /dev/mapper/crypt2024-02
vgdisplay
lvcreate --name storage --extents 100%VG raid2024-02
lvdisplay
ls /dev/raid2024-02/storage

Format File System And Mount#

mkfs.ext4 -L storage2024-02 /dev/raid2024-02/storage
mkdir /media/storage2024-02
mkdir /media/storage2024-02
mount /dev/raid2024-02/storage /media/storage2024-02/
df /media/storage2024-02/

E.g.

root@enterprise ~ # df /media/storage2024-02/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/raid2024--02-storage   15T   28K   14T   1% /media/storage2024-02

Copy Old Data#

In our case, we want to move the contents of /media/storage to /media/storage2024-02 preserving ownership and permissions.

tmux
rsync -azh --info=progress2 /media/storage/ /media/storage2024-02/

Verify Old Data#

Do whatever it is you need to do to verify old data. If you mainly host backups, then it would be advisable to try restoring from them. For general files, try to verify their integrity by other means, for example by reading them.

Remove Old Disks#

Removing the old disks is easy enough, but what SATA ports should we use?

https://download.asrock.com/Manual/Z87 Extreme6.pdf states on page 29

If the eSATA port on the rear I/O has been connected, the internal SATA3_A4 will not function.

also:

To minimize the boot time, use Intel® Z87 SATA ports (SATA3_0) for your bootable devices.

We use the connectors as listed on page 12. On the mainboard itself, there are markers showing that SATA3_0 is in the front of the SATA3_0_1 dual slot. For our setup:

  • SATA3_0_1 for boot SSD

  • SATA3_2_3 for disk 1

  • SATA3_4_5 for disk 2

../_images/nas-done.jpg

Two disks and a little bit of cable management :)#

Double Check#

sudo -i
lsblk

Shows the boot device and both RAID disks:

NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda       8:0    0 931,5G  0 disk
├─sda1    8:1    0     1G  0 part  /boot/efi
└─sda2    8:2    0 930,5G  0 part  /
sdb       8:16   0  14,6T  0 disk
└─sdb1    8:17   0  14,6T  0 part
  └─md2   9:2    0  14,6T  0 raid1
sdc       8:32   0  14,6T  0 disk
└─sdc1    8:33   0  14,6T  0 part
  └─md2   9:2    0  14,6T  0 raid1

The array looks good too:

root@enterprise ~ # cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : active (auto-read-only) raid1 sdc1[1] sdb1[0]
      15625745408 blocks super 1.2 [2/2] [UU]
      bitmap: 0/117 pages [0KB], 65536KB chunk

unused devices: <none>

Mount#

This is the decrypt.sh script:

#!/bin/bash
set -euo pipefail

# apt-get install mdadm cryptsetup lvm2
# mdadm --assemble --scan
#
# @felix: See git blame for multiple devices.

md=/dev/md2
crypt=crypt2024-02
raid_dev=/dev/raid2024-02/storage
TARGET=/media/storage

cryptsetup luksOpen $md $crypt

# wait until block device is available
# to trigger this manually, run `vgchange -a y raid` (activate vg)
while [[ ! -b $raid_dev ]]; do
  echo -n '.'
  sleep 0.25
done
echo
mkdir -p $TARGET
sudo mount $raid_dev $TARGET
ls $TARGET

Celebrate#

Be sure to treat yourself to some well earned rest. Go outside, relax and enjoy life! 🌲

Further Reading#

What does resync do?#

From man 8 mdadm:

A ‚resync‘ process is started to make sure that the array is consistent (e.g. both sides of a mirror contain the same data) but the content of the device is left otherwise untouched.

See also https://raid.wiki.kernel.org/index.php/Initial_Array_Creation

In case of failure, one can stop the array and force a rebuild:

Check Disk Activity#

iostat -hN 1

Tuning Disk Parameters#

The metric Load_Cycle_Count counts the number of times that disk’s head unit was parked. Parking reduces power consumption, but also introduces wear to the disk’s head unit.

In our case the drives should only be parked after a significant amount of time. There are two kinds of workloads running on the machine:

First, there are backup jobs, that do consistent IO. They are mainly limited by network speed and run from start to finish. Second, there is on-demand computing, like checking backups for consistency, or extracting historical data. When done manually, there can be pauses of up to 30 minutes between IO requests. During this time, the disks should continue spinning.

Install hdparm

apt install hdparm
man 8 hdparm

Look for the disk IDs to make sure that you operate on the right one:

l /dev/disk/by-id/ata-*

Check current settings

hdparm -I /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG
hdparm -I /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG

Results: https://gist.github.com/felixhummel/e044f5947a3e8f4b13f4579804c3a1ac

For hdparm parameters I found https://superuser.com/a/1218031 to be useful: Set -B 127 to enable Advanced Power Management, but allow spin-down. Set -S to 242, which equals 1 hour. You can find details in man 8 hdparm.

hdparm -B 127 -S 242 /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG

See output here.

This looks good, so let’s persist this:

cp /etc/hdparm.conf /var/backups/
cat <<'EOF' > /etc/hdparm.conf

/dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG {
  apm = 127
  spindown_time = 242
}
/dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG {
  apm = 127
  spindown_time = 242
}
EOF

And reload hdparm config params (from man 5 hdparm.conf):

/usr/lib/pm-utils/power.d/95hdparm-apm resume

To show the current power mode status:

hdparm -C /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG

tags: hdparm, Power_Cycle_Count, Load_Cycle_Count

souces:

Constant Writes#

After mounting we heard the disks constantly seeking. iostat reports writes, but only few transactions with small bandwidth:

root@enterprise # iostat -hN 1 /dev/sd? /dev/mapper/raid2024--02-storage
Linux 6.4.0-0.deb12.2-amd64 (enterprise) 	19.02.2024 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,6%    0,0%    0,3%    0,8%    0,0%   98,3%

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
    11,45        40,7k      1012,9k         0,0k      90,4M       2,2G       0,0k raid2024--02-storage
    26,12       563,7k         1,5M         0,0k       1,2G       3,4G       0,0k sda
     8,43        24,3k      1022,6k         0,0k      54,0M       2,2G       0,0k sdb
     8,30        19,5k      1022,6k         0,0k      43,3M       2,2G       0,0k sdc

A reddit comment suspects ext4 doing some bookkeeping for its lazy initialization.

As iotop (and htop) operate on the process level, there is no insight to be gained here, but there is blktrace [3]. Looking for syntax examples, I stumbled upon this pro-linux post.

blktrace -d /dev/mapper/raid2024--02-storage -o - | blkparse -i -

As suspected, this is ext4 doing its thing:

253,1    6        2     0.610765369 11584  Q  WS 9793720576 + 2048 [ext4lazyinit]

I love the visibility that Linux offers down to the deepest depths. 🤓