NAS Upgrade: 16TB RAID 1 on Debian 12#
In the following post, we will replace 6 old 3TB disks with 2 new 16TB disks. We will use RAID 1 for maximum redundancy [1].
This is an upgrade in capacity from 9TB to 16TB and a reduction in power usage of ~66%, as now only two disks are needed instead of six.
Why do we do this? There is The Cloud (TM), isn’t there? Of course, we make good use of snapshots and backup servers on Hetzner and AWS, but what if one of their data centers goes up in flames? It happens! Just search for „data center fire news“.
Having off-site backups allows us to do disaster recovery in the worst case. We could of course use another cloud provider to host backups, but downstream is cheap and physical hardware is even cheaper for storage compared to cloud offerings [2].
Our NAS is as simple desktop machine with 6 disk bays. It may not be the most power effective thing, but the nice thing about an i7 processor and 32GB RAM is: You can do other stuff with it. ;)
Before we start, let’s have a look at the disks present in the system:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 2,7T 0 disk
└─sda1 8:1 0 2,7T 0 part
└─md0 9:0 0 2,7T 0 raid1
sdb 8:16 0 2,7T 0 disk
└─sdb1 8:17 0 2,7T 0 part
└─md128 9:128 0 2,7T 0 raid1
sdc 8:32 0 2,7T 0 disk
└─sdc1 8:33 0 2,7T 0 part
└─md1 9:1 0 2,7T 0 raid1
sdd 8:48 0 2,7T 0 disk
└─sdd1 8:49 0 2,7T 0 part
└─md0 9:0 0 2,7T 0 raid1
sde 8:64 0 2,7T 0 disk
└─sde1 8:65 0 2,7T 0 part
└─md1 9:1 0 2,7T 0 raid1
sdf 8:80 0 2,7T 0 disk
└─sdf1 8:81 0 2,7T 0 part
└─md128 9:128 0 2,7T 0 raid1
sdg 8:96 0 931,5G 0 disk
├─sdg1 8:97 0 1G 0 part /boot/efi
└─sdg2 8:98 0 930,5G 0 part /
sdh 8:112 0 14,6T 0 disk
sdi 8:128 0 14,6T 0 disk
First, there is the old RAID10 called raid1
consisting of sda … sdf, each
with 2.7 TiB, resulting in a total capacity of 8.1 TiB.
Then there is sdg
, a small SSD drive with Debian 12 installed on its root
partition.
Lastly, we have two brand new Toshiba disks with 14.6 TiB each.
„But wait! Didn’t you say in the beginning that we had 9TB and upgraded to 16TB?“ Yes, my dear astute reader! Welcome in the world of marketing where counting is done in SI units when we care about powers of 1024. Suddenly 3 TB are 2.7 TiB and 16TB become meager 14.6 TiB. Oh well…
Be S.M.A.R.T.#
Before we start, we’ll run some SMART checks.
https://unix.stackexchange.com/a/588253 says
To interpret the SMART attributes, you have to know they are normalized to 100, and lower is worse.
See also https://superuser.com/a/1171905 for statistics that are relevant and also the ones that you can ignore.
sudo apt install smartmontools
git init ~/hdds
cd ~/hdds
# gather raw data
mkdir -p _raw _tables
for dev in /dev/sd?; do
disk=$(basename $dev)
sudo smartctl -a $dev > _raw/$disk || echo "WARNING: errors in $dev" >&2
done
# extract smart attribute tables
for f in _raw/*; do
disk=$(basename $f)
rg '^ID#' -A18 $f > _tables/$disk || true
done
# show relevant statistics
# (single-threaded to keep file order)
rg '^\s*(5|187|188|197|198) ' -j1 $(find _tables/ -type f | sort)
git add .
git commit -m initial
E.g.
[...]
_tables/sdh
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
_tables/sdi
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
All the values at the end are zeros. Good.
Create New RAID 1#
Become root
sudo -i
Install prerequisites
apt-get --yes install gdisk mdadm lvm2 cryptsetup
Double check disk identifiers (sdh,sdi in this case):
root@enterprise ~ # lsblk /dev/sd{h,i}
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sdh 8:112 0 14,6T 0 disk
sdi 8:128 0 14,6T 0 disk
Create partitions:
gdisk /dev/sdh
# create a new empty GUID partition table (GPT)
o
y
w
y
# add a new partition (type: Linux RAID)
gdisk /dev/sdh
n
<ENTER>
<ENTER>
<ENTER>
fd00
w
y
<ENTER>
# REPEAT for the second drive
gdisk /dev/sdi
Result
root@enterprise ~ # lsblk /dev/sd{h,i}
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sdh 8:112 0 14,6T 0 disk
└─sdh1 8:113 0 14,6T 0 part
sdi 8:128 0 14,6T 0 disk
└─sdi1 8:129 0 14,6T 0 part
Choose an md
device name. As md0
, md1
and md128
are already taken on
this system, we will use md2
.
mdadm --create --verbose /dev/md2 --level=1 --raid-devices=2 /dev/sdh1 /dev/sdi1
# y
mdadm --detail /dev/md2
Add the array to mdadm.conf
mdadm --detail --scan | rg /dev/md/2 >> /etc/mdadm/mdadm.conf
vi /etc/mdadm/mdadm.conf # format / verify
As instructed by mdadm.conf, update the initramfs
update-initramfs -u
Check raid status:
cat /proc/mdstat
Example output:
md2 : active raid1 sdi1[1] sdh1[0]
15625745408 blocks super 1.2 [2/2] [UU]
[>....................] resync = 0.2% (43325248/15625745408) finish=1290.3min speed=201267K/sec
bitmap: 117/117 pages [468KB], 65536KB chunk
This should take around 21 hours:
root@enterprise ~ # echo $(( 1290 / 60 ))
21
See you later! 😅
…
Actually, you don’t need to wait for resync
to finish (see below).
You can use the array right away.
But first:
Let’s Encrypt!#
First, be sure to update cryptsetup to the latest version in order to get good defaults:
apt update
apt upgrade
reboot
uname -r
cryptsetup --version
As per the excellent archlinux wiki, run a benchmark, because
If certain AES ciphers excel with a considerable higher throughput, these are probably the ones with hardware support in the CPU.
cryptsetup benchmark
Encrypt the raid device. Default cipher is aes-xts-plain64
(on 2024-02)
cryptsetup --verify-passphrase luksFormat /dev/md2
# YES
Open the encrypted RAID device as crypt2024-02
cryptsetup luksOpen /dev/md2 crypt2024-02
This gives us the block device that we can use as an LVM PV:
root@enterprise ~ # lsblk /dev/mapper/crypt2024-02
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
crypt2024-02 253:0 0 14,6T 0 crypt
Setup LVM#
pvcreate /dev/mapper/crypt2024-02
pvdisplay
vgcreate raid2024-02 /dev/mapper/crypt2024-02
vgdisplay
lvcreate --name storage --extents 100%VG raid2024-02
lvdisplay
ls /dev/raid2024-02/storage
Format File System And Mount#
mkfs.ext4 -L storage2024-02 /dev/raid2024-02/storage
mkdir /media/storage2024-02
mkdir /media/storage2024-02
mount /dev/raid2024-02/storage /media/storage2024-02/
df /media/storage2024-02/
E.g.
root@enterprise ~ # df /media/storage2024-02/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/raid2024--02-storage 15T 28K 14T 1% /media/storage2024-02
Copy Old Data#
In our case, we want to move the contents of /media/storage
to
/media/storage2024-02
preserving ownership and permissions.
tmux
rsync -azh --info=progress2 /media/storage/ /media/storage2024-02/
Verify Old Data#
Do whatever it is you need to do to verify old data. If you mainly host backups, then it would be advisable to try restoring from them. For general files, try to verify their integrity by other means, for example by reading them.
Remove Old Disks#
Removing the old disks is easy enough, but what SATA ports should we use?
https://download.asrock.com/Manual/Z87 Extreme6.pdf states on page 29
If the eSATA port on the rear I/O has been connected, the internal SATA3_A4 will not function.
also:
To minimize the boot time, use Intel® Z87 SATA ports (SATA3_0) for your bootable devices.
We use the connectors as listed on page 12.
On the mainboard itself, there are markers showing that SATA3_0
is in the
front of the SATA3_0_1
dual slot.
For our setup:
SATA3_0_1 for boot SSD
SATA3_2_3 for disk 1
SATA3_4_5 for disk 2
Double Check#
sudo -i
lsblk
Shows the boot device and both RAID disks:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 931,5G 0 disk
├─sda1 8:1 0 1G 0 part /boot/efi
└─sda2 8:2 0 930,5G 0 part /
sdb 8:16 0 14,6T 0 disk
└─sdb1 8:17 0 14,6T 0 part
└─md2 9:2 0 14,6T 0 raid1
sdc 8:32 0 14,6T 0 disk
└─sdc1 8:33 0 14,6T 0 part
└─md2 9:2 0 14,6T 0 raid1
The array looks good too:
root@enterprise ~ # cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : active (auto-read-only) raid1 sdc1[1] sdb1[0]
15625745408 blocks super 1.2 [2/2] [UU]
bitmap: 0/117 pages [0KB], 65536KB chunk
unused devices: <none>
Mount#
This is the decrypt.sh
script:
#!/bin/bash
set -euo pipefail
# apt-get install mdadm cryptsetup lvm2
# mdadm --assemble --scan
#
# @felix: See git blame for multiple devices.
md=/dev/md2
crypt=crypt2024-02
raid_dev=/dev/raid2024-02/storage
TARGET=/media/storage
cryptsetup luksOpen $md $crypt
# wait until block device is available
# to trigger this manually, run `vgchange -a y raid` (activate vg)
while [[ ! -b $raid_dev ]]; do
echo -n '.'
sleep 0.25
done
echo
mkdir -p $TARGET
sudo mount $raid_dev $TARGET
ls $TARGET
Celebrate#
Be sure to treat yourself to some well earned rest. Go outside, relax and enjoy life! 🌲
Further Reading#
What does resync do?#
From man 8 mdadm
:
A ‚resync‘ process is started to make sure that the array is consistent (e.g. both sides of a mirror contain the same data) but the content of the device is left otherwise untouched.
See also https://raid.wiki.kernel.org/index.php/Initial_Array_Creation
In case of failure, one can stop the array and force a rebuild:
Check Disk Activity#
iostat -hN 1
Tuning Disk Parameters#
The metric Load_Cycle_Count
counts the number of times that disk’s head unit
was parked. Parking reduces power consumption, but also introduces wear to the
disk’s head unit.
In our case the drives should only be parked after a significant amount of time. There are two kinds of workloads running on the machine:
First, there are backup jobs, that do consistent IO. They are mainly limited by network speed and run from start to finish. Second, there is on-demand computing, like checking backups for consistency, or extracting historical data. When done manually, there can be pauses of up to 30 minutes between IO requests. During this time, the disks should continue spinning.
Install hdparm
apt install hdparm
man 8 hdparm
Look for the disk IDs to make sure that you operate on the right one:
l /dev/disk/by-id/ata-*
Check current settings
hdparm -I /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG
hdparm -I /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG
Results: https://gist.github.com/felixhummel/e044f5947a3e8f4b13f4579804c3a1ac
For hdparm
parameters I found https://superuser.com/a/1218031 to be useful:
Set -B 127
to enable Advanced Power Management, but allow spin-down.
Set -S to 242
, which equals 1 hour. You can find details in man 8 hdparm
.
hdparm -B 127 -S 242 /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG
See output here.
This looks good, so let’s persist this:
cp /etc/hdparm.conf /var/backups/
cat <<'EOF' > /etc/hdparm.conf
/dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG {
apm = 127
spindown_time = 242
}
/dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG {
apm = 127
spindown_time = 242
}
EOF
And reload hdparm config params (from man 5 hdparm.conf
):
/usr/lib/pm-utils/power.d/95hdparm-apm resume
To show the current power mode status:
hdparm -C /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_14W0A05NFWTG /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TE_93K0A0DPFVGG
tags: hdparm, Power_Cycle_Count, Load_Cycle_Count
souces:
https://superuser.com/questions/1217983/why-is-my-hard-drive-cycle-count-increasing-so-fast
https://superuser.com/questions/1696041/debian-nas-server-how-to-achieve-zero-disk-activity
https://superuser.com/questions/153982/home-server-hard-drive-186k-start-stop-cycles-in-325-days
https://www.reddit.com/r/archlinux/comments/g9vksk/best_tlp_settings_for_wd_red_htpcnasseedbox/
Constant Writes#
After mounting we heard the disks constantly seeking. iostat
reports writes,
but only few transactions with small bandwidth:
root@enterprise # iostat -hN 1 /dev/sd? /dev/mapper/raid2024--02-storage
Linux 6.4.0-0.deb12.2-amd64 (enterprise) 19.02.2024 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0,6% 0,0% 0,3% 0,8% 0,0% 98,3%
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device
11,45 40,7k 1012,9k 0,0k 90,4M 2,2G 0,0k raid2024--02-storage
26,12 563,7k 1,5M 0,0k 1,2G 3,4G 0,0k sda
8,43 24,3k 1022,6k 0,0k 54,0M 2,2G 0,0k sdb
8,30 19,5k 1022,6k 0,0k 43,3M 2,2G 0,0k sdc
A reddit comment suspects ext4 doing some bookkeeping for its lazy initialization.
As iotop
(and htop
) operate on the process level, there is no insight to be
gained here, but there is blktrace
[3]. Looking for syntax examples, I
stumbled upon this pro-linux
post.
blktrace -d /dev/mapper/raid2024--02-storage -o - | blkparse -i -
As suspected, this is ext4 doing its thing:
253,1 6 2 0.610765369 11584 Q WS 9793720576 + 2048 [ext4lazyinit]
I love the visibility that Linux offers down to the deepest depths. 🤓