QNAP TS-1655 + Proxmox + TrueNAS + Drive Planning Query

ridgedale · September 26, 2023, 2:54pm

Just taken delivery of a QNAP TS-1655 NAS device. It will have 128Gb RAM installed once that arrives as well as an Intl X520-SR2 10GbE adapter.
The device has 12x SATA HDD bays, 4x SATA SSD bays and 2x M.2 NVMe slots (on the motherboard) and will be primarily used as a network storage server as well as for hosting VMs, web development and forensic data image storage.
The intention is to populate the 12x SATA HDD bays with 12x 12Tb HDDs, 2x 480Gb or 240Gb SATA SSDs (note: the cost difference between the two is negligible), 2x 1.92 or 2Tb SATA SSDs and 2x 1.92 or 2Tb M.2 NVMe SSDs - using enterprise-level drives.
The intention is to use the 2x 480Gb or 240Gb SATA SSDs (mirrored) to host Proxmox, the 2x 1.92 or 2Tb M.2 NVMe SSDs (mirrored) to access and store VMs, the 12x 12Tb HDDs for network storage (deploying TrueNAS as a VM). What I am unsure of is how the 2x 1.92 or 2Tb SATA SSDs can be best deployed. Is there any value in those drives for caching given the RAM configuration?
If anyone else within the community has setup a similarly configured device/server to run Proxmox and TrueNAS as a VM, any pointers or gotchas would be greatly appreciated.

ThatGuyB · September 26, 2023, 10:37pm

Welcome to the forum!

If this is a home build, don’t do any caching. Neither ZIL nor L2ARC is that useful. My suggestion is to install proxmox on 2x 240gb ssds and use (part of) the 2x 2tb nvme drives (in mirror) as ZFS Metadata Special Device.

What this does is that it moves all the directory structure over to the SSDs and keeps the data on the spinning drive pool. So all your file listing and searching the OS FS tree will be SSD-fast, but actually accessing the data will be HDD-fast, i.e. find /tank will move like it would on an ssd, but cat /tank/file will move like an hdd.

If this is a homelab and you want to save on storage, go with Proxmox ZFS RAID-Z3 pool on 11 drives (you read that right). If you want to get more performance and you don’t mind wasting half the size of your pool, go with stripped mirrors (raid 10 with 12 drives).

If this is a production, go either raid-z3 or with a 4x stripped 3x mirror (4 sets of 3x mirror vdevs). ZFS is smart enough that, unlike traditional raid, you get enhanced reads by how many drives you have in a mirror vdev (3x mirror = 3x the read performance). And with 4 stripes, it translates to 4x write speed and 12x read speed. With raid-z3 you don’t get much benefit, other than getting about 72% usable capacity (with 11 drives).

Reasoning for not using 12 drives for z3 pool is the default cluster size in zfs.

For a raid-z3 with 11 drives, you get a usable capacity of around 96tb. The rule of thumb is 0.3% of the pool capacity for metadata device, i.e. you need about 300gb for it. With 12 drives in stripped mirror, you get about 72tb usable capacity, meaning you’d only need about 250gb for the special dev. With 12 drives in striped 3-way mirrors, you’d get 48tb usable capacity, meaning about 180gb for metadata special dev.

You can make a 2 partitions on each nvme drive, partition 1 = between 180 and 300gb (depending on which raid you choose for the spinning rust) and part 2 = the rest of the disk. You make 2 different mirrored pools, one as metadata special device pool, the other just as a normal ssd pool which you can use for VM OS partition.

For raid-z3, just use the last drive as either a cold or hot spare. I prefer cold spares and a good alerting system to let me know to replace it (zabbix. prometheus + alertmanager, centreon, whichever fits your needs). Otherwise, still good to have an alerting system to tell you if a pool is degraded anyway, lmao.

ridgedale · September 27, 2023, 10:12am

Thanks for all the helpful feedback, @ThatGuyB. Much appreciated.
This will be a small business/home build. Reliability is the requirement over performance.
Your feedback has, however, led to further questions as none of the SSDs have been purchased yet:

Not sure I understand that. How is that calculated? When I searched online I came across https://jro.io/capacity/ that appears to indicate that 12 drives for a z3 pool is possible.
Would it be better to install Proxmox on a mirrored pair of 960Gb NVMe drives (partitioned as 240Gb for Promox host OS and 720Gb for a ZFS Metadata Special Device) and use the 4x 2.5" bays for storing and running VMs instead?
To try to optimise the amount of available storage capacity with redundancy would 6x HDDs for 2x z2 pools be worth considering?

ThatGuyB · September 27, 2023, 11:57am

The default zfs cluster size is 128 KB. That means, if you won’t want write amplification in your pool when using any kind of parity raid, then you need to arrange your pool in a way to allow only writing to chunks of 128 kb at once.

The calculation was in the second link I posted. You take 128 and divide by ( number of total drives - number of parity drives). Let’s start with easy raid-z1, only 1 parity drive, with 3, 4 and 5 drives:

128 / ( 3 - 1 ) = 64 KB (good!)
128 / ( 4 - 1 ) = 42.6(6) KB (bad!)
128 / ( 5 - 1 ) = 32 KB (good!)

Moving on to raid-z2 (which has 2 parity drives) with 4, 5, 8 and 10 drives.

128 / ( 4 - 2 ) = 64 KB (good!)
128 / ( 5 - 2 )  =  42.6(6) KB (bad!)
128 / ( 8 - 2 ) = 21.3(3) KB (bad!)
128 / ( 10 - 2 ) = 16 KB (good!)

Onto raid-z3 (3 parity drives) with 5, 10, 11 and 12 drives.

128 / ( 5 - 3 ) = 64 KB (good!)
128 / ( 10 - 3 ) = 18.28 KB (bad!)
128 / ( 11 - 3 ) = 16 KB (good!)
128 / ( 12 - 3) = 14.2(2) (bad!)

You want to avoid write amplifications at all cost if you are planning to run VMs. For just general data like a movie collection, you don’t have to worry about it (well, especially movie collections, which are static, never-changing data). What happens is that, if your pool is not balanced, you will add one or more additional writes to compensate for the mismatched data (also affecting longevity of the drives).

A z3 pool is possible with any number of disks, but you need to account for the low level stuff, to get optimal performance (and potentially life) out of your drives (be them spinning rust or flash).

When people complain that their parity zfs pool is slow, this is typically why. In the link you shared, the recordsize is what I’m referring to as cluster size. But their calculations are a bit different than what I’m used to. Still, they mention the padding for their example (which zfs does automatically). Give that entire web page a read.

In my example, I skip the block size for the parity (parity + 1). In zfs z3, you had p=4 (3 parity disks + 1), meaning, by their example, 128/4 = 32 blocks. Then, if we divide 32 by the number of data disks, for example 11 or 12 pool z3 would be 8 and 9 respectively, then we get 32 / 8 = 4 blocks (perfect for the 11 drive z3) and 32 / 9 = 3.55 (bad), so my easier calculation still stands. As long as the pool is balanced, then you don’t have to worry about the write amplifications (again, those don’t relate to the parity / partial-parity and the zfs padding, but to the actual data writes, which is what tanks your pool’s performance).

Out of curiosity, what will be the workload besides just VMs? Any particular programs, like web servers, databases etc.? If you plan on one or two DBs, z3 is fine, but if you plan to have a lot of active DBs, go with triple mirrors stripped in 4 vdevs (i.e. you first create 4x mirrored pools with 3 drives each, then you stripe the 4 together), even though you get about 33% usable capacity, unless you are fine with recoverying from backups (hope you have backups, no matter what redundancy level you use).