Slow sequential single file reads on ZFS

I built a 5x 16TB RAIDz2, filled it with data, then I discovered the following.

Sequentially reading a single file from the file system gave me around 40MB/s. Reading multiple in parallel brought the total throughput in the hundreds of megabytes - where I’d expect it. This is really weird. The 5 disks show 100% utilization during single file reads. Writes are supremely fast, whether single threaded or parallel. Reading directly from each disk gives >200MB/s.

Splitting the the RAIDz2 into two RAIDz1s, or into one RAIDz1 and a mirror improved reads to 100 and something MB/s. Better but still not where it should be.

I have an existing RAIDz1 made of 4x 8TB disks on the same machine. That one reads with 250-350MB/s. I made an equivalent 4x 16TB RAIDz1 from the new drives and that read with about 100MB/s. Much slower.

All of this was done with ashift=12 and default recordsize. The disks’ datasheets say their block size is 4096.

I decided to try RAIDz2 with ashift=13 even though the disks really say they’ve got 4K physical block size. Lo and behold, the single file reads went to over 150MB/s. 🤔

Following from there, I got full throughput when I increased the recordsize to 1M. This produces full throughput even with ashift=12. My existing 4x 8TB RAIDz1 pools with ashift=12 and recordsize=128K read single files fast.

Here’s a diff of the queue dump of the old and new drives. The left side is a WD 8TB from the existing RAIDz1, the right side is one of the new HC550 16TB

< max_hw_sectors_kb: 1024
---
> max_hw_sectors_kb: 512
20c20
< max_sectors_kb: 1024
---
> max_sectors_kb: 512
25c25
< nr_requests: 2
---
> nr_requests: 60
36c36
< write_cache: write through
---
> write_cache: write back
38c38
< write_zeroes_max_bytes: 0
---
> write_zeroes_max_bytes: 33550336

Could the max_*_sectors_kb being half on the new drives have something to do with it?

Can anyone make any sense of any of this?