So, last month, my kubernetes cluster decided to literally eat shit while I was out on a work conference.
When I returned, I decided to try something a tad different, by rolling out proxmox to all of my servers.
Well, I am a huge fan of hyper-converged, and clustered architectures for my home network / lab, so, I decided to give ceph another try.
I have previously used it in the past with relative success with Kubernetes (via rook/ceph), and currently leverage longhorn.
Cluster Details
- Kube01 - Optiplex SFF
- i7-8700 / 32G DDR4
- 1T Samsung 980 NVMe
- 128G KIOXIA NVMe (Boot disk)
- 512G Sata SSD
- 10G via ConnectX-3
- Kube02 - R730XD
- 2x E5-2697a v4 (32c / 64t)
- 256G DDR4
- 128T of spinning disk.
- 2x 1T 970 evo
- 2x 1T 970 evo plus
- A few more NVMes, and Sata
- Nvidia Tesla P4 GPU.
- 2x Google Coral TPU
- 10G intel networking
- Kube05 - HP z240
- i5-6500 / 28G ram
- 2T Samsung 970 Evo plus NVMe
- 512G Samsung boot NVMe
- 10G via ConnectX-3
- Kube06 - Optiplex Micro
- i7-6700 / 16G DDR4
- Liteon 256G Sata SSD (boot)
- 1T Samsung 980
Attempt number one.
I installed and configured ceph, using Kube01, and Kube05.
I used a mixture of 5x 970 evo / 970 evo plus / 980 NVMe drives, and expected it to work pretty decently.
It didn’t. The IO was so bad, it was causing my servers to crash.
I ended up removing ceph, and using LVM / ZFS for the time being.
Here are some benchmarks I found online:
https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
https://www.proxmox.com/images/download/pve/docs/Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf
The TLDR; after lots of research- Don’t use consumer SSDs. Only use enterprise SSDs.
Attempt / Experiment Number 2.
I ended up ordering 5x 1T Samsung PM863a enterprise sata drives.
After, reinstalling ceph, I put three of the drives into kube05, and one more into kube01 (no ports / power for adding more then a single sata disk…).
And- put the cluster together. At first, performance wasn’t great… (but, was still 10x the performance of the first attempt!). But, after updating the crush map to set the failure domain to OSD rather then host, performance picked up quite dramatically.
This- is due to the current imbalance of storage/host. Kube05 has 3T of drives, Kube01 has 1T. No storage elsewhere.
BUT… since this was a very successful test, and it was able to deliver enough IOPs to run my I/O heavy kubernetes workloads… I decided to take it up another step.
A few notes-
Can you guess which drive is the samsung 980 EVO, and which drives are enterprise SATA SSDs? (look at the latency column)
Future - Attempt #3
The next goal, is to properly distribute OSDs.
Since, I am maxed out on the number of 2.5" SATA drives I can deploy… I picked up some NVMe.
5x 1T Samsung PM963 M.2 NVMe.
I picked up a pair of dual-spot half-height bifurcation cards for Kube02. This will allow me to place 4 of these into it, with dedicated bandwidth to the CPU.
The remaining one, will be placed inside of Kube01, to replace the 1T samsung 980 NVMe.
This should give me a pretty decent distribution of data, and with all enterprise drives, it should deliver pretty acceptable performance.
More to come…
Ceph seems neat, but the fact that it can’t even function with normal SSDs points to something very wrong with how it’s designed. It seems like it has an absurd overhead.
I believe its a data-safety thing, similar to how ZFS’s ZIL works.
That is, a write isn’t completed until its actually written. In the case of consumer SSDs, this means, waiting for the write to complete. In the case of enterprise SSDs, this means the write-cache, (due to PLP, power loss protection).
With anything though, you can disable those safety features.
absurd overhead.
Actually a massive understatement. I threw together over 5 million IOPs worth of disks, to barely squeeze 100k IOPs out of the cluster! Its EXTREMELY inefficient, compared to… well, pretty much any other option. I mean, writing encrypted zip files to SD card storage can be faster in some circumstances. lol
But, its reliable, fault-tolerant storage, which is instantly available(ie, no replication, syncing, etc).