Overview

Introduction

Ceph is a reliable distributed storage system giving you object block and filesystem storage...for free! Handily, for us, it is also well integrated into OpenStack.

Reliability is gained by maintaining a number of replicas of each blob of storage (data is stored in 4KB blobs). The number of replicas is your choice -- and defaults to three -- and can be changed in-life. Once running you can add and remove storage dynamically and Ceph does The Right Thing.

It sounds too good to be true, there must be some kind of a catch. Not really. However, Ceph is designed for "the bigger player" and whilst you can run it on a single system your small setup masks some of the architectural difficulties. Architecturally it gets really quite complicated really quite quickly and every cluster is different. Also note, that to achieve some sense of resilience you are going to consume physical hosts quite quickly.

It also unmasks some fancy new (consumer) hardware as being, frankly, a bit rubbish. Yes, I'm looking at you, NVMe drive-plus-motherboard combos!

We'll take the position of the smallest viable player in the discussion below. What can you get away with?

For a view of what you should be doing in a Ceph deployment take a look at the Proxmox Benchmark paper.

Architecture

There are, in essence, three components, in a Ceph cluster:

the storage nodes
the client systems/applications using the storage
a set of monitor nodes coordinating the clients and storage nodes

Networking

Technically, so long as all parties can route packets to one another you are good to go. However, the clients and storage nodes are going to be passing a lot of data between each other so they should be on the same LAN. The storage nodes will be rebalancing from time to time so should clearly be on the same LAN. So, that done, the monitors might as well be there too.

Note

There's no magic involved, if you have three replicas then the data has to be written out over the network three times.

Storage Nodes

A storage node has the physical (or logical) disks attached to it, d'uh! The disk is called an Object Storage Device (OSD) and Ceph uses a daemon per OSD called, unhelpfully, a Ceph OSD. You'll see systemd instances called ceph-osd@N for some N, the OSD ID within the cluster. OSD IDs are integers starting at 0 and can be re-used (assuming all previous use has ceased).

Ceph prefers that you give it entire disks to do its thing with. This isn't unreasonable as it will keep the disks chuntering all day and all night. Remember that every client write means three storage nodes writing some data, every client read meaning at least one node reading some data.

How many storage nodes? Not unreasonably Ceph deems that a storage host is a likely point of failure and so arranges the distribution of replicas to keep them on distinct storage nodes. We have a default replication of three suggesting we need at least three storage nodes.

How many disks per storage node? Go nuts! Ceph doesn't care, it'll spread the replicas about as necessary.

Actually it does care or, rather, it will enforce some replication rules that will make you care. In a pathological case you might put a single disk in two nodes and the rest of your disks in the third. You're going to get bitten here as those replication rules are saying that Ceph must keep replicas on separate nodes which means your available space is limited to as many blobs can fit on the smallest of your storage nodes. Most of the storage on that third storage node will do nothing.

Ceph will also chew up RAM under certain circumstances which may be pertinent in your case. Check the Ceph hardware recommendations.

So spread the disks very evenly, then? Well, no, not necessarily. We're poking about in the lower end of viability so we're hitting some edge cases. Maybe we can (should?) buy a bit more kit!

If you have a lot of small storage nodes and a few big ones then you'll probably not notice so much as the replication count of three means that whilst one of the three replicas will probably be on a big storage node the other two get spread across however many small storage nodes you have. So here you get the aggregate of all your small storage nodes divided by two matching up against your big node.

Let's try to figure out some example. Assuming we have a standard disk size and we have ten small nodes each with one disk. How many disks are useful to have in one big system? Well, assuming that Ceph chooses to put two of the replicas across the ten small nodes then our effective capacity is 10 / 2, that is, five disks worth. That means we can put five disks in our big box and get useful work done. Any more than that and their effectiveness will wear off.

It's not all lost, though. You'll get the benefit of more OSDs responding to data requests meaning overall throughput will increase. Which you might like the idea of.

Of course, the spread of replicas around the systems is not that straightforward but as a back-of-the-envelope planning system, it'll do.

Disks

There's some complicated instructions on setting up an OSD manually but you are considerably better off using the ceph-volume tool to set things up. Handily, it prints out the commands you would have run manually. It's a long list.

For reasons I can't quite fathom it uses LVM -- although see the next section.

LVM requires a partition on the disk.

OSD Backend

In Ceph Nautilus the default is to use bluestore as the backend for the Ceph OSD. bluestore stores the blobs of data directly on the OSD without an intervening filesystem.

The older backend was filestore which, perhaps unsurprisingly, stored objects in a filesystem on the disk. I presume the legacy of this is the intervening LVM on the disk. It's no hardship to bluestore and allows you to switch to filestore if you want.

Disk Throughput

Here's the shocker. It turns out these fancy NVMe drives can be rubbish!

Say what, now? Yes, they are whippety-quick and outshine SSDs in normal operation but it turns out they can be rubbish -- worse than SSDs and in some cases worse than an HDD! Why so? Well, Ceph is a bit of a stickler for reliability and there's nothing quite as reliable as sync'ing every write to disk and it seems that some NVMe drives are particularly poor at actually syncing data.

You can test this out as dd has some useful flags: direct (side-step any kernel page caching) and, more importantly, dsync (wait for the write to complete). Ceph uses both (and you can't turn it off).

Sebastien Han is more thorough but let's test some handy disks with a few Ceph-ish 4KB writes dd if=/dev/zero of=foo bs=4k count=10000 oflag=dsync:

#	Type	Device	Model	kB/s	Format
A	NVMe	ADATA XPG SX8200 Pro	SX8200PNP	6300	M.2 PCIe
A	SDD	Crucial MX500	CT1000MX500SSD1	4200	SATA 2.5
A	HDD	WD RED	WD40EFRX-68N	152	SATA 3.5
B	NVMe	SAMSUNG SM951	MZVPV256HDGL	500	M.2 PCIe
B	SDD	Crucial MX500	CT1000MX500SSD1	3700	SATA 2.5
B	HDD	WD RED	WD10JFCX-68N	95	SATA 2.5
C	NVMe	WD Blue SN500	WDS500G1B0C	3900	M.2 PCIe
C	HDD	Seagate Constellation ES.3	ST1000NM0033	146	SATA 3.5

Note

For comparison, try the dd without the dsync flag -- you should get more on par with your nominal hardware limit (although try to write several GB of data to flush through any caches, try count=1M).

PCIe 3 x4: 3.94 GB/s
PCIe 3 x2: 1.97 GB/s
PCIe 2 x4: 2.00 GB/s
PCIe 2 x2: 1.00 GB/s

You can see machine A drives the Crucial SSD a little quicker, so adjust the numbers accordingly, and machine B's WD Red is a laptop HDD! Machine A's WD RED is hardly going to be a performance drive either but it is there for storing stuff.

What can we read into this? For a start, look how slow everything is! The best we can see is 6MB/s. Six?!? Six! Everything is going to be slow. Well, maybe. Day to day, you're pretty unlikely to be writing out data in a solid stream but there are certain very obvious times when you are. With our OpenStack hats on, when we add images to Glance we'll be writing a lot of data. That will hurt. Luckily Ceph has a copy-on-write mode so that using that data (ie. cloning an instance from it) just costs the new writes.

I did mention that Ceph is for the big boys where those 4KB blobs are spread across potentially hundreds of OSDs?

Also look at what a rubbish combination the NVMe and machine B make! It has less than a seventh of the throughput of the Crucial SSD. In normal usage it's great. Turn the dsync flag on and it goes to pot. Something is afoot there as the NVMe drive, Samsung SM951, is PCIe 3 x4 and the motherboard, Gigabyte H170N-WIFI, M.2 slot is PCIe 3 x4. Samsung's marketing blurb claims a

Peak Write Sequential Performance Up to 1550 MB/s

yet we're seeing 0.5MB/s. Maybe Peak is not for very long.

January 2020 Update spend some money

Let's take a look at a shiny new Samsung PM883 which, according to the blurb claims to run at 320-520 MB/s -- the 240GB model runs slower. With the 240GB model (320 MB/s) in a USB 3.1 (Gen2) caddy (ie. nominally 10Gb/s so should not be the blocker) I see 23MB/s. An order of magnitude slower than the blurb but, more importantly for us, an order of magnitude bigger than anything else we have.

The January 2020 pricing have a 1TB Crucial MX500 at ~£105 and the 960GB PM883 at ~£166. So, a 60% bump in price for an order of magnitude improvement in throughput.

Block DBs

With bluestore you can use a separate block database from the block data. As that document notes, the journal goes on the fastest device available and the journal and metadata fall back to the primary device if not specified (or full!).

Nominally, you'd have a small chunk of fast(er) journal/metadata space on a fast device with the data being sync'd out to the primary device in due course.

Given that NVMe and SSD drives are one to two orders of magnitude quicker than the HDDs it makes sense to use them.

There's some suggestion that you want about 4% of the OSD's size for the journal so that is roughly 40GB per TB of OSD.

With our minimum viable product hats on we can probably get away with making some suitably sized partitions on our NVMe/SSD boot disk (which is otherwise doing very little).

Disk Types

Ceph has some sense of the type of drive. Ceph nautilus recognises hdd and ssd types

Monitors

The monitors maintain the cluster map which indicates where the replicas for any blob of storage are across the cluster. They will coordinate the copying of any data to maintain the replica count should an OSD disappear for any length of time.

You can, of course, have more than one monitor, an odd number being preferable, which will chit chat amongst themselves, have repeated elections, and generally do the right thing.

Warning

Of note, but not made terribly clear, is that the cluster map is live and only live. That is, there is no persistent copy of the cluster map.

Whoa! "So what," I hear you cry, "happens if my only monitor node goes down?" Well, your cluster won't do anything new. "New" includes clients/storage nodes re-joining. The cluster will continue doing anything already set up. It won't handle the loss of any OSD as that is managed by your one (now dead) monitor node.

Let's extrapolate that a little further, not suggesting I discovered this by painful experience, but what if your single monitor node was an instance in OpenStack using Ceph as a backend on the same hosts and you reboot the hosts? When the Ceph OSDs attempt to start they hang -- because there's no monitor node to talk to to say they are up -- and OpenStack cannot start the monitor instance because the backing store is "down." Oh dear. Clearly it's not helped by having multiple monitor nodes in the same OpenStack cluster.

So what are we saying here? The monitor nodes can be VMs -- they don't consume much resource -- but they cannot be dependent on the Ceph storage they are managing. So that means they must be VMs running on other hosts and, preferably, on more than one other host so that if you reboot/crash and burn any one of these other hosts the Ceph cluster will keep running.

I suppose, if you're feeling a bit risky and are short of hardware, you could run KVM instances out of the root filesystems of the Ceph storage nodes for the monitor nodes...

Managers

You should start a manager daemon. It is the interface for metrics. Run it on your nominal primary monitor. If it isn't running you don't get any useful performance metrics. No real downside.

Summary

There's a few things to remind ourselves of:

You need at least three physical hosts with your disks for the storage nodes.
You probably want to spread the storage fairly evenly across the nodes but it matters less the more storage nodes you have.
You'll have better performance if you can put an NVMe/SSD journal in front of any HDDs.
You need at least three VMs not dependent on the Ceph storage for your monitor nodes (and the manager can piggyback on one).

Document Actions