Setup

Hardware Configuration

We have our OpenStack hardware to play with which for Ceph means:

nodes 1 & 2 have an NVMe boot disk and a 1TB SSD and a 1TB HDD
node 3 has an NVMe boot disk and a couple of 4TB HDD

we also have a few collected PCs lying about which we can run the monitor VMs on.

The NVMe disks have 64GB boot partitions and the rest is split into journal partitions for the OSD. The SSDs and HDDs are dedicated to Ceph.

Warning

Using a small (128GB) NVMe drive as a journal might not have been the smartest choice as you can read in NVMe Burnout. Essentially, after a few month of VMs chuntering away the system had written 23TB of data to the 64GB of journal space which was "over" the manufacturer's lifetime expectancy. It died.

Setup

We're following the manual deployment docs. They have an admin node but we'll use one of the monitors.

This will probably seem overly complicated but at heart there's a few things going on:

we need to bootstrap the live cluster map
Ceph uses some very simple ACLs which we need to keep in mind
keep your ducks in a row. We'll use some variables to help show what's going on
check whether things are owned by ceph or not -- you can accidentally leave things owned by root. Which doesn't work.

Networking

We'll use 192.168.8.0/24 for the Ceph public network -- that's the one the clients use to talk to the monitors and storage nodes.

We'll also use 192.168.9.0/24 for the Ceph cluster network. This is entirely optional and is used by the storage nodes to copy blobs between themselves. If it isn't set up they'll use the public network.

Software Install

Ceph and its dependencies are covered by the CentOS OpenStack repos:

yum install centos-release-openstack-stein
yum install ceph

Note

Do the above, install the OpenStack SIG release, rather than yum install centos-release-ceph-nautilus, the Ceph Storage SIG release, because it covers the dependencies.

This is the same on all Ceph nodes (storage and monitors).

Bootstrap

We'll do this on our initial monitor node which is, for some reason, mon3.

Let's set up a few variables for the rest of the setup. We'll use the short hostname where applicable.

One thing you will want to consider is the number of placement groups, 128 in the example below. The number you need is dependent on the number of pools (of data), their replication number and the number OSDs. There is a calculator.

HOSTNAME=$(uname -n)
HOST=${HOSTNAME%%.*}
MON_IP=192.168.8.3
CN=ceph
FSID=$(uuidgen)

Note

The cluster's name, CN, defaults to ceph however we'll use the variable to show where it is being used.

ceph.conf

cat <<EOF > /etc/ceph/${CN}.conf
[global]
fsid = ${FSID}
mon initial members = ${HOST}
mon host = ${MON_IP}
public network = 192.168.8.0/24
cluster network = 192.168.9.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 128
osd pool default pgp num = 128

# 1 normally, 0 for a single machine hack
osd crush chooseleaf type = 1

[mon.${HOST}]
mon allow pool delete = true
EOF

chown ceph:ceph /etc/ceph/${CN}.conf

Replication Ratio

Two parameters govern the replication ratio, osd pool default size is the actual replication ratio and osd pool default min size is what we're prepared to live with (temporarily). The cluster will be degraded if a pool loses an OSD and drops below the default size (3) but won't panic. I think there is a tacit assumption that the OSD has gone away temporarily (the host has been rebooted, say) and will be back soon enough.

Of course, the state of the OSD is verified when it does come back, Ceph is reliable!

If the OSD doesn't come back for a while then Ceph will make some attempt to move data to an alternate OSD (if one is available).

If you do drop below the minimum size then Ceph will mark the pool as read-only. This is probably fatal if that pool is the backing store for your VMs!

If, like in our minimum viable product scenario, say two of your three storage nodes have rebooted then there's not much Ceph can do anyway and when the two come back there'll be plenty of checking going on!

Monitor

Again, this looks complicated but it's mostly about creating ACLs for various purposes.

Keyrings bind a name (-n *name*) with a set of capabilities (--cap *thing* *ACL*).

Create a temporary keyring for the monitors to use -- the monitors only need to see the, er, monitors:
```
sudo -u ceph ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
```
Note

The name mon. is used to retrieve the monitor key when we add more monitors.

Create a permanent keyring for admins (us!) to use -- we need access to everything:

sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'

Create a permanent keyring for bootstrapping OSDs:

sudo ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd'

Add the second two permanent keyrings (admins, bootstrap OSDs) to the first temporary keyring (monitors):

sudo ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
sudo ceph-authtool /tmp/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring

Create the initial monitor map with this host in it using the temporary keyring:
```
monmaptool --create --add ${HOST} ${MON_IP} --fsid {$FSID} /tmp/monmap
```
Create the working directory for the monitor daemon on this host:
```
sudo -u ceph mkdir /var/lib/ceph/mon/${CN}-${HOST}
```
Start the monitor with the initial monitor map:
```
sudo -u ceph ceph-mon --mkfs -i ${HOST} --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
```
Note

The monitor is now running. We've just run the above command in the foreground and, given that nothing else is running in our cluster, we can ^C it and restart it with systemd in two ticks.
The instructions say "create the done file" so, er, we do it:
```
sudo -u ceph touch /var/lib/ceph/mon/${CN}-${HOST}/done
```

Finally, kick off the monitor daemon on this host:

sudo systemctl enable ceph-mon@${HOST}
sudo systemctl start ceph-mon@${HOST}

That's it, our monitor daemon is now live. Remember, the monitor map is live. That initial monitor map we created remains just that, a point in time instance of the map. If we rebooted now the monitor would revert to that old map. Which might not be ideal.

How do we know it is running?

# ceph -s

You should get something even it reports some errors about the cluster state. We only have a monitor node so it shouldn't be too surprising that the cluster is in ill-health.

You can also run a long term variant of the above:

ceph -w

which prints out the same initial status then hangs about printing any state changes. Most of the time nothing much happens other than Ceph complaining that time is out of sync (for reasons unknown as these systems are generally sub-millisecond in sync). However, if you restart an OSD you'll get some more interesting messages.

Firewall

Just to tidy things up, open up the firewall to allow other nodes to talk to us:

# port 6789
firewall-cmd --add-service=ceph-mon --permanent

In addition (beyond the official docs) I believe you need to open up the general Ceph ports (6800-7300) as the monitors maintain connections to all the storage nodes:

firewall-cmd --add-service=ceph --permanent

Manager

You can (should?) run a manager on each monitor node. The manager is used to collate metrics so it's not required but interesting.

The instructions kick off the manager daemon interactively:

(
  umask 007
  mkdir /var/lib/ceph/mgr/${CN}-${HOST}
  umask 077
  ceph auth get-or-create mgr.${HOST} mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/${CN}-${HOST}/keyring
  chown -R ceph:ceph /var/lib/ceph/mgr/${CN}-${HOST}
  ceph-mgr -i ${HOST}
)

So you might want to prep the manager daemon to start automatically for next time:

systemctl enable ceph-mgr@${HOST}

Firewall

# mgr!
firewall-cmd --add-port=6810/tcp --permanent

Dashboard

The Ceph Manager supports a decent Web UI dashboard:

ceph mgr module enable dashboard

(you only need to do this once per cluster)

and open the firewall on each manager node:

# mgr dashboard
firewall-cmd --add-port=8443/tcp --permanent

One nice aspect is that you can bookmark any of the monitor/manager nodes in your browser and you will be automatically redirected to the current live one.

Other Nodes

Once we have one node up and running and the monitor is live we can start adding more.

Installing the software is the same. You then need to bootstrap the new node from the old. We're copying the vital bootstrap config. So, from the original node targeting our second node, ceph2:

cd /etc/ceph
scp ceph.client.admin.keyring ${CN}.conf ceph2:$PWD
cd /var/lib/ceph/bootstrap-osd
scp ceph.keyring ceph2:$PWD

Other Monitors

Other monitors are a little more involved as the cluster map is live. We need to get a copy of the live map, add our new monitor and then kick off the daemon.

Having installed the software and copied the config as above:

Log into the new monitor and prep the working directory:

ssh mon2
HOSTNAME=$(uname -n)
HOST=${HOSTNAME%%.*}

mkdir /var/lib/ceph/mon/ceph-${HOST}

Collect the mon. key and the current cluster map:
```
ceph auth get mon. -o key
ceph mon getmap -o map
```
The mon. key was the name used for all monitors.
Kick off the new monitor using the key and map we just saved:
```
ceph-mon -i ${HOST} --mkfs --monmap map --keyring key
```
Note

The cluster map is live so the act of this new monitor starting (with the right credentials and latest map) means the cluster immediately has an extra monitor. Nothing more to do here.

Ready this system to start the monitor on reboot:

chown -R ceph:ceph /var/lib/ceph/mon/ceph-${HOST}
systemctl enable ceph-mon@${HOST}

OSDs

Adding storage with ceph-volume is very easy. We're complicating things a bit by using an NVMe partition as the journal.

Actually, we've created an LVM volume group on the NVMe partition and we'll use a logical volume in the volume group for the journal.

If you recall we want something like 4% of the OSD for the journal. Our OSDs are 1TB (and 4TB) so we need something like 40GB (and 160GB) of space per OSD. Here I've used 80GB per OSD on this example host for reasons I don't recall:

# pvs /dev/nvme0n1p4
  PV             VG      Fmt  Attr PSize   PFree
  /dev/nvme0n1p4 ceph-db lvm2 a--  173.27g 13.27g
# lvs ceph-db
 LV      VG      Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
 db-sda1 ceph-db -wi-ao---- 80.00g
 db-sdb1 ceph-db -wi-ao---- 80.00g

Creating the OSD is simple, here for partition 1 on sdb:

ceph-volume lvm create --data /dev/sdb1 --block.db ceph-db/db-sdb1

The OSD will be given an ID, a small integer starting at 0.

ceph osd tree

At any point you can run ceph osd tree to see your accumulated set of OSDs:

# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
 -1       9.70517 root default
 -2       1.54099     host c1c
  3   hdd 0.77049         osd.3     up  1.00000 1.00000
  2   ssd 0.77049         osd.2     up  1.00000 1.00000
-10       1.81839     host c2c
  4   hdd 0.90919         osd.4     up  1.00000 1.00000
  5   ssd 0.90919         osd.5     up  1.00000 1.00000
 -5       6.34579     host c3c
  0   hdd 3.17290         osd.0     up  1.00000 1.00000
  1   hdd 3.17290         osd.1     up  1.00000 1.00000

Here I don't have the full 1TB/4TB (essentially the weight column) as I've chewed off disk partitions for an OpenStack Swift implementation.

OSD Class

Note that Ceph has recognised SSDs and HDDs and designated the appropriate class. It doesn't differentiate NVMes (at the time of writing) and marks them as SSDs. You can revisit this if you desire with something like:

ceph osd crush rm-device-class ${osd_id}
ceph osd crush set-device-class nvme ${osd_id}

Does the class do anything? Well, not immediately (or at least not immediately obviously). The device class is used (or at least available to be used) in placement decision making. We can force it to be used for particular pools.

OSD removal

You can "cleanly" remove an OSD if you decide to repurpose your disk. Suppose we want to remove sdb1 then first figure out which OSD is using sdb1:

ceph-volume lvm list

Now you can mark the corresponding osd.N as down:

systemctl stop ceph-osd@N

and now actually zap the disk:

ceph-volume lvm zap /dev/sdb1

which may get so far then fail. It's not figured out the volume groups. You might want to make some appropriate decision about who's using what then manually remove the volume group (and contained logical volume) and re-run ceph-volume lvm zap. Something like:

pvs /dev/sdb1

to figure out the volume group used by sdb1 then:

vgremove -y $UUID

and finally repeat:

ceph-volume lvm zap /dev/sdb1

What we're aiming for is to have the OSD ID freed up. The ultimate test of the success of that is that a subsequent ceph-volume lvm create will re-use that freed up OSD ID.

If it didn't then you may need to remove references to the OSD ID from various places:

ceph osd rm osd.$ID
ceph osd crush rm osd.$ID
ceph auth rm osd.$ID

OSD Usage

You can get a view of how your OSDs are being used, something like:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS
 3   hdd 0.77049  1.00000 789 GiB 375 GiB 294 GiB 100 KiB 1024 MiB 414 GiB 47.55 2.78 102     up
 2   ssd 0.77049  1.00000 789 GiB 130 GiB  49 GiB  20 KiB 1024 MiB 659 GiB 16.52 0.97  90     up
 4   hdd 0.90919  1.00000 931 GiB 289 GiB 288 GiB  28 KiB  1.2 GiB 642 GiB 31.06 1.82  97     up
 5   ssd 0.90919  1.00000 931 GiB  57 GiB  56 GiB  24 KiB 1024 MiB 874 GiB  6.07 0.36  95     up
 0   hdd 3.17290  1.00000 3.2 TiB 406 GiB 155 GiB 176 KiB 1024 MiB 2.8 TiB 12.50 0.73  90     up
 1   hdd 3.17290  1.00000 3.2 TiB 440 GiB 189 GiB  96 KiB 1024 MiB 2.7 TiB 13.53 0.79 102     up
                    TOTAL 9.7 TiB 1.7 TiB 1.0 TiB 445 KiB  6.2 GiB 8.0 TiB 17.08

Although I'm not sure you can really tell much from that as a user. Notice that the larger disks (IDs 0 and 1) don't necessarily have much more data on them. We're back to that balancing across failure points (hosts) issue.

Pools

Our manual installation of Ceph nautilus does not automatically create a pool called rbd which is a bit unfortunate as several commands, notably the rbd command itself, use it as the fallback/default pool if you don't explicitly pass one.

We're targeting OpenStack which will have pools something like: glance-images, cinder-volumes and nova-vms.

When we create a pool we indicate the number of placement groups the pool should use. Wait, we indicate the number? We don't know what these placement groups are or what they mean or anything! That's true but we still have to do it!

Ceph now starts to get a bit unreasonable. ceph -s may complain about too few PGs per OSD (12 < min 30) (for some numbers of X and Y).

We made an effort in Bootstrap to use the calculator to figure out the number of placement groups we need. The act of creating pools will will under- or over-use the number of placement groups we defined in the monitor setup and Ceph will whinge. grr! If this was the first of several pools to be added then add the other pools to increase the number of placement groups in total and therefore the number of placement groups per OSD!

If you still get some whinging then you may require some judicious modification of the number of placement groups to appease it.

Leaving you to fiddle with parameters, here for the nova-vms pool:

ceph osd pool nova-vms set pgp_num 512
ceph osd pool nova-vms set pg_num 512

Pool Creation

Pool creation is easy enough, barring this unclear decision about placement groups:

ceph osd pool create glance-images 64

where 64 is the number of placement groups to use. If you decide you're unhappy with something about your pool you can fiddle with its parameters:

ceph osd pool cinder-volumes [get|set] parameter [...]

If you omit the parameter you'll get a list of possible parameters.

At this point we can do a simple list of the pools:

# ceph osd lspools
1 glance-images
2 cinder-volumes
3 nova-vms

although a rados df is more illuminating about the overall state of play (rados is a command for interacting with a Ceph object storage cluster):

# rados df
POOL_NAME         USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED    RD_OPS      RD    WR_OPS      WR USED COMPR UNDER COMPR
cinder-volumes  13 GiB    1134      0   3402                  0       0        0  55083257  49 GiB   3440171 119 GiB        0 B         0 B
glance-images  706 GiB   30019      0  90057                  0       0        0  54244420 230 GiB    238858 329 GiB        0 B         0 B
nova-vms       315 GiB   15452      0  46356                  0       0        0 384551762 676 GiB 385853274  13 TiB        0 B         0 B

total_objects    46605
total_used       1.7 TiB
total_avail      8.0 TiB
total_space      9.7 TiB

Remember that the usage is going to reflect your replication ratio.

Pool Destruction

Getting rid of pools isn't quite so easy. We need to be allowed to do it and we need to express our determination to do so.

Firstly, we need to be allowed to delete a pool at all. You'll need to edit /etc/ceph/ceph.conf:

[mon.mon3]
mon allow pool delete = true

and then restart the daemon:

systemctl restart ceph-mon@mon3

The actual delete command requires extra effort too:

# ceph osd pool delete foo
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool foo.  If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.

OK:

# ceph osd pool delete foo foo --yes-i-really-really-mean-it
pool 'foo' removed

which is trivially scripted.

OSD CRUSH Rule

We can create specialised rules to use, say, only specific classes of OSD. Suppose we want to have a "fast" pool that only uses SSDs.

The key command is to create a rule that uses a specific class:

ceph osd crush rule create-replicated *rule-name* default host *osd-class*

Note

default refers to the root of the OSD tree and host refers to the differentiator for replication. Here we say replicas should appear on different hosts.

An example for creating a CRUSH rule using SSDs only:

ceph osd crush rule create-replicated fast-ssd default host ssd

If we wanted to create a pool up front using the specific CRUSH rule then we need to supply the number of placement groups for placement (who? use the same value as for placement groups) as well as the new rule name:

ceph osd pool create cinder-volumes-fast 512 512 fast-ssd

Alternatively, you can direct a pool to use the new CRUSH rule after it is up and running:

ceph osd pool set cinder-volumes crush_rule fast-ssd

Document Actions