Maintenance

Node Restarts

Nodes restarting can be a bit of a pain, depending on the service. For most microservices the user of the service will probably be unaware of an change having occurred (unless they catch a singleton mid-redeployment). For some services, notably the GitLab Runners described previously, a new Runner means a new, er, Runner meaning we have to cycle round any interested GitLab Projects either unlocking the new Runner for wider use or enabling the Runner in other Projects.

In the meanwhile we have the mechanics of a node shutting down. Ideally, for worker nodes, we would kubectl cordon the node from further use, kubectl drain (passing increasingly vicious arguments to bypass the safety warnings) to clear it of running Pods and then we can restart the operating system. Remember, of course, to kubectl uncordon it upon its return.

Graceful Shutdown

Maybe the idea that you might want to re-use a node with an updated kernel/libc was anathema (as opposed to instantiating a new node with the new kernel/libc) as it wasn't until Kubernetes 1.21 that they introduced the idea of a GracefulShutdown to delay the operating system shutdown whilst Pods are evicted.

The wording is slightly squirrelly as it's not clear if shutdownGracePeriod alone is enough or shutdownGracePeriodCriticalPods must be defined as well. Importantly, it's even less clear where you set these values. These are kubelet settings, so that's a start.

Naively, on Fedora, there is /etc/kubernetes/kubelet.conf but that doesn't look right and /etc/sysconfig/kubelet is usually about supplying arguments to the daemon and, in this instance, not supplying any.

kubelet is a (SystemD) service, probably, so there's likely to be a /usr/lib/systemd/system/kubelet.service but that's a bit short of clues but there's also a /usr/lib/systemd/system/kubelet.service.d directory which contains 10-kubeadm.conf which finally gives us some hope.

Mind you it references three config files through environment variables (and a separate environment file). Plodding through those we end up with /var/lib/kubelet/config.yaml which does, finally, define shutdownGracePeriod/shutdownGracePeriodCriticalPods. Woo!

It doesn't feel right to be editing that file but the kubelet config file documentation doesn't suggest you can pass more than one --config= option.

kubelet --help suggests more options than you can shake a stick at but sadly, none involving shutdown grace periods (critical or otherwise).

So, set those values in that file to something reasonable for your site -- 30s and 10s are suggested values -- and at least we should have covered some coordinated rescheduling of Pods when we reboot a worker node.

You'll need to restart kubelet and with SystemD you can check whether graceful shutdown has been successful applied as kubectl will have added an "inhibitor":

# systemd-inhibit --list
WHO            UID USER PID     COMM           WHAT     WHY                                        MODE
...
kubelet        0   root 1153267 kubelet        shutdown Kubelet needs time to handle node shutdown delay

(might take a few seconds to appear after restarting kubelet)

etcd backup

etcd holds everything together so we probably want to back it up. You can back up etcd with etcdctl, preferably on a quiescent system (I have no idea how to achieve that).

Distraction

The general idea is that you use etcdctl on your master node. Hmm, there's no etcdctl on my master node. Further, etcd is, of course, running in a Pod:

# kubectl get pods -A | grep etcd
kube-system               etcd-k8s-m1      ...

where the Pod name is etcd-<node-name>.

Given that there is a possibility that my master node's etcdctl will be a just-slightly-different-and-therefore-corruptible version away from the running etcd. Can we get etcd in the Pod to back itself up?

To do anything in there we'll need:

# kubectl -n kube-system exec etcd-k8s-m1 -- <cmd>

and, if the <cmd> is sh (actually, bash) you might want:

# kubectl -n kube-system exec etcd-k8s-m1 -it -- sh

Whereupon you'll discover that you have /bin/cp and /bin/sh to play with.

We'll actually use:

# kubectl -n kube-system exec etcd-k8s-m1 -- sh -c '...'

because it allows us to set some environment variables.

etcdctl

In the first instance, the generally assumed form of the command to run is:

ETCDCTL_API=3 etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save <filename>

and we can put that in our sh -c '...' snippet.

There's a few things to unpick, here:

ETCDCTL_API=3 -- that isn't the default? Running etcdctl version suggests:
```
etcdctl version: 3.5.1
API version: 3.5
```
We're shooting a bit blind here so let's go with the flow.
--cacert etc.. etcdctl will not do much, indeed, might hang, if these arguments are not supplied. If you logged in with sh you can see those files exist but technically they are from whatever the Pod was launched with. You can get those by querying the Pod:
```
# kubectl -n kube-system get pod/etcd-k8s-m1 -o yaml
...
spec:
  containers:
  - command:
    - etcd
    - ...
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - ...
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - ...
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
```
Leaving us with the enticing possibility for any automation to parse YAML-encoded arguments to a command with inconsistent names between options and parameters.

The equivalent jsonpath is jsonpath='{$.spec.containers[0].command}' leaving us to parse JSON instead:
```
[
 "etcd",
 ...,
 "--cert-file=/etc/kubernetes/pki/etcd/server.crt",
 ...,
 "--key-file=/etc/kubernetes/pki/etcd/server.key",
 ...,
 "--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
]
```
<filename> is self-explanatory.

However, it is also quite literal. Thinking ahead about how to get the file out of the Pod, I tried a couple of variations:
- - for stdout, obviously. It will start creating a file called -.part
- /dev/fd/1 for stdout, obviously. It will fail to create a file called /dev/fd/1.part -- /dev/fd/1 do exist but even root can't create files at random in /dev.
I now have my experimental files permanently in my Pod, see below.

Hmm. So we can create a file called, say, /tmp/x and all is good. We can use the suggested verification command (which doesn't require the --cacert etc. arguments):

# kubectl -n kube-system exec etcd-k8s-m1 -- \
    sh -c 'ETCDCTL_API=3 etcdctl --write-out=table \
             snapshot status /tmp/x'
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 776f4bd2 |  2496568 |        949 |     7.2 MB |
+----------+----------+------------+------------+

Which seems OK (albeit the suggested command is deprecated).

Now here's the tricky question, how do we get it out?

There is a kubectl cp command but, rather curiously, it requires tar inside the Pod. Actually, it's obvious to see why it would require tar if asked to copy a directory hierarchy but it still requires tar when copying a single file. Our Pod does not have tar.

However, this time, cp and /dev/fd/1 do come to the rescue and we can use cp /tmp/x /dev/fd/1 in our snippet and redirect the whole kubectl exec ... into a file:

# kubectl exec ... -- sh -c 'cp /tmp/x /dev/fd/1' > etcd.bkp

Finally, we can delete /tmp/x. Yikes! Oh no we can't. There's no rm in the container and bash does not have an unlink(2) builtin. Ooops! We could limit the mess with a cp /dev/null /tmp/x but that raises the question, how much disk space is there and does creating this backup file cause a problem?

After rummaging around -- I couldn't figure out any in-Pod mechanism -- we can ask the container runtime:

# crictl imagefsinfo etcd
{
  "status": {
    "timestamp": "1649346179359457165",
    "fsId": {
      "mountpoint": "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
    },
    "usedBytes": {
      "value": "2736508928"
    },
    "inodesUsed": {
      "value": "36661"
    }
  }
}

and on the host:

# df -h /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
Filesystem                      Size  Used Avail Use% Mounted on
/dev/mapper/fedora_fedora-root   39G  8.9G   30G  24% /

OK, so it looks like we have some room for manoeuvre.

# kubectl -n kube-system exec etcd-k8s-m1 -- sh -c 'cp /dev/null /tmp/x'
# crictl imagefsinfo etcd
{
  "status": {
    "timestamp": "1649346249359355028",
    "fsId": {
      "mountpoint": "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
    },
    "usedBytes": {
      "value": "2729287680"
    },
    "inodesUsed": {
      "value": "36661"
    }
  }
}

and by my Maths (read: bc) that is a 7221248 byte drop in usedBytes. How big is our file?

# ls -l etcd.bkp
-rw-r--r--. 1 root root 7217184 Apr  7 16:27 etcd.bkp

So that looks quite plausible. Of course, no drop in inodesUsed as we haven't (can't!) delete anything.

Nominal Route

The nominal route is to use etcdctl on the node with the same arguments and the additional --endpoints=https://127.0.0.1:2379 as the Pod has mapped its ports onto the node. In practice it is also listening on the main interface so that the other nodes can talk to it. Try:

# ss -ntlp sport = :2379

on your master node.

Further advantages include not having to mess about with undeletable and hard to extract files...

etcd restore

Restoration is a bit more obtuse. You would think that, if occasion requires, then the system will be quiescent -- stop kubelet on the other nodes, I guess, if it isn't.

There are two parts to the restoration:

restoring the etcdctl backup
poking the Pod to use the restored data

Data Restore

Here we need to specify a few more bootstrap arguments and a safe place to perform the restoration to.

This latter element feels a bit awkward:

# ETCDCTL_API=3 etcdctl \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --name=k8s-m1 \
    --data-dir /var/lib/etcd-from-backup \
    --initial-cluster=master=https://127.0.0.1:2380 \
    --initial-cluster-token etcd-cluster-1 \
    --initial-advertise-peer-urls=https://127.0.0.1:2380 \
    snapshot restore etcd.bkp

Here, then, it looks much the same plus:

--name -- presumably that should the be same name as you used before (from the spec.containers[0].command above)
--data-dir -- the separate place to restore to. The normal data dir is var/lib/etcd, so avoid that, I guess.
some initial cluster args pointing at ourselves

I understand that for subsequent master nodes restoration, using the same backup file, those initial cluster args should point at the first master you restored.

Poke the Pod

Here we need to adjust the configuration of the etcd Pod, a side-effect of which causes the Pod to restart. The parameters we're going to modify are those for the data dir and a bootstrap token.

The configuration file in question is /etc/kubernetes/manifests/etcd.yaml (its existence in /etc/kubernetes/manifests is, apparently, the reason to restart the associated Pod) and /var/lib/etcd is mentioned in three places:

--data-dir in spec.containers[0].command
a mountPath in spec.containers[0].volumeMounts
a path in spec.volumes

I guess what we're seeing here is that the new Pod will use these new data dir parameters but, uh, permanently. What happens the next time?

Perhaps the restored data dir should be something more reflective of the time of the backup so at least it's clear what data set the Pod was started on (whatever state is in in now).

Mumshad, here, is suggesting we also need to update the cluster token in the etcd argument list:

- --initial-cluster-token=etcd-cluster-1

Document Actions