* [Introduction](#introduction)
* [Kubernetes & OS Compatibility](#kubernetes-&-os-compatibility)
* [Installation](#installation)
* [Configuration](#configuration)
* [Reboot Sentinel File & Period](#reboot-sentinel-file-&-period)
* [Blocking Reboots via Alerts](#blocking-reboots-via-alerts)
* [Prometheus Metrics](#prometheus-metrics)
* [Slack Notifications](#slack-notifications)
* [Overriding Lock Configuration](#overriding-lock-configuration)
* [Operation](#operation)
* [Testing](#testing)
* [Disabling Reboots](#disabling-reboots)
* [Manual Unlock](#manual-unlock)
* [Building](#building)
## Introduction
Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that
performs safe automatic node reboots when the need to do so is
indicated by the package management system of the underlying OS.
* Watches for the presence of a reboot sentinel e.g. `/var/run/reboot-required`
* Utilises a lock in the API server to ensure only one node reboots at
a time
* Optionally defers reboots in the presence of active Prometheus alerts
* Cordons & drains worker nodes before reboot, uncordoning them after
## Kubernetes & OS Compatibility
The daemon image contains a 1.7.x `k8s.io/client-go` and `kubectl`
binary for the purposes of maintaining the lock and draining worker
nodes. Whilst it has only been tested on a 1.7.x cluster, Kubernetes
typically has good forwards/backwards compatibility so there is a
reasonable chance it will work on adjacent versions; please file an
issue if this is not the case.
Additionally, the image contains a `systemctl` binary from Ubuntu
16.04 in order to command reboots. Again, although this has not been
tested against other systemd distributions there is a good chance that
it will work.
## Installation
To obtain a default installation without Prometheus alerting interlock
or Slack notifications:
```
kubectl apply -f https://github.com/weaveworks/kured/releases/download/1.0.0/kured-ds.yaml
```
If you want to customise the installation, download the manifest and
edit it in accordance with the following section before application.
## Configuration
The following arguments can be passed to kured via the daemonset pod template:
```
Flags:
--alert-filter-regexp value alert names to ignore when checking for active alerts
--ds-name string namespace containing daemonset on which to place lock (default "kube-system")
--ds-namespace string name of daemonset on which to place lock (default "kured")
--lock-annotation string annotation in which to record locking node (default "weave.works/kured-node-lock")
--period duration reboot check period (default 1h0m0s)
--prometheus-url string Prometheus instance to probe for active alerts
--reboot-sentinel string path to file whose existence signals need to reboot (default "/var/run/reboot-required")
--slack-hook-url string slack hook URL for reboot notfications
--slack-username string slack username for reboot notfications (default "kured")
```
### Reboot Sentinel File & Period
By default kured checks for the existence of
`/var/run/reboot-required` every sixty minutes; you can override these
values with `--reboot-sentinel` and `--period`. Each replica of the
daemon uses a random offset derived from the period on startup so that
nodes don't all contend for the lock simultaneously.
### Blocking Reboots via Alerts
You may find it desirable to block automatic node reboots when there
are active alerts - you can do so by providing the URL of your
Prometheus server:
```
--prometheus-url=http://prometheus.monitoring.svc.cluster.local
```
By default the presence of *any* active (pending or firing) alerts
will block reboots, however you can ignore specific alerts:
```
--alert-filter-regexp=^(RebootRequired|AnotherBenignAlert|...$
```
An important application of this filter will become apparent in the
next section.
### Prometheus Metrics
Each kured pod exposes a single gauge metric (`:8080/metrics`) that
indicates the presence of the sentinel file:
```
# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="ip-xxx-xxx-xxx-xxx.ec2.internal"} 0
```
The purpose of this metric is to power an alert which will summon an
operator if the cluster cannot reboot itself automatically for a
prolonged period:
```
# Alert if a reboot is required for any machines. Acts as a failsafe for the
# reboot daemon, which will not reboot nodes if there are pending alerts save
# this one.
ALERT RebootRequired
IF max(kured_reboot_required) != 0
FOR 24h
LABELS { severity="warning" }
ANNOTATIONS {
summary = "Machine(s) require being rebooted, and the reboot daemon has failed to do so for 24 hours",
impact = "Cluster nodes more vulnerable to security exploits. Eventually, no disk space left.",
description = "Machine(s) require being rebooted, probably due to kernel update.",
}
```
If you choose to employ such an alert and have configured kured to
probe for active alerts before rebooting, be sure to specify
`--alert-filter-regexp=^RebootRequired$` to avoid deadlock!
### Slack Notifications
If you specify a Slack hook via `--slack-hook-url`, kured will notify
you immediately prior to rebooting a node:
We recommend setting `--slack-username` to be the name of the
environment, e.g. `dev` or `prod`.
### Overriding Lock Configuration
The `--ds-name` and `--ds-namespace` arguments should match the name and
namespace of the daemonset used to deploy the reboot daemon - the locking is
implemented by means of an annotation on this resource. The defaults match
the daemonset YAML provided in the repository.
Similarly `--lock-annotation` can be used to change the name of the
annotation kured will use to store the lock, but the default is almost
certainly safe.
## Operation
The example commands in this section assume that you have not
overriden the default lock annotation, daemonset name or namespace;
if you have, you will have to adjust the commands accordingly.
### Testing
You can test your configuration by provoking a reboot on a node:
```
sudo touch /var/run/reboot-required
```
### Disabling Reboots
If you need to temporarily stop kured from rebooting any nodes, you
can take the lock manually:
```
kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}'
```
Don't forget to release it afterwards!
### Manual Unlock
In exceptional circumstances, such as a node experiencing a permanent
failure whilst rebooting, manual intervention may be required to
remove the cluster lock:
```
kubectl -n kube-system annotate ds kured weave.works/kured-node-lock-
```
> NB the `-` at the end of the command is important - it instructs
> `kubectl` to remove that annotation entirely.
## Building
```
dep ensure && make
```