README.md 6.96 KB
Newer Older
Adam Harrison's avatar
Adam Harrison committed
1

Adam Harrison's avatar
Adam Harrison committed
2
3
<img src="https://github.com/weaveworks/kured/raw/update-docs/img/logo.png" align="right"/>

Adam Harrison's avatar
Adam Harrison committed
4
* [Introduction](#introduction)
Adam Harrison's avatar
Adam Harrison committed
5
6
* [Kubernetes & OS Compatibility](#kubernetes-&-os-compatibility)
* [Installation](#installation)
Adam Harrison's avatar
Adam Harrison committed
7
8
9
* [Configuration](#configuration)
	* [Reboot Sentinel File & Period](#reboot-sentinel-file-&-period)
	* [Blocking Reboots via Alerts](#blocking-reboots-via-alerts)
Adam Harrison's avatar
Adam Harrison committed
10
11
	* [Prometheus Metrics](#prometheus-metrics)
	* [Slack Notifications](#slack-notifications)
Adam Harrison's avatar
Adam Harrison committed
12
	* [Overriding Lock Configuration](#overriding-lock-configuration)
Adam Harrison's avatar
Adam Harrison committed
13
14
15
16
* [Operation](#operation)
	* [Testing](#testing)
	* [Disabling Reboots](#disabling-reboots)
	* [Manual Unlock](#manual-unlock)
Adam Harrison's avatar
Adam Harrison committed
17
18
19
20
21
* [Building](#building)

## Introduction

Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that
Adam Harrison's avatar
Adam Harrison committed
22
23
performs safe automatic node reboots when the need to do so is
indicated by the package management system of the underlying OS.
Adam Harrison's avatar
Adam Harrison committed
24
25
26
27
28
29
30

* Watches for the presence of a reboot sentinel e.g. `/var/run/reboot-required` 
* Utilises a lock in the API server to ensure only one node reboots at
  a time
* Optionally defers reboots in the presence of active Prometheus alerts
* Cordons & drains worker nodes before reboot, uncordoning them after

Adam Harrison's avatar
Adam Harrison committed
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
## Kubernetes & OS Compatibility

The daemon image contains a 1.7.x `k8s.io/client-go` and `kubectl`
binary for the purposes of maintaining the lock and draining worker
nodes. Whilst it has only been tested on a 1.7.x cluster, Kubernetes
typically has good forwards/backwards compatibility so there is a
reasonable chance it will work on adjacent versions; please file an
issue if this is not the case.

Additionally, the image contains a `systemctl` binary from Ubuntu
16.04 in order to command reboots. Again, although this has not been
tested against other systemd distributions there is a good chance that
it will work.

## Installation

To obtain a default installation without Prometheus alerting interlock
or Slack notifications:

```
kubectl apply -f https://github.com/weaveworks/kured/releases/download/1.0.0/kured-ds.yaml
```

If you want to customise the installation, download the manifest and
edit it in accordance with the following section before application.

Adam Harrison's avatar
Adam Harrison committed
57
58
59
60
61
62
63
64
65
66
## Configuration

The following arguments can be passed to kured via the daemonset pod template:

```
Flags:
      --alert-filter-regexp value   alert names to ignore when checking for active alerts
      --ds-name string              namespace containing daemonset on which to place lock (default "kube-system")
      --ds-namespace string         name of daemonset on which to place lock (default "kured")
      --lock-annotation string      annotation in which to record locking node (default "weave.works/kured-node-lock")
Adam Harrison's avatar
Adam Harrison committed
67
      --period duration             reboot check period (default 1h0m0s)
Adam Harrison's avatar
Adam Harrison committed
68
69
      --prometheus-url string       Prometheus instance to probe for active alerts
      --reboot-sentinel string      path to file whose existence signals need to reboot (default "/var/run/reboot-required")
Adam Harrison's avatar
Adam Harrison committed
70
71
      --slack-hook-url string       slack hook URL for reboot notfications
      --slack-username string       slack username for reboot notfications (default "kured")
Adam Harrison's avatar
Adam Harrison committed
72
73
74
75
76
77
```

### Reboot Sentinel File & Period

By default kured checks for the existence of
`/var/run/reboot-required` every sixty minutes; you can override these
Adam Harrison's avatar
Adam Harrison committed
78
79
values with `--reboot-sentinel` and `--period`. Each replica of the
daemon uses a random offset derived from the period on startup so that
Adam Harrison's avatar
Adam Harrison committed
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
nodes don't all contend for the lock simultaneously.

### Blocking Reboots via Alerts

You may find it desirable to block automatic node reboots when there
are active alerts - you can do so by providing the URL of your
Prometheus server:

```
--prometheus-url=http://prometheus.monitoring.svc.cluster.local
```

By default the presence of *any* active (pending or firing) alerts
will block reboots, however you can ignore specific alerts:

```
Adam Harrison's avatar
Adam Harrison committed
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
--alert-filter-regexp=^(RebootRequired|AnotherBenignAlert|...$
```

An important application of this filter will become apparent in the
next section.

### Prometheus Metrics

Each kured pod exposes a single gauge metric (`:8080/metrics`) that
indicates the presence of the sentinel file:

```
# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="ip-xxx-xxx-xxx-xxx.ec2.internal"} 0
Adam Harrison's avatar
Adam Harrison committed
111
112
```

Adam Harrison's avatar
Adam Harrison committed
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
The purpose of this metric is to power an alert which will summon an
operator if the cluster cannot reboot itself automatically for a
prolonged period:

```
# Alert if a reboot is required for any machines. Acts as a failsafe for the
# reboot daemon, which will not reboot nodes if there are pending alerts save
# this one.
ALERT RebootRequired
  IF          max(kured_reboot_required) != 0
  FOR         24h
  LABELS      { severity="warning" }
  ANNOTATIONS {
    summary = "Machine(s) require being rebooted, and the reboot daemon has failed to do so for 24 hours",
    impact = "Cluster nodes more vulnerable to security exploits. Eventually, no disk space left.",
    description = "Machine(s) require being rebooted, probably due to kernel update.",
  }
```

If you choose to employ such an alert and have configured kured to
probe for active alerts before rebooting, be sure to specify
`--alert-filter-regexp=^RebootRequired$` to avoid deadlock!

### Slack Notifications

If you specify a Slack hook via `--slack-hook-url`, kured will notify
you immediately prior to rebooting a node:

<img src="https://github.com/weaveworks/kured/raw/update-docs/img/slack-notification.png"/>

We recommend setting `--slack-username` to be the name of the
environment, e.g. `dev` or `prod`.

Adam Harrison's avatar
Adam Harrison committed
146
147
148
149
150
151
152
153
154
155
156
### Overriding Lock Configuration

The `--ds-name` and `--ds-namespace` arguments should match the name and
namespace of the daemonset used to deploy the reboot daemon - the locking is
implemented by means of an annotation on this resource. The defaults match
the daemonset YAML provided in the repository.

Similarly `--lock-annotation` can be used to change the name of the
annotation kured will use to store the lock, but the default is almost
certainly safe.

Adam Harrison's avatar
Adam Harrison committed
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
## Operation

The example commands in this section assume that you have not
overriden the default lock annotation, daemonset name or namespace;
if you have, you will have to adjust the commands accordingly.

### Testing

You can test your configuration by provoking a reboot on a node:

```
sudo touch /var/run/reboot-required
```

### Disabling Reboots

If you need to temporarily stop kured from rebooting any nodes, you
can take the lock manually:

```
kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}'
```

Don't forget to release it afterwards!

### Manual Unlock

In exceptional circumstances, such as a node experiencing a permanent
failure whilst rebooting, manual intervention may be required to
remove the cluster lock:

```
kubectl -n kube-system annotate ds kured weave.works/kured-node-lock-
```
> NB the `-` at the end of the command is important - it instructs
> `kubectl` to remove that annotation entirely.

Adam Harrison's avatar
Adam Harrison committed
194
195
196
197
198
## Building

```
dep ensure && make
```