Commit 8fb0321e authored by Belhorn, Matt's avatar Belhorn, Matt
Browse files

Adds troubleshooting and better security.

Adds common troubleshooting cases. Encourages users to use a virtualenv for the
python deployment, and strengthens encouragement to use TLS and password
protection.
parent 82eb2f4f
Loading
Loading
Loading
Loading
+13 −10
Original line number Diff line number Diff line
@@ -5,10 +5,6 @@
#PBS -o jupyter.log
#PBS -j oe

# Setup all the optional paths.
# WORK defined in user's bashrc
VENV_DIR="$HOME/.venvs"
VENV="$VENV_DIR/rhea-pyms"

# Change the login and client ports to suitable values.
# Be aware your preferred login port may be in use by other users. A login port
@@ -18,9 +14,14 @@ LOGIN_PORT=XXXXX # FIXME: Choose a *RANDOM* unused port number in the range 10k-
SERVER_PORT=8082
COMMAND="${HOME}/.jupyter_connect"

# Setup the environment.
source $HOME/.venvs/venv-activator.sh
venvctl-rhea-app
# Setup the environment. This block assumes the use of a python virtualenv under
# which Jupyter and all the python packages needed for your work is installed.
# It also assumes there is a script `$VENV/bin/setup_environment_modules` that
# has all the environment module commands needed by packages installed to the
# virtualenv. Change this block as needed.
VENV="$HOME/.venvs/jupyter-on-rhea"
source $VENV/bin/setup_environment_modules
source $VENV/bin/activate

cd $HOME

@@ -41,9 +42,9 @@ cat << EOF > $COMMAND
#
# ssh -f -L 127.0.0.1:$CLIENT_PORT:127.0.0.1:$LOGIN_PORT $USER@rhea.ccs.ornl.gov $COMMAND
#
# Then, on your local machine, navigate to "http://127.0.0.1:$CLIENT_PORT" in
# the browser of your choice. Use 'https' if the server is configured to use
# TLS/SSL encryption.
# Then, on your local machine, navigate to "https://127.0.0.1:$CLIENT_PORT" in
# the browser of your choice. It is ill-advised to leave your server unencrypted,
# but replace 'https' with 'http' in the URL if you have not setup TLS/SSL encryption.

ssh -q -L 127.0.0.1:$LOGIN_PORT:127.0.0.1:$SERVER_PORT \
  $HOSTNAME.ccs.ornl.gov sleep $PBS_WALLTIME
@@ -52,4 +53,6 @@ EOF
trap finish EXIT
chmod a+x $COMMAND

# Change the log-level to a preferred value. DEBUG is useful for troubleshooting
# new server deployments.
jupyter-notebook --no-browser --port=$SERVER_PORT --log-level='DEBUG'
+107 −63
Original line number Diff line number Diff line
@@ -2,38 +2,56 @@ Setting up Jupyter on Rhea
==========================

The following procedure can be used to run Jupyter server instances on Rhea
while allowing connections to them from a local (i.e. a laptop) browser. This
is a band-aid procedure in leiu of dedicated infrastructure for spinning up
while allowing connections to them from a local (i.e. a laptop) browser.

This is a band-aid procedure in leiu of dedicated infrastructure for spinning up
Jupyter instances at the OLCF. Ideally we would offer a host running a
JupyterHub service where users could spin up secured, private Jupyter servers
that offload work transparently to dynamically started ipycluster backend jobs
through the batch system.

If you like this idea, please mention it in the OLCF User Survey - with enough
voices and support behind it we could push to aquire the necessary
infrastructure.
If you like the above idea, please mention it in the OLCF User Survey. With enough
voices of support, we could push to aquire the necessary infrastructure.

Meanwhile
---------

The jupyter compute kernels should be run on reserved batch nodes (ie, not
shared login nodes where they can be killed without warning) and the web
browser used to access the notebook interface is best run on your local machine
as you are already used to.
The Jupyter compute kernels should be run on *reserved batch nodes*. That is to
say **not** on the shared login nodes. Compute/resource intensive processes on
the login nodes can be killed without warning. The web browser used to access
the notebook interface is best run on your local machine, so the connection to
the notebook server must be tunneled out of the compute nodes.

## Installation

The `jupyter-on-rhea.pbs` batch script launches a jupyter notebook server on a
single batch node and sets up a script to create the necessary SSH tunnel to
access it. In order to work, you will need to have jupyter installed somewhere
in your PYTHONPATH. This can either be in `/ccs/proj/...` or (for Rhea) simply
in your home directory using pip:
The [jupyter-on-rhea.pbs](jupyter-on-rhea.pbs) batch script in this repo
launches a Jupyter server on a single batch node and sets up a script to create
the necessary SSH tunnel to access it. In order to work, you will need to have
Jupyter installed somewhere in your PYTHONPATH. This can either be in
`/ccs/proj/...` or, on Rhea, simply in your user site-packages directory. It is
recommended that you use a virtualenv or alternate Python install to manage your
Python environment for this app.

```bash
$ module load python/2.7.9 python_setuptools python_pip
$ pip install --user jupyter
$ module load python/2.7.9 python_virtualenv
$ virtualenv $MYVENVPATH # install wherever you prefer
$ . $MYVENVPATH/bin/activate
(myvenv)$ pip --trusted-host pypi.python.org pip -U pip
(myvenv)$ pip install --user jupyter
```

You should then create a skeleton configuration file and set an access password
(see the caveats below for an explanation):
The rest these instructions assume you will have the installed Jupyter package
in your PATH either through an activated virtualenv or other scheme.

## Secure the Server

The connection to the server is not encrypted nor password protected by
default. As anyone can SSH to any node on Rhea, it is possible for other users
to connect to your notebook server and then **generate and run code as your
user if left unsecured!**

To harden the server, create a skeleton configuration file and **set an access
password**:

```bash
$ jupyter notebook --generate-config
@@ -46,77 +64,71 @@ c.NotebookApp.password = 'sha1:123:some:456:secret:789:password:012:hash:3456789
add the output hashed password line to the profile config file (typically
`$HOME/.jupyter/jupyter_notebook_config.py`)

Starting the server is done by launching the PBS script with appropriate PBS
options:
Now that access is password protected, you must **encrypt all communication
traffic**, otherwise someone can simply intercept the unencrypted stream and
read your data and passwords. To enable encryption, generate a self-signed x509
server certificate/key pair. The DN data used is mostly arbitrary, but you must
use `127.0.0.1` for the hostname/server name as modern browsers will reject the
certificate if it does not match the URL that will be used to access the
notebook server:

```bash
$ qsub jupyter-on-rhea.pbs
$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout ~/.jupyter/mykey.key -out ~/.jupyter/mycert.pem
$ chmod go= ~/.jupyter/mykey.key
```

The job will place an executable at `$HOME/.jupyter_connect` which contains
instructions on how to attach to the server. Typically you would issue a
variation of
and set the options in `$HOME/.jupyter/jupyter_notebook_config.py`:

```bash
ssh -f -L 127.0.0.1:8080:127.0.0.1:8081 $USER@rhea.ccs.ornl.gov /ccs/home/$USER/.jupyter_connect
```python
c.NotebookApp.certfile = '/ccs/home/$USER/.jupyter/mycert.pem'
c.NotebookApp.keyfile = '/ccs/home/$USER/.jupyter/mykey.pem'
```

on your local workstation and direct your local browser to `http://127.0.0.1:8080`.

When using TLS encryption, you must explicitly use `https` instead of `http` in
the URL used to access the server.

## Caveats

### Security
### Running the Server

The connection to the server is not encrypted nor password protected by
default.

As anyone can SSH to any node on Rhea, it is possible for other users
to connect to your notebook server and then generate and run code as your user
if left unsecured. It is a good idea to setup TLS/SSL encryption if you are
concerned about the possibility of someone sniffing data packets sent to the
notebook server. To enable encryption, generate a self-signed x509 server
certificate/key pair. The DN data used is mostly arbitrary, but you must use
`127.0.0.1` for the hostname/server name as modern browsers will reject the
certificate if it does not match the URL in the browser:
Starting the server is done by launching the PBS script with appropriate PBS
options:

```bash
$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout ~/.jupyter/mykey.key -out ~/.jupyter/mycert.pem
$ chmod go= ~/.jupyter/mykey.key
$ qsub jupyter-on-rhea.pbs
```

and set the options in `$HOME/.jupyter/jupyter_notebook_config.py`:
The job will place an executable at `$HOME/.jupyter_connect` which contains
instructions on how to attach to the server. Typically, you must issue a
variation of

```python
c.NotebookApp.certfile = '/ccs/home/$USER/.jupyter/mycert.pem'
c.NotebookApp.keyfile = '/ccs/home/$USER/.jupyter/mykey.pem'
```bash
ssh -f -L 127.0.0.1:8080:127.0.0.1:8081 $USER@rhea.ccs.ornl.gov /ccs/home/$USER/.jupyter_connect
```

If you add TLS encryption (*you should*), you must connect using `https://127.0.0.1:8080`
instead of `http`.
on your local workstation and direct your local browser to `http://127.0.0.1:8080`.

### MPI Capabilities and Ipyparallel
## MPI Capabilities and Ipyparallel

The kernels all run on a single node. It is possible to extend this
setup to use the ipython cluster 'ipcluster' backend and the $PBS_NODEFILE to
allow the kernel to run parallel tasks. See the notebook
`interactive_notebooks_with_mpi_on_rhea.ipynb` in this repo for instructions on
setting up an ipycluster backend.
Setup as per the above instructions, the kernels all run on a single node. It is
possible to extend this setup to use the ipython cluster *ipcluster* backend and
the `$PBS_NODEFILE` to allow the kernel to run parallel tasks. See the notebook
[interactive_notebooks_with_mpi_on_rhea.ipynb](interactive_notebooks_with_mpi_on_rhea.ipynb)
in this repo for instructions on setting up an ipycluster backend.

### Server Uptime
## Server Uptime

The server is killed at least every 48 hours so you will want to make sure
your work is saved often. You can add a line like:
The server is killed at least every 48 hours so you will want to **make sure
your work is saved often**. You can add a line like:

```bash
qsub -W depend=afternotok:$PBS_JOBID ok_jupyter.pbs
```

near the top of `ok_jupyter.pbs` to resubmit the job automatically to keep a
server up, but you will still need to re-establish the tunnel each time it goes
down. 
near the top of the `jupyter-on-rhea.pbs` batch script to resubmit the job
automatically to keep a server up, but you will still need to re-establish the
tunnel each time it goes down. 

### Allocation Consumption
## Allocation Consumption

This does consume your Rhea allocation so just keeping the server up and not
using it to crunch numbers is wasteful. It is perhaps the best practice to
@@ -124,10 +136,42 @@ do interactive development work on a local jupyter instance and then run a
dedicated python script in a batch job to make the most efficient use of your
allocation.

### Custom Re-configuration
## Custom Re-configuration

Any of the configuration details should be tuned to your needs.
Specifically, the ports may need to be different for your case. You may want to
change 'c.NotebookManager.notebook_dir' to use a different path then the default
so as to keep your toplevel $HOME directory tidy.

## Troubleshooting

Issues with this approach that could be fixed but have not yet been addressed
include:

1. **Cannot connect through local `CLIENT_PORT`**
  * Check that a server job is actually running with `showq -u $USER`.
  * Check that you are using the `https` prefix in your URL if using TLS encryption.
  * Check that you only have one tunnel open on the local machine. The tunnels
    are supposed to close when the connection is broken, but if they hang open,
    new tunnels will be assigned different ports than `CLIENT_PORT`.  There is a
    fix for this, but I have not documented it...

1. **Login password is rejected even though it is entered correctly.**
   * This is a symptom of multiple users attempting to access different servers
     through the same login node and port number. The connection will succeed
     but the password will be rejected because the server you are accessing is
     someone else's. Even though the login screen will look the same, you can
     use the TLS certificate information in your browser to verify if you are
     interacting with your server (ie, it is using your certificate) or someone
     else's. If this happens, use a different *random* LOGIN_PORT in the batch
     script.

1. **Cannot start a new server jobs even though no server job is running.**
   * New servers won't start while a `$HOME/.jupyter_connect` script exists.
     This script is used as a lockfile, but is sometimes not removed when the
     last server job exits. Verify that no server job is running using `showq -u
     $USER` and if none is running, simply delete `$HOME/.jupyter_connect` to
     allow a new server job to start. This is a symptom of the crude method of
     creating and clearing lockfiles used by this technique and should be
     improved.