Commit e8c826fb authored by Belhorn, Matt's avatar Belhorn, Matt
Browse files

Initial commit of NCCS Python reference scripts.

parents
Loading
Loading
Loading
Loading

README.md

0 → 100644
+196 −0
Original line number Diff line number Diff line
Custom Python installations on NCCS Resources
=============================================

These are personal notes and scripts for installing and managing custom python
installations on various NCCS resources.

# Python

Python is available through environment modules. The base `python` environment
module (2016-09-01) provides either Python v2.7 or Python v3 and requires extra
environment modules to be loaded to provide core extensions such as `pip` and
`virtualenv` as well as center-provided builds of common packages like `numpy`.

Python packages loaded from extra environment modules should supersede any that
are provided in the base environment module because environment modules
*prepend* packages to the `PYTHONPATH`.

# Extending Python

Python packages not available through environment modules can be installed by
individual users. There are several methods by which users can extend Python for
their needs, each having their own benefits and drawbacks. The principle
considerations to choosing the best method are:

1. Will the package be used on Titan/Eos compute nodes?
2. Will the package be used by many users?

* Will the package be used on different resources or under different
  environment modules? If yes:

3. Does the package provide compiled non-python binaries or shared objects?
4. Does the package depend on shared libraries from other environment modules?

Questions (1) and (2) establish *on which filesystem* the package should be
installed.  Packages to be used on Cray compute nodes must reside on a
filesystem that is visible to those nodes in a **virtual environment** or
**alternate root**, nominally `/ccs/proj/${PROJECT_ID}` which is readable by the
compute nodes and is not purged. 

Packages that will be shared by multiple users should be installed to an
appropriately accessible **virtual environment** or **alternate root** such as a
subdirectory of `/ccs/proj/${PROJECT_ID}`.

Single-user packages that are not used on the Cray compute nodes may be
installed under your home directory. In this case, if questions (3) and (4) do
not apply or are 'no', then it can be installed to the standard **User install
directory** which is automatically added to the python package search path.

Questions (3) and (4) establish *how* the package should be installed if it will
be used on different resources or under different runtime environments.

Packages that provide non-python binaries or shared objects ('yes' to question
(3)) cannot generally be assumed to produce architecture independent code that
can be run on all OLCF resources or Cray node types. Simple cases, the package
can be installed as a generically pre-compiled **python wheel** in the standard
package search path. However, this may cause instruction errors at runtime on
some resources. This is a major issue with distributions like *anaconda*, which
prefers wheels, when run on Cray compute nodes.

Likewise for question (4), CPU instruction sets are generally different for each
system and in the case of Titan and Eos are different between the service nodes
and the compute nodes. The available shared libraries also generally change
between systems and CrayPE programming environments. 

Packages with specific runtime or architecture dependencies should be installed to either

* a **virtual environment** that is activated in the appropriate environment,
* an **alternate root** explicitly added to the PYTHONPATH when appropriate,
* or provided by the OLCF as an environment module. 

Examples of such packages include optimized `numpy`, `mpi4py`, `h5py`,
and `python-netcdf`.

# Installing Packages

Popular packages can generally be installed from online package indexes such as
the Python Package Index, *PyPI*, using the tool `pip` (aka `pip2`) or `pip3`
depending on which version of python is being used. These commands are added to
your `PATH` when the base python environment module is loaded.

Alternatively, many packages can be installed by running a `setup.py` script
provided with the package.  These scripts use a number of distribution tools
that are provided with the core python installation.

## User install directory

Packages can be easily installed to your **user install directory** (typically
`$HOME/.local/lib/pythonV.v/site-packages`) using 

`pip install --user -v [--no-binary :all:] PACKAGE`

where the optional flag `--no-binary` instructs `pip` to avoid pre-compiled
binaries and wheels and instead compile any binaries for the current environment
using the system compiler. Packages installed this way are known to the python
interpreter without any extra setup.

## Virtual Environments

A robust way to build a customized python stack is to build it from
scratch in a **virtual environment**, which is sometimes shortened to
*virtualenv* or simply *venv*.

Virtual environments allow you to maintain a personal python stack that is fully
under your control. To create a venv, load the base python environment module
and issue the command

`virtualenv [-p PYTHON] VENVPATH`

which will create a clean Python distribution directory structure at any
arbitrary virtual environment path `VENVPATH`. The optional flag `-p PYTHON`
allows you to specify a specific python interpreter version for the venv to use.

It is recommended to give your virtual environments clear names and organize
them in standard locations. For example, given shared applications or projects
named `foo` and `bar` which have environment-specific binaries and private apps
`baz` and `widget`, one might choose the following virtual environment paths:

```
/ccs/proj/<PROJECTID>/venvs/titan-pgi-foo
/ccs/proj/<PROJECTID>/venvs/titan-intel-foo
/ccs/proj/<PROJECTID>/venvs/rhea-bar
/home/$USER/.venvs/baz
/home/$USER/.venvs/widget
```

To use a venv, it must by *activated* by sourcing:

`. VENVPATH/bin/activate`

A venv can be *de-activated* by calling a shell function

`deactivate`

This function is created when the venv is activated.

It is important that environment modules are not changed while a venv is
activated. Any environment modules that are dependencies of packages installed
in the venv **must be loaded prior to both creating and subsequently
activating** the virtual environment. An active venv must be de-activated before
making any changes to the loaded environment modules.

While a venv is active, the `python` interpreter used will be the one installed
in the virtual environment path. Likewise, all packages installed to the
"system" site-packages directory, for instance using:

`pip install -v [--no-binary :all:] PACKAGE`

will in fact be installed under the virtual environment site-packages. This
allows you, as an unprivileged user, to install any package you like using
`pip` or `setup.py` into a customized python stack. 

It is typically necessary to install all packages that you would like to use
into the virtualenv. In this way, it is possible to create a customized python
stack for each resource or programming environment which is optimized with
potentially architecture specific binaries and any extra python packages that
are not available in the base distribution.

### Library links in `$VIRTUAL_ENV/lib`

Shared libraries that are not made available through environment modules can be
linked to within `$VIRTUAL_ENV/lib`.

### Enhancing `activate`

Adventurous users can add additional commands to `activate`, if they wish. This
is dangerous as parts of the script are called multiple times during activation. A
safer alternative is to write a source-able script to activate the environment.
See `venv_activator.sh` in this repo, for instance.

## Alternate Roots

It is possible to install most python packages to any location that you like and
make them available by manually adding them to your `PYTHONPATH`.

## Anaconda

The author does not like Anaconda primarily because it conflicts with the system
python, python inadvertently loaded from environment modules, and tends to favor
pre-compiled binaries that can cause runtime errors when used on Crays. However,
it can be used successfully on Rhea. If additional packages are needed by a
user, an Anaconda virtual environment (clone) should be made in a directory
where the user has write permissions.

## The Nuclear Option: Installing a core Python stack directly.

For when a virtualenv just isn't enough, users can build Python (including dual
Python2+Python3 deployments) directly from source in a directory of their
choosing. All that needs to be done to use it is to add the relevant parts of
the install to the `PATH`, `LD_LIBRARY_PATH`, and possibly `PYTHONPATH`
variables. To keep these changes from conflicting with modifications made by 
environment modules, it is recommended to construct and use a modulefile to
enable the custom stack under the module name `python`.

See `build_raw_python.sh` for an overview of what is involved to deploy a custom
python stack from source.

build_raw_python.sh

0 → 100755
+121 −0
Original line number Diff line number Diff line
#!/bin/bash

# Installs python2 and python3 side-by-side in a customized directory $TOPDIR.
# Installation includes pip, virtualenv, and core packages for an-all-in-one
# useful python stack.

PY2_VER="2.7.12"
PY3_VER="3.5.2"

WHEEL_EXTRAS=(nose
              PyYAML
              jsonschema
              pep8
              argcomplete
              psutil
             )
COMPILED_EXTRAS=(numpy
                 cython
                 matplotlib
                 ipython
                 pandas
                 sympy
                 )

declare -a GROUPS
for g in $(groups | grep -m 1 -oE "\<[a-z]{3}[0-9]{3}\>"); do
  GROUPS+=("$g")
done
echo "Under which project do you want to install Python?"
select grp in ${GROUPS[@]}; do break; done
TOPDIR="/ccs/proj/$grp/opt/python"
BUILD="$TOPDIR/build"
USR="$TOPDIR/usr"
PY2LOG="$TOPDIR/py2_build.log"
PY3LOG="$TOPDIR/py3_build.log"

notify () {
  if [ $1 -gt 0 ]; then
    printf "[FAILED]\n"
  else
    printf "[  OK  ]\n"
  fi
}

echo "Removing existing installation"
rm -fr $TOPDIR
mkdir -p $BUILD $USR
# FIXME: Installation group and permissions should be considered. It may be
#        prudent to set the group sticky bit.
cd $BUILD

echo "Obtaining source files"
echo "==============="
echo "Python2"
wget https://www.python.org/ftp/python/${PY2_VER}/Python-${PY2_VER}.tgz
echo "Python3"
wget https://www.python.org/ftp/python/${PY3_VER}/Python-${PY3_VER}.tar.xz
echo "Pip boostrap script"
curl -O https://bootstrap.pypa.io/get-pip.py
printf "===============\n\n"

printf "%-41s" "Installing Python2"
cd $BUILD
tar xf Python-${PY2_VER}.tgz >> $PY2LOG 2>&1
cd $BUILD/Python-${PY2_VER}
./configure --prefix=$TOPDIR/usr --enable-shared >> $PY2LOG 2>&1
make >> $PY2LOG 2>&1
make install  >> $PY2LOG 2>&1
notify $?

printf "%-41s" "Installing Python3"
cd $BUILD
tar xf Python-${PY3_VER}.tar.xz >> $PY3LOG 2>&1
cd $BUILD/Python-${PY3_VER}
./configure --prefix=$TOPDIR/usr --enable-shared >> $PY3LOG 2>&1
make >> $PY3LOG 2>&1
make install >> $PY3LOG 2>&1
notify $?

export LD_LIBRARY_PATH="$USR/lib:$LD_LIBRARY_PATH"
export PATH="$USR/bin:$PATH"

cd $BUILD
printf "%-41s" "Installing pip3"
python3 get-pip.py  >> $PY3LOG 2>&1 # Install pip3 first, so that
notify $?

printf "%-41s" "Installing pip2"
python get-pip.py   >> $PY2LOG 2>&1 # pip2 overwrites default `pip`
notify $?

printf "%-41s" "Installing virtualenv for Python2"
# Version Specific Packages
## Python3 provides pyvenv
pip2 install -v virtualenv >> $PY2LOG 2>&1
notify $?

printf "\nInstalling extra packages:\n"
# Install packages for python3 first so that python2 packages writing binaries
# with version-less names (like 'ipython' as opposed to 'ipython2') use python2
for pipx in $(which pip3) $(which pip2); do
  case "$(basename ${pipx})" in
    pip2) log="$PY2LOG"; name="python2" ;;
    pip3) log="$PY3LOG"; name="python3" ;;
    *) exit 1
  esac
  for package in ${WHEEL_EXTRAS[@]}; do
    printf "  %s: %-30s" $name $package
    $pipx install -v $package >> $log 2>&1
    notify $?
  done

  for package in ${COMPILED_EXTRAS[@]}; do
    printf "  %s: %-30s" $name $package
    $pipx install -v --no-use-wheel $package >> $log 2>&1
    notify $?
  done
done


printf "\nFinished! See build logs\n  python2: $PY2LOG\n  python3: $PY3LOG\nfor details.\n\n"

build_virtualenv.sh

0 → 100755
+48 −0
Original line number Diff line number Diff line
#!/bin/bash

declare -a GROUPS
for g in $(groups | grep -m 1 -oE "\<[a-z]{3}[0-9]{3}\>"); do
  GROUPS+=("$g")
done
echo "Under which project do you want to install Python?"
select grp in ${GROUPS[@]}; do break; done

# Setup all the optional paths.
VENV_DIR="/ccs/proj/$grp/.venvs"
VENV="$VENV_DIR/rhea-pyms"
TMPDIR=/tmp/$USER/venvbuild

# Set the environment.
# !!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# ENVIRONMENT MODULE CHANGES CANNOT BE MADE INSIDE AN ACTIVE VIRTUALENV
module swap PE-intel PE-gnu
module load netcdf python/2.7.9 python_virtualenv/12.0.7 

# Make the necessary directories
mkdir -p $VENV_DIR $TMPDIR

# Make and activate a virtualenv for this python stack
virtualenv $VENV
source $VENV/bin/activate

# Link against RHEL atlas - this is kind of gross. These are probably the least
# optimized lapack/blas implementations I've ever seen...
for i in $(ls /usr/lib64/atlas/*.so.3); do
  BASELIB=${i##*atlas/}
  ln -s $i $VIRTUAL_ENV/lib/${BASELIB%%.3}
done 

# Install indexed dependancies outright.
pip install --upgrade pip
pip install -v --no-binary :all: numpy
pip install -v --no-binary :all: scipy
pip install -v jupyter matplotlib nose mock ipyparallel mpi4py

# Install non-indexed packages manually.
cd $TMPDIR

wget -O pycdf-0.6-3b.tar.gz "http://downloads.sourceforge.net/project/pysclint/pycdf/pycdf-0.6.3b/pycdf-0.6-3b.tar.gz?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Fpysclint%2Ffiles%2Fpycdf%2Fpycdf-0.6.3b%2F&ts=1471285723&use_mirror=superb-sea2"
tar xf pycdf-0.6-3b.tar.gz
cd pycdf-0.6-3b
python setup.py install

jupyter-on-rhea.pbs

0 → 100644
+55 −0
Original line number Diff line number Diff line
#!/bin/bash -l
#PBS -A FIXME
#PBS -q batch
#PBS -l walltime=48:00:00,nodes=1
#PBS -o jupyter.log
#PBS -j oe

# Setup all the optional paths.
# WORK defined in user's bashrc
VENV_DIR="$HOME/.venvs"
VENV="$VENV_DIR/rhea-pyms"

# Change the login and client ports to suitable values.
# Be aware your preferred login port may be in use by other users. A login port
# used by another project will cause dire confusion at runtime.
CLIENT_PORT=8080
LOGIN_PORT=XXXXX # FIXME: Choose a *RANDOM* unused port number in the range 10k-64k.
SERVER_PORT=8082
COMMAND="${HOME}/.jupyter_connect"

# Setup the environment.
source $HOME/.venvs/venv-activator.sh
venvctl-rhea-app

cd $HOME

function finish {
  rm $COMMAND
}

if [ -f "$COMMAND" ]; then
  echo "A Jupyter server is already running."
  echo "See '$COMMAND' for details."
  exit 1
fi

cat << EOF > $COMMAND
#!/bin/bash
# To open a tunnel to the notebook server/kernels running on the compute node,
# issue the following command from your local machine:
#
# ssh -f -L 127.0.0.1:$CLIENT_PORT:127.0.0.1:$LOGIN_PORT $USER@rhea.ccs.ornl.gov $COMMAND
#
# Then, on your local machine, navigate to "http://127.0.0.1:$CLIENT_PORT" in
# the browser of your choice. Use 'https' if the server is configured to use
# TLS/SSL encryption.

ssh -q -L 127.0.0.1:$LOGIN_PORT:127.0.0.1:$SERVER_PORT \
  $HOSTNAME.ccs.ornl.gov sleep $PBS_WALLTIME
EOF

trap finish EXIT
chmod a+x $COMMAND

jupyter-notebook --no-browser --port=$SERVER_PORT --log-level='DEBUG'

venv-activator.sh

0 → 100644
+107 −0
Original line number Diff line number Diff line
#!/bin/bash
#
# This script provides shell function utilities for activating and
# deactivating python virtual environments that have dependencies on Tkl
# environment modules.
#
#------------------------------------------------------------------------------
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#------------------------------------------------------------------------------
#

function venvmodulectl () {
  # A single function to activate and deactivate a python virtualenv that has
  # prerequisite environment module dependencies. The function will apply a
  # sequence of environment module commands prior to activating a specific venv.
  # If the venv is already active, it will be deactivated and the sequence of
  # module commands will be reversed to undo the environment changes.
  #
  # USAGE:
  # venvmodulectl "/PATH/TO/VENV" "MODULE CMD" ["MODULE CMD"...]
  # 
  # Where "/PATH/TO/VENV" points to the root of the virtual environment and
  # each "MODULE CMD" is a double-quoted string of instructions to `modulecmd`
  # of the forms:
  #    "swap MODULEA MODULEB" or
  #    "load MODULE1 MODULE2 ... MODULEN"
  #
  # The script will likely fail if the sequence of module commands conflicts
  # with the modules that are loaded when the function is first called.  It is
  # intended that the sequence of module commands be chosen such that they are
  # applied from a clean login environment. 

  declare _ENVNAME="$1"
  declare -a _COMMANDS
  for jj in "${2:+"${@:2}"}"; do
    _COMMANDS+=("$jj")
  done
  if [ -z "$_ENVNAME" -o -z "$_COMMANDS" ]; then
    echo "Could not interpret input."
    return
  fi

  declare _CMD
  if [ -z "$VIRTUAL_ENV" -a -z "$MYENV" ]; then
    # Load the modules.
    for _CMD in "${_COMMANDS[@]}"; do 
      echo "module ${_CMD}"
      eval "module ${_CMD}"
    done
    # Activate the virtualenv.
    . "$_ENVNAME/bin/activate"
    # Keep track of what's been done.
    export MYENV="$_ENVNAME"
  elif [[ "$MYENV" == "$_ENVNAME" ]]; then
    # Deactivate the virtualenv.
    [ -n "$VIRTUAL_ENV" ] && deactivate
    # Unload the modules. The double loop is gross and inefficient
    # but more shell-agnostic and readable than other approaches.
    declare -a SDNAMMOC_
    for _CMD in "${_COMMANDS[@]}"; do
      _CMD="$(echo ${_CMD} | sed 's/^lo\(.*\)$/unlo\1/' | \
              awk '{ printf("%s ", $1)
                     for (i=NF; i>2; i--) printf("%s ",$i)
                     print $2 }')"
      SDNAMMOC_=("$_CMD" "${SDNAMMOC_[@]}")
    done
    for _CMD in "${SDNAMMOC_[@]}"; do
      echo "module $_CMD"
      eval "module $_CMD"
    done
    # Cleanup the environment.
    unset MYENV
  else
    echo "ERROR - Cannot alter $_ENVNAME environment:"
    [ -n "$VIRTUAL_ENV" ] && echo "        $VIRTUAL_ENV already active."
    [ -n "$MYENV" ] && echo "        Run script for $MYENV first"
  fi
}

# Usage Examples
# ==============
# Environement module requirements for a venv are unlikely to change often.
# Therefore it makes sense to use the above function in other functions or
# aliases crafted for specific venvs deployed on OLCF resources.

# These examples assume a venv is located at $HOME/.venvs/rhea-app that was
# built using the python/2.7.9 module on Rhea and has dependancies on the PE-gnu
# and netcdf environment modules.

# Within a shell function
function venvctl-rhea-app () {
  declare -a COMMANDS
  COMMANDS=(
    "swap PE-intel PE-gnu"
    "load netcdf python/2.7.9"
  )
  venvmodulectl "$HOME/.venvs/rhea-app" "${COMMANDS[@]}"
}

# Within an alias
alias venvctl-rhea-app-alias="venvmodulectl "$HOME/.venvs/rhea-app" "swap PE-intel PE-gnu" "load netcdf python/2.7.9""