Changes

Doak, Peter W. · 57ba7100
--- a/Monitoring-Codes.md
+++ b/Monitoring-Codes.md
+## [Watch Server](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/utility/watch_server)
+### Beta -- [Help](mailto:doakpw@ornl.gov), or even faster, @pdoak on Cades Condos slack.
+This is a simple python server that watches logs and writes ticks to influxdb database. It finds the logs and gives them a name based on the directory the log is in. Each instance of the server can handle all your calculations with a particular code in a particular directory tree.
+
+You can then watch your codes run on a [grafana dashboard](http://128.219.185.137:3000/dashboard/db/jobs-influx?from=1487230378812&to=1487359978813).
+
+**Waste less cpu time and catch calculation problems quickly.**
+
+### Basic Usage
+clone the repo, I'm assuming you've done this in your home directory.
+
+```shell-session
+[you@or-condo-login02 ~]$ export CCRWS=~/CNMS_Computing_Resources/utility/watch_server
+[you@or-condo-login02 ~]$ mkdir ~/watch; cd !!:1
+[you@or-condo-login02 ~]$ cp $CCRWS/watch_whatever_code.yaml ./
+[you@or-condo-login02 ~]$ vim|emacs|nano watch_whatever_code.yaml
+```
+Edit the rootdir, username and log_file if needed.  
+The rest should be properly set already.
+
+```shell-session
+[you@or-condo-login02 ~]$ nohup python $CCRWS/watch_server.py ./watch_whatever_code.yaml 2>&1 whatever_code.out &
+```
+Now when you the server sees a log file appear anywhere in the root_dir tree it will begin to read it and write real-time data to the influxdb server.
+
+### Grafana Dash Board
+[Go here](http://128.219.185.137:3000/)
+
+Eventually this will be integrated with UCAMS, but for the beta just sign up for an account.  
+You should be able to see a few sample dashboards on the dashboard list. They have a drop-down userid selector at the upper left. Once you've written some data from your calculations, your name should show up. Select it and you should see ticks from your jobs.  
+If you want to modify the dashboard you should be able to click the gear icon and save as.
+
+### Beta Note!
+If you can't get this all working smoothly don't worry, you're one of the first to try this. It's been cobbled together in my (Peter Doak) spare time but I think its worth sharing. I will provide you help getting set up and we will improve the docs and system.
+
+### Yaml configuration file
+```yaml
+log_name: out #name of your outfile for this code
+dropoff: 20 #minutes since last status change to no longer find log
+host: 128.219.185.137 #influxdb host
+port: 8086 #influxdb port
+root_dir: /your/calculation/root/dir #top of your calculation tree
+user_name: your id #your user id
+job_prefix: 'espresso' #generally the code
+job_suffix: 'condo' #generally the server
+influx: True #we're using influxdb here
+start: 'PWSCF.*starts' #what to match to now a calculation is starting
+init: #measurements that want a first tick
+  'tcpu': 0 #startup tick value for tcpu
+header: #these are header matches, we expect them once per run
+  'Parallel version (MPI), running on': [[6, 'nproc']]
+  'K-points.*npool': [[5, 'kpar']]
+  'number of k points': [[5, 'kpoints']]
+  'number of atoms/cell': [[5, 'natoms']]
+  'number of Kohn-Sham states': [[5, 'nstates']]
+parse: #these are recurring measurments
+# regex: column, tick_name
+  'total cpu time spent': [[9, 'tcpu']]
+  'total energy': [[4, 'toten']]
+  'estimated': [[5, 'eacc']]
+  'Total force': [[4, 'tforce']]
+finish: #this is matched when the job finishes neatly
+- 'JOB DONE'
+idle_count: 30 #how many times a log can be idle before it is dropped
+find_sleep: 60 #time between idle checks.
+```
+The comments provide sufficient explanation except for the actual parsing items.
+
+The format works like this  
+```
+'regex of line with data': [ [ column of data, 'measurement name'] , ... ]
+```
+So a bit in the spirit of awk.