|
|
## [Watch Server](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/utility/watch_server)
|
|
|
### Beta -- [Help](mailto:doakpw@ornl.gov), or even faster, @pdoak on Cades Condos slack.
|
|
|
This is a simple python server that watches logs and writes ticks to influxdb database. It finds the logs and gives them a name based on the directory the log is in. Each instance of the server can handle all your calculations with a particular code in a particular directory tree.
|
|
|
|
|
|
You can then watch your codes run on a [grafana dashboard](http://128.219.185.137:3000/dashboard/db/jobs-influx?from=1487230378812&to=1487359978813).
|
|
|
|
|
|
**Waste less cpu time and catch calculation problems quickly.**
|
|
|
|
|
|
### Basic Usage
|
|
|
clone the repo, I'm assuming you've done this in your home directory.
|
|
|
|
|
|
```shell-session
|
|
|
[you@or-condo-login02 ~]$ export CCRWS=~/CNMS_Computing_Resources/utility/watch_server
|
|
|
[you@or-condo-login02 ~]$ mkdir ~/watch; cd !!:1
|
|
|
[you@or-condo-login02 ~]$ cp $CCRWS/watch_whatever_code.yaml ./
|
|
|
[you@or-condo-login02 ~]$ vim|emacs|nano watch_whatever_code.yaml
|
|
|
```
|
|
|
Edit the rootdir, username and log_file if needed.
|
|
|
The rest should be properly set already.
|
|
|
|
|
|
```shell-session
|
|
|
[you@or-condo-login02 ~]$ nohup python $CCRWS/watch_server.py ./watch_whatever_code.yaml 2>&1 whatever_code.out &
|
|
|
```
|
|
|
Now when you the server sees a log file appear anywhere in the root_dir tree it will begin to read it and write real-time data to the influxdb server.
|
|
|
|
|
|
### Grafana Dash Board
|
|
|
[Go here](http://128.219.185.137:3000/)
|
|
|
|
|
|
Eventually this will be integrated with UCAMS, but for the beta just sign up for an account.
|
|
|
You should be able to see a few sample dashboards on the dashboard list. They have a drop-down userid selector at the upper left. Once you've written some data from your calculations, your name should show up. Select it and you should see ticks from your jobs.
|
|
|
If you want to modify the dashboard you should be able to click the gear icon and save as.
|
|
|
|
|
|
### Beta Note!
|
|
|
If you can't get this all working smoothly don't worry, you're one of the first to try this. It's been cobbled together in my (Peter Doak) spare time but I think its worth sharing. I will provide you help getting set up and we will improve the docs and system.
|
|
|
|
|
|
### Yaml configuration file
|
|
|
```yaml
|
|
|
log_name: out #name of your outfile for this code
|
|
|
dropoff: 20 #minutes since last status change to no longer find log
|
|
|
host: 128.219.185.137 #influxdb host
|
|
|
port: 8086 #influxdb port
|
|
|
root_dir: /your/calculation/root/dir #top of your calculation tree
|
|
|
user_name: your id #your user id
|
|
|
job_prefix: 'espresso' #generally the code
|
|
|
job_suffix: 'condo' #generally the server
|
|
|
influx: True #we're using influxdb here
|
|
|
start: 'PWSCF.*starts' #what to match to now a calculation is starting
|
|
|
init: #measurements that want a first tick
|
|
|
'tcpu': 0 #startup tick value for tcpu
|
|
|
header: #these are header matches, we expect them once per run
|
|
|
'Parallel version (MPI), running on': [[6, 'nproc']]
|
|
|
'K-points.*npool': [[5, 'kpar']]
|
|
|
'number of k points': [[5, 'kpoints']]
|
|
|
'number of atoms/cell': [[5, 'natoms']]
|
|
|
'number of Kohn-Sham states': [[5, 'nstates']]
|
|
|
parse: #these are recurring measurments
|
|
|
# regex: column, tick_name
|
|
|
'total cpu time spent': [[9, 'tcpu']]
|
|
|
'total energy': [[4, 'toten']]
|
|
|
'estimated': [[5, 'eacc']]
|
|
|
'Total force': [[4, 'tforce']]
|
|
|
finish: #this is matched when the job finishes neatly
|
|
|
- 'JOB DONE'
|
|
|
idle_count: 30 #how many times a log can be idle before it is dropped
|
|
|
find_sleep: 60 #time between idle checks.
|
|
|
```
|
|
|
The comments provide sufficient explanation except for the actual parsing items.
|
|
|
|
|
|
The format works like this
|
|
|
```
|
|
|
'regex of line with data': [ [ column of data, 'measurement name'] , ... ]
|
|
|
```
|
|
|
So a bit in the spirit of awk. |