Help, or even faster, @pdoak on Cades Condos slack.Beta --
This is a simple python server that watches logs and writes ticks to influxdb database. It finds the logs and gives them a name based on the directory the log is in. Each instance of the server can handle all your calculations with a particular code in a particular directory tree.
You can then watch your codes run on a grafana dashboard.
Waste less cpu time and catch calculation problems quickly.
clone the repo, I'm assuming you've done this in your home directory.
[you@or-condo-login02 ~]$ export CCRWS=~/CNMS_Computing_Resources/utility/watch_server [you@or-condo-login02 ~]$ mkdir ~/watch; cd !!:1 [you@or-condo-login02 ~]$ cp $CCRWS/watch_whatever_code.yaml ./ [you@or-condo-login02 ~]$ vim|emacs|nano watch_whatever_code.yaml
Edit the rootdir, username and log_file if needed.
The rest should be properly set already.
Then you can start the watch server like so:
[you@or-condo-login02 ~]$ nohup python $CCRWS/watch_server.py ./watch_whatever_code.yaml 2>&1 whatever_code.out &
Now when you the server sees a log file appear anywhere in the root_dir tree it will begin to read it and write real-time data to the influxdb server.
Grafana Dash Board
Eventually this will be integrated with UCAMS, but for the beta just sign up for an account.
You should be able to see a few sample dashboards on the dashboard list. They have a drop-down userid selector at the upper left. Once you've written some data from your calculations, your name should show up. Select it and you should see ticks from your jobs.
If you want to modify the dashboard you should be able to click the gear icon and save as.
The default yaml files point at a available to any internal ornl IP address backend. The more of us that send out data there the more the computational science at the CNMS looks awesome. Additionally historical data about how different codes have run could lead to enhanced optimization. If you aren't comfortable with this see below., you can still participate in the beta.
Gotcha calculation restart issue see
If you can't get this all working smoothly don't worry, you're one of the first to try this. It's been cobbled together in my (Peter Doak) spare time but I think its worth sharing. I will provide you help getting set up and we will improve the docs and system.
But I'm a private/paranoid Scientist
You can have your own influxdb and grafana server that only you can access. I can help you set the back end up as a pair of docker containers on your cades-openstack birthright computing. Then you can break, extend or neglect that part of the watchserver system completely on your own.
Yaml configuration file
log_name: out #name of your outfile for this code dropoff: 20 #minutes since last status change to no longer find log host: 18.104.22.168 #influxdb host port: 8086 #influxdb port root_dir: /your/calculation/root/dir #top of your calculation tree user_name: your id #your user id job_prefix: 'espresso' #generally the code job_suffix: 'condo' #generally the server influx: True #we're using influxdb here start: 'PWSCF.*starts' #what to match to now a calculation is starting init: #measurements that want a first tick 'tcpu': 0 #startup tick value for tcpu header: #these are header matches, we expect them once per run 'Parallel version (MPI), running on': [[6, 'nproc']] 'K-points.*npool': [[5, 'kpar']] 'number of k points': [[5, 'kpoints']] 'number of atoms/cell': [[5, 'natoms']] 'number of Kohn-Sham states': [[5, 'nstates']] parse: #these are recurring measurments # regex: column, tick_name 'total cpu time spent': [[9, 'tcpu']] 'total energy': [[4, 'toten']] 'estimated': [[5, 'eacc']] 'Total force': [[4, 'tforce']] finish: #this is matched when the job finishes neatly - 'JOB DONE' idle_count: 30 #how many times a log can be idle before it is dropped find_sleep: 60 #time between idle checks.
The comments provide sufficient explanation except for the actual parsing items.
The format works like this
'regex of line with data': [ [ column of data, 'measurement name'] , ... ]
So a bit in the spirit of awk.