@@ -28,5 +28,24 @@ Let's put together a list of things we'd like to have in the harness and maybe t
...
@@ -28,5 +28,24 @@ Let's put together a list of things we'd like to have in the harness and maybe t
Possible additional tasks for harness - WJ
Possible additional tasks for harness - WJ
==========================================
==========================================
* LSF integration
* convert to python 3 (crest has python3.4, titan has python3.4.3 loadable through module)
* get threaded version running robustly
* code checking with pylint
* adopt consistent coding style, e.g. google python coding style - this is not absolutely necessary, but if this is something we could agree to, it would improve the quality of the product
* in-code documentation - docstrings etc.
* user manual, getting started, faq - to whatever level of detail we think is needed
* integration of documentation generation tool
* assertions or other fine-grained error checking throughout code
* unit testing
* CI testing
* notification of failed job, e.g., email, IM, text message (possibly outside harness proper)
* nagios interface to notify operators of critical failure condition (possibly outside harness proper)
* reporting of global status of an acceptance campaign, e.g., splunk
* separate class to handle scheduler choice (LSF, PBS, etc.) to make more easy to extend
* way to define many tests with minimal or no redundancy in what the user specifies
* careful examination of robustness under instabilities, e.g., flaky file system, to try to make robust under any conditions, to within reason
* give the harness a "trusted" file system location (not Lustre) that will be reliable with very high degree of confidence, for storing records, logs safely even in case of transient errors/failure
* some way to navigate files relevant to a run more easily -- e.g., in the work dir put symlinks to the matching build dir, status file, run archive file/dir, scripts dir, source dir, job id file, etc. - cross-links to make it easy to move around. usability design of this - how can we do our failure diagnosis with the minimal number of clicks/keystrokes?
* decided-on policy for what constitutes "failure" of different kinds, as a basis for consistent reporting