Improved error messaging.

Problem: Users need better notification, either they are not notified of issues, or they are getting raw error messages and generally cannot interpret them well. They are confused, and think it is their problem to solve. Here are some specific cases where error messaging can be improved, and suggested improvements.

Timeouts. (diagnostic file: scheduler_stderr.txt) The message from slurm (stderr.txt) says the job was cancelled, but the user does not receive an explicit notification they reached a timeout. (There is a separate document dealing with restarts appended to the bottom of this document).

srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
slurmstepd: *** STEP 14509054.0 ON comet-29-72 CANCELLED AT 2018-02-26T14:43:48 DUE TO TIME LIMIT ***
srun: got SIGCONT
srun: forcing job termination
srun: error: comet-29-72: task 0: Terminated
Or for scheduler_stderr.txt
slurmstepd: *** STEP 14509054.0 ON comet-29-72 CANCELLED AT 2018-02-26T14:43:48 DUE TO TIME

When this is how a job ends, the user should receive a specific job timeout email, instead of the usual email that says their job has completed. The text of the email will depend upon whether or not the code supports restarts; as follows:

The message starts: “Your CIPRES job <jobhandle> has reached the end of its maximum allowed run time.”

If code X supports restarts, in the body of the message, it continues: “This job can be continued using the X_restart interface, but only if you act within 14 days.” 

If the code does not support restarts, the message continues “Unfortunately this job can’t be continued, but you can clone and resubmit using a longer maximum wall time. “

The methodology for handling restarts is covered in greater detail in a separate document.

When CanSubmit = 0 or The user exceeds the data limit: 
Both of these conditions throw the same error message currently; the message appears in the user interface at the top of the page, in red:

  - Sorry, job submission from your account has been temporarily suspended. Most likely this is due to heavy consumption of community resources from this account. For information on reactivating your account, or if you think you received this message in error, please contact mmiller@sdsc.edu

These two conditions should give descriptive messaging specific to the issue. If the user has used too many SUS, the message should be:

  - Sorry, job submission from your account has been temporarily suspended because you have reached the maximum number of core hours for the current allocation year. For information on reactivating your account, or if you think you received this message in error, please contact mmiller@sdsc.edu

If the user has exceeded the data allocation, the message should read:
  - Sorry, job submission from your account has been temporarily suspended because you have exceeded the maximum amount of data storage permitted. You can reactivate your account by deleting some of your data. If you have questions, or think you received this message in error, please contact mmiller@sdsc.edu

Submission failures:
Job submissions can fail for a number of reasons. I am keeping a catalogue of the potential messages that users might receive. I would like us to create an intervening utility that parses job submission failure messages and translates them into something useful for the user, and forwards a user-friendly version of the message on to the user. The raw email message can still come to CIPRES staff.

The goal is to implement a form of smart retrying. Some error messages are likely to fail on retry consistently, while others are likely to succeed on retry. For example, the scheduler can be out for just a minute or two, but a problem with the user account/authentication is unlikely to be resolved in 4 hours.
There are five types of messages that might appear. The first two indicate we will retry the job automatically (perhaps we send an email when submission is successful, or on timeout, if we have given up). The last three are for jobs that are unlikely to succeed on retry.

1.	Communication problem, will retry:
Your job submission was unsuccessful because the scheduler is temporarily off line or too busy to respond. CIPRES will re-try your submission for the next 24 hours before giving up. (You may cancel the retries if you wish?) You will be notified when submission is successful. If you have questions, or you get bored of waiting, please contact mmiller@sdsc.edu, and include the raw error message, which is :text of raw error message).
 
2.	Temporary issues, will retry:
Your job submission was unsuccessful because of a temporary load issue of some kind. CIPRES will re-try your submission for the next 24 hours before giving up. (You may cancel the retries if you wish You will be notified when submission is successful. If you have questions, or you get bored of waiting, please contact mmiller@sdsc.edu, and include the raw error message, which is :text of raw error message). 

3.	Your job submission was unsuccessful because of a problem with authentication on our side. Please send the raw error message to mmiller@sdsc.edu. 

4.	Your job submission was unsuccessful because of an unexpected problem on our side. Please send the raw error message to mmiller@sdsc.edu. 

5.	Your job submission was unsuccessful, most likely because of a bug in the CIPRES interface. Please send the raw error message to mmiller@sdsc.edu. 

This is a list of phrases that can be parsed out of diagnostic messages that were received in the last 18 months, and whether or not submission should be retried on those, and a specific error message. 
 
Connectivity issues					retry		message	   
Unable to contact slurm controller 		1		1	   
Socket timed out on send/recv operation	1		1	   
Connection closed by remote host			1		1	   
Connection timed out					1		1	   
Connection closed by 198.				1		1	   
Connect to host comet.sdsc.edu port 22: No route to host	1	1	   
Broken Pipe						1		1	   
There was a problem while connecting 		1		1	   
Connection reset by peer				1		1	   
Cannot connect to server tscc			1		1	   
Transport endpoint is not connected		1		1	   
Connection refused					1		1	   
No such device or address				1		1	   
Resource temporarily unavailable 		1		1	   
						   
Other issues					   
Authentication failed					0		3	   
Expired user						0		3	   
Invalid account 						0		3	   
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,keyboard-interactive,hostbased).		0	3	   
						   
Required partition not available 		0		4	   
Invalid email format in line 			0		4	   
No space left on device				0		4	   
REMOTE HOST IDENTIFICATION HAS CHANGED		0		4	   
						   
ValueError: invalid literal for int()		0		5	   
Requested node configuration is not available	0	5	   
						   
No row in task_input_source_documents 		1		2	   
Could not chdir to home directory 		1		2	   
Too many tasks						1		2	 



Improved restarts:

Better restart capabilities: the user has a way to restart a job in a directory that is still in the /ARCHIVE folder.
Specifications:

Use case: User runs a job that supports checkpointing/restarting. Job times out at Y hours (or whatever time). User receives an email message: the attn. line says: CIPRES job <jobhandle> or <joblabel> has timed out. Maybe: (User interface displays a blue button that says <timed out> instead of <view results>)
The text of the email is as follows:
Your CIPRES job <jobhandle> has reached the end of its maximum allowed run time.

If code X supports restarts, in the body of the message, it says: “This job can be continued using the X_restart interface, but only if you act within 14 days.” 
If the code does not support restarts, the message says “Unfortunately this job can’t be continued, but you can clone and resubmit using a longer maximum wall time. “

Each code X has a corresponding X_Restart interface.
The X_Restart interface could work like this (open to suggestions/improvements):
The input file could be the original input file, or it could be _JOBINFO.txt, which could be parsed to identify the directory. The behavior needed is as follows

Current Restartable codes:
MrBayes (the original input file must have the expression append=yes to the mcmc command line; if there is no mybayes block, one has to be created and appended to the original data set, with the original params identical to run 1, and append=yes included on the mcmc line. One way to do this is to rename the original input file in the directory.
BEAST2: You can re-run from the command line, adding the -resume option in the same directory. 
IQTree: We can just re-run the job. If an IQ-TREE run was interrupted for some reason, rerunning it with the same command line and input data will automatically resume the analysis from the last stopped point. The previous log file will then be appended. If a run successfully finished, IQ-TREE will deny to rerun and issue an error message. A few options that control checkpointing behavior:
Phylobayes: To restart an existing chain: pb <chainname>