- 24 Aug, 2016 2 commits
- 08 Aug, 2016 1 commit
-
-
Paul Rich authored
Fixing a situation where locations when set in a resrvation job causes issues.
-
- 06 Aug, 2016 1 commit
-
-
Paul Rich authored
There was one further step needed for running jobs. Also fixing a potential statefile issue with prior versions.
-
- 03 Aug, 2016 2 commits
- 01 Aug, 2016 4 commits
- 31 Jul, 2016 1 commit
-
-
Paul Rich authored
Update node state was resetting an admin down. Added an additional flag so we can differentiate between admin down and hardware down. If a node is marked down with an admin command, then no matter what, it will remain marked down.
-
- 29 Jul, 2016 1 commit
-
-
Paul Rich authored
-
- 27 Jul, 2016 1 commit
-
-
Paul Rich authored
Support for apkill added to kill user alps instnace in interactive jobs. Kachina testing pending.
-
- 18 Jul, 2016 1 commit
-
-
Paul Rich authored
Resources for interactive jobs are now appropriately released. There is still a known issue with currently running aprun instances. That will be addressed in a further patch.
-
- 06 Jul, 2016 1 commit
-
-
Paul Rich authored
-
- 24 Jun, 2016 2 commits
- 23 Jun, 2016 1 commit
-
-
Paul Rich authored
Rereservations were broken for long (>5 min) startups. This should allow the CAPMC scripts to do their thing.
-
- 13 Jun, 2016 1 commit
-
-
Paul Rich authored
-
- 10 Jun, 2016 1 commit
-
-
Paul Rich authored
-
- 02 Jun, 2016 2 commits
-
-
Paul Rich authored
maxtotaljobs limit added. This adds the limiter for maximum jobs overall running in queue. Useful for profiling machines with noisy network environments. This also adds output to cqadm for this information, and an entry in the cqadm manpage. See merge request !10
-
Paul Rich authored
This adds the limiter for maximum jobs overall running in queue. Useful for profiling machines with noisy network environments. This also adds output to cqadm for this information, and an entry in the cqadm manpage.
-
- 25 May, 2016 1 commit
-
-
Paul Rich authored
Multiple forkers Support for multiple Forkers (i.e. multiple script hosts) for Cray systems. See merge request !9
-
- 24 May, 2016 1 commit
-
-
Paul Rich authored
-
- 23 May, 2016 1 commit
-
-
Paul Rich authored
-
- 20 May, 2016 1 commit
-
-
Paul Rich authored
Well, we're at least back at original functionality. Forkers are automatically acquired and dispatch to at least the single forker case appears to work.
-
- 11 May, 2016 1 commit
-
-
Paul Rich authored
The alps forker can now rename itself at runtime. Will be needed to identify multiple components. The output redirect for consoles needs some work in the init script still.
-
- 09 May, 2016 1 commit
-
-
Paul Rich authored
-
- 04 May, 2016 3 commits
-
-
Paul Rich authored
Fix 17 compact format pbs This fixes the PBS file format so the nodes for cray systems are in a compact format. Other systems should not have their records affected. Some of this logic should probably be extended into the cluster systems themselves. See merge request !8
-
Paul Rich authored
This was a less trivial change than I thought. Had to do this all in the system component to avoid Cray handling logic leakage into other components like cqm.
-
Paul Rich authored
-
- 03 May, 2016 1 commit
-
-
Paul Rich authored
On restart, if cobalt was shutdown abruptly (like with a power failure or a kill -9), there was a way to lose the forker child process of a process group. The process group would never finish cleaning up, and the associated resources would keep being put into cleanup-pending by the reserve_resources_until code. Now the orphaned process group(s) are cleaned up automatically. CQM jobs that reference these should get back an error stating that the underlying task no longer exists/cannot be found. This circumstance should be rare in production (I hope), but I could see this scenario being triggered during abnormal operations (like a facility power/cooling failure).
-
- 26 Apr, 2016 1 commit
-
-
Paul Rich authored
info.
-
- 22 Apr, 2016 3 commits
-
-
Paul Rich authored
Enh 13 cobalt reservations This brings reservation support to Cray systems. See merge request !7
-
Paul Rich authored
-
Paul Rich authored
The overlap check was failing. Has been modified for cray systems such that the check is entirely local with no call the the system component. There is no reason for the remote information at this point due to no possibility of node overlap.
-
- 21 Apr, 2016 1 commit
-
-
Paul Rich authored
-
- 20 Apr, 2016 1 commit
-
-
Paul Rich authored
There was bug that was counting active reservation nodes as 2 nodes for the purposes of determining how many nodes were left in the non-reservation queue.
-
- 19 Apr, 2016 2 commits