- 31 Jul, 2016 1 commit
-
-
Paul Rich authored
Update node state was resetting an admin down. Added an additional flag so we can differentiate between admin down and hardware down. If a node is marked down with an admin command, then no matter what, it will remain marked down.
-
- 24 Jun, 2016 2 commits
- 23 Jun, 2016 1 commit
-
-
Paul Rich authored
Rereservations were broken for long (>5 min) startups. This should allow the CAPMC scripts to do their thing.
-
- 13 Jun, 2016 1 commit
-
-
Paul Rich authored
-
- 10 Jun, 2016 1 commit
-
-
Paul Rich authored
-
- 02 Jun, 2016 2 commits
-
-
Paul Rich authored
maxtotaljobs limit added. This adds the limiter for maximum jobs overall running in queue. Useful for profiling machines with noisy network environments. This also adds output to cqadm for this information, and an entry in the cqadm manpage. See merge request !10
-
Paul Rich authored
This adds the limiter for maximum jobs overall running in queue. Useful for profiling machines with noisy network environments. This also adds output to cqadm for this information, and an entry in the cqadm manpage.
-
- 25 May, 2016 1 commit
-
-
Paul Rich authored
Multiple forkers Support for multiple Forkers (i.e. multiple script hosts) for Cray systems. See merge request !9
-
- 24 May, 2016 1 commit
-
-
Paul Rich authored
-
- 23 May, 2016 1 commit
-
-
Paul Rich authored
-
- 20 May, 2016 1 commit
-
-
Paul Rich authored
Well, we're at least back at original functionality. Forkers are automatically acquired and dispatch to at least the single forker case appears to work.
-
- 11 May, 2016 1 commit
-
-
Paul Rich authored
The alps forker can now rename itself at runtime. Will be needed to identify multiple components. The output redirect for consoles needs some work in the init script still.
-
- 09 May, 2016 1 commit
-
-
Paul Rich authored
-
- 04 May, 2016 3 commits
-
-
Paul Rich authored
Fix 17 compact format pbs This fixes the PBS file format so the nodes for cray systems are in a compact format. Other systems should not have their records affected. Some of this logic should probably be extended into the cluster systems themselves. See merge request !8
-
Paul Rich authored
This was a less trivial change than I thought. Had to do this all in the system component to avoid Cray handling logic leakage into other components like cqm.
-
Paul Rich authored
-
- 03 May, 2016 1 commit
-
-
Paul Rich authored
On restart, if cobalt was shutdown abruptly (like with a power failure or a kill -9), there was a way to lose the forker child process of a process group. The process group would never finish cleaning up, and the associated resources would keep being put into cleanup-pending by the reserve_resources_until code. Now the orphaned process group(s) are cleaned up automatically. CQM jobs that reference these should get back an error stating that the underlying task no longer exists/cannot be found. This circumstance should be rare in production (I hope), but I could see this scenario being triggered during abnormal operations (like a facility power/cooling failure).
-
- 26 Apr, 2016 1 commit
-
-
Paul Rich authored
info.
-
- 22 Apr, 2016 3 commits
-
-
Paul Rich authored
Enh 13 cobalt reservations This brings reservation support to Cray systems. See merge request !7
-
Paul Rich authored
-
Paul Rich authored
The overlap check was failing. Has been modified for cray systems such that the check is entirely local with no call the the system component. There is no reason for the remote information at this point due to no possibility of node overlap.
-
- 21 Apr, 2016 1 commit
-
-
Paul Rich authored
-
- 20 Apr, 2016 1 commit
-
-
Paul Rich authored
There was bug that was counting active reservation nodes as 2 nodes for the purposes of determining how many nodes were left in the non-reservation queue.
-
- 19 Apr, 2016 2 commits
- 18 Apr, 2016 2 commits
- 14 Apr, 2016 2 commits
-
-
Paul Rich authored
This is an interrim checkin of reservation handling. This code is not yet functional. Saving prior to branch switch.
-
Paul Rich authored
Fix 16 node cleanup Merging in a change to fix a bug where nodes were not cleaning up as long as any reservations were on the system. This also fixes Cobalt ignoring node roles. It will now only try to schedule on batch nodes. See merge request !6
-
- 13 Apr, 2016 1 commit
-
-
Paul Rich authored
New statuses in 1.7 are supported. Now correctly marking node non-idle for alps-interactive (effectively a 'down' state). Removed redundant release check.
-
- 12 Apr, 2016 1 commit
-
-
Paul Rich authored
Resources weren't actually exititng the cleanup state when there were other resrervations on the system. The check to mark nodes idle was not actually ocurring when a reservation existed to mark nodes as idle..
-
- 08 Apr, 2016 2 commits
-
-
Paul Rich authored
Enh 14 use system reservednodes Smaller system query added. RESERVENODES support will be added in a later ticket. Also has fixes from first encounters with Kachina. See merge request !5
-
Paul Rich authored
This should significantly reduce the overhead of the system inventory for updating state. Also gets memory statuses. Need to add dynamic attribute update for running systems.
-
- 07 Apr, 2016 1 commit
-
-
Paul Rich authored
We can now get and properly display node attributes via the system type query.
-
- 04 Apr, 2016 1 commit
-
-
Paul Rich authored
-
- 01 Apr, 2016 1 commit
-
-
Paul Rich authored
Cray's documentation on what depth and nppn do isn't all that clear. Apparenlty this arrangement will actually reserve proper numbers of nodes. Full allocation now works reliably.
-
- 29 Mar, 2016 1 commit
-
-
Paul Rich authored
The alps_script_forker has to call BASIL, too. Make sure they all point to the same one.
-
- 15 Mar, 2016 2 commits
-
-
Paul Rich authored
Fix 12 scheduler loop performance Fixes for improving the performance of scheduling loop and improving performance of clients during state update and job scheduling. See merge request !4
-
Paul Rich authored
This is an issue I saw with higher job counts. This works on a per-queue basis and will prevent an attempt at placement if Cobalt already knows there aren't enough available nodes for a job.
-