- 26 Sep, 2016 2 commits
-
-
Paul Rich authored
Backfillng has an epsilon of 2 minutes by default. This can be altered in the cobalt config file.
-
Paul Rich authored
There was a way to set up resrvations across disjoint queues that caused one set of queues to ignore that a reservation was pending because the reservation wasn't associated with that equivalence class. This caused forbidden locations to not be set.
-
- 23 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 19 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 16 Sep, 2016 1 commit
-
-
Paul Rich authored
Draining and backfilling are passing basic tests. Need to add more test cases to the automated suite and test corner cases around queues/reservations/locations list. Also need to add backfill time display to nodelist/nodeadm -l.
-
- 14 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 13 Sep, 2016 3 commits
- 08 Sep, 2016 1 commit
-
-
Paul Rich authored
This should get rid of the bulk of the 1234567 exit statuses. Forces a timeout. The timeout goes away when the job is started. This should fix the process group initilization/start gap.
-
- 07 Sep, 2016 1 commit
-
-
Paul Rich authored
Checking in fixes for find queue equivalence classes that impact draining. Drain-status-clear now working. Stub for drain selection.
-
- 01 Sep, 2016 2 commits
- 24 Aug, 2016 5 commits
-
-
Paul Rich authored
Make sure that an update cannot change this list midflight for a job. Caller also holds this lock at this point in time. The node_lock must be reentrant!
-
Paul Rich authored
-
Paul Rich authored
Thanks Eric! Duplicated nids are now avoided.
-
Paul Rich authored
Fixes attrs location evading cobalt admin down on nodes.
-
Paul Rich authored
Non-idle nodes are now fully respected. Consistiently get string nid lists out of this. ValueError doesn't get raised if the attrs location exists stradling a reservation (still in the queue, but not available due to the reservation).
-
- 15 Aug, 2016 1 commit
-
-
Paul Rich authored
-
- 11 Aug, 2016 2 commits
-
-
Paul Rich authored
The apid fetch wasn't restricting itself to the actual ALPS reservation. This was causing everything to get killed.
-
Paul Rich authored
System component restart on the fly should be safe again. We recover the process groups properly now. Found this while testing other changes in the fix for aggressive cleanup.
-
- 08 Aug, 2016 1 commit
-
-
Paul Rich authored
Fixing a situation where locations when set in a resrvation job causes issues.
-
- 06 Aug, 2016 1 commit
-
-
Paul Rich authored
There was one further step needed for running jobs. Also fixing a potential statefile issue with prior versions.
-
- 03 Aug, 2016 2 commits
- 01 Aug, 2016 1 commit
-
-
Paul Rich authored
-
- 31 Jul, 2016 1 commit
-
-
Paul Rich authored
Update node state was resetting an admin down. Added an additional flag so we can differentiate between admin down and hardware down. If a node is marked down with an admin command, then no matter what, it will remain marked down.
-
- 29 Jul, 2016 1 commit
-
-
Paul Rich authored
-
- 27 Jul, 2016 1 commit
-
-
Paul Rich authored
Support for apkill added to kill user alps instnace in interactive jobs. Kachina testing pending.
-
- 18 Jul, 2016 1 commit
-
-
Paul Rich authored
Resources for interactive jobs are now appropriately released. There is still a known issue with currently running aprun instances. That will be addressed in a further patch.
-
- 06 Jul, 2016 1 commit
-
-
Paul Rich authored
-
- 23 Jun, 2016 1 commit
-
-
Paul Rich authored
Rereservations were broken for long (>5 min) startups. This should allow the CAPMC scripts to do their thing.
-
- 13 Jun, 2016 1 commit
-
-
Paul Rich authored
-
- 20 May, 2016 1 commit
-
-
Paul Rich authored
Well, we're at least back at original functionality. Forkers are automatically acquired and dispatch to at least the single forker case appears to work.
-
- 04 May, 2016 1 commit
-
-
Paul Rich authored
This was a less trivial change than I thought. Had to do this all in the system component to avoid Cray handling logic leakage into other components like cqm.
-
- 03 May, 2016 1 commit
-
-
Paul Rich authored
On restart, if cobalt was shutdown abruptly (like with a power failure or a kill -9), there was a way to lose the forker child process of a process group. The process group would never finish cleaning up, and the associated resources would keep being put into cleanup-pending by the reserve_resources_until code. Now the orphaned process group(s) are cleaned up automatically. CQM jobs that reference these should get back an error stating that the underlying task no longer exists/cannot be found. This circumstance should be rare in production (I hope), but I could see this scenario being triggered during abnormal operations (like a facility power/cooling failure).
-
- 22 Apr, 2016 1 commit
-
-
Paul Rich authored
-
- 20 Apr, 2016 1 commit
-
-
Paul Rich authored
There was bug that was counting active reservation nodes as 2 nodes for the purposes of determining how many nodes were left in the non-reservation queue.
-
- 19 Apr, 2016 1 commit
-
-
Paul Rich authored
-
- 18 Apr, 2016 1 commit
-
-
Paul Rich authored
-