- 23 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 22 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 15 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 11 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 03 Nov, 2016 1 commit
-
-
Paul Rich authored
This could happen when the node goes down while a job is running, causing the node to still show up in the job end_times.
-
- 06 Oct, 2016 1 commit
-
-
Paul Rich authored
-
- 26 Sep, 2016 2 commits
-
-
Paul Rich authored
Backfillng has an epsilon of 2 minutes by default. This can be altered in the cobalt config file.
-
Paul Rich authored
There was a way to set up resrvations across disjoint queues that caused one set of queues to ignore that a reservation was pending because the reservation wasn't associated with that equivalence class. This caused forbidden locations to not be set.
-
- 23 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 19 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 16 Sep, 2016 1 commit
-
-
Paul Rich authored
Draining and backfilling are passing basic tests. Need to add more test cases to the automated suite and test corner cases around queues/reservations/locations list. Also need to add backfill time display to nodelist/nodeadm -l.
-
- 14 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 13 Sep, 2016 3 commits
- 08 Sep, 2016 1 commit
-
-
Paul Rich authored
This should get rid of the bulk of the 1234567 exit statuses. Forces a timeout. The timeout goes away when the job is started. This should fix the process group initilization/start gap.
-
- 07 Sep, 2016 1 commit
-
-
Paul Rich authored
Checking in fixes for find queue equivalence classes that impact draining. Drain-status-clear now working. Stub for drain selection.
-
- 01 Sep, 2016 2 commits
- 24 Aug, 2016 5 commits
-
-
Paul Rich authored
Make sure that an update cannot change this list midflight for a job. Caller also holds this lock at this point in time. The node_lock must be reentrant!
-
Paul Rich authored
-
Paul Rich authored
Thanks Eric! Duplicated nids are now avoided.
-
Paul Rich authored
Fixes attrs location evading cobalt admin down on nodes.
-
Paul Rich authored
Non-idle nodes are now fully respected. Consistiently get string nid lists out of this. ValueError doesn't get raised if the attrs location exists stradling a reservation (still in the queue, but not available due to the reservation).
-
- 15 Aug, 2016 1 commit
-
-
Paul Rich authored
-
- 11 Aug, 2016 2 commits
-
-
Paul Rich authored
The apid fetch wasn't restricting itself to the actual ALPS reservation. This was causing everything to get killed.
-
Paul Rich authored
System component restart on the fly should be safe again. We recover the process groups properly now. Found this while testing other changes in the fix for aggressive cleanup.
-
- 08 Aug, 2016 1 commit
-
-
Paul Rich authored
Fixing a situation where locations when set in a resrvation job causes issues.
-
- 06 Aug, 2016 1 commit
-
-
Paul Rich authored
There was one further step needed for running jobs. Also fixing a potential statefile issue with prior versions.
-
- 03 Aug, 2016 2 commits
- 01 Aug, 2016 1 commit
-
-
Paul Rich authored
-
- 31 Jul, 2016 1 commit
-
-
Paul Rich authored
Update node state was resetting an admin down. Added an additional flag so we can differentiate between admin down and hardware down. If a node is marked down with an admin command, then no matter what, it will remain marked down.
-
- 29 Jul, 2016 1 commit
-
-
Paul Rich authored
-
- 27 Jul, 2016 1 commit
-
-
Paul Rich authored
Support for apkill added to kill user alps instnace in interactive jobs. Kachina testing pending.
-
- 18 Jul, 2016 1 commit
-
-
Paul Rich authored
Resources for interactive jobs are now appropriately released. There is still a known issue with currently running aprun instances. That will be addressed in a further patch.
-
- 06 Jul, 2016 1 commit
-
-
Paul Rich authored
-
- 23 Jun, 2016 1 commit
-
-
Paul Rich authored
Rereservations were broken for long (>5 min) startups. This should allow the CAPMC scripts to do their thing.
-
- 13 Jun, 2016 1 commit
-
-
Paul Rich authored
-
- 20 May, 2016 1 commit
-
-
Paul Rich authored
Well, we're at least back at original functionality. Forkers are automatically acquired and dispatch to at least the single forker case appears to work.
-