- 11 Aug, 2017 3 commits
- 31 Jul, 2017 1 commit
-
-
Paul Rich authored
-
- 28 Jul, 2017 1 commit
-
-
Paul Rich authored
-
- 27 Jul, 2017 1 commit
-
-
Paul Rich authored
Slight bit of reorganization. Also added some things to make the presentation elsewhere in Cobalt kind of consistent with the rest of Cray's stack.
-
- 10 Jul, 2017 1 commit
-
-
Paul Rich authored
-
- 03 Jul, 2017 2 commits
-
-
Paul Rich authored
In light of this bug, adding checks to make sure that we don't end up accidentally adding in bad values to reservations again.
-
Paul Rich authored
This was traced to a call that could cause a non-string key to be added to the alps_reservation dictionary, resulting in a version of the reservation with an integer jobid key and a second with a string jobid key. These should be keyed with strings. Added as further mitigation a check to see if there is an integer version of a key to clean. If there is, then notify that it happened and clean that one, too. Triggering condition is an interactive job where the initial ALPS reservation times out.
-
- 30 Jun, 2017 1 commit
-
-
Paul Rich authored
-
- 27 Jun, 2017 2 commits
- 23 Jun, 2017 1 commit
-
-
Paul Rich authored
This reverts merge request !43
-
- 19 Jun, 2017 1 commit
-
-
Paul Rich authored
-
- 14 Apr, 2017 1 commit
-
-
Paul Rich authored
-
- 13 Apr, 2017 2 commits
- 12 Apr, 2017 1 commit
-
-
Paul Rich authored
Adding in a better validtor to prevent issues with users typing bad NUMA/MCDRAM modes. Also, adding a default setting if none provided.
-
- 11 Apr, 2017 1 commit
-
-
Paul Rich authored
-
- 04 Jan, 2017 1 commit
-
-
Paul Rich authored
A well (or poorly depending on how you look at it) qdel could cause Cobalt to put a node into cleanup but never complete the cleanup due to there being no ALPS backend reservation to clean up. This would clear if there were no jobs currently running, however, it would hang nodes otherwise.
-
- 08 Dec, 2016 1 commit
-
-
Paul Rich authored
If the child fetch succeeds but cleanup fails, make sure we use the intially fetched data, rahter than replacing it with the now potentially lost child data.
-
- 28 Nov, 2016 1 commit
-
-
Paul Rich authored
There was the possiblity of losign a PID if child cleanup was interrupted. This ensures retries until the child process is actually dead.
-
- 23 Nov, 2016 2 commits
- 22 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 15 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 11 Nov, 2016 1 commit
-
-
Paul Rich authored
-
- 03 Nov, 2016 1 commit
-
-
Paul Rich authored
This could happen when the node goes down while a job is running, causing the node to still show up in the job end_times.
-
- 28 Oct, 2016 1 commit
-
-
Paul Rich authored
In the end the buffer size had to be increased to avoid timing issues. Added in further try-except safety checks to prevent system component issues if this runs long agian.
-
- 06 Oct, 2016 1 commit
-
-
Paul Rich authored
-
- 26 Sep, 2016 2 commits
-
-
Paul Rich authored
Backfillng has an epsilon of 2 minutes by default. This can be altered in the cobalt config file.
-
Paul Rich authored
There was a way to set up resrvations across disjoint queues that caused one set of queues to ignore that a reservation was pending because the reservation wasn't associated with that equivalence class. This caused forbidden locations to not be set.
-
- 23 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 19 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 16 Sep, 2016 1 commit
-
-
Paul Rich authored
Draining and backfilling are passing basic tests. Need to add more test cases to the automated suite and test corner cases around queues/reservations/locations list. Also need to add backfill time display to nodelist/nodeadm -l.
-
- 14 Sep, 2016 1 commit
-
-
Paul Rich authored
-
- 13 Sep, 2016 3 commits
- 08 Sep, 2016 1 commit
-
-
Paul Rich authored
This should get rid of the bulk of the 1234567 exit statuses. Forces a timeout. The timeout goes away when the job is started. This should fix the process group initilization/start gap.
-