1. 26 Sep, 2016 2 commits
  2. 23 Sep, 2016 1 commit
  3. 19 Sep, 2016 1 commit
  4. 16 Sep, 2016 1 commit
    • Paul Rich's avatar
      Draining and backfilling basics operational. · 7856d38b
      Paul Rich authored
      Draining and backfilling are passing basic tests.  Need to add more test
      cases to the automated suite and test corner cases around
      queues/reservations/locations list.
      
      Also need to add backfill time display to nodelist/nodeadm -l.
      7856d38b
  5. 14 Sep, 2016 1 commit
  6. 13 Sep, 2016 3 commits
  7. 08 Sep, 2016 1 commit
  8. 07 Sep, 2016 1 commit
  9. 01 Sep, 2016 2 commits
  10. 24 Aug, 2016 5 commits
  11. 15 Aug, 2016 1 commit
  12. 11 Aug, 2016 2 commits
    • Paul Rich's avatar
      Fix for the aggressive cleanup · 06c5d122
      Paul Rich authored
      The apid fetch wasn't restricting itself to the actual ALPS reservation.
      This was causing everything to get killed.
      06c5d122
    • Paul Rich's avatar
      Fixed error in recovering pgroups. · d9595cc8
      Paul Rich authored
      System component restart on the fly should be safe again.  We recover
      the process groups properly now.  Found this while testing other changes
      in the fix for aggressive cleanup.
      d9595cc8
  13. 08 Aug, 2016 1 commit
  14. 06 Aug, 2016 1 commit
  15. 03 Aug, 2016 2 commits
  16. 01 Aug, 2016 1 commit
  17. 31 Jul, 2016 1 commit
    • Paul Rich's avatar
      Admin down was not getting properly detected. · 93155c72
      Paul Rich authored
      Update node state was resetting an admin down.  Added an additional flag
      so we can differentiate between admin down and hardware down.
      
      If a node is marked down with an admin command, then no matter what, it
      will remain marked down.
      93155c72
  18. 29 Jul, 2016 1 commit
  19. 27 Jul, 2016 1 commit
    • Paul Rich's avatar
      apkill support added · 0fcbb56e
      Paul Rich authored
      Support for apkill added to kill user alps instnace in interactive jobs.
      Kachina testing pending.
      0fcbb56e
  20. 18 Jul, 2016 1 commit
    • Paul Rich's avatar
      Interactive cleanup now working. · 6af7cebf
      Paul Rich authored
      Resources for interactive jobs are now appropriately released.  There is
      still a known issue with currently running aprun instances.  That will
      be addressed in a further patch.
      6af7cebf
  21. 06 Jul, 2016 1 commit
  22. 23 Jun, 2016 1 commit
  23. 13 Jun, 2016 1 commit
  24. 20 May, 2016 1 commit
  25. 04 May, 2016 1 commit
    • Paul Rich's avatar
      PBS records now in compact format · cf6412a1
      Paul Rich authored
      This was a less trivial change than I thought.  Had to do this all in
      the system component to avoid Cray handling logic leakage into other
      components like cqm.
      cf6412a1
  26. 03 May, 2016 1 commit
    • Paul Rich's avatar
      Fixed process group w/o process startup error · 8aeec73f
      Paul Rich authored
      On restart, if cobalt was shutdown abruptly (like with a power failure
      or a kill -9), there was a way to lose the forker child process of a
      process group.  The process group would never finish cleaning up, and
      the associated resources would keep being put into cleanup-pending by
      the reserve_resources_until code.
      
      Now the orphaned process group(s) are cleaned up automatically.  CQM
      jobs that reference these should get back an error stating that the
      underlying task no longer exists/cannot be found.
      
      This circumstance should be rare in production (I hope), but  I could
      see this scenario being triggered during abnormal operations (like a
      facility power/cooling failure).
      8aeec73f
  27. 22 Apr, 2016 1 commit
  28. 20 Apr, 2016 1 commit
    • Paul Rich's avatar
      Fix for negative nodes. · 923f6691
      Paul Rich authored
      There was bug that was counting active reservation nodes as 2 nodes for
      the purposes of determining how many nodes were left in the
      non-reservation queue.
      923f6691
  29. 19 Apr, 2016 1 commit
  30. 18 Apr, 2016 1 commit