Shows what cobalt components are currently registered, and which address/port they are presently located at. It helps to make sure everything is up.
Job and Queue Administration
This is the primary administrative command handler for Cobalt jobs and queues. Most tasks with respect to administratively altering jobs go through this command.
It can also be used to control queue state and set queue rules as shown below.
All of these commands require a list of jobids to be appended
--admin-hold: Set an admin_hold on a job. Jobs in this state will not accrue priority, nor will they run until released
--admin-release: Release an admin_hold on a job
--run=: Tell cobalt to immediately run a job at the specified location. This bypasses all normal sanity checks and attempts to run the job regardless of resource status.
--preempt: cause a job to enter the preempted state
--kill: kill a user job as a normal request
--delete: forcibly kill a user job. This will remove the job from cqm but may not release resources.
--time=: Change the walltime of a job.
These take a list of queue names to operate on
--addq: Add a queue to the queue-manager
--delq: Delete a queue entirely from the queue-manager
--getq: list queues in manager
--setq 'prop=value pro=value...': set properties for a queue
--unsetq 'prop prop...': delete/revert to default property values for the target queue(s)
---drainq: Stop accepting new jobs to the target queue(s), but let already submitted jobs run. State set to "draining"
---killq: Stop accepting new jobs and stop running new jobs. Jobs already started will continue to run. State set to "dead"
--force: force the deletion of a queue. May also be used with --delete to forcibly remove a job from the queue-manager. If used to delete a job,
the job will be removed from the queue without the queue manager waiting for any backend job cleanup to occur.
''maxtime'' Maximum time in minutes allowed for jobs in the target queue (default ?? min)
''mintime'' Minimum time in minutes allowed for jobs in target queue (default: 10 min)
''maxqueued'' Maximum number of jobs a user can have in the queue, running as well as queued (default: ??)
''state'' set the queue to one of the below states:
''running'': Queue accepting new jobs/is scheduling submitted jobs
''stopped'': Queue accepting new jobs/not scheduling jobs
''draining'': Queue not accepting new jobs/is scheduling submitted jobs
''dead'': Queue not accepting new jobs/not scheduling jobs/existing running jobs continue to run
''adminemail'': A colon-separated list of email addresses to send a message to for the start and stop of every job in the queue
''maxrunning'': The maximum number of jobs that a single user is allowed to have running at once in a given queue
''maxusernodes'': The maximum number of nodes for any job in a queue that the user is allowed to request
''totalnodes'': The maximum number of nodes that can be allocated among all jobs in a queue
''priority'': The priority for a queue, higher number is higher priority (default: 0)
Set the next jobid: cqadm -j next_jobid
dump a statefile: --savestate filename
all commands can take the '-d' flag to get better debugging information
This command can be used to set and force things to occur at the queue-manager level. Be careful if using a force-type command, as things like cleanup at the system component level may not be guaranteed.
The jobid can only be set to a higher value than previous. You can't go back.
These commands can affect the scheduling of the queue manager overall, instead of just at the queue-level
--stop: stop scheduling jobs
--start: start scheduling jobs
--reread-policy: reread the utility functions definition file
--savestate filename: save cqm's current state to filename
The following take a list of jobs, as cqadm does:
--score=adjust: reset the scores of the specified jobs to adjust
--inherit=dep_frac: change the fraction of the score inherited by jobs in argument list
if --stop is used, none of the queues will show as being in the stopped state.
This command acts more like two separate commands depending on whether or not you give it the '-m' flag.
Note: This command doesn't take times in anything resembling a standard format. It looks for YYYY_MM_DD-HH:MM
Reservations also keep to minutes as their atomic unit for requesting time though calculations are done in seconds.
Keep this in mind if you're making reservations that follow one right after the other. Also, the reservation will
round to the minute should you provide the duration in the HH:MM:SS format.
While this command will warn you if you set overlapping reservations, it does not stop you from doing so.
The exact scheduling behavior between overlapping reservations is not well defined.
If you're creating a reservation:
setres -n <res_name> -s <starttime> -d <duration> -c <cycletime> -p <partition> -q <queue_name> -u <user_list> -f <list of partitions, if not using -p>
If you're modifying a reservation (i.e. using the -m flag):
setres -m -n <res_name> -s <new_starttime> -d <new_duration> -c <new_cycletime> -D -q <new_queue_name> -u <new_userlist> -p <new target partition> -f <list of partitions if not using -p and changing partitions>
ID-modification commands are not compatible with other modifications to reservations.
the --force_id flag will force the altered ids to any arbitrary value, including ones already used. Use this with extreme caution.
-D can only be used in conjunction with -m. Also, it cannot be used in conjunction with altering the cycle time, duration or starttime of a reservation.
-n must always be used with -m so that the target name is specified
Partitions can be set to an individual partition (and by extension, blocking all parent, child and partially overlapping partitions) with the -p flag. You can also specify a list of partitions as positional arguments at the end of your setres command. Mixing the two is not recommended.
-f bypases the partition checks. Like any other force-type flag, be careful with this.
Including multiple partitions: Instead of using -p you can specify a set of partitions by setting them as positional arguments to setres. i.e. setres -n hw.foo ...more flags... ANL-R00-R07-8192 ANL-R30-R37-8192 would set a reservation for rows 0 and 3 and let the users sub jobs to either of them
Used in conjunction with cqsub/qsub's "--attrs location=ANL-SOME-PARTITION" option, this can be used to set up variable sized, disjoint, partitions that users can then direct jobs to, as appropriate.
This is the main command for administrative tasks for partitions as well as extracting information about them. Like setres, this
is also effectively a multi-modal command, depending on the flags you pass it.
Partition scheduler addition/deletion: [-a|-d] ... Pass * (from the command line: '*' to add all available partitions)
Partition functionality: [--activate|--deactivate] : Mark partitions as being functional. '*' will activate all partitions added to scheduling.
Partition Scheduling: [--enable|--disable] : Enables/disables scheduling on the partition list. The same '*' in place of the list trick works to enable every partition for scheduling
Queue association: --queue=queue1:queue2:... part1 part2 ... partn: Associate the listed partitions with the list of queues. This is in addition to creating the queue using cqadm.
diagnostics: --diag=diag_name target_partition: run the named diagnostic script on the target partition. Valid diagnostics are set in the ''cobalt.conf''
Partition failure and recovery: [--fail|--unfail] : Mark partitions as failed, as though they failed diagnostics/bring a partition out of a failed state. Jobs will not be scheduled on failed partitions.
-l lists partitions with a more detailed view than partlist gives.
--xml dumps out an xml file that can be used with the cobalt simulator's system component mock-up.
--dump will dump the state to a text file.
--savestate will save a copy of the statefile for the system to the target filename.
Recursive Mode: -r can be used with commands if you want the command you're running to affect all child partitions. Doesn't work for --diag.
Most of the flags for partadm are mutually exclusive. Also, multiple operations in a single pass isn't really supported, i.e. you can't pass -a --activate --enable all in one command. It must be three.
As a rule, most of the commands will take an asterisk as an argument, which will expand out to all the partitions the system component knows about.
Since the system component actually goes through the bridge API, some of these commands may be slower than expected.
This is the counterpart to partadm for the cluster_system component. It is much simpler than partadm.
--down: marks a node as down, as though cleanup failed
--up: marks a node as being back up, jobs can be scheduled on it again
--queue=queue1:queue2... node1 node2 ...: Associates queues with nodes, much like partadm
-l: lists node states
-b <node_id_list>: list all state information for the listed nodes (Cray XC40 only)
Should something go very wrong in cleanup, and a job stops, but the node remains in the allocated state, you can mark the node down and then back up to get the node back to a schedule-able state. Make sure that nodes you do this to have actually been cleared of user processes before marking them back up