Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
    • Contribute to GitLab
  • Sign in
C
Cobalt
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • AIG-public
  • Cobalt
  • Wiki
    • Cmdref
  • cobalt.conf.5

cobalt.conf.5

Last edited by Paul Rich Jul 17, 2019
Page history

NAME

cobalt.conf - configuration parameters for cobalt components

SYNOPSIS

/etc/cobalt.conf

DESCRIPTION

The Cobalt configuration file is an "ini-style" configuration file. This configuration file has sections for all Cobalt components and clients in a given instance. The general format of a section is:

[section]

Values that are lists are ":"-delimited. In the event that a key is defined multiple times in a section, the value of the last key in the section will be the value used. Comments may be made in a file by beginning a like with a '#'. Comments must not be inline with key-value pairs. The sections that follow describe the various sections and their options.

If a configuration value definition is mandatory, that will be noted.

General Sections

[components]

service-location

The url:port of the service-locator component (slp). The default port is 8256. This must be specified for a given Cobalt instance

python

The path to the python interpreter to use. If omitted, the default is /usr/bin/python

[communication]

SSL configuration for Cobalt. These must be specified per-install

key

Key to use for SSL communication. May be generated via openssl(1)

cert

This is a locally stored certification for authenticating the key.

ca

Certification authority for the

entry. This is typically the same as

password

Required to be set in configuration file. This is a shared secret for all Cobalt daemons and clients to use, and is the password required for Cobalt’s internal

communication.

[statefiles]

Options for Cobalt’s statefile persistence.

location

Path to where the statefiles are stored.

[system]

Common system configuration settings. These apply to all types of systems.

elogin_hosts

A ':'-delimited list of hostnames of hosts that users can qsub interactive jobs on that then have to be run on another node. This is ususally due to restrictions in the authentication and authorization mechanisms for the mpirun equivalent on a given system. This is most commonly required for Cray systems using eLogin nodes.

resource_name

This is the resource name for this Cobalt instance for accounting purposes. This is typically the name of the cluster being run. If unspecified the resource name for accounting logs is "NOTSET".

size

Maximum size of a given system in nodes.

[forker]

Common option for Cobalt forker components.

ignore_setgroup_errors

Default false. If set to true, then setuid/setgid failures will not kill jobs. Necessary for running local or subinstances of Cobalt, as well as running any non-root simulation mode. If this is set and the forker components are not running as root, this will cause any job ran to be run as the user to run as the user that the forker component is running as.

pipe_buffsize

The size in bytes of the buffer to use for reading from a pipe connected to a child process standard out. Increase this if you are getting particularly large messages from standard out. This is most likely to occur with large Cray systems using the BASIL interface to ALPS. Default 16777216 bytes.

save_me_interval

The minimum interval that Cobalt will wait between saving statefiles for this component, in seconds. By default the interval is 10.0 seconds. Under periods of high load on the component, the interval between statefiles may be longer.

use_cgroups

The default is false. If enabled, cgroup support is enabled. When cgroup support is enabled, Cobalt will use cgclassify during script process startup to place Cobalt initiated scripts into an administrator-specified cgroup. This is generally used if proc connector is disabled on a given system. Cgroup-related options may be set on a per forker-type basis, or on a per- instance basis. This is currently supported for Cray systems, and for site- specified system scripts, such as job and resource pre and postscripts.

cgroup_failure_fatal

The default is false. If cgclassify fails to set a cgroup and cgroup_failure_fatal is set to true, then script startup will fail and the process will exit with a nonzero status.

cgclassify_path

This is the path to the cgclassify executable. By default this is /usr/bin/cgclassify. Use this if cgclassify is in a different location.

cgclassify_args

Arguments to pass to cgclassify. No default arguments are provided by Cobalt. See cgclassify(1) for information on options to cgclassify.

[forker.system]

This applies cgroup options to the system_script_forker. Any options not specified here will default to the values set by the general forker section. These options will affect any auxiliary scripts that Cobalt from the system or queue-manager components. These options will not be applied to any user-provided scripts.

use_cgroups

If true, cgclassify support is enabled.

cgroup_failure_fatal

If cgclassify fails to set a cgroup and cgroup_failure_fatal is set to true, then script startup will fail and the process will exit with a nonzero status.

cgclassify_path

This is the path to the cgclassify executable. By default this is /usr/bin/cgclassify. Use this if cgclassify is in a different location.

cgclassify_args

Arguments to pass to cgclassify. No default arguments are provided by Cobalt. See cgclassify(1) for information on options to cgclassify.

[forker.alps]

This applies cgroup options to all alps_script_forker instances. Any options not specified here will default to the values set by the general forker section. These options will be applied to all alps_script_forkers that are not individually overridden. These options will affect all user-run jobs on a Cray system.

use_cgroups

If true, cgclassify support is enabled.

cgroup_failure_fatal

If cgclassify fails to set a cgroup and cgroup_failure_fatal is set to true, then script startup will fail and the process will exit with a nonzero status.

cgclassify_path

This is the path to the cgclassify executable. By default this is /usr/bin/cgclassify. Use this if cgclassify is in a different location.

cgclassify_args

Arguments to pass to cgclassify. No default arguments are provided by Cobalt. See cgclassify(1) for information on options to cgclassify.

[forker.<alps_script_forker_instance_name>]

Applies these configuration options to an individual forker instance. If these are not defined then the values used or passed along by the "[forker.alps]" section will be used.

use_cgroups

If true, cgclassify support is enabled.

cgroup_failure_fatal

If true, if cgclassify fails to set a cgroup, then script startup will fail and the process will exit with a nonzero status.

cgclassify_path

This is the path to the cgclassify executable. By default this is /usr/bin/cgclassify. Use this if cgclassify is in a different location.

cgclassify_args

Arguments to pass to cgclassify. No default arguments are provided by Cobalt. See cgclassify(1) for information on options to cgclassify.

[logger]

This section handles cobalt component logging and default levels. Valid logging levels in this section are

and

to_syslog

If true, send logging data to the syslog daemon.

syslog_level

Only send messages to syslog at this level or higher. The default level is INFO

syslog_location

Location of logfile

syslog_facility

Logger facility to send logs to. The default is local0

to_console

Send logging data to console or stdout/stderr as appropriate. This defaults to true.

console_level

Only send messages to the console at this level or higher. The default level is INFO

[bgsched]

default_reservation_policy

If set, this is the score accrual policy that will be used on reservation queues. The default policy is "default" (fifo).

db_flush_interval

The minimum frequency with which messages are sent to the database component. use_db_logging must be set to true, and the default interval is 10 seconds. log_dir The directory to place reservation accounting logs.

overflow_file

This is a file location to use for holding database messages should use_db_logging be set to true, but the CobaltDB writer component is unavailable for an extended period of time. If this file is present, then on cdbwriter startup, messages from this file will be pushed to the component and added to the database, followed by in-memory pending messages.

max_queued_messages

This is the number of messages to keep in memory before flushing to the

If set to -1, the component will never flush to the overflow file. If this is not set, then the overflow file will not be used.

save_me_interval

The minimum interval that Cobalt will wait between saving statefiles for this component, in seconds. By default the interval is 10.0 seconds. Under periods of high load on the component, the interval between statefiles may be longer.

schedule_jobs_interval

This is the minimum interval between iterations of the scheduling loop. The default time is 10 seconds.

utility_file

Location of file for site-defined utility functions.

use_db_logging

If true, send messages to CobaltDB, or cache the messages that would be sent if the CobaltDB writer is currently unavailable for later writing. The default is false

[cqm]

These are options for the queue-manager component, cqm. Cqm handles queueing and overall job tracking operations.

filters

A colon-delimited list of paths to scripts to run. These are run by the clients that work with cqm(8), specifically, qsub(1), qalter(1), and qmove(1). These are invoked from the clients and these scripts must run return an exit status of 0 prior to the job, or job modification being passed into cqm. These are intended as site-specific validation scripts. Scripts recieve job parameters as key=value pairs as arguments, and any key=value pairs written to stdout will modify job parameters accordingly, for instance a non-default initial score of 500 may be written to stdout as score=500. If a job would fail to pass the filter entirely, then it should return a nonzero exit status. A note as to which filter failed should be presented to the user. It should be noted that cqadm(1) as an admin-level command does not run these filters. Since the filters are invoked as a part of client invocation, any change to this parameter to a running Cobalt instance will have an immediate effect without signaling or restart.

job_prescripts

A colon-delimited list of scripts to run when the job is scheduled, but prior to job invocation. These are run once per job, whether or not it is preempted. Nonzero exit statuses in these scripts are fatal to a job starting up.

job_postscripts

A colon-delimited list of scripts to run after the job has ended. These are run once per job, whether or not it is preempted. Nonzero exit statuses in these scripts have no effect on a job.

resource_prescripts

A colon-delimited list of scripts to run when the job is scheduled, but prior to job invocation. These are run once per task, prior to resuming from preemption. Nonzero exit statuses in these scripts are fatal to a job starting up.

resource_postscripts

A colon-delimited list of scripts to run after the job has ended. These are run after each preemption step. Nonzero exit statuses at the end of a job in these scripts have no effect on a job.

dep_frac

The floating-point fraction of a job’s score that a dependent job inherits. This sets a default value and may be overridden on a per-job basis by the schedctl(1) command. The default is 0.5.

scale_dep_frac

If set to true, the dependency fraction inherited by jobs will be modified by the ratio of the size of the resources the dependent job to the job it is inheriting score from. This only applies to dependent jobs that are smaller than the job they are inheriting from. For instance, a 4 node job depending on an 8 node job would inherit half the score fraction than an 8 node job that depended on an 8-node job.

mailserver

The address of the mailserver to use for sending admin emails and requested user emails for startup and termination notification.

force_kill_delay

The length of time, in seconds, to wait between sending a SIGTERM and a SIGKILL to a job. The default is 300 seconds.

log_dir

The directory to place job accounting logs.

overflow_file

This is a file location to use for holding database messages should use_db_logging be set to true, but the CobaltDB writer component is unavailable for an extended period of time. If this file is present, then on cdbwriter startup, messages from this file will be pushed to the component and added to the database, followed by in-memory pending messages.

max_queued_messages

This is the number of messages to keep in memory before flushing to the

If set to -1, the component will never flush to the overflow file. If this is not set, then the overflow file will not be used.

save_me_interval

The minimum interval that Cobalt will wait between saving statefiles for this component, in seconds. By default the interval is 10.0 seconds. Under periods of high load on the component, the interval between statefiles may be longer.

utility_file

Location of file for site-defined utility functions.

use_db_logging

If true, send messages to CobaltDB, or cache the messages that would be sent if the CobaltDB writer is currently unavailable for later writing. The default is false

poll_process_groups_interval

The interval in seconds between queries to the system component for process group status.

use_db_jobid_generator

If true, use CobaltDB to generate a unique jobid. This may be used to ensure unique jobids across multiple Cobalt instances on related resources. Default false.

progress_interval

The minimum time in seconds between job statemachine steps. Default 10 seconds.

max_walltime

If set, defines a general maximum requested walltime for all queues. May be overriden by setting the MaxWalltime property on a given queue. If this is not set, then there is no default limit on the length of time a user job may request, unless explicitly set as a part of a given queue.

compute_utility_interval

The minimum time in seconds to wait between score calculation iterations. The default is 10 seconds.

cqstat_header

A colon-delimited list of display headers to use in qstat(1)'s default display. A default set of headers will be used if this is not set.

cqstat_header_full

A colon-delimited list of display headers to use with qstat(1)'s -f flag. If not set, a default set of display headers are used. This does not change the -f -l combination for display.

starttime_estimate_shadow

A floating point time to add to the current time for a minimum start time estimate. This will force a minimum start time in the future to handle situations where there is an ongoing cleanup or other system issue where a job may be running long. This only affects display of Est_Start_Time in qstat(1)'s display. The default is 300.0 seconds.

[cdbwriter]

log_dir

The directory to place cdbwriter message overflow files.

user

The user to connect to DB2. It is recommended to use a user identity that only has access to the Cobalt database. This user requires read, write, and update permissions on the Cobalt database.

pwd

This is the password that the user will use to connect to the Cobalt database.

database

The name of the database in DB2 to connect to that contains the Cobalt database.

schema

The name of the DB2 schema where the Cobalt database resides. Multiple schemas may exist in the same database, which is useful for handling multiple, related, Cobalt instances.

save_me_interval

The minimum interval that Cobalt will wait between saving statefiles for this component, in seconds. By default the interval is 10.0 seconds. Under periods of high load on the component, the interval between statefiles may be longer.

Cluster System Sections

[cluster_system]

simulation_mode

Set the cluster_system component to run in a simulation mode. In this mode, The cluster system will not actually run jobs on target nodes in its configuration, but it will instead run the

which will provide statistics on what would have ran. Otherwise the system component will track and allocate resources as though it was actually running on a multi-node cluster, with a confguration sprcified in the

entry if true. This defaults to false.

simulation_executable

Instead of running pre and postscripts, run the specified executable. This must be specified if running in simulation_mode. Output from this script is logged to the cluster_system component’s logs.

run_remote

If set to false, do not attempt to run pre/postscripts on remote resources. The default is true.

hostfile

This is a list of hostnames for nodes that the cluster system component can schedule. Nodes may be added or removed, and the list of available nodes is updated at restart.

epilogue

This is a colon-delimited set of scripts to run on a per-node basis on task termination on a resource. If any script returns a non-zero exit status, the node will be marked down, and no new jobs will be scheduled on that resource.

epilogue_timeout

The amount of time in seconds to wait for each script to complete. If the script has not completed and exited with a status of 0 before this timeout is reached, that node will be marked down.

prologue

Not currently used. Per-node scripts are currently launched as a part of the cqm(8) resource_prologue

prologue_timeout

This is not currently used within the cluster system component

allocation_timeout

This is the time in seconds to wait when resources are allocated, but have not had a job started on them. This usually occurs when a user deletes a job while it is starting up. After this timeout has elapsed the resources will be returned to the pool of available nodes, and a new job may be scheduled on the resources. The default timeout is 300 seconds.

drain_mode

This sets the backfill mode to use and may be one of backfill, drain_only, or first_fit. The first_fit mode will run the highest scored job that can immediately run on resources available. The drain_only mode will run the highest scored job, if sufficient resources are available or it will start draining nodes and then run the job once sufficient resources are available. The backfill mode will run and drain resources as the drain_only mode, but will also attempt to run jobs on the empty, but draining nodes in a score-order first-fit manner. It is recommended that backfill be used if draining is permitted for improved utilization of cluster resources.

minimum_backfill_window

This is the minimum amount of backfill time to set for a set of resources that being cleaned by post-job epilogue scripts. The default is 300 seconds.

BlueGene/P Sections

[bgpm]

mmcs_server_ip

The IP address of the BlueGene mmcs_server.

mpirun

The location of the BlueGene mpirun binary. This is typically /bgsys/drivers/ppcfloor/bin/mpirun

[bgsystem]

kernel

If true, allow the use of alternative kernels

bootprofiles

This is a path to the directory that holds the alternate kernel subdirectories. If alternate kernel support is being used, then this must be set. This is the location of where symlinks to the current profiles of

partitions

should be made. Cobalt will autogenerate these symlinks as a part of the boot process on an as-needed basis.

bgtype

The type of BlueGene being run on. For BlueGene/Q this should be set to 'bgp'.

stress_comm_code

Enables an extra function to place the system component under high-communication stress for race-condition debugging and fault-handling testing if True. This is False by default. This applies only to the brooklyn system simulation environment.

BlueGene/Q Sections

[bgpm]

runjob

The location of the BlueGene runjob binary. This is typically /bgsys/drivers/ppcfloor/bin/runjob

[bgsystem]

allow_alternate_kernels

If set to true, allow alternate kernels to be run by users using the --kernel or --io_kernel flags to qsub(1). This defaults to false.

bootprofiles

This is a path to the directory that holds the alternate kernel subdirectories. If alternate kernel support is being used, then this must be set. This is the location of where symlinks to the current profiles of

partitions

should be made. Cobalt will autogenerate these symlinks as a part of the boot process on an as-needed basis.

default_kernel

The default compute-node kernel image to use. This name should be a directory found at the path indicated by

This value is set to 'default' by default.

default_kernel_options

A list of options to pass to the default kernel image.

ion_default_kernel

The default IO-node kernel image to use. This name should be a directory found at the path indicated by ion_default_kernel_options A list of options to pass to the default kernel image.

This value is set to 'default' by default.

subblock_prefix

This is a location prefix to attach to subblock names. Usually this is the resource’s prefix for the Cobalt instance. The default for subblock use is "COBALT".

subblock_config

Sets a configuration for subblock use. This is a key-value list of the form: + _ "[blockname1:min_size1],[blockname2:min_size2],…​" _ + Blocks must be specified in the BlueGene control system. Pseudoblocks will be generated down to the specified minimum size. Valid minimum sizes are 64, 32, 16, 8, 4, 2, 1. Subblock geometries are per-IBM’s recommendations in runjob(1) where appropriate. If + is specified then + may also be specified.

ignore_subblock_sizes

A colon-delimited list of sizes to skip when generating pseudoblocks for automatic subblock use.

terminal_boot_timeout

Sets an automatic timeout in seconds for block boots initiated by Cobalt’s boot_block(1) command. The default is 300 seconds.

bgtype

The type of BlueGene being run on. For BlueGene/Q this should be set to 'bgq'.

CRAY SECTIONS

[alps]

basil

The path to Cray’s apbasil command. The default path is /opt/cray/default/alps/bin/apbasil

apkill

The path to Cray’s apkill command. The default path is /opt/cray/alps/default/bin/apkill

cray_mom_qsub

The path to qsub on the mom (or other alps_script_forker) nodes to use when using interactive qsub from the eLogin hosts on Cray systems. This must be a fully qualified path. The default is /usr/bin/qsub

default_depth

The default processors per node. This should be set to the number of KNL cores on each node for XC40 systems. The default value is 72.

[alpssystem]

min_ssd_size

The size of the smallest SSD available on the system in GB.

pgroup_startup_timeout

The time to allow for process group startup in seconds. The default is 120 seconds.

save_me_interval

The minimum interval that Cobalt will wait between saving statefiles for this component, in seconds. By default the interval is 10.0 seconds. Under periods of high load on the component, the interval between statefiles may be longer.

temp_reservation_time

The default time for the temporary allocation reservation for starting jobs in seconds. The default is 300 seconds.

update_thread_timeout

The polling interval for state updates from ALPS in seconds. The default is 10 seconds.

[capmc]

path

Path to CAPMC command front-end. If unset, the default is /opt/cray/capmc/default/bin/capmc

[system]

backfill_epsillon

Set the amount of time to subtract from the remaining drain window, in seconds, when placing backfill jobs. This allows time for cleanup for backfill jobs to prior to the exit time of the job causing the drain to occur. The default is 120 seconds.

cleanup_drain_window

Set the draining time to set for nodes in cleanup statuses. The time is in seconds. The default time is 300 seconds.

drain_mode

Set the draining algorithm to use. This may be backfill or first-fit. The default is first-fit.

ENVIRONMENT

COBALT_CONFIG_FILES If set, Cobalt will use the configuration pointed to by this path.

FILES

/etc/cobalt.conf

This is the default location for the configuration file used by all Cobalt daemons and clients. Due to the potential for abuse of the

interfaces, access to this file should be carefully controlled. This file does not to be writable under normal conditions, and only must be readable by the user used by Cobalt’s setgid wrappers. By default, this is the

user.

SEE ALSO

slp(8), bgpm(8), bgsched(8), cqm(8)

Clone repository
  • Installation
  • ScriptDirectives
  • Simulation
  • SystemConfig
  • SystemSpecificFunctions
  • admin quickref
  • bgq interactive jobs
  • bgq user block booting
  • brooklyn system simulator
  • cmdref
    • CommandReference
    • bgsched.8
    • bgsystem.8
    • boot block.1
    • cobalt.conf.5
    • cqadm.8
View All Pages