Commit 39717263 authored by Michael Salim's avatar Michael Salim
Browse files

added docs/html dir: copied from sphinx build directory

parent 70332624
Balsam Launcher
.. automodule:: balsam.launcher
.. automodule:: balsam.launcher.runners
.. autoclass:: balsam.launcher.runners.Runner
:members: __init__, start, update_jobs, finished
.. autoclass:: balsam.launcher.runners.RunnerGroup
.. autoclass:: balsam.launcher.runners.MPIRunner
.. autoclass:: balsam.launcher.runners.MPIEnsembleRunner
.. automodule:: balsam.launcher.jobreader
.. automodule:: balsam.launcher.worker
MPI Command Templates
.. automodule:: balsam.launcher.mpi_commands
Transition Processes
.. automodule:: balsam.launcher.transitions
Launcher: Main Executable
.. automodule:: balsam.launcher.launcher
MPI Ensemble: serial jobs
.. automodule:: balsam.launcher.mpi_ensemble
.. automodule:: balsam.launcher.util
This document outlines both short- and long-term goals for Balsam development.
Argo Service
* Establish Argo--Balsam communication protocol
* Think carefully about scenarios where a job resides @Argo, @Balsam1, @Balsam2
* Missing-jobs are easy: Argo can send jobs that are missing from a list of PKs
* What if a job is out-of-date? Simple timestamp or version number is not
enough. Do we enforce that only one instance can modify jobs? It would
be very useful to let an ancestor job @Balsam1 modify the dependencies of
some other job @Balsam2
* Balsam job sync service
* Balsam site scheduler: when a BalsamJob can run at more than one Balsam site
* Job source: automatic generation of BalsamJobs from some event (filesystem change, external DB write)
* Web interface
File Transfer Services
* Transparent data movement: data flow between Balsam sites
* Stage-in logic should figure out where input files reside; choose appropriate protocol
* Support Globus SDK, GridFTP, scp, cp, ln
* Simplify user_settings: uncluttered configuration/credentials
Balsam Jobs
* Database
* SQLite + Global locks: likely to become a bottleneck. Evaluate scalability with increasing writes.
* DB that supports real concurrency
* MySQL or Postgres?
* DB installation, configuration, and running DB server should all be "hidden" from user in setuptools and Balsam service itself. Keep the simplicity of user experience with SQLite.
* Queueing save() events, manually implement a serialized writer process
* Implement script job type: for users who want to ``balsam qsub --mode script``
* uses ScriptRunner that bypasses mpi
* will need to parse user script, translate mpirun to system-specific commands
* ensure script does not try to use more nodes than requested
Launcher Functionality
* Runner creation strategies: optimize the scheduling strategy given some runnable jobs and workers
* MPIEnsemble
* Right now it's fixed jobs per worker; try master-worker (job-pull) instead?
* What if an MPIEnsembleRunner runs continuously, rather than being dispatched with a fixed set of jobs?
* If there are also MPI jobs mixed in, we don't want it to hog resources
* How many serial job ranks per node?
* How long to wait between Runner creations
* Transitions
* Move to Metascheduler service? Do the transitions really need to be happening on the Launcher side?
* Especially makes sense if there is heavy data movement; why wait to stage-in until compute allocation!?
* Improving concurrency with coroutines?
* We could keep 1-10 transition processes, but increase the concurrency in each by using asynchronous I/O for DB read/writes, and especially time-consuming stage-in/stage-out transfers
* Rewriting transition functions as coroutines may be a very natural choice
* PriorityQueue implementation: the Manager process incurs some overhead; how much faster than FIFO?
* Currently Launcher runs in place of user-submitted batch scripts; this happens on a single "head" node of an allocation, from which MPI applications are initiated. This design choice is based on current constraints at ALCF. How can/should this improve with future abilities like:
* Containerized applications with Singularity
* direct ssh access to compute nodes
* MPI Spawn processes
* Support BG/Q workers: worker setup() routine should boot subblocks of requested size
* Wrap post-process steps (which run arbitrary user code) in a transaction, so that DB can be rolled back if
the Launcher crashes in the middle of some DAG manipulations
Metascheduler Service
* Argo--Balsam job sync service
* periodically send request for new jobs, updates
* Abstractions for queue queries, application performance models, scheduling strategy
* Bayesian updates of performance model from application timing data
* Aggressive search for holes in local batch queue
* Scheduling algorithms to minimize time-to-solution
* Constrained by maximum node-hour usage, minimum parallel efficiency
User Experience
* Move hard-coded logic out of simplify user configuration
* Update documentation
* Document prototype/patterns of Brain Imaging workflow
* Document prototype/patterns of Solar Windows workflow
* Document prototype/patterns of Hyperparameter Opt workflow
* Python CWL Parser: automatic workflow generation from CWL script
Hyperparameter Optimization
.. note::
The code examples below are heavily pruned to illustrate only essentials of
the Balsam interface. Refer to the
`dl-hps repo <>`_
for runnable examples and code.
In this workflow, a light-weight driver runs either on a login node, or as a
long-running, single-core job on one of the compute nodes. The driver executes
a hyperparameter search algorithm, which requires that the model loss is
evaluated for several points in hyperparameter space. It evaluates these points
by dynamically adding jobs to the database::
import balsam.launcher.dag as dag
from balsam.service.models import BalsamJob, END_STATES
def create_job(x, eval_counter, cfg):
'''Add a new evaluatePoint job to the Balsam DB'''
task = {}
task['x'] = x
task['params'] = cfg.params
jname = f"task{eval_counter}"
fname = f"{jname}.dat"
with open(fname, 'w') as fp: fp.write(json.dumps(task))
child = dag.spawn_child(name=jname, workflow="dl-hps",
application="eval_point", wall_time_minutes=2,
num_nodes=1, ranks_per_node=1,
return child.job_id
In this function, the point is represented by the Python dictionary ``x``. This
and other configuration data is dumped into a JSON dictionary on disk. The job
to evaluate this point is created with ``dag.spawn_child`` and the
``wait_for_parents=False`` argument, which indicates that the child job may
overlap with the running driver job. The job carries out application ``"eval_point"``,
which is pre-registered as the worker application.
The ``input_files`` argument is set to the name of the JSON file, which causes
it to be transferred into the working directory of the child automatically.
Also, ``application_args`` is set to this filename in order to pass it as an
argument to the ``eval_point`` application.
The driver itself makes calls to this ``create_job`` function in order to
dispatch new jobs. It also queries the Balsam job database for newly finished
jobs, in order to assimilate the results and inform the optimizer::
while not opt_is_finished():
# Spawn new jobs
XX = opt.ask(n_points=num_tocreate) if num_tocreate else []
for x in XX:
eval_counter += 1
key = str(x)
jobid = create_job(x, eval_counter, cfg)
# Read in new results
new_jobs = BalsamJob.objects.filter(job_id__in=my_jobs)
new_jobs = new_jobs.filter(state="JOB_FINISHED")
new_jobs = new_jobs.exclude(job_id__in=finished_jobs)
for job in new_jobs:
result = json.loads(job.read_file_in_workdir('result.dat'))
The strength of Balsam in this workflow is decoupling the optimizer almost
entirely from the evaluation of points. The ``eval_point`` jobs take some
input JSON specification; besides this, they are free to run arbitrary code as
single-core or multi-node MPI jobs. The Balsam launcher and job database allow
an ensemble of serial and MPI jobs to run concurrently, and they are robust to
allocation time-outs or unexpected failures. The driver and/or jobs can be
killed at any time and restarted, provided the driver itself is checkpointing
data as necessary.
Frequently Asked Questions
Why isn't the launcher running my jobs?
Check the log for how many workers the launcher is assigning jobs to. It may be
that a long-running job is hogging more nodes than you think, and there aren't enough
idle nodes to run any jobs. Also, the launcher will only start running jobs that have an
estimated run-time below the allocation's remaining wall-time.
Where does the output of my jobs go?
Look in the ``data/`` subdirectory of your :doc:`Balsam database directory
<../quick/db>`. The jobs will be organized into folders according to the name
of their workflow, and each job working directory is in turn given a unique
name from its name and UUID.
All stdout/stderr from a job is directed into the file ``<jobname>.out``, along with job timing
information. Any files created by the job will be placed in its working directory, unless another
location is specified explicitly.
How can I move the output of my jobs to an external location?
This is easy to do with the "stage out" feature of BalsamJobs.
You need to specify two fields, ``stage_out_url`` and ``stage_out_files``,
either from the ``balsam job`` command line interface or as arguments to
``dag.create_job()`` or ``dag.spawn_child()``.
Set this field to the location where you want files to go. Balsam supports
a number of protocols for remote and local transfers (scp, GridFTP, etc...).
If you just want the files to move to another directory in the same file system, use
the ``local`` protocol like this::
This is a whitespace-separated list of shell file-patterns, for example::
stage_out_url='result.out *.log simulation*.dat'
Any file matching any of the patterns in this field will get copied to the
How can I control the way an application runs in my workflow?
There are several optional fields that can be set for each BalsamJob. These
fields can be set at run-time, during the dynamic creation of jobs, which
gives a lot of flexibility in the way an application is run.
Command-line arguments passed to the application
Environment variables to be set for the duration of the application execution
Which files are "staged-in" from the working directories of parent jobs. This
follows the same shell file-pattern format as the ``stage_out_files`` field
mentioned above. It is intended to facilitate data-flow from parent to child
jobs in a DAG, without resorting to stage-out functionality.
preprocess and postprocess
You can override the default pre- and post-processing scripts which run before and after
the application is executed. (The default processing scripts are defined alongside the application).
I want my program to wait on the completion of a job it created.
If you need to wait for a job to finish, you can set up a polling function like the following::
from balsam.launcher import dag
import time
def poll_until_state(job, state, timeout_sec=60.0, delay=5.0):
start = time.time()
while time.time() - start < timeout_sec:
if job.state == state:
return True
return False
Then you can check for any state with a specified maximum waiting time and delay.
For finished jobs, you can do::
newjob = dag.add_job( ... )
success= poll_until_state(newjob, 'JOB_FINISHED')
There is a convenience function for reading files in a job’s working directory::
if success:
output = newjob.read_file_in_workdir(‘output.dat’) # contents of file in a string
Querying the Job database
You can perform complex queries on the BalsamJob database thanks to Django. If
you ever need to filter the jobs according to some criteria, the entire
database is available via ``dag.BalsamJob``
See the `official documentation
<>`_ for lots of
examples, which directly apply wherever you can replace ``Entry`` with
``BalsamJob``. For example, say you want to filter for all jobs containing
“simulation” in their name, but exclude jobs that are already finished::
from balsam.launcher import dag
BalsamJob = dag.BalsamJob
pending_simulations = BalsamJob.objects.filter(name__contains=“simulation").exclude(state=“JOB_FINISHED”)
You could count this query::
num_pending = pending_simulations.count()
Or iterate over the pending jobs and kill them::
for sim in pending_simulations:
Useful command lines
Create a dependency between two jobs::
balsam dep <parent> <child> # where <parent>, <child> are the first few characters of job ID
balsam ls --tree # see a tree view showing the dependencies between jobs
Reset a failed job state after some changes were made::
balsam modify jobs b0e --attr state --value CREATED # where b0e is the first few characters of the job id
See the state history of your jobs and any error messages that were recorded while the job ran::
balsam ls --hist | less
Remove all jobs with substring "task"::
balsam rm jobs --name task
Useful Python scripts
You can use the ``balsam.launcher.dag`` API to automate a lot of tasks that
might be tedious from the command line. For example, say you want to
**delete** all jobs that contain "master" in their name, but reset all jobs
that start with "task" to the "CREATED" state, so they may run again::
import balsam.launcher.dag as dag
for job in dag.BalsamJob.objects.filter(name__startswith="task"):
.. balsam documentation master file, created by
sphinx-quickstart on Fri Dec 15 13:22:13 2017.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Balsam - HPC Workflow and Edge Service
Balsam is a Python-based service that handles the cumbersome process of running
many jobs across one or more HPC resources. It runs on the login nodes, keeping
track of all your jobs and submitting them to the local scheduler on your
Why do I want this?
Whereas a local batch scheduler like Cobalt runs on behalf of **all users**,
with the goals of fair resource sharing and maximizing overall utilization,
Balsam runs on **your** behalf, interacting with the scheduler to check for
idle resources and sizing jobs to minimize time-to-solution.
You could use Balsam as a drop-in replacement for ``qsub``, simply using
``balsam qsub`` to submit your jobs with absolutely no restrictions. Let Balsam
throttle submission to the local queues, package jobs into ensembles for you,
and dynamically size these packages to exploit local scheduling policies.
There is :doc:`much more <userguide/overview>` to Balsam, which is a complete
service for managing complex workflows and optimized scheduling across multiple
HPC resources.
.. toctree::
:maxdepth: 2
:caption: Quickstart
.. toctree::
:maxdepth: 2
:caption: User Guide
.. _dev_docs:
.. toctree::
:maxdepth: 3
:caption: Developer Documentation
.. toctree::
:maxdepth: 2
:caption: Use Cases
Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
The Balsam Database
Every Balsam instance is associated with a **database** of jobs. If you use the
Balsam scheduling service, it will package and submit these jobs for concurrent
ensemble execution. You can also bypass the service and directly run the Balsam
launcher, which is the pilot responsible for processing and executing jobs
When Balsam is first installed, a SQLite database is created in the default **Balsam DB directory**.
This directory is named ``default_balsamdb`` and located in the top level of the Balsam repo.
Each Balsam DB directory contains the following files:
The actual database (in the case of SQLite, a single file) on disk. For
heavy duty databases like Postgres, this file is replaced by a entire
directory containing the various database files
JSON datastore used to coordinate database clients and servers
Directory containing all the output and files generated by your jobs. The
data/ directory is organized by subdirectories for each workflow name. All
jobs assigned to the same workflow will be placed in the same folder. This
workflow folder then contains all the working directories of its jobs,
which are given unique names based on the given job name.
Log files, which are very useful to monitor the state and health of the Balsam applicaitons, are
written into this logging subdirectory.
Creating a new Balsam DB
You may want to set up one or more Balsam DB directories for different
projects; or perhaps one SQLite DB for testing and a heavier-duty one for
production. Create a new Balsam DB simply by specifying the path of the new
$ balsam init ~/testdb --db-type sqlite3
This will create a new directory and populate it with the files mentioned
above. If you ever ``git pull`` and the database schema goes out-of-date,
you can also use this command to refresh the existing DB.
Starting up the Balsam DB Server
With SQLite3, processes that use the Balsam DB can simply make a direct connection by opening the file on disk.
This approach breaks down when independent processes (e.g. the metascheduler and launcher) or user applications
try to **write concurrently** to the database. Because SQLite+Django has poor support for concurrency, Balsam
provides a DB proxy service, which runs in front of the database and serializes all calls to write through a
ZeroMQ socket. Start and stop the DB server with::
$ balsam dbserver --path=~/testdb
$ balsam dbserver --stop
This will run a "watchdog" process that keeps the server alive and establishes
the appropriate port for TCP communication to the server. All programs in your
workflow will automatically communicate with this server instead of trying to
write directly to the DB.
In the case of Postgres or MySQL, running the **dbserver command is not
optional**, because **all** DB read and write queries are handled by the
server! Fortunately, the details of setting up and running a DB server are
completely encapsulated by the commands ``balsam init`` and ``balsam
Specifying which DB to use
If you constantly want to use one Balsam DB other than ``default_balsamdb/``,
you can change the path in ``balsam/``. Set the variable
``default_db_path`` at the top of the file equal to the full path of your DB
To override the ``default_db_path`` for a particular session, you can set the environment
variable ``BALSAM_DB_PATH``. Either export this variable or prefix your command lines with
it, for instance::
$ BALSAM_DB_PATH=~/testdb balsam ls # will list jobs in the ~/testdb location
$ # or equivalently...
$ export BALSAM_DB_PATH=~/testdb
$ balsam ls
Hello World and Testing
Hello World (on Cooley)
The launcher pulls jobs from the database and invokes MPI to run the jobs.
To try it out interactively, grab a couple nodes on Cooley and remember to
load the appropriate environment::
qsub -A datascience -n 2 -q debug -t 30 -I
soft add +anaconda
source activate balsam
The **balsam** command-line tool will have been added to your path.
There are a number of sub-commands to try; to explore the options, use
the ``--help`` flag::
balsam --help
balsam ls --help
Let's setup a balsam DB in our home directory for testing and start up a DB server to
manage that database::
balsam init ~/testdb
export BALSAM_DB_PATH=~/testdb
balsam dbserver --path ~/testdb
With the ``BALSAM_DB_PATH`` environment variable set, all ``balsam`` programs will refer to this
database. Now let's create a couple dummy jobs and see them listed in
the database::
balsam ls # no jobs in ~/testdb yet
balsam qsub "echo hello world" --name hello -t 0
balsam make_dummies 2
balsam ls --hist # history view of jobs in ~/testdb
Useful log messages will be sent to ``~/testdb/log/`` in real time. You can
change the verbosity, and many other Balsam runtime parameters, in
balsam/ Finally, let's run the launcher::
balsam launcher --consume --time 0.5 # run for 30 seconds
balsam ls --hist # jobs are now done
balsam ls --verbose
balsam rm jobs --all
Comprehensive Test Suite
The **balsam-test** command line tool invokes tests in the ``tests/`` directory
You can run specific tests by passing the test module names, or run all of them
just by calling **balsam-test** with no arguments. You can provide the ``--temp`` parameter
to run certain tests in a temporary test directory::
$ balsam-test tests.test_dag --temp # this should be quick
You should see this at the end of the test output::
Ran 3 tests in 1.575s
To run the comprehensive set of unit and functional tests, you must create a
persistent test DB and run a DB server in front of it. To help prevent users
from running tests in their production database, Balsam requires that the DB
directory contains the substring "test". (The ``~/testdb`` DB created above
would suffice here). Be sure that a DB server is running in front of the test
export BALSAM_DB_PATH=~/testdb
balsam dbserver --path ~/testdb
balsam-test # the test_functional module might take over 10 minutes!
Install Balsam
.. note::
If you are reading this documentation from GitHub/GitLab, some of the example code
is not displayed! Take a second and build this documentation on your
own machine (until it's hosted somewhere accessible from the internet)::
$ pip install --user sphinx
$ cd docs
$ make html
$ firefox _build/html/index.html # or navigate to file from browser