Commit 213b886f authored by Michael Salim's avatar Michael Salim

Updated documentation

parent 53a5dccf
Hyperparameter Optimization
===============================
.. note::
The code examples below are heavily pruned to illustrate only essentials of
the Balsam interface. Refer to the
`dl-hps repo <https://xgitlab.cels.anl.gov/pbalapra/dl-hps/tree/balsam-port>`_
for runnable examples and code.
In this workflow, a light-weight driver runs either on a login node, or as a
long-running, single-core job on one of the compute nodes. The driver executes
a hyperparameter search algorithm, which requires that the model loss is
evaluated for several points in hyperparameter space. It evaluates these points
by dynamically adding jobs to the database::
import balsam.launcher.dag as dag
from balsam.service.models import BalsamJob, END_STATES
def create_job(x, eval_counter, cfg):
'''Add a new evaluatePoint job to the Balsam DB'''
task = {}
task['x'] = x
task['params'] = cfg.params
jname = f"task{eval_counter}"
fname = f"{jname}.dat"
with open(fname, 'w') as fp: fp.write(json.dumps(task))
child = dag.spawn_child(name=jname, workflow="dl-hps",
application="eval_point", wall_time_minutes=2,
num_nodes=1, ranks_per_node=1,
input_files=f"{jname}.dat",
application_args=f"{jname}.dat",
wait_for_parents=False
)
return child.job_id
In this function, the point is represented by the Python dictionary ``x``. This
and other configuration data is dumped into a JSON dictionary on disk. The job
to evaluate this point is created with ``dag.spawn_child`` and the
``wait_for_parents=False`` argument, which indicates that the child job may
overlap with the running driver job. The job carries out application ``"eval_point"``,
which is pre-registered as the worker application.
The ``input_files`` argument is set to the name of the JSON file, which causes
it to be transferred into the working directory of the child automatically.
Also, ``application_args`` is set to this filename in order to pass it as an
argument to the ``eval_point`` application.
The driver itself makes calls to this ``create_job`` function in order to
dispatch new jobs. It also queries the Balsam job database for newly finished
jobs, in order to assimilate the results and inform the optimizer::
while not opt_is_finished():
# Spawn new jobs
XX = opt.ask(n_points=num_tocreate) if num_tocreate else []
for x in XX:
eval_counter += 1
key = str(x)
jobid = create_job(x, eval_counter, cfg)
my_jobs.append(jobid)
# Read in new results
new_jobs = BalsamJob.objects.filter(job_id__in=my_jobs)
new_jobs = new_jobs.filter(state="JOB_FINISHED")
new_jobs = new_jobs.exclude(job_id__in=finished_jobs)
for job in new_jobs:
result = json.loads(job.read_file_in_workdir('result.dat'))
resultsList.append(result)
finished_jobs.append(job.job_id)
The strength of Balsam in this workflow is decoupling the optimizer almost
entirely from the evaluation of points. The ``eval_point`` jobs take some
input JSON specification; besides this, they are free to run arbitrary code as
single-core or multi-node MPI jobs. The Balsam launcher and job database allow
an ensemble of serial and MPI jobs to run concurrently, and they are robust to
allocation time-outs or unexpected failures. The driver and/or jobs can be
killed at any time and restarted, provided the driver itself is checkpointing
data as necessary.
Frequently Asked Questions
==========================
Why isn't the launcher running my jobs?
---------------------------------------------
Check the log for how many workers the launcher is assigning jobs to. It may be
that a long-running job is hogging more nodes than you think, and there aren't enough
idle nodes to run any jobs. Also, the launcher will only start running jobs that have an
estimated run-time below the allocation's remaining wall-time.
Where does the output of my jobs go?
---------------------------------------
Look in the ``data/`` subdirectory of your :doc:`Balsam database directory
<../quick/db>`. The jobs will be organized into folders according to the name
of their workflow, and each job working directory is in turn given a unique
name from its name and UUID.
All stdout/stderr from a job is directed into the file ``<jobname>.out``, along with job timing
information. Any files created by the job will be placed in its working directory, unless another
location is specified explicitly.
How can I move the output of my jobs to an external location?
--------------------------------------------------------------------
This is easy to do with the "stage out" feature of BalsamJobs.
You need to specify two fields, ``stage_out_url`` and ``stage_out_files``,
either from the ``balsam job`` command line interface or as arguments to
``dag.create_job()`` or ``dag.spawn_child()``.
stage_out_url
Set this field to the location where you want files to go. Balsam supports
a number of protocols for remote and local transfers (scp, GridFTP, etc...).
If you just want the files to move to another directory in the same file system, use
the ``local`` protocol like this::
stage_out_url="local:/path/to/my/destination"
stage_out_files
This is a whitespace-separated list of shell file-patterns, for example::
stage_out_url='result.out *.log simulation*.dat'
Any file matching any of the patterns in this field will get copied to the
``stage_out_url``.
How can I control the way an application runs in my workflow?
------------------------------------------------------------------
There are several optional fields that can be set for each BalsamJob. These
fields can be set at run-time, during the dynamic creation of jobs, which
gives a lot of flexibility in the way an application is run.
application_args
Command-line arguments passed to the application
environ_vars
Environment variables to be set for the duration of the application execution
input_files
Which files are "staged-in" from the working directories of parent jobs. This
follows the same shell file-pattern format as the ``stage_out_files`` field
mentioned above. It is intended to facilitate data-flow from parent to child
jobs in a DAG, without resorting to stage-out functionality.
preprocess and postprocess
You can override the default pre- and post-processing scripts which run before and after
the application is executed. (The default processing scripts are defined alongside the application).
I want my program to wait on the completion of a job it created.
-----------------------------------------------------------------
If you need to wait for a job to finish, you can set up a polling function like the following::
from balsam.launcher import dag
import time
def poll_until_state(job, state, timeout_sec=60.0, delay=5.0):
start = time.time()
while time.time() - start < timeout_sec:
time.sleep(delay)
job.refresh_from_db()
if job.state == state:
return True
return False
Then you can check for any state with a specified maximum waiting time and delay.
For finished jobs, you can do::
newjob = dag.add_job( ... )
success= poll_until_state(newjob, 'JOB_FINISHED')
There is a convenience function for reading files in a job’s working directory::
if success:
output = newjob.read_file_in_workdir(‘output.dat’) # contents of file in a string
Querying the Job database
---------------------------
You can perform complex queries on the BalsamJob database thanks to Django. If
you ever need to filter the jobs according to some criteria, the entire
database is available via ``dag.BalsamJob``
See the `official documentation
<https://docs.djangoproject.com/en/2.0/topics/db/queries>`_ for lots of
examples, which directly apply wherever you can replace ``Entry`` with
``BalsamJob``. For example, say you want to filter for all jobs containing
“simulation” in their name, but exclude jobs that are already finished::
from balsam.launcher import dag
BalsamJob = dag.BalsamJob
pending_simulations = BalsamJob.objects.filter(name__contains=“simulation").exclude(state=“JOB_FINISHED”)
You could count this query::
num_pending = pending_simulations.count()
Or iterate over the pending jobs and kill them::
for sim in pending_simulations:
dag.kill(sim)
Useful command lines
----------------------
Create a dependency between two jobs::
balsam dep <parent> <child> # where <parent>, <child> are the first few characters of job ID
balsam ls --tree # see a tree view showing the dependencies between jobs
Reset a failed job state after some changes were made::
balsam modify jobs b0e --attr state --value CREATED # where b0e is the first few characters of the job id
See the state history of your jobs and any error messages that were recorded while the job ran::
balsam ls --hist | less
Remove all jobs with substring "task"::
balsam rm jobs --name task
Useful Python scripts
----------------------
You can use the ``balsam.launcher.dag`` API to automate a lot of tasks that
might be tedious from the command line. For example, say you want to
**delete** all jobs that contain "master" in their name, but reset all jobs
that start with "task" to the "CREATED" state, so they may run again::
import balsam.launcher.dag as dag
dag.BalsamJob.objects.filter(name__contains="master").delete()
for job in dag.BalsamJob.objects.filter(name__startswith="task"):
job.update_state("CREATED")
......@@ -33,13 +33,15 @@ HPC resources.
:caption: Quickstart
quick/quickstart.rst
quick/tutorial.rst
quick/db.rst
quick/hello.rst
.. toctree::
:maxdepth: 2
:caption: User Guide
userguide/overview
userguide/tutorial.rst
userguide/dag
.. _dev_docs:
......@@ -50,6 +52,12 @@ HPC resources.
devguide/roadmap
devguide/launcher
.. toctree::
:maxdepth: 2
:caption: Use Cases
example/dl-hps.rst
example/recipes.rst
Indices and tables
......
The Balsam Database
===================
Every Balsam instance is associated with a **database** of jobs. If you use the
Balsam scheduling service, it will package and submit these jobs for concurrent
ensemble execution. You can also bypass the service and directly run the Balsam
launcher, which is the pilot responsible for processing and executing jobs
concurrently.
When Balsam is first installed, a SQLite database is created in the default **Balsam DB directory**.
This directory is named ``default_balsamdb`` and located in the top level of the Balsam repo.
Each Balsam DB directory contains the following files:
db.sqlite3
The actual database (in the case of SQLite, a single file) on disk. For
heavy duty databases like Postgres, this file is replaced by a entire
directory containing the various database files
dbwriter_address
JSON datastore used to coordinate database clients and servers
data/
Directory containing all the output and files generated by your jobs. The
data/ directory is organized by subdirectories for each workflow name. All
jobs assigned to the same workflow will be placed in the same folder. This
workflow folder then contains all the working directories of its jobs,
which are given unique names based on the given job name.
log/
Log files, which are very useful to monitor the state and health of the Balsam applicaitons, are
written into this logging subdirectory.
Creating a new Balsam DB
--------------------------
You may want to set up one or more Balsam DB directories for different
projects; or perhaps one SQLite DB for testing and a heavier-duty one for
production. Create a new Balsam DB simply by specifying the path of the new
directory::
$ balsam init ~/testdb --db-type sqlite3
This will create a new directory and populate it with the files mentioned
above. If you ever ``git pull`` and the database schema goes out-of-date,
you can also use this command to refresh the existing DB.
Starting up the Balsam DB Server
----------------------------------
With SQLite3, processes that use the Balsam DB can simply make a direct connection by opening the file on disk.
This approach breaks down when independent processes (e.g. the metascheduler and launcher) or user applications
try to **write concurrently** to the database. Because SQLite+Django has poor support for concurrency, Balsam
provides a DB proxy service, which runs in front of the database and serializes all calls to write through a
ZeroMQ socket. Start and stop the DB server with::
$ balsam dbserver --path=~/testdb
$ balsam dbserver --stop
This will run a "watchdog" process that keeps the server alive and establishes
the appropriate port for TCP communication to the server. All programs in your
workflow will automatically communicate with this server instead of trying to
write directly to the DB.
In the case of Postgres or MySQL, running the **dbserver command is not
optional**, because **all** DB read and write queries are handled by the
server! Fortunately, the details of setting up and running a DB server are
completely encapsulated by the commands ``balsam init`` and ``balsam
dbserver``.
Specifying which DB to use
----------------------------
If you constantly want to use one Balsam DB other than ``default_balsamdb/``,
you can change the path in ``balsam/user_settings.py``. Set the variable
``default_db_path`` at the top of the file equal to the full path of your DB
directory.
To override the ``default_db_path`` for a particular session, you can set the environment
variable ``BALSAM_DB_PATH``. Either export this variable or prefix your command lines with
it, for instance::
$ BALSAM_DB_PATH=~/testdb balsam ls # will list jobs in the ~/testdb location
$
$ # or equivalently...
$ export BALSAM_DB_PATH=~/testdb
$ balsam ls
Hello World and Testing
=========================
Hello World (on Cooley)
------------------------
The launcher pulls jobs from the database and invokes MPI to run the jobs.
To try it out interactively, grab a couple nodes on Cooley and remember to
load the appropriate environment::
qsub -A datascience -n 2 -q debug -t 30 -I
soft add +anaconda
source activate balsam
The **balsam** command-line tool will have been added to your path.
There are a number of sub-commands to try; to explore the options, use
the ``--help`` flag::
balsam --help
balsam ls --help
Let's setup a balsam DB in our home directory for testing and start up a DB server to
manage that database::
balsam init ~/testdb
export BALSAM_DB_PATH=~/testdb
balsam dbserver --path ~/testdb
With the ``BALSAM_DB_PATH`` environment variable set, all ``balsam`` programs will refer to this
database. Now let's create a couple dummy jobs and see them listed in
the database::
balsam ls # no jobs in ~/testdb yet
balsam qsub "echo hello world" --name hello -t 0
balsam make_dummies 2
balsam ls --hist # history view of jobs in ~/testdb
Useful log messages will be sent to ``~/testdb/log/`` in real time. You can
change the verbosity, and many other Balsam runtime parameters, in
balsam/user_settings.py. Finally, let's run the launcher::
balsam launcher --consume --time 0.5 # run for 30 seconds
balsam ls --hist # jobs are now done
balsam ls --verbose
balsam rm jobs --all
Comprehensive Test Suite
------------------------
The **balsam-test** command line tool invokes tests in the ``tests/`` directory
You can run specific tests by passing the test module names, or run all of them
just by calling **balsam-test** with no arguments. You can provide the ``--temp`` parameter
to run certain tests in a temporary test directory::
$ balsam-test tests.test_dag --temp # this should be quick
You should see this at the end of the test output::
----------------------------------------------------------------------
Ran 3 tests in 1.575s
OK
To run the comprehensive set of unit and functional tests, you must create a
persistent test DB and run a DB server in front of it. To help prevent users
from running tests in their production database, Balsam requires that the DB
directory contains the substring "test". (The ``~/testdb`` DB created above
would suffice here). Be sure that a DB server is running in front of the test
database::
export BALSAM_DB_PATH=~/testdb
balsam dbserver --path ~/testdb
balsam-test # the test_functional module might take over 10 minutes!
Install Balsam
==========================
.. note::
If you are reading this documentation from GitHub/GitLab, some of the example code
is not displayed! Take a second and build this documentation on your
own machine (until it's hosted somewhere accessible from the internet)::
$ pip install --user sphinx
$ cd docs
$ make html
$ firefox _build/html/index.html # or navigate to file from browser
Prerequisites
-------------
Balsam requires **Python 3.6** and **mpi4py** to be set up and running prior to installation.
The best approach is to set up an Anaconda environment just for Balsam.
The best approach is to set up an Anaconda environment just for Balsam. This environment can
easily be cloned or extended to suit the needs of your own workflows that use Balsam.
.. note::
A working **mpi4py** installation is somewhat system-dependent and therefore this
dependency is not packaged with Balsam. Below are just guidelines to get it set up
dependency is not packaged with Balsam. Below are guidelines to get it set up
on a few systems where Balsam has been tested.
Mac OS X
......@@ -30,8 +42,8 @@ Cooley (@ALCF)
$ soft add +anaconda
$ conda config --add channels intel
$ conda create --name balsam intelpython3_full python=3
$ source activate balsam # mpi4py just works
$ conda create --name balsam_cooley intelpython3_full python=3
$ source activate balsam_cooley # mpi4py just works
Theta (@ALCF)
^^^^^^^^^^^^^^^^^^^^^^^
......@@ -44,82 +56,63 @@ Theta (@ALCF)
$ cp /opt/cray/pe/mpt/7.6.0/gni/mpich-intel-abi/16.0/lib/libmpi* ~/.conda/envs/balsam/lib/ # need to link to intel ABI
$ export LD_LIBRARY_PATH=~/.conda/envs/balsam/lib:$LD_LIBRARY_PATH # add to .bash_profile
.. note::
.. warning::
If running on Balsam on two systems with a shared file system, keep in mind
that a separate conda environment should be created for each (e.g.
balsam_theta and balsam_cooley)
that a **separate** conda environment should be created for each (e.g.
balsam_theta and balsam_cooley).
Get Balsam
Environment
-----------
Before installing Balsam, and whenever you subsequently use it, remember the appropriate
environment must be loaded! Thus, for every new login session or in each job submission script, be sure
to do the following:
Mac OS X
^^^^^^^^^
.. code:: bash
git clone git@xgitlab.cels.anl.gov:turam/hpc-edge-service.git
cd hpc-edge-service
git checkout develop
source activate balsam
Installation
-------------
Pip/setuptools will take care of the remaining dependencies (``django``, etc...) and run the
necessary code to set up a default Balsam database.
Cooley
^^^^^^^^^
.. code:: bash
pip install -e .
Quick Tests
-------------
The ``balsam-test`` command-line utility will have been added to your path. To check the installation, try
running one of the quick tests:
soft add +anaconda
source activate balsam_cooley
>>> balsam-test tests.test_dag
Theta
^^^^^^^^^
Hello World (on Cooley)
------------------------
The launcher pulls jobs from the database and invokes MPI to run the jobs.
To try it out interactively, grab a couple nodes on Cooley::
.. code:: bash
qsub -A datascience -n 2 -q debug -t 30 -I
soft add +anaconda
source ~/.bash_profile # this is not auto-sourced on MOM nodes
source activate balsam
The **balsam** command-line tool will have been added to your path.
There are a number of sub-commands to try; to explore the options, use
the ``--help`` flag::
balsam --help
balsam ls --help
balsam ls # no jobs in DB yet
Get Balsam
-----------
Check out the development branch of Balsam:
Now let's create a couple dummy jobs and see them listed in
the database::
.. code:: bash
balsam qsub "echo hello world" --name hello -t 0
balsam make_dummies 2
balsam ls --hist
git clone git@xgitlab.cels.anl.gov:turam/hpc-edge-service.git
cd hpc-edge-service
git checkout develop
Finally, run the launcher. Useful log messages will be sent to the log/ directory in real time.
You can change the verbosity, and many other Balsam runtime parameters, in balsam/user_settings.py::
Pip/setuptools will take care of the remaining dependencies (``django``, etc...) and run the
necessary code to set up the default Balsam database.
balsam launcher --consume --time 0.5 # run for 30 seconds
balsam ls --hist # jobs are now done
balsam rm jobs --all
.. code:: bash
Hello World (on Theta)
------------------------
The procedure is largely the same as for Cooley, except that instead of using "soft", anaconda
is added to the PATH explicitly::
$ qsub -A datascience -n 2 -q debug-cache-quad -t 30 -I
$ source ~/.bash_profile # this should contain the PATH export mentioned previously
$ source activate balsam
$ export LD_LIBRARY_PATH=~/.conda/envs/balsam/lib:$LD_LIBRARY_PATH # if not already in .bash_profile
$ balsam ls # ready to go
pip install -e . # your balsam environment is already loaded
Comprehensive Test Suite
------------------------
The **balsam-test** command line tool invokes tests in the tests/ directory
You can run specific tests by passing the test module names, or run all of
them just by calling **balsam-test** with no arguments::
Quick Tests
-------------
The ``balsam-test`` command-line utility will have been added to your path. To
check the installation, try running one of the quick tests. The ``--temp`` parameter
creates a temporary test database for the duration of the unit tests::
balsam-test tests.test_dag # this should be quick
balsam-test # the test_functional module might take over 10 minutes!
$ balsam-test --temp tests.test_dag
Balsam Tutorial
Workflow Tutorial
===================
The goal of Balsam is to manage and optimize workflow execution with a minimum of user intervention,
......@@ -63,6 +63,18 @@ Let's first write these mock applications in Python. Create a new folder and pop
.. literalinclude:: balsam_tutorial/reduce.py
:caption: reduce.py
We can set up a Balsam DB just for this tutorial::
$ source activate balsam
$ balsam init tutorial_db
$ export BALSAM_DB_PATH=$(PWD)/tutorial_db
$ balsam dbserver --path=tutorial_db
.. note::
This dbserver must be reachable from all the Balsam processes. On
Theta@ALCF, you can run these commands on a login node. Jobs running on the
MOM/compute nodes will automatically reach back to this server, as long as
``BALSAM_DB_PATH`` is set correctly in the batch script/interactive session.
Writing dynamic workflows
^^^^^^^^^^^^^^^^^^^^^^^^^^^
......
......@@ -10,11 +10,15 @@ from os import path
import os
import time
def auto_setup_db():
here = path.abspath(path.dirname(__file__))
default_db_path = os.path.join(here , 'default_balsamdb')
os.mkdir(default_db_path, mode=0o755)
try:
os.mkdir(default_db_path, mode=0o755)
except OSError:
from shutil import rmtree
rmtree(default_db_path)
os.mkdir(default_db_path, mode=0o755)
time.sleep(1)
cwd = os.getcwd()
os.chdir(default_db_path)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment