Host Management

ecFlow is ultimately a framework for executing tasks, but task execution requires a context. pyflow makes use of a Host object to supply the context for this execution. As such pyflow requires a host object to be defined before it will generate any executable nodes in the tree. The host can be set at any level (Suite, Family or Task) and is inherited unless overridden.

If the default behaviour of ecFlow is required, and task execution is being managed explicitly, the host may be set to NullHost() at the Suite level. This will suppress all host-related behaviour inside pyflow.

For task handling, it is important that the ecflow_client is configured (via appropriate environment variables) and that it is correctly called to trigger changes of state in the server. Further, any and all errors that may occur in a script must be correctly caught and reported to the ecFlow server.

Host objects must also know how to transfer data to/from the host to be able to implement the Deployable Resources functionality.

Host Arguments

Host classes have many configurable options, but some of these options are available for all host classes and configure the base Host class. Other than name, all of these are optional, keyword arguments with plausible defaults.

  • name - the name used for the host. Required (non keyword argument).

  • hostname - The hostname to run the task on. Defaults to name if not supplied

  • scratch_directory - The path in which tasks will be run, unless otherwise specified. Also to be used within suites when a scratch location is needed.

  • log_directory - The directory to use for script output. Defaults to ECF_HOME, but may need to be changed on systems with scheduling systems to make the output visible to the ecFlow server.

  • resources_directory - The directory to use for suite resources. By default, scratch_directory is used.

  • limit - How many tasks can run on the node simultaneously.

  • extra_paths - Paths that are to be added to PATH on the host.

  • extra_variables - A dictionary of additional ECFLOW variables that should be set to configure the host (e.g. {'SCHOST': 'hpc'}).

  • environment_variables - Additional environment variables to export into all scripts.

  • module_source - The shell script to source to initialise the module system. Default None.

  • modules - Modules to module load

  • purge_modules - Should a module purge command be run (before loading any modules). Default False.

  • label_host - Whether to create an exec_host label on nodes where this host is freshly set. Default True.

  • user - The user running the script. May be used to determine paths, or for login details. Defaults to current user.

  • ecflow_path - The directory containing the ecflow_client executable

  • server_ecfvars - If true, don’t define ECF_JOB_CMD, ECF_KILL_CMD, ECF_STATUS_CMD and ECF_OUT variables and use defaults from server.

  • submit_arguments - A dictionary of arguments to pass to the scheduler when submitting jobs, which each key is a label that can be referenced when creating tasks with the Host instance.

  • workdir - Work directory for every task executed within the Host instance, if not overriden for a Node.

  • trap_signals - The list of signals to trap. A default list is used if not set.

Existing Host Classes

A number of existing host clases have been defined. These can be extended, and alternatives provided.

LocalHost

This is essentially a trivial host. It runs tasks as background processes on the current node - i.e. on the ecflow server, and running as the same user as the server. Other than for examples, this is extremely useful for running tasks that update labels, meters, events and variables on a node that is certain to have the ecflow_client working correctly and with no job queuing delay.

[2]:
host = pf.LocalHost()

SSHHost

Run a script on a remote host which has been accessed by SSH. The name argument is treated as the target hostname unless the hostname keyword argument is explicitly supplied. By default the user that generated the pyflow suite is used, unless the user argument is supplied.

The SSHHost is special in that it does not require the ecflow_client to be installed on the remote host and does not require the presence of any shared filesystems or log servers to make output logs visible to the user. All of the ecflow_client commands required are executed on the server side, and the script output is piped back through the SSH command.

For these connections to be established, it is necessary that the ecflow server is configured to have SSH access to the target systems using SSH keys. Further, as this requires an SSH connection to be maintained for each of the running commands, it imposes a practical limit on the number of commands that can be run simultaneously on any remote host. There may be value in setting up SSH connections that persist across multiple commands, by making use of the ControlMaster, ControlPath and ControlPersist options in the ssh config file.

[3]:
host = pf.SSHHost('dhs9999', user='max', scratch_directory='/data/a_mounted_filesystem/tmp')

The SSHHost class can also take additional optional arguments indirect_host and indirect_user. If indirect_host is supplied then a two-hop connection is made, such that a connection is made to the indirect_host, and then a further SSH connection is made to the real host. Note that this is not the same as using a ProxyCommand configured to a normal SSH connection - the credentials for the second hop are held on the intermediate system. indirect_user defaults to user if it is not supplied.

[4]:
host = pf.SSHHost('cloud-mvr001',
                  user='mover-user',
                  indirect_host='cloud-gateway',
                  indirect_user='cloud-user')

PBSHost

Connects to a remote host by SSH, and submits a job on the batch scheduling system. As this task will run asynchronously on a remote system this requires the ecflow_client to be available, and if it is not at the default location this should be configured with the ecflow_path keyword argument.

It is anticipated that for real use this class will be derived from to add and configure site-specific functionality (such as knowledge of, and handling of, queues).

It is likely that the log_directory will need to be modified, and the ECF_LOGHOST and ECF_LOGPORT variables are likely to be needed to operate with a log server to get output working fully.

SLURMHost

This executes scripts on a remote system, by ssh-ing in and submitting to the SLURM job scheduling system. This is very much analagous to the PBSHost.

Limits

Host objects accept an argument limit=. This can be used to construct a limit (preferably in a sensible location within the suite). Once this has been set up then any Task that is created using this host object will automatically be added to the limit for the given host.

Note that this implies that the same host object should be used to configure Tasks throughout the suite, rather than just using host objects that refer to the same host.

[6]:
with CourseSuite('limits', host=pf.LocalHost(limit=3)) as s:

    with pf.Family('limits'):
        s.host().build_limits()

    pf.Task('t1', script='I am limited')

s
[6]:
suite limits
  defstatus suspended
  edit ECF_FILES '/path/to/scratch/files/limits'
  edit ECF_HOME '/path/to/scratch/out'
  edit ECF_JOB_CMD 'bash -c 'export ECF_PORT=%ECF_PORT%; export ECF_HOST=%ECF_HOST%; export ECF_NAME=%ECF_NAME%; export ECF_PASS=%ECF_PASS%; export ECF_TRYNO=%ECF_TRYNO%; export PATH=/usr/local/apps/ecflow/%ECF_VERSION%/bin:$PATH; ecflow_client --init="$$" && %ECF_JOB% && ecflow_client --complete || ecflow_client --abort ' 1> %ECF_JOBOUT% 2>&1 &'
  edit ECF_KILL_CMD 'pkill -15 -P %ECF_RID%'
  edit ECF_STATUS_CMD 'true'
  edit ECF_OUT '%ECF_HOME%'
  label exec_host "localhost"
  family limits
    limit localhost 3
  endfamily
  task t1
    inlimit /limits/limits:localhost
endsuite

Job Characteristics

In pyflow, a task is generated as a synthesis of multiple pieces of information:

  • The Task object in the suite - when to run

  • The Script object (script attribute on Task) - what to run

  • The Host object - how to run

The combination of these three components provides the information to determine when, what, and how a task should be executed. The Host object is important as it provides two major components:

  1. A mechanism by which a task should be executed. This reduces to the ECF_JOB_CMD and associated machinery.

  2. Preamble and Postamble material that is used for consting the script to execute.

Unfortunately, the breakdown is not nearly so clear in real life. Consider the case of one of the HPC machines. We can:

  • Run a task on the head node as a simple SSHHost

  • Submit a serial, fractional or parallel job

  • Submit jobs using various (machine specific) resource requirements

This is a problem. Conceptually properties such as the number of cores and nodes, whether to use hyperthreading or hugepages are properties of the Task but they depend very strongly on the Host.

Currently all properties that determine the execution process must belong to the Host. These can be parameterised to use ecFlow variables that are set on Families or Tasks, but this is a bit of a hack. We would like this parameterisation to only be needed if those properties should be changeable at runtime (e.g. by the operators).

The Host submit_arguments dictionary is used to pass arguments to the scheduler when submitting jobs. Each key in this dictionary is a label that can be referenced when creating tasks with the Host instance. This allows for flexible job submission configurations based on the host’s capabilities and requirements.

Example

from pyflow import Suite, Task, SlurmHost
# Create a suite with a local host
suite = Suite(
    "example_suite",
    host=SlurmHost(
        name="slurm_host",
        submit_arguments={"simple_jobs": {"job_name": "%TASK%", "partition": "compute", "time": "01:00:00"}}
        workdir="$JOBSWDIR",
    ),
)
with suite:
    # Add a task to the suite
    Task("example_task", script="echo 'Hello, World!'", submit_arguments="simple_jobs")

The above code will generate a task that runs the command echo 'Hello, World!' on a SLURM-managed host. The task will be submitted with the specified job name, partition, and time limit. The generated script will look something like this:

#!/bin/bash
# This file is generated by pyflow
# SBATCH --partition=compute
# SBATCH --time=01:00:00
# SBATCH --job-name=%TASK%

[[ -d "$JOBSWDIR" ]] || mkdir -p "$JOBSWDIR"
cd "$JOBSWDIR"
echo "Current working directory: $(pwd)"

%nopp
echo 'Hello, World!'
(...)