Using `f3dasm` on a High-Performance Cluster Computer¤

Your f3dasm workflow can be seamlessly translated to a high-performance computing cluster.
The advantage is that you can parallelize the total number of experiments among the nodes of the cluster.
This is especially useful when you have a large number of experiments to run.

This example has been tested on the following high-performance computing cluster systems:

The hpc06 cluster of Delft University of Technology, using the TORQUE resource manager.

The DelftBlue: TU Delft supercomputer, using the SLURM resource manager.

The OSCAR compute cluster from Brown University, using the SLURM resource manager.

from time import sleep

import numpy as np

from f3dasm import (
    HPC_JOBID,
    ExperimentData,
    create_datagenerator,
    create_sampler,
)
from f3dasm.design import make_nd_continuous_domain

We will create the following data-driven process:

Create a 20D continuous Domain.
Sample from the domain using a Latin-hypercube sampler.
With multiple nodes, use a data generation function, which will be the "Ackley" function from the benchmark functions.

We want to ensure that the sampling is done only once, and that the data generation is performed in parallel.
Therefore, we can divide the different nodes into two categories:

The first node (f3dasm.HPC_JOBID == 0) will be the master node, responsible for creating the design-of-experiments and sampling (the create_experimentdata function).

def create_experimentdata():
    """Design of Experiment"""
    # Create a domain object
    domain = make_nd_continuous_domain(
        bounds=np.tile([0.0, 1.0], (20, 1)), dimensionality=20
    )

    # Create the ExperimentData object
    data = ExperimentData(domain=domain)

    sampler = create_sampler(sampler="latin", seed=42)

    # Sampling from the domain
    data = sampler.call(data, n_samples=10)

    # Store the data to disk
    data.store()

All the other nodes (f3dasm.HPC_JOBID > 0) will be process nodes, which will retrieve the ExperimentData from disk and proceed directly to the data generation function.

def worker_node():
    # Extract the experimentdata from disk
    data = ExperimentData.from_file(project_dir=".")

    """Data Generation"""
    # Use the data-generator to evaluate the initial samples
    data_generator = create_datagenerator(data_generator="Ackley")

    data_generator.arm(data=data)

    data = data_generator.call(data=data, mode="cluster")

The entrypoint of the script can now check the jobid of the current node and decide whether to create the experiment data or to run the data generation function:

if __name__ == "__main__":
    # Check the jobid of the current node
    if HPC_JOBID is None:
        # If the jobid is none, we are not running anything now
        pass

    elif HPC_JOBID == 0:
        create_experimentdata()
        worker_node()
    elif HPC_JOBID > 0:
        # Asynchronize the jobs in order to omit racing conditions
        sleep(HPC_JOBID)
        worker_node()

Running the Program¤

You can run the workflow by submitting the bash script to the HPC queue.
Make sure you have miniconda3 installed on the cluster and that you have created a conda environment (in this example named f3dasm_env) with the necessary packages.

TORQUE Example¤

#!/bin/bash
# Torque directives (#PBS) must always be at the start of a job script!
#PBS -N ExampleScript
#PBS -q mse
#PBS -l nodes=1:ppn=12,walltime=12:00:00

# Make sure I'm the only one that can read my output
umask 0077

# The PBS_JOBID looks like 1234566[0].
# With the following line, we extract the PBS_ARRAYID, the part in the brackets []:
PBS_ARRAYID=$(echo "${PBS_JOBID}" | sed 's/\[[^][]*\]//g')

module load use.own
module load miniconda3
cd $PBS_O_WORKDIR

# Activate my conda environment:
source activate f3dasm_env

# Limit the number of threads
OMP_NUM_THREADS=12
export OMP_NUM_THREADS=12

# If the PBS_ARRAYID is not set, set it to None
if ! [ -n "${PBS_ARRAYID+1}" ]; then
    PBS_ARRAYID=None
fi

# Execute my Python program with the jobid flag
python main.py --jobid=${PBS_ARRAYID}

SLURM Example¤

#!/bin/bash -l

#SBATCH -J "ExampleScript"                # Name of the job
#SBATCH --get-user-env                    # Set environment variables

#SBATCH --partition=compute
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --account=research-eemcs-me
#SBATCH --array=0-2

source activate f3dasm_env

# Execute my Python program with the jobid flag
python3 main.py --jobid=${SLURM_ARRAY_TASK_ID}

You can run the workflow by submitting the bash script to the HPC queue.
The following command submits an array job with 3 jobs where f3dasm.HPC_JOBID takes values of 0, 1, and 2.

TORQUE Example¤

qsub pbsjob.sh -t 0-2

SLURM Example¤

sbatch --array 0-2 pbsjob.sh

Next: Grid Sampling

Using f3dasm on a High-Performance Cluster Computer¤

Running the Program¤

TORQUE Example¤

SLURM Example¤

TORQUE Example¤

SLURM Example¤

Using `f3dasm` on a High-Performance Cluster Computer¤