This is an old revision of the document!


Hellbender

Hellbender is a traditional HPC research cluster built at MU in 2023 to support research efforts for UM researchers. Mizzou's Division of Research, Innovation & Impact (DRII) is the primary source of funding for the hardware and support of Hellbender. Hellbender is made up of 112 compute nodes containing 2 AMD 7713 processors and 512GB of RAM and 17 GPU nodes containing 4 Nvidia A100 80GB RAM GPUs, 2 Intel Xeon 6338 processors, and 256GB of system RAM.

DRII has made it clear the mission of Hellbender is to accelerate Mizzou Forward initiatives. There are 2 access levels made available for UM researchers, general and priority access. General access is free and available to all UM researchers. General access provides an equal share of at least 50% of the resources available to all users. Priority access provides dedicated access to some number of nodes on Hellbender and is available through either direct investment or DRII allocations. Direct investments are subsidized by DRII at a rate of 25% of the investment. For more specific details regarding access levels and costs for investment please see the computing model document for Hellbender.

Requesting access to Hellbender can be done through our request form. Each form entry will need a faculty sponsor listed as the principal investigator for the group (PI) to be the primary contact for the request. The PI will be the responsible party for managing members The form entry can also request access to our Research Data Ecosystem (RDE) at the same time as an HPC request or the RDE request can be made separately later if you find a need for it.

Cluster Status

Regular maintenance is scheduled for the 2nd Tuesday of every month. Jobs will run if scheduled to complete before the window begins, and jobs will start once maintenance is complete.

  • More information can also be found on the MOTD of the cluster that displays when you log in

Subscribe to the RSS Announcement List for the latest information on scheduled maintenance, training, investment opportunities and other information.

Request a Hellbender Account

Please follow the link below and fill out the form to request an account on Hellbender.

https://missouri.qualtrics.com/jfe/form/SV_e4E2UyI77SpiYMC

When logging into Hellbender you can use your UM system password with the userid that is assigned to you and shared in your initial welcome e-mail if you are authenticating from within the campus network or VPN. If you plan on accessing Hellbender from off-campus we will need a password protected ssh keypair to associate with your account.

Hellbender Investment Model

Overview

The newest High Performance Computing (HPC) resource, Hellbender, has been provided through partnership with the Division of Research Innovation and Impact (DRII) and is intended to work in conjunction with DRII policies and priorities. This outline will provide definitions about how fairshare, general access, priority access, and researcher contributions will be handled for Hellbender. HPC has been identified as a continually growing need for researchers, as such DRII has invested in Hellbender to be an institutional resource. This investment is intended to increase ease of access to these resources, provide cutting edge technology, and grow the pool of resources available.

Fairshare

To understand how general access and priority access differs, fairshare must first be defined. Fairshare is an algorithm that is used by the scheduler to assign priority to jobs from users in a way that gives every user a fair chance at the resources available. This algorithm has several metrics to perform this calculation over for any given job waiting in the queue, such as job size, wait time, current and recent usage, and individual user priority levels. This allows administrators to tune the fairshare algorithm, to adjust how it determines which jobs are next to run once resources are available.

Resources Available to Everyone: General Access

General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource. Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations will be provided through RDE Advanced Allocations, and independent of HPC priority status.

Hellbender Advanced: Priority Access

When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access. Priority access will give research groups a limited set of resources that will be available to them without competition from general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well.

Priority Designation

Hellbender Advanced Allocations are eligible for DRII Priority Designation. This means that DRII has determined the proposed use case (such as a core or grant-funded project) presents a strategic advantage or high priority service to the university. In this case, DRII fully subsidizes the resources used to create the Advanced Allocation.

Traditional Investment

Hellbender Advanced Allocation requests that are not approved for DRII Priority Designation may be treated as traditional investments with the researcher paying for the resources used to create the Advanced Allocation at the defined rate. These rates are subject to change based on the determination of DRII, and hardware costs.

Resource Management

Information Technology Research Support Solutions (ITRSS) will procure, set up, and maintain the resource. ITRSS will work in conjunction with MU Division of Information Technology and Facility Services to provide adequate infrastructure for the resource.

Resource Growth

Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution.

Benefits of Investing

The primary benefit of investing is receiving “shares” and a priority access partition for you or your research group. Shares are used to calculate the percentage of the cluster owned by an investor. As long as an investor has used less than they own, investors will be able to use their shares to get higher priorities in the general queue. than they own. FairShare is by far the largest factor in queue placement and wait times.

Investors will be granted Slurm accounts to use in order to charge their investment (FairShare). These accounts can contain the same members of a POSIX group (storage group) or any other set of users at the request of the investor.

To use an investor account in an sbatch script, use:

#SBATCH --account=<investor account>
#SBATCH --partition=<investor partition> (for cpu jobs)
#SBATCH --partition=<investor partition>-gpu --gres=gpu:A100:1 (requests 1 A100 gpu for gpu jobs)

To use a QOS in an sbatch script, use:

#SBATCH --qos=<qos>

HPC Pricing

The HPC Service is available at any time at the following rates for year 2023:

Service Rate Unit Support
Hellbender HPC Compute $2,702.00 Per Node Year to Year
GPU Compute $7,691.38 Per Node Year to Year
High Performance Storage $95.00 Per TB/year Year to Year
General Storage $25.00 Per TB/year Year to Year
GPRS Storage $7.00 Per TB/Month Month to Month

Connecting

To connect to Hellbender please first make sure that you have an account. To get an account please fill out our account request form.

https://missouri.qualtrics.com/jfe/form/SV_e4E2UyI77SpiYMC

Once you have been notified by the RSS team that your account has been created on Hellbender, open a terminal and type in ssh [SSO]@hellbender-login.rnet.missouri.edu. Using your UM-system password you will be able to login directly to Hellbender if you are on campus or on the VPN. Once connected you will land on the login node and will see a screen similar to this: [HELLBENDER LANDING PAGE EXAMPLE].

You are now on the login node and are ready to proceed to submit jobs and work on the cluster.

SSH

If you won't be primarily connecting to Hellbender from on campus and do not want to use the VPN - another option is to use public/private key authentication. You can add your ssh keypairs to any number of computers and they will be able to access Hellbender from outside the campus network.

Generating an SSH Key on Windows

  1. To generate an SSH key on a Windows computer - you will need to first download a terminal program - we suggest MobaXterm (https://mobaxterm.mobatek.net/).
  2. Once you have MobaXterm downloaded - start a new session by selecting “Start Local Terminal”
  3. [Insert local terminal mobaxterm image here].
  4. Type ssh-keygen and when prompted press enter to save the key in the default location /home/<username>/.ssh/id_rsa then enter a strong passphrase (required) twice.
  5. After you generate your key - you will need to send us the public key. To see what your public key is you can type: cat ~/.ssh/id_rsa.pub. The output will be a string of characters and numbers. Please copy this information and send to RSS and we will add your key to your account.

Generating an SSH Key on MacOS/Linux

  1. Open your terminal application of choice
  2. Type ssh-keygen and when prompted press enter to save the key in the default location /home/<username>/.ssh/id_rsa then enter a strong passphrase (required) twice.
  3. After you generate your key - you will need to send us the public key. To see what your public key is you can type: cat ~/.ssh/id_rsa.pub. The output will be a string of characters and numbers. Please copy this information and send to RSS and we will add your key to your account.

Hellbender Partitions

Partition Usage Guidelines

General

The general partition is intended for non-investors to run multi-node, multi-day jobs.

  • default time limit: 1 hour
  • maximum time limit: 7 days
Requeue

The requeue partition is intended for non-investor jobs that have been re-queued due to their landing on an investor-owned node.

  • default time limit: 10 minutes
  • maximum time limit: 7 days
Gpu

The Gpu partition is composed of Nvidia A100 cards (4 per node). Acceptable use includes jobs that utilize a GPU for the majority of the run

  • default time limit: 1 hour
  • maximum time limit: 7 days
Interactive

This partition is designed for short interactive testing, interactive debugging, and general interactive jobs. Please use this for light testing as opposed to the login node.

  • default time limit: 1 hour
  • maximum time limit: 7 days
Logical_cpu

This partition is designed for workloads that can make use of hyperthreaded hardware

  • default time limit: 1 hour
  • maximum time limit: 7 days

Job Submission

By default jobs submitted without a partition will land on requeue. If your job lands on a node that is owned by an investor in requeue - that job is subject to being stopped and requeued at any point if the investor needs to run on the same node at the same time.

Slurm Overview

Slurm is for cluster management and job scheduling. All RSS clusters use Slurm. This document gives an overview of how to run jobs, check job status, and make changes to submitted jobs. To learn more about specific flags or commands please visit slurm's website [link here].

All jobs must be run using srun or sbatch to prevent running on the Lewis login node. Jobs that are running found running on the login node will be immediately terminated followed up with a notification email to the user.

Slurm Commands and Options

Job submission

sbatch - Submit a batch script for execution in the future (non-interactive)

srun - Obtain a job allocation and run an application interactively

Option Description
-A, –account=<account> Account to be charged for resources used
-a, –array=<index> Job array specification (sbatch only)
-b, –begin=<time> Initiate job after specified time
-C, –constraint=<features> Required node features
–cpu-bind=<type> Bind tasks to specific CPUs (srun only)
-c, –cpus-per-task=<count> Number of CPUs required per task
-d, –dependency=<state:jobid> Defer job until specified jobs reach specified state
-m, –distribution=<method[:method]> Specify distribution methods for remote processes
-e, –error=<filename> File in which to store job error messages (sbatch and srun only)
-x, –exclude=<name> Specify host names to exclude from job allocation
–exclusive Reserve all CPUs and GPUs on allocated nodes
–export=<name=value> Export specified environment variables (e.g., all, none)
–gpus-per-task=<list> Number of GPUs required per task
-J, –job-name=<name> Job name
-l, –label Prepend task ID to output (srun only)
–mail-type=<type> E-mail notification type (e.g., begin, end, fail, requeue, all)
–mail-user=<address> E-mail address
–mem=<size>[units] Memory required per allocated node (e.g., 16GB)
–mem-per-cpu=<size>[units] Memory required per allocated CPU (e.g., 2GB)
-w, –nodelist=<hostnames> Specify host names to include in job allocation
-N, –nodes=<count> Number of nodes required for the job
-n, –ntasks=<count> Number of tasks to be launched
–ntasks-per-node=<count> Number of tasks to be launched per node
-o, –output=<filename> File in which to store job output (sbatch and srun only)
-p, –partition=<names> Partition in which to run the job
–signal=[B:]<num>[@time] Signal job when approaching time limit
-t, –time=<time> Limit for job run time

Interactive Slurm Job

Interactive jobs are typically a few minutes. This a basic example of an interactive job using srun and -n to use one cpu:

srun -n 1 hostname

An example of the output from this code would be:

[bjmfg8@hellbender-login ~]$ srun -n 1 hostname
srun: Warning, you are submitting a job the to the requeue partition. There is a chance that your job will be preempted by priority partition jobs and have to start over from the beginning.
g003.mgmt.hellbender
[bjmfg8@hellbender-login ~]$

As noted - submitting with no partition specified will result in the job landing in the requeue partition.

[bjmfg8@hellbender-login ~]$ srun -p general -n 1 hostname
c006.mgmt.hellbender
[bjmfg8@hellbender-login ~]

Batch Slurm Job

Batch jobs run multiple jobs and multiple tasks. They typically take a few hours to a few days to complete. Most of the time you will use a SBATCH file to launch your jobs. This examples shows how to put our SLURM options in the file saving_the_world.sh and then submit the job to the queue. To learn more about the partitions available for use on Lewis and the specifics of each partition please read our Partition Policy.

#! /bin/bash
 
#SBATCH -p general  # use the general partition
#SBATCH -J saving_the_world  # give the job a custom name
#SBATCH -o results-%j.out  # give the job output a custom name
#SBATCH -t 0-02:00  # two hour time limit
 
#SBATCH -N 2  # number of nodes
#SBATCH -n 2  # number of cores (AKA tasks)
 
# Commands here run only on the first core
echo "$(hostname), reporting for duty."
 
# Commands with srun will run on all cores in the allocation
srun echo "Let's save the world!"
srun hostname

Once the SBATCH file is ready to go start the job with:

sbatch saving_the_world.sh

Output is found in the file results-<job id here>.out. Example below:

[bjmfg8@hellbender-login ~]$ sbatch saving_the_world.sh
Submitted batch job 86439
[bjmfg8@hellbender-login ~]$ cat results-86439.out
c006.mgmt.hellbender, reporting for duty.
Let's save the world!
Let's save the world!
c006.mgmt.hellbender
c015.mgmt.hellbender
[bjmfg8@hellbender-login ~]$ 

Software

Anaconda

[ADD CONTENT]

CUDA

[ADD CONTENT]

Gaussian

[ADD CONTENT]

Globus

[ADD CONTENT]

Introduction to Basic Linux

[ADD CONTENT]

Matlab

[ADD CONTENT]

MobaXterm

[ADD CONTENT]

Environment Modules User Guide

[ADD CONTENT]

MPI

[ADD CONTENT]

R

[ADD CONTENT]

Singularity

[ADD CONTENT]

Julia

[ADD CONTENT]

TensorFlow

[ADD CONTENT]

Services

RSS provides the following services:

  • [High Performance Compute](/services/#high-performance-compute)
  • [Storage](/services/#storage)
  • [Teaching Cluster](/services/#teaching-cluster)
  • [Grant Assistance](/services/#grant-assistance)

Training

RSS provides training for our high-performance computing resources. This training covers hardware basics, how to use the scheduler, how to use secure shell key-based authentication, and any other questions you may have.

  • [Training](/training)
  • [Office Hours](/training/#rcss-office-hours)
  • [Introduction to Basic Linux](/software/linux-intro)
  • [Introduction to Lewis and Clark Clusters](/lewis-and-clark-clusters)

Policies

Following are RSS policies and guidelines for different services and groups:

  • [Partition Policy](/policies/#partition-policy)
  • [Storage Policy](/policies/#storage-policy)
  • [Software and Procurement Policy](/policies/#software-and-procurement-policy)
  • [Research Network Policy](/policies/#research-network-policy)
  • [General Purpose Research Network Policy](/policies/#general-purpose-research-network-policy)
  • [Teaching Cluster Policy](/policies/#teaching-cluster-policy)
  • [Teaching VM Policy](/policies/#teaching-vm-policy)
  • [NSF MRI-RC Policy](/policies/#nsf-mri-rc-policy)
  • [Howto](/howto)
  • [Getting Help](/getting-help)
  • [Getting Started](/getting-started)
  • [Software Install Request](/software-install-request)
  • [Frequently Asked Questions](/frequently-asked-questions)