====== Hellbender ====== ===== Request an account ===== You can request an account on the Hellbender by filling out the form found at https://request.itrss.umsystem.edu ===== System Information ===== ==== Maintenance ==== **__Regular maintenance is scheduled for the 2nd Tuesday of every month.__** Jobs will run if scheduled to complete before the window begins, and jobs will start once maintenance is complete. ==== Links ==== * UM IT System Status: https://status.missouri.edu * RSS Announcement List: https://po.missouri.edu/scripts/wa.exe?SUBED1=RSSHPC-L&A=1 * RSS Services: https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO * Hellbender Account Request Form: https://request.itrss.umsystem.edu/ * Hellbender Add User to Existing Account Form: https://missouri.qualtrics.com/jfe/form/SV_9LAbyCadC4hQdBY * Hellbender Classes Student Form: https://missouri.qualtrics.com/jfe/form/SV_6FpWJ3fYAoKg5EO ==== Services ==== Order RSS Services by filling out the form found at https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO RSS offers: * HPC Compute: Compute Node * HPC Compute: GPU Node * RDE Storage: General storage * RDE Storage: High performance storage * Software ==== Add People to Existing Account(s) ==== Add users to existing compute (SLURM), storage, or software groups by filling out the form found at https://missouri.qualtrics.com/jfe/form/SV_9LAbyCadC4hQdBY ==== Software ==== The Foundry was built and managed with Puppet. The underlying OS for the Foundry is Alma 8.9. For resource management and scheduling we are using SLURM Workload manager version 22.05.11 ==== Hardware ==== === Management nodes === The head nodes and login nodes are virtual, making this one of the key differences from the previous generation cluster named Lewis. === Compute nodes === Dell R6525: .5 rack unit servers each containing dual 64 core AMD EPYC Milan 7713 CPUs with a base clock of 2 GHz and a boost clock of up to 3.675 GHz. Each C6525 node contains 512 GB DDR4 system memory. | Model | CPU Cores | System Memory | Node Count | Local Scratch | Total Core Count | | Dell C6525 | 128 | 512 GB | 112 | 1.6 TB | 14336 | === GPU nodes === | Model | CPU cores | System Memory | GPU | GPU Memory | GPU Count | Local Scratch | Node Count | | Dell XE9640 | 104 | 2048 GB | H100 | 80 GB | 4 | 3.2 TB | 2 | | Dell R740xa | 64 | 356 GB | A100 | 80 GB | 4 | 1.6 TB | 17 | A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/features they have. sinfo -o "%5D %4c %8m %28f %35G" ==== Investment Model ==== === Overview === The newest High Performance Computing (HPC) resource, Hellbender, has been provided through partnership with the Division of Research Innovation and Impact (DRII) and is intended to work in conjunction with DRII policies and priorities. This outline will provide definitions about how fairshare, general access, priority access, and researcher contributions will be handled for Hellbender. HPC has been identified as a continually growing need for researchers, as such DRII has invested in Hellbender to be an institutional resource. This investment is intended to increase ease of access to these resources, provide cutting edge technology, and grow the pool of resources available. === Fairshare === To understand how general access and priority access differs, fairshare must first be defined. Fairshare is an algorithm that is used by the scheduler to assign priority to jobs from users in a way that gives every user a fair chance at the resources available. This algorithm has several metrics to perform this calculation over for any given job waiting in the queue, such as job size, wait time, current and recent usage, and individual user priority levels. This allows administrators to tune the fairshare algorithm, to adjust how it determines which jobs are next to run once resources are available. === Resources Available to Everyone: General Access === General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource. Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations will be provided through RDE Advanced Allocations, and independent of HPC priority status. === Hellbender Advanced: Priority Access === When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access. Priority access will give research groups a limited set of resources that will be available to them without competition from general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well. === Priority Designation === Hellbender Advanced Allocations are eligible for DRII Priority Designation. This means that DRII has determined the proposed use case (such as a core or grant-funded project) presents a strategic advantage or high priority service to the university. In this case, DRII fully subsidizes the resources used to create the Advanced Allocation. === Traditional Investment === Hellbender Advanced Allocation requests that are not approved for DRII Priority Designation may be treated as traditional investments with the researcher paying for the resources used to create the Advanced Allocation at the defined rate. These rates are subject to change based on the determination of DRII, and hardware costs. === Resource Management === Information Technology Research Support Solutions (ITRSS) will procure, set up, and maintain the resource. ITRSS will work in conjunction with MU Division of Information Technology and Facility Services to provide adequate infrastructure for the resource. === Resource Growth === Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution. === Benefits of Investing === The primary benefit of investing is receiving "shares" and a priority access partition for you or your research group. Shares are used to calculate the percentage of the cluster owned by an investor. As long as an investor has used less than they own, investors will be able to use their shares to get higher priorities in the general queue. than they own. **FairShare is by far the largest factor in queue placement and wait times.** Investors will be granted Slurm accounts to use in order to charge their investment (FairShare). These accounts can contain the same members of a POSIX group (storage group) or any other set of users at the request of the investor. To use an investor account in an sbatch script, use: #SBATCH --account= #SBATCH --partition= (for cpu jobs) #SBATCH --partition=-gpu --gres=gpu:A100:1 (requests 1 A100 gpu for gpu jobs) To use a QOS in an sbatch script, use: #SBATCH --qos= === HPC Pricing === The HPC Service is available at any time at the following rates for year 2023-2024: ^ Service ^ Rate ^ Unit ^ Support ^ |Hellbender HPC Compute | $2,702.00 | Per Node/Year | Year to Year | |GPU Compute* | $7,691.38 | Per Node/Year | Year to Year | |High Performance Storage | $95.00 | Per TB/Year | Year to Year | |General Storage | $25.00 | Per TB/Year | Year to Year | *Note: The GPU compute service is no longer active. We have reached 50% of the GPU nodes in the cluster under investment - if you need GPU capacity beyond the general pool we are able to plan and work with your grant submissions to add additional capacity to Hellbender. ==== Policies ==== **__Under no circumstances should your code be running on the login node.__** === Software and Procurement === Open Source Software installed cluster-wide must have an open source (https://opensource.org/licenses) license or be obtained utilizing the procurement process even if there is not a cost associated with it. Licensed software (any software that requires a license or agreement to be accepted) must follow the procurement process to protect users, their research, and the University. Software must be cleared via the ITSRQ. For more information about this process please reach out to us! For widely used software RSS can facilitate the sharing of license fees and/or may support the cost depending on the cost and situation. Otherwise, user are responsible for funding for fee licensed software and RSS can handle the procurement process. We require that if the license does not preclude it, and there are not node or other resource limits, that the software is make made available to all users on the cluster. All licensed software installed on the cluster is to be used following the license agreement. We will do our best to install and support a wide rage of scientific software as resources and circumstances dictate but in general we only support scientific software that will run on RHEL in a HPC cluster environment. RSS may not support software that is implicitly/explicitly deprecated by the community. === Containers, Singularity/Apptainer/Docker === A majority of scientific software and software libraries can be installed in users’ accounts or in group space. We also provide limited support for Singularity for advanced users who require more control over their computing environment. We cannot knowingly assist users to install software that may put them, the University, or their intellectual property at risk. === Storage === **__None of the cluster attached storage available to users is backed up in any way by us__**, this means that if you delete something and don't have a copy somewhere else, it is gone. Please note the data stored on cluster attached storage is limited to Data Class 1 and 2 as defined by UM System Data Classifications. If you have need to store things in DCL3 or DCL4 please contact us so we may find a solution for you. | Storage Type | Location | Quota | Description | | Home | /home/$USER | 5 GB | Available to all users | | Pixstor | /home/$USER/data | 500 GB | Available to all users | | Local Scratch | /local/scratch | 1.6-3.2 TB | Available to all users | | Pixstor | /cluster/pixstor, /mnt/pixstor | Varies | For investment, cluster attached | | Vast | /cluster/VAST | Varies | For investment, cluster/instrument attached | === Research Network === Research Network DNS: The domain name for the Research Network (RNet) is rnet.missouri.edu and is for research purposes only. All hosts on RNet will have a .rnet.missouri.edu domain. Subdomains and CNAMEs are not permitted. Reverse records will always point to a host in the .rnet.missouri.edu domain. ==== Partitions ==== | ^ Default Time Limit ^ Maximum Time Limit ^ Description ^ ^ general | 1 hour | 2 days | For non-investors to run multi-node, multi-day jobs. | ^ requeue | 10 minutes | 2 days | For non-investor jobs that have been requeued due to their landing on an investor-owned node. | ^ gpu | 1 hour | 2 days | Acceptable use includes jobs that utilize a GPU for the majority of the run. Is composed of Nvidia A100 cards, 4 per node.| ^ interactive | 1 hour | 2 days | For short interactive testing, interactive debugging, and general interactive jobs. Use this for light testing as opposed to the login node. | ^ logical_cpu | 1 hour | 2 days | For workloads that can make use of hyperthreaded hardware | ^ priority partitions | 1 hour | 28 days | For investors | ==== Citation ==== We ask that when you cite any of the RSS clusters in a publication to send an email to muitrss@missouri.edu as well as share a copy of the publication with us. To cite the use of any of the RSS clusters in a publication please use: **//The computation for this work was performed on the high performance computing infrastructure operated by Research Support Solutions in the Division of IT at the University of Missouri, Columbia MO DOI:[[https://doi.org/10.32469/10355/97710]]//** ===== Quick Start ===== ==== Open OnDemand ==== * [[https://ondemand.rnet.missouri.edu|Hellbender Open OnDemand portal (OOD)]] * [[https://ondemand-classes.rnet.missouri.edu|Hellbender Classes Open OnDemand portal (OOD classes)]] OnDemand provides an integrated, single access point for all of your HPC resources. The following apps are currently available on Hellbender's Open Ondemand. * Jupyter Notebook * RStudio Server * Virtual Desktop * VSCode ==== Teaching Cluster ==== Hellbender can be used by instructors, TAs, and students for instructional work via the [[https://ondemand-classes.rnet.missouri.edu|Hellbender Classes Open OnDemand Classes portal (OOD)]]. Below is process for setting up a class on the OOD portal. - Send the class name, the list of students and TAs, and any shared storage requirements to itrss-support@umsystem.edu. - We will add the students to the group allowing them access to OOD. - If the student does not have a Hellbender account yet, they will be presented with a link to a form to fill out requesting a Hellbender account. - We activate the student account and the student will receive an Account Request Complete email. If desired, the professor would be able to perform step 2 themselves. You may already be able to modify your class groups here: https://netgroups.apps.mst.edu/auth-cgi-bin/cgiwrap/netgroups/netmngt.pl If the class size is large, we can perform steps 3 and 4. ==== Connecting ==== You can request an account on the Hellbender by filling out the form found at https://request.itrss.umsystem.edu Once you have been notified by the RSS team that your account has been created on Hellbender, open a terminal and type in ssh [SSO]@hellbender-login.rnet.missouri.edu. Using your UM-system password you will be able to login directly to Hellbender if you are on campus or on the VPN. Once connected you will land on the login node and will see a screen similar to this: {{:pub:hpc:hellbender_landing.png?400|}} You are now on the login node and are ready to proceed to submit jobs and work on the cluster. ==== SSH ==== If you won't be primarily connecting to Hellbender from on campus and do not want to use the VPN - another option is to use public/private key authentication. You can add your ssh keypairs to any number of computers and they will be able to access Hellbender from outside the campus network. === Generating an SSH Key on Windows === - To generate an SSH key on a Windows computer - you will need to first download a terminal program - we suggest MobaXterm (https://mobaxterm.mobatek.net/). - Once you have MobaXterm downloaded - start a new session by selecting "Start Local Terminal" - [Insert local terminal mobaxterm image here]. - Type ''%%ssh-keygen%%'' and when prompted press enter to save the key in the default location ''%%/home//.ssh/id_rsa%%'' then enter a strong passphrase (required) twice. - After you generate your key - you will need to send us the public key. To see what your public key is you can type: ''%%cat ~/.ssh/id_rsa.pub%%''. The output will be a string of characters and numbers. Please copy this information and send to RSS and we will add your key to your account. === Generating an SSH Key on MacOS/Linux === - Open your terminal application of choice - Type ''%%ssh-keygen%%'' and when prompted press enter to save the key in the default location ''%%/home//.ssh/id_rsa%%'' then enter a strong passphrase (required) twice. - After you generate your key - you will need to send us the public key. To see what your public key is you can type: ''%%cat ~/.ssh/id_rsa.pub%%''. The output will be a string of characters and numbers. Please copy this information and send to RSS and we will add your key to your account. ==== Job Submission ==== By default jobs submitted without a partition will land on requeue. If your job lands on a node that is owned by an investor in requeue - that job is subject to being stopped and requeued at any point if the investor needs to run on the same node at the same time. ==== Slurm Overview ==== Slurm is for cluster management and job scheduling. All RSS clusters use Slurm. This document gives an overview of how to run jobs, check job status, and make changes to submitted jobs. To learn more about specific flags or commands please visit slurm's website [link here]. All jobs must be run using srun or sbatch to prevent running on the Hellbender login node. Jobs that are running found running on the login node will be immediately terminated followed up with a notification email to the user. === Slurm Commands and Options === **Job submission** **sbatch** - Submit a batch script for execution in the future (non-interactive) **srun** - Obtain a job allocation and run an application interactively ^ Option ^ Description | |-A, --account= | Account to be charged for resources used | |-a, --array= | Job array specification (sbatch only) | |-b, --begin=