====== Getting Started with HPC ======

===== Choosing a Resource =====

You're not limited to only choosing one however due to differences in funding sources for the resource, some of them have different access rules. This section will cover some of the system details of each resource and requirements for account approval.

====Hellbender====

Hellbender is a traditional HPC research cluster built at MU in 2023 to support research efforts for UM researchers. Mizzou's Division of Research, Innovation & Impact (DRII) is the primary source of funding for the hardware and support of Hellbender. Hellbender started as 112 compute nodes containing 2 AMD 7713 processors and 512GB of RAM and 17 GPU nodes containing 4 Nvidia A100 80GB RAM GPUs, 2 Intel Xeon 6338 processors, and 256GB of system RAM. Since then, it has expanded thanks to researcher investments and repurposing the newest portions of the previous HPC cluster. See a more detailed overview of Hellbender architecture at {{ :pub:hpc:hellbender_system_overview.pdf |}}.

DRII has made it clear the mission of Hellbender is to accelerate Mizzou Forward initiatives. There are 2 access levels made available for UM researchers, general and priority access. General access is free and available to all UM researchers. General access provides an equal share of at least 50% of the resources available to all users. Priority access provides dedicated access to some number of nodes on Hellbender and is available through investment. 

Requesting access to Hellbender can be done through our [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041| Hellbender account request form]]. Each prospective user will need a faculty sponsor listed as the principal investigator(PI)/sponsor. The PI/sponsor will be the responsible party for managing members who have access to the Hellbender cluster. To request a 5TB storage group each PI/sponsor can submit the following form:[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1043| group storage request form]] to request access to our [[pub:rde:start|Research Data Ecosystem]] (RDE).

Hellbender now uses [[pub:hpc:cilogon:|CILogon]] for access.

[[pub:hpc:hellbender|Hellbender Documentation]]

====The Mill====

The Mill is a traditional HPC research cluster built at S&T in 2023. The Mill currently consists of 229 compute nodes; 25 with 512GB of DDR4 RAM, a single 1.6Tb NVME drive for local scratch and 128 cores; and 160 with 256GB of DDR4 RAM, a single 2.6TB NVME drive for local scratch and 64 cores; 44 with 192GB of DDR4 RAM, 2.6TB of local scratch and 40 cores, totaling 15,200 compute cores across the system. There are also 8 GPU nodes; 6 with 4 Nvidia V100 GPUs; 1 with 8 H100 GPUs; 1 with 2 V100s, totaling 34 GPUs. The network is based on an HDR backbone which provides up to 200 gigabits of data point-to-point on the network. Each node is attached to the backbone with an HDR-100 InfiniBand connection capable of providing 100 gigabits of data throughput per second to each node. The Mill is connected to 250Tb of high-performance all-flash InfiniBand storage (VAST) as well as 800Tb of utility CEPH storage. Storage lab allocations are protected by associated security groups applied to the share, with the ability for the PI to manage group access.

Priority access to dedicated hardware can be made through investment in hardware.

We ask that when you cite any of the RSS clusters in a publication to send an email to itrss-support@umsystem.edu as well as share a copy of the publication with us. To cite the use of The Mill in a publication please use: "The computation for this work was performed on the high-performance computing infrastructure provided by Research Support Solutions at Missouri University of Science and Technology https://doi.org/10.71674/PH64-N397".

[[pub:hpc:mill|Mill Documentation]]

====The Foundry====

The Foundry is a traditional HPC research cluster built at S&T in 2019 to support regional higher education institutions. Primarily the funding for the Foundry came from the National Science Foundation's (NSF) Major Research Infrastructure (MRI) program, additional funding for support and infrastructure has been provided by the S&T campus. The Foundry is comprised of 160 5 compute nodes with 2 AMD Epyc 7502 processors and 256GB of RAM and 6 GPU nodes with 2x Intel Xeon 6248 processors 192GB of system RAM and 4 Nvidia V100 GPUs with 32GB of GPU RAM each.

With the mission of the MRI being that of making The Foundry a regional resource for higher ed, general resources are available freely to any Missouri University by emailing a one page request for resources to foundry-access@mst.edu. For UM system researchers a request to the IT help desk is all that is necessary to gain general access. Priority access to dedicated hardware can be made through investment in hardware.

In all publications or products resulting from work performed using the Foundry the NSF Grant which provided funding for the Foundry must be acknowledged. This can be done by adding this sentence to the publication “This work was supported in part by the National Science Foundation under Grant No. OAC-1919789.” Or something to this effect.

[[pub:hpc:foundry|Foundry Documentation]]

====Nautilus====
Researchers with workflows that involve AI, machine learning, simulation, or similar computation that can be parallelized at the job level may be interested to know of another available HPC resource: the NSF National Research Platform Nautilus cluster. NSF grants orchestrated by faculty at the University of Missouri and in the Great Plains Network have contributed substantially to this resource.
 
The Nautilus HPC System is a public cluster utilizing Kubernetes containerization. Its resources include 1,352 compute nodes, 32 NVIDIA A100 nodes, 26 petabytes of DDN storage and a 200 Gbps NVIDIA Mellanox InfiniBand interconnect.
 
Users will need to be learn how to use Kubernetes and containers, a very different system than the SLURM HPC Systems we have at the University, but may find more resource availability.  Nautilus also differs from other NSF programs (like ACCESS) as it is not based on proposals and approvals to gain access. All resources requested are expected to be used, so there is a strong need for users to understand what their jobs require. There is GitHub documentation to assist with these learning points.
 
Data is expected to be DCL 1 or 2; higher classifications of data are not appropriate for this cluster.  Researchers should understand that the work they are doing on Nautilus should be considered open to the public.
 
If you think that Nautilus may be a resource that can help your research excel, please let us know of your interest.

[[pub:hpc:nautilus|Nautilus Documentation]]
=====General Policies=====

The following are RSS policies and guidelines for different services and groups:

===Software and Procurement Policy===

==Open Source Software==

Software installed cluster-wide must have an open source (https://opensource.org/licenses) license or be obtained utilizing the procurement process even if there is not a cost associated with it.

==Licensed Software==

Licensed software (any software that requires a license or agreement to be accepted) must follow the procurement process to protect users, their research, and the University. To ensure this, for RSS to install and support licensed software RSS must manage the license and the license server.

For widely used software RSS can facilitate the sharing of license fees and/or may support the cost depending on the cost and situation. Otherwise, user are responsible for funding for fee licensed software and RSS can handle the procurement process. We require that if the license does not preclude it, and there are not node or other resource limits, that the software is make made available to all users on the cluster. All licensed software installed on the cluster is to be used following the license agreement. We will do our best to install and support a wide rage of scientific software as resources and circumstances dictate but in general we only support scientific software that will run on Linux in a HPC cluster environment. RSS may not support software that is implicitly/explicitly deprecated by the community.

==Singularity Support==
A majority of scientific software and software libraries can be installed in users’ accounts or in group space. We also provide limited support for Singularity (https://sylabs.io/docs/) for advanced users who require more control over their computing environment. We cannot knowingly assist users to install software that may put them, the University, or their intellectual property at risk.

====Use Policy====

====Data Policy====

=== Research Data Ecosystem Storage (RDE) Allocation Model ===

The Research Data Ecosystem (RDE) is a collection of technical resources supporting research data. These resources are intended to meet researcher needs for physical storage, data movement, and institutional management. They are provided in partnership with the Division of Research Innovation and Impact (DRII) and are intended to work in conjunction with DRII policy drivers and priorities. 
The Ecosystem brings together storage platforms, data movement, metadata management, data protection, technical and practice expertise, and administration to fully support the research data lifecycle. 
DRII has invested in RDE as an institutional resource. Details of the specific underlying platforms may change over time, but always directed towards ease of use, access, and performance to purpose. 
Throughout May 2023, portions of the RDE are moving into production. These include:

  * On-premises high-performance and general-purpose research storage 
  * Globus for advanced data movement 
  * Specialized need consultation
  * Data archival

These resources work in conjunction with RSS services related to grant support, HPC infrastructure, and data management plan development. Capabilities that are not yet generally available but in development include:

  * Data backup
  * Data analytics and reporting

Additionally, effective use of some resources may require changes to network architecture, so additional limitations may apply at this time.  
We invite researchers needing solutions (including those dependent on resources not yet generally available) to consult with RSS. We may be able to find effective workarounds or make use of pilot projects when appropriate. We are committed to finding solutions supporting your research productivity needs!

=== Resources available to all researchers: RDE Standard Allocation ===

All researchers are eligible for the RDE Standard Allocation. The Standard Allocation provides a base level of storage suitable for use in connection with High Performance Computing clusters, or for general-purpose lab shares (SMB or NFS). The exact capacity is subject to change based on utilization and DRII direction. See the appendices for current specifications.

=== RDE Advanced Allocation ===

For needs beyond the RDE Standard Allocation, researchers may request one or more RDE Advanced Allocations. Advanced Allocations can provide for larger or specialized research storage needs. Storage is provided at a per-TB/per-year rate which is subject to change under DRII guidance. See the appendices for current rates and how the cost model is being implemented. RDE Advanced Allocations should be associated with research services or defined projects. Research Cores, Research Centers, labs providing services to other labs, and RSS may be considered research services. Defined projects include sponsored programs or otherwise well-defined initiatives. 
All RDE allocations must include appropriate data management planning. A plan may be a formal Data Management Plan associated with a grant, or an operational workflow description appropriate for a core or service entity, as long as data protection, transfer, and disposition requirements are documented.  
Advanced allocations require consultation with RSS. RSS will work with researchers to match allocated resources with capacity, performance, protection, and connectivity needs.

**Priority Designation**

RDE Advanced Allocations are eligible for DRII Priority Designation. This means DRII has determined the proposed use case (such as a core or grant-funded project) presents a strategic advantage or high priority service to the University and agrees to subsidize the resources assigned in that designation. DRII is responsible for determination of criteria for Priority Designation.

**Traditional Investment**

RDE Advanced Allocations that are not approved for DRII Priority Designation or that inherently receive funding for storage may be treated as traditional investments, with the researcher paying for the allocation at the defined rate.

**Data Compliance**

By default, the RDE environment supports data classified as DCL 1 or 2. It may be possible to arrange for a higher DCL, but this must be vetted and approved by appropriate security and compliance authorities.

**Allocation Maintenance**

Researchers are expected to ensure allocation information is kept current. Annual confirmation that the allocation is still needed will be required for all Standard Allocations. For lab (group) and Advanced Allocations, annual vetting of group membership will be required, as well as updates to data management planning if changes (duration, disposition, etc.) are needed.

=== Appendix ===

**Appendix A: RDE Standard Allocation**

  * Individual researcher: 500GB (In addition to 50GB home directory space for HPC users) 
  * Lab group: 5TB 
  * Duration/renewal cycle: Annual

**Appendix B: RDE Advanced Allocation**

  * Capacities and duration determined in consultation. 
  * Cost per TB per year (equipment/licensing): $95 for high performance storage, $25 for general purpose storage* 
  * **Supplemental services**
    * Snapshotting (pending implementation and potential cost evaluation) 
    * Performance optimization 
    * Backup (pending implementation and potential cost evaluation) 
    * Archival (pending implementation and potential cost evaluation) 
    * Globus endpoint

**Note**: All currently available storage is high performance. As capacity is consumed, general purpose (lower tier) storage will be added to the hardware environment and data priced as “general purpose” will be subject to automatic migration to the lower tier. 


=====Getting Help=====
====Office Hours====

RSS office hours are now virtual. In person library RSS office hours have been suspended until further notice. Our team will be still available to help during the office hours via Zoom.


^Office Hours	^Date and Time	^Location ^
|RSS	|Wed 10:00 - 12:00	|https://umsystem.zoom.us/j/910886067|
|Engineering/RSS	|Mon, Tue, Thu 2:00 - 4:00	|https://umsystem.zoom.us/j/910886067|
|BioCompute	|Please message RSS or join	|zoom above for Biocompute questions|
Note that above Zoom links are password protected, please contact us for receiving sessions password.

We are also available to answer questions outside of these hours via email: itrss-support@umsystem.edu

====Grant Proposal Assistance====

The RSS team is here to help with grants. We offer consultations and project reviews that include but are not limited to:

  * **Security Reviews**
  * **Vendor Quotes**
    * We work with university approved vendors to get preferred pricing and support
  * **Letters of Support**
    * Some grants require proof of campus IT knowledge/support of the proposal
  * **Regional Partnerships**
    * We are active members in several different regional network groups (Great Plains Network, CIMUSE, Campus Champions and more) that can be assets in finding partnerships for multi-institution projects.
  * **Data Management Plans**
    * MU Libraries has good resources for DMP's for many of the most common granting agencies [[https://libraryguides.missouri.edu/datamanagement| MU Libraries Data Management]]
  * **Facilities Description**
    * Please reach out to us at itrss-support@umsystem.edu for the most up to date version of RSS facilities for inclusion in your proposals. {{ :pub:hpc: rss_overview.docx |}}