Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
pub:hpc:hellbender [2025/10/29 19:15] – [GPU Node Lease] bjmfg8pub:hpc:hellbender [2026/04/21 16:15] (current) – [What is Hellbender?] redmonp
Line 3: Line 3:
 **Request an Account:** **Request an Account:**
 You can request an account for access to Hellbender by filling out the form found at: You can request an account for access to Hellbender by filling out the form found at:
-[[https://request.itrss.umsystem.edu/| Hellbender/RDE Account Request Form]]+[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041| Hellbender Account Request Form]]
  
 ==== What is Hellbender? ==== ==== What is Hellbender? ====
Line 9: Line 9:
 **Hellbender** is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System. **Hellbender** is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System.
  
-**Hellbender** consists of 222 mixed x86-64 CPU nodes providing 22,272 cores as well as 28 GPU nodes consisting of mix of Nvidia GPU'(see hardware section for more details). Hellbender is attached to our Research Data Ecosystem ('RDE') that consists of 8PB of high performance and general purpose research storage. RDE can be accessible from other devices outside of Hellbender to create a single research data location across different computational environments.+**Hellbender** consists of 263 mixed x86-64 CPU nodes providing 24844 cores as well as 41 GPU nodes with total of 132 GPUs consisting of Nvidia GPU'including Volta, Ampere, and Hopper generations, with more details at [[https://docs.itrss.umsystem.edu/pub/hpc/hellbender#hardware]]. Hellbender is attached to our Research Data Ecosystem ('RDE') that consists of 8PB of high performance and general purpose research storage. RDE can be accessible from other devices outside of Hellbender to create a single research data location across different computational environments.
  
 ==== Investment Model ==== ==== Investment Model ====
Line 25: Line 25:
  
 General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource. General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource.
-Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations will be provided through RDE Advanced Allocations, and independent of HPC priority status+Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations can be attained via investment, and independent of HPC priority status.
- +
-=== Hellbender Advanced: Priority Access === +
- +
-When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access. Priority access will give research groups a limited set of resources that will be available to them without competition from  general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well. +
- +
-=== Traditional Investment === +
- +
-Hellbender Advanced Allocation requests that are not approved for DRII Priority Designation may be treated as traditional investments with the researcher paying for the resources used to create the Advanced Allocation at the defined rate. These rates are subject to change based on the determination of DRII, and hardware costs.+
  
 === Resource Management === === Resource Management ===
Line 42: Line 34:
  
 Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution. Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution.
 +
 +=== Hellbender Advanced: Priority Access - Investment ===
 +
 +When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access via investment. Priority access will give research groups a limited set of resources that will be available to them without competition from  general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well.
  
 === Benefits of Investing === === Benefits of Investing ===
Line 70: Line 66:
 ==== How Much Does Investing Cost? ==== ==== How Much Does Investing Cost? ====
  
-See our rates for FY 2024-2025:+See our rates for FY 2025-2026:
  
 ^ Service                              ^ Rate        ^ Unit         ^ Support        ^ ^ Service                              ^ Rate        ^ Unit         ^ Support        ^
Line 91: Line 87:
   * When running on the 'General' partition - users jobs are queued according to their fairshare score. The maximum running time is 2 days.    * When running on the 'General' partition - users jobs are queued according to their fairshare score. The maximum running time is 2 days. 
   * When running on the 'Requeue' partition - users jobs are subject to pre-emption if those jobs happen to land on an investor owned node. The maximum running time is 2 days.   * When running on the 'Requeue' partition - users jobs are subject to pre-emption if those jobs happen to land on an investor owned node. The maximum running time is 2 days.
-  * To get started please fill out our [[https://request.itrss.umsystem.edu/| Hellbender/RDE Account Request Form]]+  * To get started please fill out our [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041| Hellbender Account Request Form]]
  
 - **Paid access (Investor) tier compute**: - **Paid access (Investor) tier compute**:
Line 101: Line 97:
   * All accounts are given 50GB of storage in /home/$USER as well as 500GB in /home/$USER/data at no cost.   * All accounts are given 50GB of storage in /home/$USER as well as 500GB in /home/$USER/data at no cost.
   * MU PI's are eligible for 1 free 5TB group storage in our RDE environment   * MU PI's are eligible for 1 free 5TB group storage in our RDE environment
-  * To get started please fill our our general [[https://request.itrss.umsystem.edu/RSS Account Request Form]]+  * To get started please fill our our general [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041Hellbender Account Request Form]] for a Hellbender account and our [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1043| RDE Group Storage Request Form]] for the free 5TB group storage.
  
 - **Paid access (Investor) tier storage**: - **Paid access (Investor) tier storage**:
Line 169: Line 165:
 **The 2025 pricing is: General Storage: $25/TB/Year, High Performance Storage: $95/TB/Year** **The 2025 pricing is: General Storage: $25/TB/Year, High Performance Storage: $95/TB/Year**
  
-To order storage please fill out our [[https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO| RSS Services Order Form]]+To order storage please fill out our [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceDet?ID=605| RSS Services Order Form]]
  
-=== Research Data Archive ===+==== Research Data Archive ====
 **Use Case** **Use Case**
  
Line 185: Line 181:
  
 To utilize the tape archive functionality that RSS has setup, the data to be archived will need to be copied to RDE storage if it does not exist there already. This would require the following steps. To utilize the tape archive functionality that RSS has setup, the data to be archived will need to be copied to RDE storage if it does not exist there already. This would require the following steps.
-  * Submit a RDE storage request if the data resides locally and a RDE share is not already available to the researcher: [[http://request.itrss.umsystem.edu|RSS Account Request Form]]+  * Submit a RDE storage request if the data resides locally and a RDE share is not already available to the researcher: [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1043|RSS Group Storage Form]]
   * Create an archive folder or folders in the relevant RDE storage share to hold the data you would like to archive. The folder(s) can be named to signify the contents, but we ask that the name includes _archive at the end. For example, something akin to: labname_projectx_archive_2024.   * Create an archive folder or folders in the relevant RDE storage share to hold the data you would like to archive. The folder(s) can be named to signify the contents, but we ask that the name includes _archive at the end. For example, something akin to: labname_projectx_archive_2024.
   * Copy the contents to be archived to the newly created archive folder(s) within the RDE storage share.   * Copy the contents to be archived to the newly created archive folder(s) within the RDE storage share.
Line 234: Line 230:
   * **[[https://LISTS.UMSYSTEM.EDU/scripts/wa-UMS.exe?SUBED1=RSSHPC-L&A=1&SUB=1| RSS Announcement List: Please Sign Up]]**   * **[[https://LISTS.UMSYSTEM.EDU/scripts/wa-UMS.exe?SUBED1=RSSHPC-L&A=1&SUB=1| RSS Announcement List: Please Sign Up]]**
   * **[[https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO|RSS Services: Order Form]]**   * **[[https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO|RSS Services: Order Form]]**
-  * **[[https://request.itrss.umsystem.edu/|Hellbender: Account Request Form]]** +  * **[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041|Hellbender: Account Request Form]]** 
-  * **[[https://missouri.qualtrics.com/jfe/form/SV_9LAbyCadC4hQdBY|Hellbender: Add User to Existing Account Form]]** +  * **[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceDet?ID=1820|Hellbender: Add/Remove Users to Existing Groups]]** 
-  * **[[https://missouri.qualtrics.com/jfe/form/SV_6FpWJ3fYAoKg5EO|Hellbender: Course Request Form]]**+  * **[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/TicketRequests/NewForm?ID=lE2Zh7XD%7e-o_&RequestorType=Service|Hellbender: Course Request Form]]**
  
  
Line 274: Line 270:
 === GPU nodes === === GPU nodes ===
  
-| **Model**   | **Nodes** | **Cores/Node** | **System Memory** | **GPU**  | **GPU Memory** | **GPUs/Node** | **Local Scratch** | **Cores** | **Node Names** |+| **Model**   | **Nodes** | **Cores/Node** | **System Memory** | **GPU**  | **GPU Memory/GPU** | **GPUs/Node** | **Local Scratch** | **Cores** | **Node Names** |
 | Dell R750xa | 17        | 64             | 490 GB            | A100     | 80 GB          | 4        | 1.6 TB            | 1088    | g001-g017      | | Dell R750xa | 17        | 64             | 490 GB            | A100     | 80 GB          | 4        | 1.6 TB            | 1088    | g001-g017      |
 | Dell XE8640 | 2         | 104            | 2002 GB           | H100     | 80 GB          | 4        | 3.2 TB            | 208     | g018-g019      | | Dell XE8640 | 2         | 104            | 2002 GB           | H100     | 80 GB          | 4        | 3.2 TB            | 208     | g018-g019      |
Line 282: Line 278:
 | Dell R740xd | 2         | 40             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 80      | g026-g027      | | Dell R740xd | 2         | 40             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 80      | g026-g027      |
 | Dell R740xd | 1         | 44             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 44      | g028           | | Dell R740xd | 1         | 44             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 44      | g028           |
-| Dell R760xa | 6         | 64             | 490 GB            | H100     | 94 GB          | 2        | 1.8 TB            | 384      | g029-g034 | +| Dell R760xa | 6         | 64             | 490 GB            | H100     | 94 GB          | 2        | 1.8 TB            | 384      | g029-g034 
-| Dell R760 | 6         | 64             | 490 GB            | L40S     | 45 GB          | 2        | 3.5 TB            | 384      | g035-g040 | +| Dell R760 | 6         | 64             | 490 GB            | L40S     | 45 GB          | 2        | 3.5 TB            | 384      | g035-g040 
-* = Available Oct 14                                 |                            | Total GPU      | 124      | Total Cores       2476                   |+Dell XE9680 | 1         | 96             | 2048 GB            | H200     | 141 GB          | 8        | 28 TB            | 96      | g041  | 
 +                                |                            | Total GPU      | 132      | Total Cores       2572                   |
  
 A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/features they have. A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/features they have.
Line 553: Line 550:
  
 ==== Moving Data ==== ==== Moving Data ====
 +
 +** Use one of the following options to move data. Do not move data on the login node.**
  
 === Globus === === Globus ===
Line 559: Line 558:
  
   * Hellbender Collection Name: U MO ITRSS RDE   * Hellbender Collection Name: U MO ITRSS RDE
-  * Lewis Collection Name:  MU RCSS Lewis Home Directories 
   * Mill Collection Name: Missouri S&T Mill   * Mill Collection Name: Missouri S&T Mill
-  * Foundry Collection Name: Missouri S&T HPC Storage 
  
 More detailed information on how to use Globus is at [[https://docs.itrss.umsystem.edu/pub/hpc/hellbender#globus1]] More detailed information on how to use Globus is at [[https://docs.itrss.umsystem.edu/pub/hpc/hellbender#globus1]]