Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
pub:hpc:hellbender [2025/06/18 15:55] – [GPU Node Lease] bjmfg8pub:hpc:hellbender [2025/11/12 15:21] (current) – [Hardware] epkknd
Line 3: Line 3:
 **Request an Account:** **Request an Account:**
 You can request an account for access to Hellbender by filling out the form found at: You can request an account for access to Hellbender by filling out the form found at:
-[[https://request.itrss.umsystem.edu/| Hellbender/RDE Account Request Form]]+[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041| Hellbender Account Request Form]]
  
 ==== What is Hellbender? ==== ==== What is Hellbender? ====
Line 9: Line 9:
 **Hellbender** is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System. **Hellbender** is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System.
  
-**Hellbender** consists of 208 mixed x86-64 CPU nodes (112 AMD, 96 Intel) providing 18,688 cores as well as 28 GPU nodes consisting of a mix of Nvidia GPU's (see hardware section for more details). Hellbender is attached to our Research Data Ecosystem ('RDE') that consists of 8PB of high performance and general purpose research storage. RDE can be accessible from other devices outside of Hellbender to create a single research data location across different computational environments.+**Hellbender** consists of 222 mixed x86-64 CPU nodes providing 22,272 cores as well as 28 GPU nodes consisting of a mix of Nvidia GPU's (see hardware section for more details). Hellbender is attached to our Research Data Ecosystem ('RDE') that consists of 8PB of high performance and general purpose research storage. RDE can be accessible from other devices outside of Hellbender to create a single research data location across different computational environments.
  
 ==== Investment Model ==== ==== Investment Model ====
Line 74: Line 74:
 ^ Service                              ^ Rate        ^ Unit         ^ Support        ^ ^ Service                              ^ Rate        ^ Unit         ^ Support        ^
 |Hellbender CPU Node | $2,702.00 | Per Node/Year | Year to Year | |Hellbender CPU Node | $2,702.00 | Per Node/Year | Year to Year |
-|Hellbender GPU Node* | $7,691.38 | Per Node/Year | Year to Year |+|Hellbender A100 GPU Node* | $7,691.38 | Per Node/Year | Year to Year | 
 +|Hellbender L40s GPU Node* | $4,785.00 | Per Node/Year | Year to Year | 
 +|Hellbender H100 GPU Node* | $13,123.00 | Per Node/Year | Year to Year |
 |RDE Storage: High Performance | $95.00 | Per TB/Year | Year to Year | |RDE Storage: High Performance | $95.00 | Per TB/Year | Year to Year |
 |RDE Storage: General Performance | $25.00 | Per TB/Year | Year to Year | |RDE Storage: General Performance | $25.00 | Per TB/Year | Year to Year |
  
-***Update 06/2025**: Additional GPU priority partitions cannot be allocated at this time as GPU investment has reached beyond the 50% threshold. If you require capacity beyond the general pool we are able to plan and work with your grant submissions to add additional capacity to Hellbender.+***Update 10/2025**: Additional GPU priority partitions cannot be allocated at this time as GPU investment has reached beyond the 50% threshold. If you require capacity beyond the general pool we are able to plan and work with your grant submissions to add additional capacity to Hellbender.
  
  
Line 89: Line 91:
   * When running on the 'General' partition - users jobs are queued according to their fairshare score. The maximum running time is 2 days.    * When running on the 'General' partition - users jobs are queued according to their fairshare score. The maximum running time is 2 days. 
   * When running on the 'Requeue' partition - users jobs are subject to pre-emption if those jobs happen to land on an investor owned node. The maximum running time is 2 days.   * When running on the 'Requeue' partition - users jobs are subject to pre-emption if those jobs happen to land on an investor owned node. The maximum running time is 2 days.
-  * To get started please fill out our [[https://request.itrss.umsystem.edu/| Hellbender/RDE Account Request Form]]+  * To get started please fill out our [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041| Hellbender Account Request Form]]
  
 - **Paid access (Investor) tier compute**: - **Paid access (Investor) tier compute**:
Line 99: Line 101:
   * All accounts are given 50GB of storage in /home/$USER as well as 500GB in /home/$USER/data at no cost.   * All accounts are given 50GB of storage in /home/$USER as well as 500GB in /home/$USER/data at no cost.
   * MU PI's are eligible for 1 free 5TB group storage in our RDE environment   * MU PI's are eligible for 1 free 5TB group storage in our RDE environment
-  * To get started please fill our our general [[https://request.itrss.umsystem.edu/RSS Account Request Form]]+  * To get started please fill our our general [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041Hellbender Account Request Form]] for a Hellbender account and our [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1043| RDE Group Storage Request Form]] for the free 5TB group storage.
  
 - **Paid access (Investor) tier storage**: - **Paid access (Investor) tier storage**:
Line 126: Line 128:
 The investment structure for GPU nodes is the same as CPU - per node per year. f you have funds available that you would like to pay for multiple years up front we can accommodate that. Once Hellbender has hit 50% of the total GPU nodes in the cluster being investor-owned we will restrict additional leases until more nodes become available via either purchase or surrendered by other PI's. The GPU nodes available for investment comprise of the following: The investment structure for GPU nodes is the same as CPU - per node per year. f you have funds available that you would like to pay for multiple years up front we can accommodate that. Once Hellbender has hit 50% of the total GPU nodes in the cluster being investor-owned we will restrict additional leases until more nodes become available via either purchase or surrendered by other PI's. The GPU nodes available for investment comprise of the following:
  
-| Model       | # Nodes | Cores/Node | System Memory | GPU  | GPU Memory | # GPU | Local Scratch | # Core  +| Model       | # Nodes | Cores/Node | System Memory | GPU  | GPU Memory | # GPU/Node | Local Scratch | # Cores  
-| Dell R740xa | 17     | 64         238 GB        | A100 | 80 GB      | 4     | 1.6 TB        | 1088   +| Dell R740xa | 17     | 64         490 GB        | A100 | 80 GB      | 4     | 1.6 TB        | 1088    
 +| Dell R740xa | 6     | 64         | 490 GB        | H100 | 94 GB      | 2     | 1.8 TB        | 384  
 +| Dell R760 | 6     | 64         | 490 GB        | L40S | 45 GB      | 2     | 3.5 TB        | 384 
  
-***Update 06/2025Additional GPU priority partitions cannot be allocated at this time as GPU investment has reached beyond the 50% thresholdIf you require capacity beyond the general pool we are able to plan and work with your grant submissions to add additional capacity to Hellbender** +  A100 Node$7,691.30 Per Node/Year 
- +  H100 Node: $13,123.00 Per Node/Year 
-**The 2025 pricing is: $7,692 per node per year.**+  L40S Node: $4,785.00 Per Node/Year 
  
 ==== Storage: Research Data Ecosystem ('RDE') ==== ==== Storage: Research Data Ecosystem ('RDE') ====
Line 176: Line 180:
 **Costs** **Costs**
  
-The cost associated with using the RDE tape archive is $8/TB for short term data kept in inside the tape library for 1-3 years or $140 per tape rounded to the number of tapes for tapes sent offsite for long term retention up to 10 years. We send these tapes off to record management where they are stored in a climate-controlled environment. Each tape from the current generation LTO 9 holds approximately 18TB of data These are flat onetime costs, and you have the option to do both a short term in library copy, and a longer-term offsite copy, or one or the other, providing flexibility.+The cost associated with using the RDE tape archive is $8/TB for short term data kept in inside the tape library for 1-3 years or $144 per tape rounded to the number of tapes for tapes sent offsite for long term retention up to 10 years. We send these tapes off to record management where they are stored in a climate-controlled environment. Each tape from the current generation LTO 9 holds approximately 18TB of data These are flat onetime costs, and you have the option to do both a short term in library copy, and a longer-term offsite copy, or one or the other, providing flexibility.
  
 **Request Process** **Request Process**
  
 To utilize the tape archive functionality that RSS has setup, the data to be archived will need to be copied to RDE storage if it does not exist there already. This would require the following steps. To utilize the tape archive functionality that RSS has setup, the data to be archived will need to be copied to RDE storage if it does not exist there already. This would require the following steps.
-  * Submit a RDE storage request if the data resides locally and a RDE share is not already available to the researcher: [[http://request.itrss.umsystem.edu|RSS Account Request Form]]+  * Submit a RDE storage request if the data resides locally and a RDE share is not already available to the researcher: [[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1043|RSS Group Storage Form]]
   * Create an archive folder or folders in the relevant RDE storage share to hold the data you would like to archive. The folder(s) can be named to signify the contents, but we ask that the name includes _archive at the end. For example, something akin to: labname_projectx_archive_2024.   * Create an archive folder or folders in the relevant RDE storage share to hold the data you would like to archive. The folder(s) can be named to signify the contents, but we ask that the name includes _archive at the end. For example, something akin to: labname_projectx_archive_2024.
   * Copy the contents to be archived to the newly created archive folder(s) within the RDE storage share.   * Copy the contents to be archived to the newly created archive folder(s) within the RDE storage share.
-  * Submit a RDE tape Archive request: [[https://missouri.qualtrics.com/jfe/form/SV_5o0NoDafJNzXnRY]]+  * Submit a RDE tape Archive request: [[https://archiverequest.itrss.umsystem.edu]]
   * Once the tape archive jobs are completed ITRSS will notify you and send you an Archive job report after which you can delete the contents of the archive folder.   * Once the tape archive jobs are completed ITRSS will notify you and send you an Archive job report after which you can delete the contents of the archive folder.
   * We request that subsequent archive jobs be added to a separate folder, or the initial folder renamed to something that signifies the time of archive for easier retrieval *_archive2024, *archive2025, etc.   * We request that subsequent archive jobs be added to a separate folder, or the initial folder renamed to something that signifies the time of archive for easier retrieval *_archive2024, *archive2025, etc.
Line 230: Line 234:
   * **[[https://LISTS.UMSYSTEM.EDU/scripts/wa-UMS.exe?SUBED1=RSSHPC-L&A=1&SUB=1| RSS Announcement List: Please Sign Up]]**   * **[[https://LISTS.UMSYSTEM.EDU/scripts/wa-UMS.exe?SUBED1=RSSHPC-L&A=1&SUB=1| RSS Announcement List: Please Sign Up]]**
   * **[[https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO|RSS Services: Order Form]]**   * **[[https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO|RSS Services: Order Form]]**
-  * **[[https://request.itrss.umsystem.edu/|Hellbender: Account Request Form]]**+  * **[[https://tdx.umsystem.edu/TDClient/36/DoIT/Requests/ServiceOfferingDet?ID=1041|Hellbender: Account Request Form]]**
   * **[[https://missouri.qualtrics.com/jfe/form/SV_9LAbyCadC4hQdBY|Hellbender: Add User to Existing Account Form]]**   * **[[https://missouri.qualtrics.com/jfe/form/SV_9LAbyCadC4hQdBY|Hellbender: Add User to Existing Account Form]]**
   * **[[https://missouri.qualtrics.com/jfe/form/SV_6FpWJ3fYAoKg5EO|Hellbender: Course Request Form]]**   * **[[https://missouri.qualtrics.com/jfe/form/SV_6FpWJ3fYAoKg5EO|Hellbender: Course Request Form]]**
Line 256: Line 260:
 Dell C6420: .5 unit server containing dual 24 core Intel Xeon Gold 6252 CPUs with a base clock of 2.1 GHz. Each C6420 node contains 384 GB DDR4 system memory. Dell C6420: .5 unit server containing dual 24 core Intel Xeon Gold 6252 CPUs with a base clock of 2.1 GHz. Each C6420 node contains 384 GB DDR4 system memory.
  
-Dell R6620: 1 unit server containing dual 128 core AMD EPYC 9754 CPUs with a base clock of 2.25 GHz. Each R6620 node contains 1 TB DDR5 system memory.+Dell R6625: 1 unit server containing dual 128 core AMD EPYC 9754 CPUs with a base clock of 2.25 GHz. Each R6625 node contains 1 TB DDR5 system memory. 
 + 
 +Dell R6625: 1 unit server containing dual 128 core AMD EPYC 9754 CPUs with a base clock of 2.25 GHz. Each R6625 node contains 6 TB DDR5 system memory.
  
 | **Model**  | **Nodes** | **Cores/Node** | **System Memory** | **CPU**                                  | **Local Scratch**   | **Cores** | **Node Names** | | **Model**  | **Nodes** | **Cores/Node** | **System Memory** | **CPU**                                  | **Local Scratch**   | **Cores** | **Node Names** |
Line 263: Line 269:
 | Dell C6420 | 64        | 48             | 364 GB            | Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz | 1 TB                | 3072      | c146-c209      | | Dell C6420 | 64        | 48             | 364 GB            | Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz | 1 TB                | 3072      | c146-c209      |
 | Dell R6625 | 12        | 256            | 994 GB            | AMD EPYC 9754 128-Core Processor         | 1.5 TB              | 3072      | c210-c221      | | Dell R6625 | 12        | 256            | 994 GB            | AMD EPYC 9754 128-Core Processor         | 1.5 TB              | 3072      | c210-c221      |
-| Dell R6625 | 2         | 256            | 6034 GB           | AMD EPYC 9754 128-Core Processor         | 1.6 TB              | 512       c210-c221      | +| Dell R6625 | 2         | 256            | 6034 GB           | AMD EPYC 9754 128-Core Processor         | 1.6 TB              | 512       c222-c223      | 
-|            |                          |                                                            | Total Cores         | 22272     c222-c223      |+|            |                          |                                                            | Total Cores         | 22272                    |
  
 === GPU nodes === === GPU nodes ===
  
-| **Model**   | **Nodes** | **Cores/Node** | **System Memory** | **GPU**  | **GPU Memory** | **GPUs** | **Local Scratch** | **Cores** | **Node Names** | +| **Model**   | **Nodes** | **Cores/Node** | **System Memory** | **GPU**  | **GPU Memory** | **GPUs/Node** | **Local Scratch** | **Cores** | **Node Names** | 
-| Dell R740xa | 17        | 64             238 GB            | A100     | 80 GB          | 4        | 1.6 TB            | 1088    | g001-g017      |+| Dell R750xa | 17        | 64             490 GB            | A100     | 80 GB          | 4        | 1.6 TB            | 1088    | g001-g017      |
 | Dell XE8640 | 2         | 104            | 2002 GB           | H100     | 80 GB          | 4        | 3.2 TB            | 208     | g018-g019      | | Dell XE8640 | 2         | 104            | 2002 GB           | H100     | 80 GB          | 4        | 3.2 TB            | 208     | g018-g019      |
 | Dell XE9640 | 1         | 112            | 2002 GB           | H100     | 80 GB          | 8        | 3.2 TB            | 112     | g020           | | Dell XE9640 | 1         | 112            | 2002 GB           | H100     | 80 GB          | 8        | 3.2 TB            | 112     | g020           |
Line 276: Line 282:
 | Dell R740xd | 2         | 40             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 80      | g026-g027      | | Dell R740xd | 2         | 40             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 80      | g026-g027      |
 | Dell R740xd | 1         | 44             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 44      | g028           | | Dell R740xd | 1         | 44             | 364 GB            | V100     | 32 GB          | 3        | 240 GB            | 44      | g028           |
-|                                      |                            | Total GPU      | 100      | Total Cores       1708                   |+Dell R760xa | 6         | 64             | 490 GB            | H100     | 94 GB          | 2        | 1.8 TB            | 384      | g029-g034 
 +| Dell R760 | 6         | 64             | 490 GB            | L40S     | 45 GB          | 2        | 3.5 TB            | 384      | g035-g040 
 +|        |                          |                            | Total GPU      | 124      | Total Cores       2476                   |
  
 A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/features they have. A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/features they have.
Line 613: Line 621:
 Below is process for setting up a class on the OOD portal. Below is process for setting up a class on the OOD portal.
  
-  - Send the class name, the list of students and TAs, and any shared storage requirements to itrss-support@umsystem.edu.+  - Send the class name, the list of students and TAs, and any shared storage requirements to itrss-support@umsystem.edu. This can be also accomplished by filling out our course request form:  * **[[https://missouri.qualtrics.com/jfe/form/SV_6FpWJ3fYAoKg5EO|Hellbender: Course Request Form]]**
   - We will add the students to the group allowing them access to OOD.   - We will add the students to the group allowing them access to OOD.
   - If the student does not have a Hellbender account yet, they will be presented with a link to a form to fill out requesting a Hellbender account.   - If the student does not have a Hellbender account yet, they will be presented with a link to a form to fill out requesting a Hellbender account.
Line 804: Line 812:
  
 **Documentation**:http://docs.nvidia.com/cuda/index.html **Documentation**:http://docs.nvidia.com/cuda/index.html
 +
 +==== RStudio ====
 +
 +[[https://youtu.be/WuAwXMUYE_Y]]
  
 ==== Visual Studio Code ==== ==== Visual Studio Code ====