This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| pub:hpc:hellbender [2025/01/31 16:04] – [Tutorial: Globus File Manager] bjmfg8 | pub:hpc:hellbender [2026/04/21 16:15] (current) – [What is Hellbender?] redmonp | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| **Request an Account:** | **Request an Account:** | ||
| You can request an account for access to Hellbender by filling out the form found at: | You can request an account for access to Hellbender by filling out the form found at: | ||
| - | [[https://request.itrss.umsystem.edu/ | + | [[https://tdx.umsystem.edu/ |
| ==== What is Hellbender? ==== | ==== What is Hellbender? ==== | ||
| Line 9: | Line 9: | ||
| **Hellbender** is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System. | **Hellbender** is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System. | ||
| - | **Hellbender** consists of 208 mixed x86-64 CPU nodes (112 AMD, 96 Intel) | + | **Hellbender** consists of 263 mixed x86-64 CPU nodes providing |
| ==== Investment Model ==== | ==== Investment Model ==== | ||
| Line 25: | Line 25: | ||
| General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource. | General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource. | ||
| - | Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations | + | Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations |
| - | + | ||
| - | === Hellbender Advanced: Priority Access === | + | |
| - | + | ||
| - | When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access. Priority access will give research groups a limited set of resources that will be available to them without competition from general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well. | + | |
| - | + | ||
| - | === Traditional Investment === | + | |
| - | + | ||
| - | Hellbender Advanced Allocation requests that are not approved for DRII Priority Designation may be treated as traditional investments with the researcher paying for the resources used to create the Advanced Allocation at the defined rate. These rates are subject to change based on the determination of DRII, and hardware costs. | + | |
| === Resource Management === | === Resource Management === | ||
| Line 42: | Line 34: | ||
| Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution. | Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution. | ||
| + | |||
| + | === Hellbender Advanced: Priority Access - Investment === | ||
| + | |||
| + | When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access via investment. Priority access will give research groups a limited set of resources that will be available to them without competition from general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well. | ||
| === Benefits of Investing === | === Benefits of Investing === | ||
| Line 70: | Line 66: | ||
| ==== How Much Does Investing Cost? ==== | ==== How Much Does Investing Cost? ==== | ||
| - | See our rates for FY 2024-2025: | + | See our rates for FY 2025-2026: |
| ^ Service | ^ Service | ||
| - | |Hellbender CPU Node* | $2,702.00 | Per Node/Year | Year to Year | | + | |Hellbender CPU Node | $2,702.00 | Per Node/Year | Year to Year | |
| - | |Hellbender GPU Node* | $7,691.38 | Per Node/Year | Year to Year | | + | |Hellbender |
| + | |Hellbender L40s GPU Node* | $4,785.00 | Per Node/Year | Year to Year | | ||
| + | |Hellbender H100 GPU Node* | $13, | ||
| |RDE Storage: High Performance | $95.00 | Per TB/Year | Year to Year | | |RDE Storage: High Performance | $95.00 | Per TB/Year | Year to Year | | ||
| |RDE Storage: General Performance | $25.00 | Per TB/Year | Year to Year | | |RDE Storage: General Performance | $25.00 | Per TB/Year | Year to Year | | ||
| - | ***Update | + | ***Update |
| Line 89: | Line 87: | ||
| * When running on the ' | * When running on the ' | ||
| * When running on the ' | * When running on the ' | ||
| - | * To get started please fill out our [[https://request.itrss.umsystem.edu/ | + | * To get started please fill out our [[https://tdx.umsystem.edu/ |
| - **Paid access (Investor) tier compute**: | - **Paid access (Investor) tier compute**: | ||
| Line 99: | Line 97: | ||
| * All accounts are given 50GB of storage in /home/$USER as well as 500GB in / | * All accounts are given 50GB of storage in /home/$USER as well as 500GB in / | ||
| * MU PI's are eligible for 1 free 5TB group storage in our RDE environment | * MU PI's are eligible for 1 free 5TB group storage in our RDE environment | ||
| - | * To get started please fill our our general [[https://request.itrss.umsystem.edu/ | + | * To get started please fill our our general [[https://tdx.umsystem.edu/ |
| - **Paid access (Investor) tier storage**: | - **Paid access (Investor) tier storage**: | ||
| Line 119: | Line 117: | ||
| | Dell C6525 | 112 | 128 | 490 GB | 1.6 TB | 14336 | c001-c112 | | Dell C6525 | 112 | 128 | 490 GB | 1.6 TB | 14336 | c001-c112 | ||
| - | **The 2024 pricing is: $2,702 per node per year.** | + | **The 2025 pricing is: $2,702 per node per year.** |
| ==== GPU Node Lease ==== | ==== GPU Node Lease ==== | ||
| Line 126: | Line 124: | ||
| The investment structure for GPU nodes is the same as CPU - per node per year. f you have funds available that you would like to pay for multiple years up front we can accommodate that. Once Hellbender has hit 50% of the total GPU nodes in the cluster being investor-owned we will restrict additional leases until more nodes become available via either purchase or surrendered by other PI's. The GPU nodes available for investment comprise of the following: | The investment structure for GPU nodes is the same as CPU - per node per year. f you have funds available that you would like to pay for multiple years up front we can accommodate that. Once Hellbender has hit 50% of the total GPU nodes in the cluster being investor-owned we will restrict additional leases until more nodes become available via either purchase or surrendered by other PI's. The GPU nodes available for investment comprise of the following: | ||
| - | | Model | # Nodes | Cores/Node | System Memory | GPU | GPU Memory | # GPU | Local Scratch | # Core | + | | Model | # Nodes | Cores/Node | System Memory | GPU | GPU Memory | # GPU/Node | Local Scratch | # Cores |
| - | | Dell R740xa | 17 | 64 | + | | Dell R740xa | 17 | 64 |
| + | | Dell R740xa | 6 | 64 | 490 GB | H100 | 94 GB | 2 | 1.8 TB | 384 | ||
| + | | Dell R760 | 6 | 64 | 490 GB | L40S | 45 GB | 2 | 3.5 TB | 384 | ||
| - | **The 2024 pricing is: $7,692 per node per year.** | + | |
| + | | ||
| + | | ||
| ==== Storage: Research Data Ecosystem (' | ==== Storage: Research Data Ecosystem (' | ||
| Line 138: | Line 140: | ||
| * Storage lab allocations are protected by associated security groups applied to the share, with group member access administered by the assigned PI or appointed representative. | * Storage lab allocations are protected by associated security groups applied to the share, with group member access administered by the assigned PI or appointed representative. | ||
| - | **What is the Difference between High Performance and General Performance Storage?** | + | **What is the Difference between High Performance and General Performance Storage? ** |
| - | On Pixstor, which is used for standard HPC allocations, | + | On Pixstor, which is used for standard HPC allocations, |
| On VAST, which is used for non HPC and mixed HPC / SMB workloads, the disks are all flash but general storage allocations have a QOS policy attached that limits IOPS to prevent the share from the possibility of saturating the disk pool to the point where high-performance allocations are impacted. | On VAST, which is used for non HPC and mixed HPC / SMB workloads, the disks are all flash but general storage allocations have a QOS policy attached that limits IOPS to prevent the share from the possibility of saturating the disk pool to the point where high-performance allocations are impacted. | ||
| Line 153: | Line 155: | ||
| * Workloads that require sustained use of low latency read and write IO with multiple GB/s, generally generated from jobs utilizing multiple NFS mounts | * Workloads that require sustained use of low latency read and write IO with multiple GB/s, generally generated from jobs utilizing multiple NFS mounts | ||
| + | |||
| + | **Snapshots** | ||
| + | |||
| + | *VAST default policy retains 7 daily and 4 weekly snapshots for each share | ||
| + | *Pixstor default policy is 10 daily snapshots | ||
| **__None of the cluster attached storage available to users is backed up in any way by us__**, this means that if you delete something and don't have a copy somewhere else, it is gone. Please note the data stored on cluster attached storage is limited to Data Class 1 and 2 as defined by [[https:// | **__None of the cluster attached storage available to users is backed up in any way by us__**, this means that if you delete something and don't have a copy somewhere else, it is gone. Please note the data stored on cluster attached storage is limited to Data Class 1 and 2 as defined by [[https:// | ||
| - | **The 2024 pricing is: General Storage: $25/ | + | **The 2025 pricing is: General Storage: $25/ |
| - | To order storage please fill out our [[https://missouri.qualtrics.com/jfe/form/SV_6zkkwGYn0MGvMyO| RSS Services Order Form]] | + | To order storage please fill out our [[https://tdx.umsystem.edu/TDClient/36/DoIT/ |
| - | === Research Data Archive === | + | ==== Research Data Archive |
| **Use Case** | **Use Case** | ||
| Line 169: | Line 176: | ||
| **Costs** | **Costs** | ||
| - | The cost associated with using the RDE tape archive is $8/TB for short term data kept in inside the tape library for 1-3 years or $140 per tape rounded to the number of tapes for tapes sent offsite for long term retention up to 10 years. We send these tapes off to record management where they are stored in a climate-controlled environment. Each tape from the current generation LTO 9 holds approximately 18TB of data These are flat onetime costs and you have the option to do both a short term in library copy, and a longer-term offsite copy, or one or the other, providing flexibility. | + | The cost associated with using the RDE tape archive is $8/TB for short term data kept in inside the tape library for 1-3 years or $144 per tape rounded to the number of tapes for tapes sent offsite for long term retention up to 10 years. We send these tapes off to record management where they are stored in a climate-controlled environment. Each tape from the current generation LTO 9 holds approximately 18TB of data These are flat onetime costs, and you have the option to do both a short term in library copy, and a longer-term offsite copy, or one or the other, providing flexibility. |
| **Request Process** | **Request Process** | ||
| To utilize the tape archive functionality that RSS has setup, the data to be archived will need to be copied to RDE storage if it does not exist there already. This would require the following steps. | To utilize the tape archive functionality that RSS has setup, the data to be archived will need to be copied to RDE storage if it does not exist there already. This would require the following steps. | ||
| - | * Submit a RDE storage request if the data resides locally and a RDE share is not already available to the researcher: [[http://request.itrss.umsystem.edu|RSS | + | * Submit a RDE storage request if the data resides locally and a RDE share is not already available to the researcher: [[https://tdx.umsystem.edu/ |
| - | * Create an archive folder or folders in the relevant RDE storage share to hold the data you would like to archive. The folder(s) can be named to signify the contents, but we ask that the name includes _archive at then end. For example, something akin to: labname_projectx_archive_2024. | + | * Create an archive folder or folders in the relevant RDE storage share to hold the data you would like to archive. The folder(s) can be named to signify the contents, but we ask that the name includes _archive at the end. For example, something akin to: labname_projectx_archive_2024. |
| * Copy the contents to be archived to the newly created archive folder(s) within the RDE storage share. | * Copy the contents to be archived to the newly created archive folder(s) within the RDE storage share. | ||
| - | * Submit a RDE tape Archive request: [[https://missouri.qualtrics.com/ | + | * Submit a RDE tape Archive request: [[https://archiverequest.itrss.umsystem.edu]] |
| - | * Once the tape archive jobs are completed ITRSS will notify you and send you a Archive job report after which you can delete the contents of the archive folder. | + | * Once the tape archive jobs are completed ITRSS will notify you and send you an Archive job report after which you can delete the contents of the archive folder. |
| - | * We request that subsequent archive jobs be added to a separate folder or the initial folder renamed to something that signifies the time of archive for easier retrieval *_archive2024, | + | * We request that subsequent archive jobs be added to a separate folder, or the initial folder renamed to something that signifies the time of archive for easier retrieval *_archive2024, |
| **Recovery** | **Recovery** | ||
| Line 221: | Line 228: | ||
| * **[[https:// | * **[[https:// | ||
| - | * **[[https:// | + | * **[[https:// |
| * **[[https:// | * **[[https:// | ||
| - | * **[[https:// | + | * **[[https:// |
| - | * **[[https:// | + | * **[[https:// |
| - | * **[[https:// | + | * **[[https:// |
| Line 233: | Line 240: | ||
| ==== Software ==== | ==== Software ==== | ||
| - | The Foundry | + | Hellbender |
| ==== Hardware ==== | ==== Hardware ==== | ||
| Line 249: | Line 256: | ||
| Dell C6420: .5 unit server containing dual 24 core Intel Xeon Gold 6252 CPUs with a base clock of 2.1 GHz. Each C6420 node contains 384 GB DDR4 system memory. | Dell C6420: .5 unit server containing dual 24 core Intel Xeon Gold 6252 CPUs with a base clock of 2.1 GHz. Each C6420 node contains 384 GB DDR4 system memory. | ||
| - | Dell R6620: 1 unit server containing dual 128 core AMD EPYC 9754 CPUs with a base clock of 2.25 GHz. Each R6620 node contains 1 TB DDR5 system memory. | + | Dell R6625: 1 unit server containing dual 128 core AMD EPYC 9754 CPUs with a base clock of 2.25 GHz. Each R6625 node contains 1 TB DDR5 system memory. |
| + | |||
| + | Dell R6625: 1 unit server containing dual 128 core AMD EPYC 9754 CPUs with a base clock of 2.25 GHz. Each R6625 node contains 6 TB DDR5 system memory. | ||
| | **Model** | | **Model** | ||
| | Dell C6525 | 112 | 128 | 490 GB | AMD EPYC 7713 64-Core | | Dell C6525 | 112 | 128 | 490 GB | AMD EPYC 7713 64-Core | ||
| - | | Dell R640 | 32 | 40 | + | | Dell R640 | 32 | 40 |
| - | | Dell C6420 | 64 | 48 | + | | Dell C6420 | 64 | 48 |
| - | | Dell R6620 | 12 | 256 | 1 TB | + | | Dell R6625 | 12 | 256 | 994 GB |
| - | | | | + | | Dell R6625 | 2 | 256 | 6034 GB | AMD EPYC 9754 128-Core Processor |
| + | | | | ||
| === GPU nodes === | === GPU nodes === | ||
| - | | **Model** | + | | **Model** |
| - | | Dell R740xa | + | | Dell R750xa |
| | Dell XE8640 | 2 | 104 | 2002 GB | H100 | 80 GB | 4 | 3.2 TB | 208 | g018-g019 | | Dell XE8640 | 2 | 104 | 2002 GB | H100 | 80 GB | 4 | 3.2 TB | 208 | g018-g019 | ||
| | Dell XE9640 | 1 | 112 | 2002 GB | H100 | 80 GB | 8 | 3.2 TB | 112 | g020 | | | Dell XE9640 | 1 | 112 | 2002 GB | H100 | 80 GB | 8 | 3.2 TB | 112 | g020 | | ||
| - | | Dell R730 | 4 | 20 | + | | Dell R730 | 4 | 20 |
| - | | Dell R7525 | 1 | + | | Dell R7525 | 1 |
| - | | Dell R740xd | 3 | 44 | + | | Dell R740xd |
| - | | | + | | Dell R740xd | 1 | 44 |
| + | | Dell R760xa | 6 | 64 | 490 GB | H100 | 94 GB | 2 | 1.8 TB | 384 | g029-g034 | | ||
| + | | Dell R760 | 6 | 64 | 490 GB | L40S | 45 GB | 2 | 3.5 TB | 384 | g035-g040 | ||
| + | | Dell XE9680 | 1 | 96 | ||
| + | | | ||
| A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/ | A specially formatted sinfo command can be ran on Hellbender to report live information about the nodes and the hardware/ | ||
| Line 353: | Line 367: | ||
| ==== Open OnDemand ==== | ==== Open OnDemand ==== | ||
| - | * https:// | + | * https:// |
| - | * https:// | + | * https:// |
| OnDemand provides an integrated, single access point for all of your HPC resources. The following apps are currently available on Hellbender' | OnDemand provides an integrated, single access point for all of your HPC resources. The following apps are currently available on Hellbender' | ||
| Line 536: | Line 550: | ||
| ==== Moving Data ==== | ==== Moving Data ==== | ||
| + | |||
| + | ** Use one of the following options to move data. Do not move data on the login node.** | ||
| === Globus === | === Globus === | ||
| Line 542: | Line 558: | ||
| * Hellbender Collection Name: U MO ITRSS RDE | * Hellbender Collection Name: U MO ITRSS RDE | ||
| - | * Lewis Collection Name: MU RCSS Lewis Home Directories | ||
| * Mill Collection Name: Missouri S&T Mill | * Mill Collection Name: Missouri S&T Mill | ||
| - | * Foundry Collection Name: Missouri S&T HPC Storage | ||
| More detailed information on how to use Globus is at [[https:// | More detailed information on how to use Globus is at [[https:// | ||
| Line 604: | Line 618: | ||
| Below is process for setting up a class on the OOD portal. | Below is process for setting up a class on the OOD portal. | ||
| - | - Send the class name, the list of students and TAs, and any shared storage requirements to itrss-support@umsystem.edu. | + | - Send the class name, the list of students and TAs, and any shared storage requirements to itrss-support@umsystem.edu. |
| - We will add the students to the group allowing them access to OOD. | - We will add the students to the group allowing them access to OOD. | ||
| - If the student does not have a Hellbender account yet, they will be presented with a link to a form to fill out requesting a Hellbender account. | - If the student does not have a Hellbender account yet, they will be presented with a link to a form to fill out requesting a Hellbender account. | ||
| Line 795: | Line 809: | ||
| **Documentation**: | **Documentation**: | ||
| + | |||
| + | ==== RStudio ==== | ||
| + | |||
| + | [[https:// | ||
| ==== Visual Studio Code ==== | ==== Visual Studio Code ==== | ||
| Line 908: | Line 926: | ||
| After you’ve signed up and logged in to Globus, you’ll begin at the File Manager. | After you’ve signed up and logged in to Globus, you’ll begin at the File Manager. | ||
| + | |||
| + | **note: | ||
| + | https:// | ||
| + | If symlinks need to be copied, consider using the rsync on the DTN with with the -l flag** | ||
| + | |||
| + | |||
| + | |||
| The first time you use the File Manager, all fields will be blank: | The first time you use the File Manager, all fields will be blank: | ||