Investing in Discovery

Body

Overview

  • The DISCOVERY general purpose Cluster is an exciting opportunity for researchers to participate in creating a world class super computer devoted to furthering research at Dartmouth.
  • Researchers considering their own purchase of a Linux cluster are invited to consider the advantages of joining the cooperative Discovery cluster.
  • Discovery has been running since the fall of 2005 and currently contains over 4000 cores for jobs. 

Benefits of participating in the Discovery cluster

  • Access to a 4000+  core cluster.
  • Full systems support, this allows users to focus on their research without having to worry about running a system.
  • We will install your software, help get your code running and make sure the system continues to meet the needs of its user base.
  • Access to more resources than purchased. Users may utilize additional resources if these are not in use.
  • The cluster’s administrator ensures that your home directory data is backed-up, that the system is secure and has the updates and system changes that are needed to meet your research needs.
  • All stakeholders of the Discovery cluster can help determine the direction of Discovery via the Discovery User group.

The Discovery support team provides

  • General user support and training on cluster utilization.
  • Systems administration support as well as software installation.
  • Programming, debugging, parallelization and optimization support.
  • Electrical power, (including uninterrupted power supply), machine room space, and cooling.
  • Network capacity
  • System software, clustering software, setup, configuration, and some standard compilers and research applications (e.g,. Matlab, Fortran 90/95).
  • Prompt repairs.
  • Local, temporary storage for use during computational runs.

"Free tier": Research Computing Bill-of-Rights level of access

Dartmouth offers a subvented tier free of charge to the researchers through the DISCOVERY Cluster. This tier is designed to provide users with immediate access to computing resources, enabling researchers to address computationally intensive tasks without incurring significant expenses.

Each account within this tier is provided access to:

  • Compute:
  • Research Data Storage:
    • everyone: 50GB Home directory on DartFS, up to 5TB of global scratch space (no backup, temporary storage)
    • faculty: 1TB of Lab space for shared research data storage on DartFS
  • Consulting and Facilitation services:
    • consulting and facilitation services are available for every-day support, training, and proof-of-concept development. Expertise range from High-Performance computing management, to data science, professional software engineering, GIS, scientific and research programming, and much more. Please reach out to Research.Computing@dartmouth.edu for more information.

Paid tiers and how to join the Discovery cluster

There are four models

  1. We sell CPU shares for $1,000/year that lets you schedule up to 80 cores. This is a community model where if the cluster has 80 cores available, all of your jobs will immediately run. However, there are times when things are busy so your jobs will queue and run when resources become available. There should be at least 20 cores available to you at all times with the purchase of one CPU share. 
  2. For GPU shares we have V100s ($2,500/year) and A100s ($3,500/year). GPU shares also use the community model, but the community is much smaller. With the A100s, we have the ability to turn on Multi-Instance GPU (MIG) if desired. We work with each of the groups who buy in to ensure we are meeting their target service level needs. 
  3. Our next model is where we buy dedicated hardware with your funds and make it part of the cluster. In this model you are given a priority queue that allows you to make full use of your hardware at any time. We also create a "preemptable" queue that other researchers can submit to. Our users understand that they can use these preemptable queues to run jobs when they are free, but their job may be killed at any point if the owning group needs to run things. 
  4. The last model is where we buy private nodes with your funds and they are not part of the cluster. We do this the most with collaborators in the research and development areas because they have needs for specialized systems and require 100% full access at all times or they work with sensitive data and need to be isolated. In this case we establish a SOW with the research group and charge a System Administrator fee to manage things. The costs on these vary and are negotiated based on the amount of people power required to support the custom environment. 

Summary of cost structures

Item Cost
One CPU share (80 cpu cores) $1,000/year
One V100 GPU share $2,500/year
One A100 GPU share $3,500/year
Dedicate hardware part of the cluster Variable: Just the cost of the hardware
Private hardware outside the cluster Variable: The cost of the hardware and
a negotiated System Administrator fee

Running jobs and scheduling

  • If there are available resources, a user can run jobs on up to 4 times more cores then they purchased.
  • In general there are always available nodes on the cluster but when the cluster is fully utilized there may be times when jobs need to wait in the queue.
  • You may contact us in advance of any deadlines so that we can ensure resources are available to run your jobs during high-use periods.
  • Scheduling priority is based on the number of nodes purchased.
  • Users who purchase more nodes will be able to run more jobs and their queued jobs will get to the run queue faster.
  • Users are allowed to log in to nodes directly to check on the status of their jobs but all jobs need to be submitted through the queue.
  • Slurm is used to manage jobs on the cluster and users are able to check the status of their jobs.

Software and operating Environment

  • Home directories are on an NFS server that all the nodes can see.
  • Discovery is run as a production cluster at the Berry Machine Room (BMR) which will reduce down times due to power interruptions.
  • Currently, a 64 bit version of Red Hat Enterprise Linux 8.10 is the base O/S on the nodes.
  • Portland Group, Intel and GNU compilers for C, C++ and Fortran are installed on the system as well as multiple versions of MPI.
  • Java, Perl, Python as well as other open source programs are installed.
  • Additional software will be installed upon request.

Frequently asked Questions

  1. How do I get help or ask questions about Discovery?
    1. For help or to ask questions send email to: research.computing@dartmouth.edu
  2. I’m on a deadline, how do I make sure my jobs run through quickly?
    1. We will work hard to accommodate users who have deadlines so the sooner we know about an expected high-demand period the better able we will be to meet your needs.
    2. With 1 week notice we will typically be able to provide a user with many more resources then they purchased however if many users all have deadlines at the same time all users will be allocated the number of nodes they purchased.
  3. My group has a special software package that we alone are allowed to run. Will other people be able to access this package when it’s installed on Discovery?
    1. No, other users will not be able to access your software. We will install licensed software that you own and will set up the system so that only users in your group will be able to use this software.

Details

Details

Article ID: 133133
Created
Thu 6/10/21 10:06 AM
Modified
Mon 12/2/24 8:56 AM

Related Articles

Related Articles (1)