Computing Resources

Recent advances in machine learning strongly depend upon sophisticated computational resources, primarily GPU clusters and advanced parallel computing facilities. The National Science Foundation (NSF) awarded Stony Brook University, a research university, with a major research instrumentation (MRI) grant in order to build a heterogeneous computing infrastructure to support machine learning research. The AI Cluster serves as a primary resource for researchers across AI-related disciplines at Stony Brook University, to develop their research with efficacy and efficiency.

Since its inception, the AI heterogeneous GPU cluster includes: 1. An eight-GPU HGX H100, 2. An eight-GPU DGX A100, 3. Large central storage (75 TB), 3. A six-node cluster with an additional 36 GPUs. The accessibility of this cluster has enriched research efforts across an array of disciplines, such as: Artificial Intelligence (Natural Language Processing, Computer Vision, and Machine Learning), scientific disciplines (Biomedical Informatics, Chemistry, Ecology, and Linguistics), and Computer Science. The cluster is available for all students and faculty affiliated with the AI Institute.

The AI Institute offers three primary computational resources:

AI Cluster,
DGX,
and HGX

Latest AI Institute Newsletter

Newsletter Directory

AI Institute Resource Architecture

Onboarding Instructions

1. Please fill out this form to request access. In case of any queries regarding onboarding, please reach out to rt@cs.stonybrook.edu. Make sure to include your NET ID in the request you send (Note that NET ID is not the same as your CS ID. You use the NET ID to login to Brightspace and other campus SSO websites).

2. Using your Net ID and password, log into the following nodes as necessary. Note that the first login should take some time since your home directory is being set up.

AI cluster: submit.ai.stonybrook.edu on port 22. SLURM is the best way to go about using the AI Cluster. We have also compiled a guide on how to go about using the cluster here.
DGX: 130.245.162.235 on port 130. Nvidia recommends only using docker containers on the DGX machines. It is not recommended to build your own environments that use the GPUs as the stable releases of popular libraries may not work with the GPUs. Instead, build on top of the ample containers provided by Nvidia for free. Please note that most of the popular libraries (like PyTorch, TensorFlow, etc) are available as containers and you do not need to compile the binaries to use these frameworks.

3. The usage of notebooks (Jupiter/Colab) on the server is not allowed. Use scripts that end after execution and free the resources. If a GPU is not freed up after multiple days, we will send a reminder. In case we do not receive any response, the process would be killed.

4. Use the storage to only retain the datasets that you are actively working on. Please use other resources for long-term storage (external drives, lab servers, etc).

5. Please make sure to read the AI Institute Usage Policy before starting to use the resources.

Which machine to use and why?

Because the machines vary in the types of resources that are available, it is a good idea to choose the server that best suits your needs.

If your workload is CPU intensive and mainly involves processing a dataset, then AI orion is the best choice. The machine has a powerful CPU and lots of memory that makes it ideal for libraries like pandas/numpy.

For a purely deep learning based application and other tasks, the AI Cluster machines are the recommended servers. Please remember that these servers cannot be accessed directly and the jobs can only be submitted through the submit node. Please look at this document on how to submit jobs in the AI Cluster. Example here. Both the servers are equipped with NVlink connections between the GPUs, effectively increasing the available memory. Please make sure to visit the AI Institute Usage policy page before using the cluster.

In case you need to train a huge model from the start (as opposed to using a pretrained model and fine tuning it), you should use the DGX-A100. This server has 8 GPUs that are based on the latest Ampere architecture. It is the best choice for tasks that are too slow or not possible on other machines.

Key pointers on usage:

Experiments should primarily be run on the AI cluster
Running time for processes: Longer running jobs (over days or weeks) should be run on the cluster rather than the other standalone servers (AI orion and DGX)
For docker: Use DGX
AI-Institute Usage Policy Document

Hardware specifications

AI Cluster

The AI Cluster is a heterogeneous GPU cluster consisting of 6 servers (details of servers below). Jobs are submitted via the submit node through SLURM.

tesla1

8x Tesla V100-SXM2 GPUs (each with 32GB RAM & NVLink)
Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
384GB DDR4 2666MHz ECC REG Memory
100Gb/s Ethernet network

tesla2

8x Tesla V100-SXM2 GPUs (each with 32GB RAM & NVLink)
Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
384GB DDR4 2666MHz ECC REG Memory
100Gb/s Ethernet network

quadro1

8x Quadro RTX 8000 (each with 48GB GDDR6 RAM & NVLink)
Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
384GB DDR4 2666MHz ECC REG Memory
100Gb/s Ethernet network

quadro2

8x Quadro RTX 8000 (each with 48GB GDDR6 RAM & NVLink)
Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
384GB DDR4 2666MHz ECC REG Memory
100Gb/s Ethernet network

h100

4x H100 GPUs (each with 80GB HBM2e RAM)
AMD EPYC 9124 CPU (with 3GHz, 16 cores, 1 MB Cache)
387GB DDR5 2.4 GHz Memory
100 Gb/s Ethernet network

AI orion

4x Tesla V100-SXM2 with 128GB total GPU memory
Intel Xeon Gold 6140 CPU (each with 2.30GHz, 36 cores, 25MB Cache)
1.5TB DDR4 RAM
100Gb/s Ethernet network

DGX

The Nvidia DGX A100 is a standalone server.

8x A100 GPUs w/ 320GB total GPU memory
6x NVIDIA NVSwitches providing 4.8TB/s Bi-directional bandwidth
Dual AMD Rome 7742 CPUs (each w/ 2.25Ghz, 64 cores)
1TB DDR4 RAM
100Gb/s Ethernet network

For more information on the DGX A100 server, visit the official page.

HGX

The Nvidia HGX H100 is a standalone server.

8x NVIDIA HGX H100 GPUs (each with 80GB, NVLink, and NVSwitch)
Dual AMD EPYC 9004 Series CPUs (up to 128 cores, 256 threads)
Up to 6TB ECC DDR5 Memory (24 DIMM slots)
18x 2.5" hot-swap NVMe/SATA drive bays + 1 M.2 NVMe boot drive
8 NICs for GPU Direct RDMA (1:1 GPU ratio)
8 PCIe Gen 5.0 x16 LP slots, 2 PCIe Gen 5.0 x16 FHFL slots
10 heavy-duty fans, 6x 3000W redundant Titanium-level power supplies

Access Policy for HGX: Given the high demand for this system, controlled access will be implemented to ensure fair and efficient usage for projects requiring its capabilities. To request access, please complete the designated Google Form. Access will be granted based on project needs and system availability. Approved users will receive login details via email.

Storage

All the servers use a high-speed 100 Gigabit Ethernet network and share a 75 TB all-flash storage array. Storage quota for users is capped at 500 GB but can be temporarily increased to accommodate special requests.

Contacts

For system-related questions: aiadmins@cs.stonybrook.edu

For access and network-related questions: rt@cs.stonybrook.edu