Computing Resources
The AI Institute offers three different computational resources:
AI Cluster
and
The DGX
Latest AI Institute Newsletter
AI Institute Resource Architecture
Onboarding Instructions
1. Please fill out this form to request access. In case of any queries regarding onboarding, please reach out to rt@cs.stonybrook.edu. Make sure to include your NET ID in the request you send (Note that NET ID is not the same as your CS ID. You use the NET ID to login to Brightspace and other campus SSO websites).
2. Using your Net ID and password, log into the following nodes as necessary. Note that the first login should take some time since your home directory is being set up.
- AI cluster: submit.ai.stonybrook.edu on port 22. SLURM is the best way to go about using the AI Cluster. We have also compiled a guide on how to go about using the cluster here.
- DGX: 130.245.162.235 on port 130. Nvidia recommends only using docker containers on the DGX machines. It is not recommended to build your own environments that use the GPUs as the stable releases of popular libraries may not work with the GPUs. Instead, build on top of the ample containers provided by Nvidia for free. Please note that most of the popular libraries (like PyTorch, TensorFlow, etc) are available as containers and you do not need to compile the binaries to use these frameworks.
3. The usage of notebooks (Jupiter/Colab) on the server is not allowed. Use scripts that end after execution and free the resources. If a GPU is not freed up after multiple days, we will send a reminder. In case we do not receive any response, the process would be killed.
4. Use the storage to only retain the datasets that you are actively working on. Please use other resources for long-term storage (external drives, lab servers, etc).
5. Please make sure to read the AI Institute Usage Policy before starting to use the resources.
Which machine to use and why?
Because the machines vary in the types of resources that are available, it is a good idea to choose the server that best suits your needs.
If your workload is CPU intensive and mainly involves processing a dataset, then AI orion is the best choice. The machine has a powerful CPU and lots of memory that makes it ideal for libraries like pandas/numpy.
For a purely deep learning based application and other tasks, the AI Cluster machines are the recommended servers. Please remember that these servers cannot be accessed directly and the jobs can only be submitted through the submit node. Please look at this document on how to submit jobs in the AI Cluster. Example here. Both the servers are equipped with NVlink connections between the GPUs, effectively increasing the available memory. Please make sure to visit the AI Institute Usage policy page before using the cluster.
In case you need to train a huge model from the start (as opposed to using a pretrained model and fine tuning it), you should use the DGX-A100. This server has 8 GPUs that are based on the latest Ampere architecture. It is the best choice for tasks that are too slow or not possible on other machines.
Key pointers on usage:
- Experiments should primarily be run on the AI cluster
- Running time for processes: Longer running jobs (over days or weeks) should be run on the cluster rather than the other standalone servers (AI orion and DGX)
- For docker: Use DGX
- AI-Institute Usage Policy Document
Hardware specifications
AI Cluster
The AI Cluster is a heterogeneous GPU cluster consisting of 6 servers (details of servers below). Jobs are submitted via the submit node through SLURM.
tesla1
- 8x Tesla V100-SXM2 GPUs (each with 32GB RAM & NVLink)
- Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
- 384GB DDR4 2666MHz ECC REG Memory
- 100Gb/s Ethernet network
tesla2
- 8x Tesla V100-SXM2 GPUs (each with 32GB RAM & NVLink)
- Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
- 384GB DDR4 2666MHz ECC REG Memory
- 100Gb/s Ethernet network
quadro1
- 8x Quadro RTX 8000 (each with 48GB GDDR6 RAM & NVLink)
- Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
- 384GB DDR4 2666MHz ECC REG Memory
- 100Gb/s Ethernet network
quadro2
- 8x Quadro RTX 8000 (each with 48GB GDDR6 RAM & NVLink)
- Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
- 384GB DDR4 2666MHz ECC REG Memory
- 100Gb/s Ethernet network
h100
- 4x H100 GPUs (each with 80GB HBM2e RAM)
- AMD EPYC 9124 CPU (with 3GHz, 16 cores, 1 MB Cache)
- 387GB DDR5 2.4 GHz Memory
- 100 Gb/s Ethernet network
AI orion
- 4x Tesla V100-SXM2 with 128GB total GPU memory
- Intel Xeon Gold 6140 CPU (each with 2.30GHz, 36 cores, 25MB Cache)
- 1.5TB DDR4 RAM
- 100Gb/s Ethernet network
DGX
The Nvidia DGX A100 is a standalone server.
- 8x A100 GPUs w/ 320GB total GPU memory
- 6x NVIDIA NVSwitches providing 4.8TB/s Bi-directional bandwidth
- Dual AMD Rome 7742 CPUs (each w/ 2.25Ghz, 64 cores)
- 1TB DDR4 RAM
- 100Gb/s Ethernet network
For more information on the DGX A100 server, visit the official page.
Storage
All the servers use a high-speed 100 Gigabit Ethernet network and share a 75 TB all-flash storage array. Storage quota for users is capped at 500 GB but can be temporarily increased to accommodate special requests.
Contacts
For system-related questions: aiadmins@cs.stonybrook.edu
For access and network-related questions: rt@cs.stonybrook.edu