AI Cluster Newsletter

 

Latest AI Cluster Newsletter:

 

Welcome to the AI institute landing page. Here you will find information about the available computing resources and how to start using them.

 


 

Hardware

 

The AI Institute has 3 different computational resources: AI Cluster, AI-Orion and DGX A100. AI Cluster and DGX A100 share a 100Gb Ethernet network and 50TB all-flash storage array. All user data is accessible across AI Cluster and DGX A100. AI-Orion has a separate 18TB storage.

 

The AI Cluster is a heterogeneous GPU cluster of 4 servers (details of servers below).  Jobs are submitted via the submit node (submit.ai.stonybrook.edu) through SLURM.  

 

  • Ai02
    • 8x Tesla V100-SXM2 GPUs (each with 32GB RAM & NVLink)
    • Dual Intel Xeon Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
    • 384GB DDR4 2666MHz ECC REG Memory
    • 100Gb/s Ethernet network
  • Ai03
    • 8x Tesla V100-SXM2 GPUs (each with 32GB RAM & NVLink)
    • Dual Intel Xeon  Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
    • 384GB DDR4 2666MHz ECC REG Memory
    • 100Gb/s Ethernet network
  • Ai04
    • 8x Quadro RTX 8000 (each with 48GB GDDR6 RAM & NVLink)
    • Dual Intel Xeon  Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
    • 384GB DDR4 2666MHz ECC REG Memory
    • 100Gb/s Ethernet network
  • Ai05
    • 8x Quadro RTX 8000 (each with 48GB GDDR6 RAM & NVLink)
    • Dual Intel Xeon  Silver 4216 CPU (each with 2.1Ghz, 16 cores, 22MB Cache)
    • 384GB DDR4 2666MHz ECC REG Memory
    • 100Gb/s Ethernet network
  • DGX A100 is a stand-alone server (details below)

    • 8x A100 GPUs w/ 320GB total GPU memory
    • 6x NVIDIA NVSwitches providing 4.8TB/s Bi-directional bandwidth
    • Dual AMD Rome 7742 CPUs (each w/ 2.25Ghz, 64 cores)
    • 1TB DDR4 RAM
    • For more information on the DGX A100 system go .

 


 

Software :

AI Cluster : 

  • Linux OS: "Ubuntu 18.04.5 LTS"
  • CUDA Version: 11.2
  • Slurm Workload Manager: 20.02.6

Ai-Orion is a stand-alone server (details below).

  • Linux OS: "Ubuntu 18.04.3 LTS"
  • CUDA Version: 10.1
  • Docker version 18.09.6

DGX A100 is a stand-alone server (details below)

  • Linux OS: "Ubuntu 18.04.5 LTS"
  • CUDA Version: 11.0
  • Docker version 19.03.8

Which machine to use and why?

 

Because the machines vary in the types of resources that are available, it is a good idea to choose the server that best suits your needs. 

 

If your workload is CPU intensive and mainly involves processing a dataset, then aiorion is the best choice. The machine has a powerful CPU and lots of memory that makes it ideal for libraries like pandas/numpy. 

 

For a purely deep learning based application and other tasks, the AI Cluster machines are the recommended servers. Please remember that these servers cannot be accessed directly and the jobs can only be submitted through the submit node: submit.ai.stonybrook.edu. Please look at this on how to submit jobs in the AI Cluster. Example . Both the servers are equipped with NVlink connections between the GPUs, effectively increasing the available memory. Please make sure to visit the AI Institute Usage page before starting to use the cluster.

 

 

In case you need to train a huge model from the start (as opposed to using a pretrained model and fine tuning it), you should use the DGX-A100. This server has 8 GPUs that are based on the latest Ampere architecture. It is the best choice for tasks that are too slow or not possible in other machines.

 

Key Pointers on the usage:

  • Experiments should primarily be run on the AI cluster. 
  • Running time for processes : Longer running jobs (over days or weeks) should be run on the cluster rather than the other standalone servers (aiorion and DGX). 
  • For docker : Use DGX
  • AI-Institute Usage

 


 

Access to AI Cluster

 

Please follow the following steps to get access to the machines:

 


  1. Please fill the form: to request for access. In case of any queries, please mail rt [at] cs.stonybrook.edu. Make sure to include your NET ID in the request you send (Note that NET ID is not the same as your CS id. You use the NET ID to login to blackboard and other campus SSO websites). 
  2. Log into submit.ai.stonybrook.edu (using your NetID/Password). The first login should take some time as your home directory is set up. After that you should have access to the machine. 
  3. Using SLURM commands is the best way to go about using the servers. In case you prefer docker, we have it available and configured to use the GPUs. In case your work does not pertain to deep learning (and the corresponding high level python libraries) you may need to get some custom software installed. Please send us an email so that we can do the root level installations. 
  4. Usage of notebooks (Jupiter/colab) on the server is not allowed. Use scripts that end after execution and free the GPU. In case a GPU is not freed for multiple days, we will send a reminder. In case we receive no response, the process will have to be killed.
  5. Use the storage to only retain datasets you are actively working on. Please use some other resource for long term storage (external drives, lab servers, etc). 
  6. In case of questions email shuagrawal [at] cs.stonybrook.edu or jimjoseph [at] cs.stonybrook.edu, we will be happy to help.

Access to AI-Orion

  1. Please fill the form: to request for access. In case of any queries, please mail rt [at] cs.stonybrook.edu. Make sure to include your NET ID in the request you send (Note that NET ID is not the same as your CS id. You use the NET ID to login to blackboard and other campus SSO websites). Send me an email to give you access to the docker commands.
  2. Log into aiorion.ai.stonybrook.edu on port 130 (using your NetID/Password). The first login should take some time as your home directory is set up. After that you should have access to the machine.
  3. Server aiorion.ai.stonybrook.edu has a local storage (not connected to the network) so the maximum available space per user is 1.5 TB. If you need more space, please contact the admins.
  4. Using conda environments is the best way to go about using the servers. In case you prefer docker, we have it available and configured to use the GPUs. In case your work does not pertain to deep learning (and the corresponding high level python libraries) you may need to get some custom software installed. Please send me an email so that I can do the root level installations.
  5. Usage of notebooks (Jupiter/colab) on the server is not allowed. Use scripts that end after execution and free the GPU. In case a GPU is not freed for multiple days, we will send a reminder. In case we receive no response, the process will have to be killed.
  6. Use the storage to only retain datasets you are actively working on. Please use some other resource for long term storage (external drives, lab servers, etc). 
  7. In case of questions email shuagrawal [at] cs.stonybrook.edu or jimjoseph [at] cs.stonybrook.edu, we will be happy to help.

 


 

Access to DGX A100

 

Please follow these steps to get access to the machines:

 

  1. Please fill the form: to request for access. In case of any queries, please mail rt [at] cs.stonybrook.edu. Make sure to include your NET ID in the request you send (Note that NET ID is not the same as your CS id. You use the NET ID to login to blackboard and other campus SSO websites).
  2. Log into 130.245.162.235 on port 130 (using your NetID/Password). The first login should take some time as your home directory is set up. After that you should have access to the machine. Send me an email to give you access to the docker commands.
  3. The DGX server has access to the same storage as other servers so you should see the exact same view from both the machines. However the software settings on the DGX machine are different from the other two servers.
  4. Nvidia recommends only using docker containers on the DGX machines. It is not recommended to build your own environments that use the GPUs as the stable releases of popular libraries may not work with the GPUs. Instead build on top of the ample containers provided by nvidia for free . Please note that most of the popular libraries (like PyTorch/TF/etc) are available as containers and you do not need to compile the binaries in order to use the deep learning frameworks.
  5. Usage of notebooks (Jupiter/colab) on the server is not allowed. Use scripts that end after execution and free the GPU. In case a GPU is not freed for multiple days, we will send a reminder. In case we receive no response, the process will have to be killed.
  6. Use the storage to only retain datasets you are actively working on. Please use some other resource for long term storage (external drives, lab servers, etc). 
  7. In case of questions email shuagrawal [at] cs.stonybrook.edu or jimjoseph [at] cs.stonybrook.edu, we will be happy to help.

 

 


 

Contact:

 

For any System related questions:

 

Shubham Agrawal: shuagrawal [at] cs.stonybrook.edu

Jimmy Joseph: jimjoseph [at] cs.stonybrook.edu

 

For any access and network related questions:

rt [at] cs.stonybrook.edu