UniBo NLP Logo

Cluster Usage Guide

Official Website

Visit our NLP research site at University of Bologna.

LinkedIn

Connect for updates on research and opportunities.

Hugging Face

Explore our models and contribute to our projects.

1. SLURM 🚥

SLURM is a cluster manager that allows us to dynamically schedule and allocate jobs on the GPUs of our servers.
Our computing resources are limited, while the number of users (e.g., undergraduates, PhD Students, PostDocs) and their projects are many. Coordinating vocally (e.g., "I need GPU 0 on the X server for two hours, please don't use it") or via shared files would be a nightmare. SLURM effectively manages the competition for GPUs within our cluster, ensuring equitable access for all users and projects while maximizing resource utilization and system efficiency.


2. Servers 🗂

Our SLURM cluster includes 4 servers with GPUs:

Server Name IP Address SSH Port GPUs
Faretra 137.204.107.40 37335 💻 4 Ă— NVIDIA GeForce RTX 3090 (24GB)
Deeplearn2 137.204.107.153 37335 💻 1 Ă— NVIDIA GeForce RTX 3090 (24GB)
💻 1 Ă— NVIDIA Titan XP (12GB)
Moro232 137.204.107.232 37335 💻 1 Ă— NVIDIA GeForce RTX 3090 (24GB)



2.1. Notation

We usually call nodes with the last digits of their IP address, the only part on which they differ (e.g., 40, 153, 232).

2.2. Master Node

faretra (or 40) is the master node, the one on which you should store all your code, data, and execute commands. The other nodes are used only for computation and storage. For instance, if you run a training script that get executed on the 153 node, you should access that node only to check the saved output files (e.g. model checkpoints, metric results) and eventually move them back to the 40 server.

2.3. Accessing

ssh username@ip -p port

To avoid repeating IP and port every time, save your ssh key with ssh-copy-id.


3. Working Mode 🚧


4. Preliminary Steps 🛫

On each server...

  1. Before start, refresh your Linux knowledge with a cheatsheet for engineers.
  2. Make sure you have an account by logging in with the credentials that you have been provided.
  3. Use passwd to change your default password with a private one containing at least 10 characters; use lower case and upper case letters adding also numbers and symbols. Avoid indicating personal passwords already used in other services.
  4. Install docker rootless by running the script install_rootless_docker.sh.
    • After the installation is complete, restart the shell and check that docker is working by running docker ps (check that no errors are returned).
    • If docker is not working, try to execute systemctl --user start docker.
    • If it doesn't work still, try running docker_rootless_fix.sh
    • , which will delete your current installation (along with all your images and containers!!) and reinstall docker rootless.

On the master node (40)...

  1. Execute sinfo --Format=NodeAddr,CPUs:10,Gres:80 to make sure SLURM is working properly.
    The output should be:

    
                        NODE_ADDR           CPUS      GRES                                                                            
                        137.204.107.153     16        gpu:titan_xp:1(S:0),gpu:nvidia_geforce_rtx_3090:1(S:0)                          
                        137.204.107.40      48        gpu:nvidia_geforce_rtx_3090:4(S:0)                                              
                        137.204.107.49      32        (null)                                                                          
                        137.204.107.157     48        (null)                                                                          
                        137.204.107.232     4         gpu:nvidia_geforce_rtx_3090:1(S:0)    
                    
  2. Upon logging in, you'll land in your home directory, which initially appears empty. To commence your project, you'll need to upload all required files. This can be accomplished by either pulling from a GitHub repository or employing a file transfer tool like WinSCP (e.g., local-remote drag and drop).

5. Executing Tasks with SLURM 🚥

In SLURM, a shared virtual queue serves as a centralized point to which jobs of any user can be appended. The SLURM scheduler efficiently manages and allocates resources for execution based on specified in-queue job requirements and system availability. Once suitable GPUs are identified, the scheduler allocates them to a job based on its queue position. Importantly, the SLURM queue operates with a dynamic priority assignment rather than adhering to a strict First-In-First-Out strategy. Instead of solely relying on the order of job submission, SLURM calculates an integer priority for each task by considering a multitude of factors, including load balancing between users. For example, if a user before you queues 50 jobs, you will not be in 51st position.

💡 Tip: In the event that you find yourself at the bottom of the queue and require immediate execution of a job, such as due to an impending deadline, you can reach out to the SLURM administrator via Microsoft Teams (lorenzo.molfetta@unibo.it). Should the request be deemed reasonable, your priority value will be elevated accordingly


This automatic GPU assignment ensures fair allocation and efficient resource utilization. Ideally the GPUs should operate continuously 24/7. When a job is assigned one or more GPUs, the index or indices of these GPUs are stored in the environment variable $CUDA_VISIBLE_DEVICES, which is specific to the node to which the GPU(s) belong(s). Note that each job created within SLURM is assigned with a unique identifier.

5.1. Asynchronous Job Scheduling (SBATCH)

The command sbatch is used to schedule the execution of a script file (e.g., the main one containing your training loop). The job is handled in background by SLURM and no longer linked to the shell you employed to submit it. This means that, after submission, you can log out, and close the terminal without consequences: when your turn comes, your job will be executed and GPUs freed upon its completion (i.e., non-blocking behavior). In fact, once a sbatch script completes its execution, SLURM releases the allocated resources automatically, including GPU locks, and moves on to the next task in the queue. This allows the cluster to minimize GPU wastage and maximize overall throughput (i.e., tasks completed within a given time frame). Within our SLURM web application, users operating in sbatch mode can be identified by the inclusion of the script name (e.g., "run_docker.sh") alongside the job.
By default, standard output and standard error are redirected to a file named "slurm-%j.out", where "%j" is replaced with the job ID. If your job ends with an error, this file will help you identify and troubleshoot any issues that may have occurred.

📝 Note: The "slurm-%j.out" file will be generated on the node where the job was allocated (NOT the one you ran the scheudling command from).

📝 Note: You should use on-cloud WandB logging for tracking your runs (not print commands).

Utilization:
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 train.sh


📝 Note: The sbatch command is not suitable for debugging. If you need to debug your script, you should run it interactively. Waiting for the job to be executed in the queue is not the best way to debug your code. This can lead to longer wait times, especially if the queue is busy and resources are not immediately available.

We recomend to use Colab for this purpose. Test your code in Colab with smaller models, datasets, and batch sizes. Once you are sure that your code works, you can move it to the cluster with enhanced configurations.


5.2. Job Management and Monitoring

📝 Note: If your script is already running, the scancel command may not directly stop your process—it might only remove the job from the SLURM queue. Depending on your script type, your script could continue running, leaving the GPU occupied without SLURM's awareness 😱.

Therefore, after canceling a running script with scancel, always verify GPU usage with nvidia-smi. Identify the owner of any lingering processes using ps -aux | grep <PID>. If you detect unwanted processes, terminate them manually with kill -9 <PID>.

Conversely, if you manually kill your processes without removing the job from the SLURM queue using scancel, your job may remain in the queue, preserving its priority and potentially blocking GPUs unnecessarily.

5.4. General Recommendations

It is advisable not to occupy a GPU for more than 3 days continuously. This practice helps maintain fast access to GPU resources and facilitates their recirculation, benefiting all users. Any deviation from this guideline should be approved by your supervisor. If you encounter the need to execute a longer task, you should divide it into multiple jobs. For instance, you can opt for incremental training stages that resume from the last saved checkpoint.

WARNING: SLURM supports another allocation command (the one who shall not be named). Commands other than sbatch are not allowed on our cluster. Processes not using this command will be killed without notice. Please always remember to use sbatch for all your job submissions.



6. Docker 🚀

There are many users (just run a "cd .." from your home to verify how many active-user homes exist). Each user can have several projects (e.g., the proposed method and some baselines). Each project comes with its distinct set of requirements (Python libraries and their specific versions). Directly installing or updating libraries on the physical machine would be impractical. Hence, we heavily rely on Docker, where each user executes a specific project within a sandbox—a virtual environment equipped with all the necessary files and dependencies.

In Docker, there are three fundamental components: Dockerfile, Image, and Container. These components are interdependent, meaning they build upon each other in an incremental manner. See Docker in a nutshell.
💡 Tip: OBVIOUSLY you can use any name for these files.

🚸 CAREFUL: Docker containers and images are often the cause of memory disk saturation. Please ensure that you delete any unused containers and images. You can use the docker system prune command to remove all stopped containers, dangling images, and unused networks and volumes.


7. Independent File Systems and Code Distribution 🛜

We lack a distributed file system within the cluster. WHY this is a problem❓ As we said, SLURM dynamically decides where to allocate your job, interdependently from the location you ran the command from.

📝 Note: Slurm only determines which machine will run your command. It does not automatically transfer or synchronize your code files to that machine for execution. In this context, it's crucial to highlight that you cannot predict which server SLURM will allocate a GPU from.


Then, WHAT IF SLURM executes your task on a machine where your code doesn't exist ❓ The job will fail because the machine won't have access to the necessary files needed to run your program. In other words, if you create a file on a server, it won't automatically propagate to all other ones.

📝 Example: You're logged into server 40, where all your project files reside. You submit a job to the queue with sbatch, requesting execution of a training script on an NVIDIA RTX 3090. SLURM promptly allocates a 3090, but not on server 40; it's on server 153. SLURM searches for your specified file but doesn't find it. Consequently, the job terminates with an error.


Given the uncertainty of the node where your job will execute, it's imperative to ensure synchronization of the directory containing your project files. This way, regardless of the allocated node, the job remains executable. Storing your code on GitHub and simply pulling it onto the other servers is an ideal solution.

📝 Note: Not only code, you shoudl recreate the Docker image with the same name on all the servers. This way, you can be sure that the environment is the same on all the machines and the job won't fail.


🔢 TL;DR: To sum up, here are the steps to follow:
  1. Create a GitHub Repository and Clone It

    Set up a GitHub repository and clone it to your home directory.

    git clone https://github.com/your-repo/project1.git /home/molfetta/project1
    cd /home/molfetta/project1
  2. Create a Build Directory and Build the Docker Image

    Inside your project folder, create a build directory and build the Docker image.

    mkdir /home/molfetta/project1/build

    Run the following command to build the Docker image:

    docker build -f build/Dockerfile -t project1_image_name .
  3. Create Shell Scripts for Running Docker Containers

    Write .sh scripts that use docker run, referencing the image name created in the previous step.

    Example script:

    docker run -v /home/molfetta/project1:/workdir 
               -v /llms:/llms project1_image_name 
               ...
               /workspace/RELATIVE_PATH_TO_TRAIN.sh ... 
  4. Ensure the Correct Image Name in run_docker.sh

    Before submitting a job, verify that the run_docker.sh script specifies the correct Docker image name.

  5. Submit the Job

    Use the following command to submit the job via sbatch:

    sbatch run_docker.sh
  6. You're Done! 🎉

    Your job is now scheduled and will be executed automatically. Just wait for the results!



📝 Note: In very very ... very rare cases, you may be using too memory-consuming resources (e.g. very large datasets), where replicating them on all servers is not feasible. ONLY in that case, you can set an additional variable in the sbatch command to force the scheduler to remain in the machine you have your data on. This argument is -w followed by the specific server name.
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w faretra train.sh
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w cloudifaicdlw001-System-Product-Name train.sh
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w deeplearn2 train.sh
USE IT SPARINGLY. Forcing the destination machine make low usage of the great scheduling and dynamic capabilities of SLURM, and may lead to a waste of resources.



8. Visualizing and Interacting with your Files 👀

Accessing the server via ssh and manually copying files using scp can be tedious and time-consuming. Every time you need to edit or transfer a file, you must run multiple commands, making development inefficient.

Instead, we recommend using Visual Studio Code's Remote - SSH extension. This extension allows you to connect to a remote server and interact with files as if they were on your local machine. With this setup, you can:

🛠 Installation Guide: VS Code Remote - SSH

Follow these steps to set up and use the Remote - SSH extension in VS Code:

  1. Install Visual Studio Code

    If you haven't installed VS Code yet, download it from the official website:

    đź”— VS Code Download

  2. Install the Remote - SSH Extension

    Open VS Code and install the extension:

    • Click on the Extensions icon (Ctrl+Shift+X).
    • Search for "Remote - SSH".
    • Click "Install".

    Alternatively, install it directly from the marketplace:

    đź”— Remote - SSH Extension

  3. Configure SSH in VS Code

    To enable seamless SSH connections, IN YOUR LOCAL MACHINE configure the SSH settings in ~/.ssh/config (create the file if it doesn't exist yet). Copy-paste the following text into that file (using your username):

    Host faretra
        HostName 137.204.107.40
        Port 37335
        User molfetta
    
    Host moro232
        HostName 137.204.107.232
        Port 37335
        User molfetta
    
    Host deeplearn2
        HostName 137.204.107.153
        Port 37335
        User molfetta

    Once installed, close and re-open VS Code. Then, at the bottom-left of your VS Code window, a green icon similar to "><" should appear. Click on it and select the machine you want to connect to from the drop-down menu (those names are taken from the ".ssh/config" file).

  4. Start Coding on the Remote Server 🛡

    Once connected, you can:

    • Use the built-in file explorer to navigate remote files.
    • Open, edit, and save files directly on the server.
    • Run commands in the VS Code terminal without opening a separate SSH session.

    Now, you can work on your remote machine as if it were local!

✅ Done! Now you can interact with your files efficiently using VS Code instead of manually copying them with scp.

📝 Note: You can get more information about the Remote - SSH extension and its features on the official VS Code documentation page: 🔗 VS Code Remote - SSH Documentation

🚸 CAREFUL: Even if files are now completely accessible and visible on the servers, ALWAYS remember to push your changes.


9. Before Graduation or Project Completion 🚩





Official Website

Visit our NLP research site at University of Bologna.

LinkedIn

Connect for updates on research and opportunities.

Hugging Face

Explore our models and contribute to our projects.