SLURM is a cluster manager that allows us to dynamically schedule and allocate jobs on the GPUs of our servers.
Our computing resources are limited, while the number of users (e.g., undergraduates, PhD Students, PostDocs) and their projects are many.
Coordinating vocally (e.g., "I need GPU 0 on the X server for two hours, please don't use it") or via shared files would be a nightmare.
SLURM effectively manages the competition for GPUs within our cluster, ensuring equitable access for all users and projects while maximizing resource utilization and system efficiency.
Our SLURM cluster includes 4 servers with GPUs:
Server Name | IP Address | SSH Port | GPUs |
---|---|---|---|
Faretra | 137.204.107.40 | 37335 | 💻 4 Ă— NVIDIA GeForce RTX 3090 (24GB) |
Deeplearn2 | 137.204.107.153 | 37335 |
💻 1 Ă— NVIDIA GeForce RTX 3090 (24GB) 💻 1 Ă— NVIDIA Titan XP (12GB) |
Moro232 | 137.204.107.232 | 37335 | 💻 1 Ă— NVIDIA GeForce RTX 3090 (24GB) |
We usually call nodes with the last digits of their IP address, the only part on which they differ (e.g., 40, 153, 232).
faretra (or 40) is the master node, the one on which you should store all your code, data, and execute commands. The other nodes are used only for computation and storage. For instance, if you run a training script that get executed on the 153 node, you should access that node only to check the saved output files (e.g. model checkpoints, metric results) and eventually move them back to the 40 server.
ssh username@ip -p port
To avoid repeating IP and port every time, save your ssh key with ssh-copy-id
.
Write code you are proud of.
Take your time to craft high-quality implementations. Quality rather than quantity.
Strive to create code that not only accomplishes the task at hand but also stands as a testament to your skills and professionalism.
Write scripts that are well-structured, efficient, and easy to understand.
Do not use cryptic variable names, prioritize modular components, write unit tests when possible.
Start with notable examples from authoritative repositories or notebooks, take the best from them, and raise the quality bar.
De facto style guide for Python code.
Go on the cluster only when you are ready!
You should always prototype on Google Colab and move your codebase to the 40 node only when you are sure that it works properly (e.g., the training loop with a small-version model on a dataset sample starts without errors).
Note that a free-tier account on Google Colab provides you with access to a T4 GPU having 15GB VRAM; a single GeForce RTX 3090 on our servers has 24GB VRAM and is signficantly faster, without usage thresholding.
Keep your home clean and essential.
The disk space of our servers is not infinite.
Students have a quota of 150GB on each server (i.e., you can be the owner of files for a total of 150GB).
You should carefully organize your home in directories.
Please, make sure to delete all the unnecessary files, including docker images and containers (see Section 7 for details).
Remote development using SSH.
Editing complex files on the 40 server using nano or vim may not always be the most convenient option.
Popular IDEs such as Visual Studio Code and PyCharm offer the capability to open a remote folder on any remote machine, virtual machine, or container with a running SSH server.
In this way, you can enjoy a local-quality development experience—including completions, code navigation, and debugging—regardless of where your code is hosted.
Visual Studio Code Remote - SSH extension (discussed more in details below).
Reduce development time with GitHub Copilot (X).
By leveraging Large Language Models trained on vast code repositories, GitHub Copilot accelerates development by offering context-aware code suggestions and completions.
Seamlessly integrated into popular code editors like Visual Studio Code, Copilot analyzes code snippets and suggests relevant solutions in real-time, empowering developers to write high-quality code with speed and precision.
Your @studio.unibo.it credentials enable free access to this AI companion and its X version. Use it.
Switch to byobu before every long-standing experiment.
Before initiating a lengthy computational task, such as model training, inference, or data processing that spans several hours or even days, it is crucial to ensure uninterrupted execution without keeping your local terminal open and your computer on.
However, closing the terminal session might terminate the remote task.
Byobu is a terminal multiplexer that enables you to detach from your session while preserving the running processes on our cluster.
With byobu, you can safely disconnect from the remote machine without fear of interrupting the ongoing computation.
Later, you can reattach to the session at any time to check the progress (e.g., training logs) or manage the task.
This ensures seamless execution of long-standing experiments while maintaining flexibility and convenience in your workflow.
Create a session (default name)
byobu
After entering this command, a new byobu session will be created with the default name. From this point, you can proceed with your work as usual within the byobu session. You can safely close your local terminal or even turn off your local machine without interrupting the tasks running on the remote server. When you are ready to resume your work, you can simply reconnect to the server via SSH and retype byobu to reattach to your existing byobu session. Note: with byobu, you can open additional terminal windows or split the current window into multiple panes using keybindings.
Create a named session
byobu new -s {session-name}
Named sessions provide a convenient way to organize and manage multiple byobu sessions, especially when working on different experiments. You can easily identify and switch between named sessions using their designated names. If you type just byobu and there are multiple sessions or if you want to choose a specific session, byobu will prompt you to select the session from a list.
Do not use GPUs with higher VRAM capabilities of what you need.
In an era dominated by LLMs, it is tempting to assume that 12GB of VRAM is inadequate. However, for numerous projects, a GPU like the Titan Xp can more than suffice. It is crucial to match your GPU resources to the requirements of your specific tasks, avoiding unnecessary overhead and ensuring efficient resource utilization.
Fun fact: in 2019, Julien Chaumond (CTO @ HuggingFace) deemed himself GPU-rich when he possessed a server equipped with merely two Titan XPs, akin to the 153 [Source].
Monitor your runs.
Especially when running an experiment for the first time, it is important to keep an eye on the server's RAM utilization to prevent slowdowns or unexpected interruptions. Utilizing tools like htop allows you to monitor system resources in real-time and identify any anomalies or excessive resource usage.
Track all you runs with WandB.
Keeping track of experiments and their results can become challenging over time. WandB (Weights & Biases) provides a comprehensive solution to this problem. It's a cloud service that offers powerful tools for experiment tracking, visualization, and collaboration. By registering for free and creating a project on WandB, you can efficiently manage and share your experiments with your supervisor or team members. WandB allows you to log various aspects of your runs, including hyperparameters, prompts, predictions, loss values, and metrics. What's more, the logging is done in real-time, enabling you to monitor the progress of your experiments from anywhere, even from your smartwatch. One of the standout features of WandB is its automatic tracking of system resources, such as the operating system, GPU type and usage, network traffic, RAM consumption, and running times. This information is invaluable for understanding the performance of your experiments and diagnosing any issues that may arise. WandB keeps you informed about the status of your runs by sending emails or Slack messages when a run completes or crashes. It also generates insightful interactive visualizations, which you can easily download as .csv coordinate files and import into LaTeX for inclusion in reports or papers. Moreover, WandB offers advanced filtering and navigation capabilities, allowing you to easily navigate through all your experiments and smartly filter them based on criteria such as hyperparameter values, model types, dataset names, run status, or the user who launched the experiment. Instead of using "prints" or other local Python logging systems, try to keep everything organized in a WandB repository and log everything inside.
In WandB, it is even possible to upload per-run artifacts such as model checkpoints and files in general.
Save your checkpoints.
Consider privately storing your best checkpoints in your Hugging Face account.
On each server...
On the master node (40)...
Execute sinfo --Format=NodeAddr,CPUs:10,Gres:80 to make sure SLURM is working properly.
The output should be:
NODE_ADDR CPUS GRES 137.204.107.153 16 gpu:titan_xp:1(S:0),gpu:nvidia_geforce_rtx_3090:1(S:0) 137.204.107.40 48 gpu:nvidia_geforce_rtx_3090:4(S:0) 137.204.107.49 32 (null) 137.204.107.157 48 (null) 137.204.107.232 4 gpu:nvidia_geforce_rtx_3090:1(S:0)
In SLURM, a shared virtual queue serves as a centralized point to which jobs of any user can be appended.
The SLURM scheduler efficiently manages and allocates resources for execution based on specified in-queue job requirements and system availability.
Once suitable GPUs are identified, the scheduler allocates them to a job based on its queue position.
Importantly, the SLURM queue operates with a dynamic priority assignment rather than adhering to a strict First-In-First-Out strategy.
Instead of solely relying on the order of job submission, SLURM calculates an integer priority for each task by considering a multitude of factors, including load balancing between users.
For example, if a user before you queues 50 jobs, you will not be in 51st position.
$CUDA_VISIBLE_DEVICES
, which is specific to the node to which the GPU(s) belong(s).
Note that each job created within SLURM is assigned with a unique identifier.
The command sbatch is used to schedule the execution of a script file (e.g., the main one containing your training loop).
The job is handled in background by SLURM and no longer linked to the shell you employed to submit it.
This means that, after submission, you can log out, and close the terminal without consequences: when your turn comes, your job will be executed and GPUs freed upon its completion (i.e., non-blocking behavior).
In fact, once a sbatch script completes its execution, SLURM releases the allocated resources automatically, including GPU locks, and moves on to the next task in the queue.
This allows the cluster to minimize GPU wastage and maximize overall throughput (i.e., tasks completed within a given time frame).
Within our SLURM web application, users operating in sbatch mode can be identified by the inclusion of the script name (e.g., "run_docker.sh") alongside the job.
By default, standard output and standard error are redirected to a file named "slurm-%j.out"
, where "%j"
is replaced with the job ID.
If your job ends with an error, this file will help you identify and troubleshoot any issues that may have occurred.
"slurm-%j.out"
file will be generated on the node where the job was allocated (NOT the one you ran the scheudling command from).
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 train.sh
-N 1
tells SLURM to run the task in one node (you will never need to edit it).--gpus=nvidia_geforce_rtx_3090:1
You can specify constraints to SLURM, such as the number of GPUs you need and their type. Every wish is an order.
In this example, we assume that your job is particularly expensive and 12GBs of GPU are not enough.
Therefore, you would like to ask the scheduler for one NVIDIA GeForce RTX 3090 GPU.
--gpus=1
to require any GPU.train.sh
is the name of the script you want to execute (e.g., containing "python main.py"). Note that the file lookup happens when the turn of your job comes up and resources are allocated for it. Therefore, the latest version of your script is utilized when your job is actually executed, even if you make modifications after the initial queue submission.
Don't forget to ensure that the script is executable, especially if the file has been transferred from GitHub or WinSCP. You can do this by running the command chmod +x train.sh.Exercise caution when relying solely on our SLURM web application. Its functionality involves periodically queuing snapshot jobs with root permissions (maximum priority). As a result, while user assignments and runtimes remain current, metrics such as temperature, memory, and usage may lag behind, contingent upon the frequency of job recirculation. For precise resource usage statistics, you must access the server of your choice and execute the nvidia-smi command. To view comprehensive resource allocation status within SLURM, utilize the squeue command. This command provides details (such as JobID, Execution time, and NodeHost) regarding both running and pending jobs. The output should looks like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 21273412 main fol_trainin zangrillo PD 0:00 1 (Resources) 21182964 main graph_extr zeng PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:faretra) 21185262 main progr_synth delvecchio PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:faretra) 21187148 main reform_graph freddi PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:faretra) 21183062 main prompt_learn fantazzini R 1-05:13:04 1 faretra 21182878 main med_retr frisoni R 1:29:17 1 faretra 21055690 main run_summ ragazzi R 1-16:35:12 1 faretra 21224223 main llm_bench cocchier R 17:16:20 1 cloudifaicdlw001-System-Product-Name 20990504 main diff_sampl italiani R 1-22:14:28 1 faretra 21267175 main legal_llm_inf moro R 3:54:47 1 moro232 21261646 main run_kg_inject molfetta R 3:55:43 1 deeplearn2ST=status, PD=pending, R=running.
To cancel your own job, you can use the command scancel, executing: scancel <job_id>.
nvidia-smi
. Identify the owner of any lingering processes using ps -aux | grep <PID>
. If you detect unwanted processes, terminate them manually with kill -9 <PID>
.
scancel
, your job may remain in the queue, preserving its priority and potentially blocking GPUs unnecessarily.
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 LABEL maintainer="UniboNLP" # Zero interaction (default answers to all questions) ENV DEBIAN_FRONTEND=noninteractive # Set work directory WORKDIR /workdir ENV APP_PATH=/workdir # Install general-purpose dependencies RUN apt-get update -y && \ apt-get install -y curl \ git \ bash \ nano \ python3.11 \ python3-pip && \ apt-get autoremove -y && \ apt-get clean -y && \ rm -rf /var/lib/apt/lists/* RUN pip install --upgrade pip RUN pip install wrapt --upgrade --ignore-installed RUN pip install gdown COPY build/requirements.txt . RUN pip3 install --no-cache-dir -r requirements.txt RUN pip3 install flash-attn==2.7.4.post1 --no-build-isolation # Back to default frontend ENV DEBIAN_FRONTEND=dialog
datasets==3.3.1 torch==2.6.0 transformers==4.49.0 colorlog==6.9.0 wandb==0.19.7 einops==0.8.0 python-dotenv==1.0.1 sentence-transformers==3.4.1 pretty-errors==1.2.25 accelerate==1.4.0
-f
is the name of the Dockerfile to refer for building the image (default: "$(pwd)/Dockerfile")
-t
is used to define yout image's name and optionally a tag (format: "name:tag")
<image-name>
is a placeholder; replace it with the name of your image (e.g., "project1-image:latest").#!/bin/bash flags="--model_checkpoint ${1} --dataset ${2}" python3 main.py $flags
#!/bin/bash PHYS_DIR="your project dir" # e.g., /home/molfetta/project1 docker run \ -v "$PHYS_DIR":/workspace \ --rm \ --memory="30g" \ --gpus '"device='"$CUDA_VISIBLE_DEVICES"'"' \ <image-name> \ "/workspace/train.sh" \ "${1}" "${2}" # ... parameters to pass to the main function
meta/Llama-3.1-8B
), it would be a waste of memory if you all download a copy of this model in each of your home directories.
/llms
on all machines. To use it, you have mount it in your container as follows:
#!/bin/bash PHYS_DIR="your project dir" # e.g., /home/molfetta/project1 LLM_CACHE_DIR="/llms" DOCKER_INTERNAL_CACHE_DIR="/llms" docker run \ -v "$PHYS_DIR":/workspace \ -v "$LLM_CACHE_DIR":"$DOCKER_INTERNAL_CACHE_DIR" \ -e HF_HOME="$DOCKER_INTERNAL_CACHE_DIR" \ --rm \ --memory="30g" \ --gpus '"device='"$CUDA_VISIBLE_DEVICES"'"' \ <image-name> \ "/workspace/train.sh" \ "${1}" "${2}" # ... parameters to pass to the main function
HF_HOME
to the path where the LLMs are stored. This way, the Hugging Face library will look for the models in the shared directory, saving memory disk and time. (this is an example for using model from HuggingFace. You can do the same for any other framework by changing the variable names - but 99.99% of the times you will use this configuration).
sbatch_script.sh
#!/bin/bash # PARAMS: # 1: model_checkpoint # 2: dataset # 1: model_checkpoint bart_base="facebook/bart-base" bart_large="facebook/bart-large" # 2: dataset pubmed="pubmed" arxiv="arxiv" # run bart-base on arxiv sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_base" "$arxiv" # run bart-large on arxiv sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_large" "$arxiv" # run bart-base on pubmed sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_base" "$pubmed" # run bart-large on pubmed sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_large" "$pubmed"
Set up a GitHub repository and clone it to your home directory.
git clone https://github.com/your-repo/project1.git /home/molfetta/project1
cd /home/molfetta/project1
Inside your project folder, create a build
directory and build the Docker image.
mkdir /home/molfetta/project1/build
Run the following command to build the Docker image:
docker build -f build/Dockerfile -t project1_image_name .
Write .sh
scripts that use docker run
, referencing the image name created in the previous step.
Example script:
docker run -v /home/molfetta/project1:/workdir
-v /llms:/llms project1_image_name
...
/workspace/RELATIVE_PATH_TO_TRAIN.sh ...
run_docker.sh
Before submitting a job, verify that the run_docker.sh
script specifies the correct Docker image name.
Use the following command to submit the job via sbatch
:
sbatch run_docker.sh
Your job is now scheduled and will be executed automatically. Just wait for the results!
-w
followed by the specific server name.
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w faretra train.sh
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w cloudifaicdlw001-System-Product-Name train.sh
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w deeplearn2 train.shUSE IT SPARINGLY. Forcing the destination machine make low usage of the great scheduling and dynamic capabilities of SLURM, and may lead to a waste of resources.
Accessing the server via ssh
and manually copying files using scp
can be tedious and time-consuming. Every time you need to edit or transfer a file, you must run multiple commands, making development inefficient.
Instead, we recommend using Visual Studio Code's Remote - SSH extension. This extension allows you to connect to a remote server and interact with files as if they were on your local machine. With this setup, you can:
Follow these steps to set up and use the Remote - SSH extension in VS Code:
If you haven't installed VS Code yet, download it from the official website:
Open VS Code and install the extension:
Ctrl+Shift+X
).Alternatively, install it directly from the marketplace:
To enable seamless SSH connections, IN YOUR LOCAL MACHINE configure the SSH settings in ~/.ssh/config
(create the file if it doesn't exist yet). Copy-paste the following text into that file (using your username):
Host faretra HostName 137.204.107.40 Port 37335 User molfetta Host moro232 HostName 137.204.107.232 Port 37335 User molfetta Host deeplearn2 HostName 137.204.107.153 Port 37335 User molfetta
Once installed, close and re-open VS Code. Then, at the bottom-left of your VS Code window, a green icon similar to "><
" should appear. Click on it and select the machine you want to connect to from the drop-down menu (those names are taken from the ".ssh/config" file).
Once connected, you can:
Now, you can work on your remote machine as if it were local!
✅ Done! Now you can interact with your files efficiently using VS Code instead of manually copying them with scp
.
Ensure reproducibility.
Make sure that (i) the code is well documented, (ii) files and directories have meaningful names.
Lastly, do not forget to write a README file detailing all the necessary steps for replicating the experiments outlined in your thesis/project.
Push your code to GitHub and add us contributors.
Ensure that all your code is pushed to a GitHub repository. This will allow you to access your code from anywhere and share it with others. Then add us as contributors to your repository so we can access your code for evaluation and to run some tests.
Collect and clean your files.
Ensure to remove unnecessary files and images.
It is important to have all code and data stored on the 40 server, as only your home on the main server will be backed up.
Backup your files.
Before leaving, make sure to back up all your files to an external hard drive or cloud storage.
Remember that your home directory on the main server will be deleted after graduation.