Slurm Web | Guide

1. SLURM 🚥

SLURM is a cluster manager that allows us to dynamically schedule and allocate jobs on the GPUs of our servers.
Our computing resources are limited, while the number of users (e.g., undergraduates, PhD Students, PostDocs) and their projects are many. Coordinating vocally (e.g., "I need GPU 0 on the X server for two hours, please don't use it") or via shared files would be a nightmare. SLURM effectively manages the competition for GPUs within our cluster, ensuring equitable access for all users and projects while maximizing resource utilization and system efficiency.

2. Servers 🗂

Our SLURM cluster includes 4 servers with GPUs:

Server Name	IP Address	SSH Port	GPUs
Faretra	137.204.107.40	37335	💻 4 × NVIDIA GeForce RTX 3090 (24GB)
Moro43	137.204.107.43	22	💻 1 × NVIDIA GeForce RTX 5090 (32GB)
Deeplearn2	137.204.107.153	37335	💻 1 × NVIDIA GeForce RTX 3090 (24GB) 💻 1 × NVIDIA Titan XP (12GB)
Moro232	137.204.107.232	37335	💻 1 × NVIDIA GeForce RTX 3090 (24GB)

2.1. Notation

We usually call nodes with the last digits of their IP address, the only part on which they differ (e.g., 40, 153, 232).

2.2. Master Node

faretra (or 40) is the master node, the one on which you should store all your code, data, and execute commands. The other nodes are used only for computation and storage. For instance, if you run a training script that get executed on the 153 node, you should access that node only to check the saved output files (e.g. model checkpoints, metric results) and eventually move them back to the 40 server.

2.3. Accessing

ssh username@ip -p port

To avoid repeating IP and port every time, save your ssh key with ssh-copy-id.

3. Working Mode 🚧

Write code you are proud of.
Take your time to craft high-quality implementations. Quality rather than quantity. Strive to create code that not only accomplishes the task at hand but also stands as a testament to your skills and professionalism. Write scripts that are well-structured, efficient, and easy to understand. Do not use cryptic variable names, prioritize modular components, write unit tests when possible. Start with notable examples from authoritative repositories or notebooks, take the best from them, and raise the quality bar. De facto style guide for Python code.
Go on the cluster only when you are ready!
You should always prototype on Google Colab and move your codebase to the 40 node only when you are sure that it works properly (e.g., the training loop with a small-version model on a dataset sample starts without errors). Note that a free-tier account on Google Colab provides you with access to a T4 GPU having 15GB VRAM; a single GeForce RTX 3090 on our servers has 24GB VRAM and is signficantly faster, without usage thresholding.
Keep your home clean and essential.
The disk space of our servers is not infinite. Students have a quota of 150GB on each server (i.e., you can be the owner of files for a total of 150GB). You should carefully organize your home in directories. Please, make sure to delete all the unnecessary files, including docker images and containers (see Section 7 for details).
Remote development using SSH.
Editing complex files on the 40 server using nano or vim may not always be the most convenient option. Popular IDEs such as Visual Studio Code and PyCharm offer the capability to open a remote folder on any remote machine, virtual machine, or container with a running SSH server. In this way, you can enjoy a local-quality development experience—including completions, code navigation, and debugging—regardless of where your code is hosted. Visual Studio Code Remote - SSH extension (discussed more in details below).
Reduce development time with GitHub Copilot (X).
By leveraging Large Language Models trained on vast code repositories, GitHub Copilot accelerates development by offering context-aware code suggestions and completions. Seamlessly integrated into popular code editors like Visual Studio Code, Copilot analyzes code snippets and suggests relevant solutions in real-time, empowering developers to write high-quality code with speed and precision. Your @studio.unibo.it credentials enable free access to this AI companion and its X version. Use it.
Switch to byobu before every long-standing experiment.
Before initiating a lengthy computational task, such as model training, inference, or data processing that spans several hours or even days, it is crucial to ensure uninterrupted execution without keeping your local terminal open and your computer on. However, closing the terminal session might terminate the remote task. Byobu is a terminal multiplexer that enables you to detach from your session while preserving the running processes on our cluster. With byobu, you can safely disconnect from the remote machine without fear of interrupting the ongoing computation. Later, you can reattach to the session at any time to check the progress (e.g., training logs) or manage the task. This ensures seamless execution of long-standing experiments while maintaining flexibility and convenience in your workflow.
- Create a session (default name)
  byobu
  After entering this command, a new byobu session will be created with the default name. From this point, you can proceed with your work as usual within the byobu session. You can safely close your local terminal or even turn off your local machine without interrupting the tasks running on the remote server. When you are ready to resume your work, you can simply reconnect to the server via SSH and retype byobu to reattach to your existing byobu session. Note: with byobu, you can open additional terminal windows or split the current window into multiple panes using keybindings.
- Create a named session
  byobu new -s {session-name}
  Named sessions provide a convenient way to organize and manage multiple byobu sessions, especially when working on different experiments. You can easily identify and switch between named sessions using their designated names. If you type just byobu and there are multiple sessions or if you want to choose a specific session, byobu will prompt you to select the session from a list.
Do not use GPUs with higher VRAM capabilities of what you need.
In an era dominated by LLMs, it is tempting to assume that 12GB of VRAM is inadequate. However, for numerous projects, a GPU like the Titan Xp can more than suffice. It is crucial to match your GPU resources to the requirements of your specific tasks, avoiding unnecessary overhead and ensuring efficient resource utilization. Fun fact: in 2019, Julien Chaumond (CTO @ HuggingFace) deemed himself GPU-rich when he possessed a server equipped with merely two Titan XPs, akin to the 153 [Source].
Monitor your runs.
Especially when running an experiment for the first time, it is important to keep an eye on the server's RAM utilization to prevent slowdowns or unexpected interruptions. Utilizing tools like htop allows you to monitor system resources in real-time and identify any anomalies or excessive resource usage.
Track all you runs with WandB.
Keeping track of experiments and their results can become challenging over time. WandB (Weights & Biases) provides a comprehensive solution to this problem. It's a cloud service that offers powerful tools for experiment tracking, visualization, and collaboration. By registering for free and creating a project on WandB, you can efficiently manage and share your experiments with your supervisor or team members. WandB allows you to log various aspects of your runs, including hyperparameters, prompts, predictions, loss values, and metrics. What's more, the logging is done in real-time, enabling you to monitor the progress of your experiments from anywhere, even from your smartwatch. One of the standout features of WandB is its automatic tracking of system resources, such as the operating system, GPU type and usage, network traffic, RAM consumption, and running times. This information is invaluable for understanding the performance of your experiments and diagnosing any issues that may arise. WandB keeps you informed about the status of your runs by sending emails or Slack messages when a run completes or crashes. It also generates insightful interactive visualizations, which you can easily download as .csv coordinate files and import into LaTeX for inclusion in reports or papers. Moreover, WandB offers advanced filtering and navigation capabilities, allowing you to easily navigate through all your experiments and smartly filter them based on criteria such as hyperparameter values, model types, dataset names, run status, or the user who launched the experiment. Instead of using "prints" or other local Python logging systems, try to keep everything organized in a WandB repository and log everything inside. In WandB, it is even possible to upload per-run artifacts such as model checkpoints and files in general.
Save your checkpoints.
Consider privately storing your best checkpoints in your Hugging Face account.

4. Preliminary Steps 🛫

On each server...

Before start, refresh your Linux knowledge with a cheatsheet for engineers.
Make sure you have an account by logging in with the credentials that you have been provided.
Use passwd to change your default password with a private one containing at least 10 characters; use lower case and upper case letters adding also numbers and symbols. Avoid indicating personal passwords already used in other services.
Install docker rootless by running the script install_rootless_docker.sh.
- After the installation is complete, restart the shell and check that docker is working by running docker ps (check that no errors are returned).
- If docker is not working, try to execute systemctl --user start docker.
- If it doesn't work still, try running docker_rootless_fix.sh

On the master node (40)...

Execute sinfo --Format=NodeAddr,CPUs:10,Gres:80 to make sure SLURM is working properly.
The output should be:


                    NODE_ADDR           CPUS      GRES                                                                            
                    137.204.107.153     16        gpu:titan_xp:1(S:0),gpu:nvidia_geforce_rtx_3090:1(S:0)                          
                    137.204.107.40      48        gpu:nvidia_geforce_rtx_3090:4(S:0)                                              
                    137.204.107.232     4         gpu:nvidia_geforce_rtx_3090:1(S:0)                                              
                    137.204.107.43      32        gpu:nvidia_geforce_rtx_5090:1(S:0-23)                                           
                    137.204.107.49      32        (null)                                                                          
                    137.204.107.157     48        (null)

Upon logging in, you'll land in your home directory, which initially appears empty. To commence your project, you'll need to upload all required files. This can be accomplished by either pulling from a GitHub repository or employing a file transfer tool like WinSCP (e.g., local-remote drag and drop).

5. Executing Tasks with SLURM 🚥

In SLURM, a shared virtual queue serves as a centralized point to which jobs of any user can be appended. The SLURM scheduler efficiently manages and allocates resources for execution based on specified in-queue job requirements and system availability. Once suitable GPUs are identified, the scheduler allocates them to a job based on its queue position. Importantly, the SLURM queue operates with a dynamic priority assignment rather than adhering to a strict First-In-First-Out strategy. Instead of solely relying on the order of job submission, SLURM calculates an integer priority for each task by considering a multitude of factors, including load balancing between users. For example, if a user before you queues 50 jobs, you will not be in 51st position.

💡 Tip: In the event that you find yourself at the bottom of the queue and require immediate execution of a job, such as due to an impending deadline, you can reach out to the SLURM administrator via Microsoft Teams (lorenzo.molfetta@unibo.it). Should the request be deemed reasonable, your priority value will be elevated accordingly

This automatic GPU assignment ensures fair allocation and efficient resource utilization. Ideally the GPUs should operate continuously 24/7. When a job is assigned one or more GPUs, the index or indices of these GPUs are stored in the environment variable $CUDA_VISIBLE_DEVICES, which is specific to the node to which the GPU(s) belong(s). Note that each job created within SLURM is assigned with a unique identifier.

5.1. Asynchronous Job Scheduling (SBATCH)

The command sbatch is used to schedule the execution of a script file (e.g., the main one containing your training loop). The job is handled in background by SLURM and no longer linked to the shell you employed to submit it. This means that, after submission, you can log out, and close the terminal without consequences: when your turn comes, your job will be executed and GPUs freed upon its completion (i.e., non-blocking behavior). In fact, once a sbatch script completes its execution, SLURM releases the allocated resources automatically, including GPU locks, and moves on to the next task in the queue. This allows the cluster to minimize GPU wastage and maximize overall throughput (i.e., tasks completed within a given time frame). Within our SLURM web application, users operating in sbatch mode can be identified by the inclusion of the script name (e.g., "run_docker.sh") alongside the job.
By default, standard output and standard error are redirected to a file named "slurm-%j.out", where "%j" is replaced with the job ID. If your job ends with an error, this file will help you identify and troubleshoot any issues that may have occurred.

📝 Note: The "slurm-%j.out" file will be generated on the node where the job was allocated (NOT the one you ran the scheudling command from).

📝 Note: You should use on-cloud WandB logging for tracking your runs (not print commands).

Utilization:

sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 train.sh

-N 1 tells SLURM to run the task in one node (you will never need to edit it).
--gpus=nvidia_geforce_rtx_3090:1 You can specify constraints to SLURM, such as the number of GPUs you need and their type. Every wish is an order. In this example, we assume that your job is particularly expensive and 12GBs of GPU are not enough. Therefore, you would like to ask the scheduler for one NVIDIA GeForce RTX 3090 GPU.
- Note 1: titan_xp and nvidia_geforce_rtx_5090 can be specified instead of nvidia_geforce_rtx_3090; do not occupy resources with higher capacity than you need.
- Note 2: specify --gpus=1 to require any GPU.
train.sh is the name of the script you want to execute (e.g., containing "python main.py"). Note that the file lookup happens when the turn of your job comes up and resources are allocated for it. Therefore, the latest version of your script is utilized when your job is actually executed, even if you make modifications after the initial queue submission. Don't forget to ensure that the script is executable, especially if the file has been transferred from GitHub or WinSCP. You can do this by running the command chmod +x train.sh.

📝 Note: The sbatch command is not suitable for debugging. If you need to debug your script, you should run it interactively. Waiting for the job to be executed in the queue is not the best way to debug your code. This can lead to longer wait times, especially if the queue is busy and resources are not immediately available.

We recomend to use Colab for this purpose. Test your code in Colab with smaller models, datasets, and batch sizes. Once you are sure that your code works, you can move it to the cluster with enhanced configurations.

📝 GPU Selection Strategy: Consider your requirements carefully when choosing between RTX 3090 and RTX 5090. The RTX 5090 offers superior performance with 32GB VRAM and is highly optimized for both training and inference tasks. However, since there's only one RTX 5090 in the cluster, requesting it specifically may result in longer queue times. In contrast, we have multiple RTX 3090 GPUs available, which might lead to faster job execution despite their lower individual performance. If your workload can fit within 24GB VRAM and you prefer shorter wait times, consider using RTX 3090. Reserve the RTX 5090 for memory-intensive tasks that truly require the additional 8GB VRAM or benefit significantly from its enhanced performance.

5.2. Job Management and Monitoring

SLURM snapshot.

Exercise caution when relying solely on our SLURM web application. Its functionality involves periodically queuing snapshot jobs with root permissions (maximum priority). As a result, while user assignments and runtimes remain current, metrics such as temperature, memory, and usage may lag behind, contingent upon the frequency of job recirculation. For precise resource usage statistics, you must access the server of your choice and execute the nvidia-smi command. To view comprehensive resource allocation status within SLURM, utilize the squeue command. This command provides details (such as JobID, Execution time, and NodeHost) regarding both running and pending jobs. The output should looks like this:

                        
                JOBID    PARTITION            NAME          USER    ST       TIME  NODES NODELIST(REASON)
                21273412      main     fol_trainin     zangrillo    PD       0:00      1 (Resources)
                21182964      main      graph_extr          zeng    PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:faretra)
                21185262      main     progr_synth    delvecchio    PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:faretra)
                21187148      main    reform_graph        freddi    PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:faretra)
                21183062      main    prompt_learn    fantazzini    R  1-05:13:04      1 faretra
                21182878      main        med_retr       frisoni    R     1:29:17      1 faretra
                21055690      main        run_summ       ragazzi    R  1-16:35:12      1 faretra
                21224223      main       llm_bench      cocchier    R    17:16:20      1 moro43
                20990504      main      diff_sampl      italiani    R  1-22:14:28      1 faretra
                21267175      main   legal_llm_inf          moro    R     3:54:47      1 moro232
                21261646      main   run_kg_inject      molfetta    R     3:55:43      1 deeplearn2

ST=status, PD=pending, R=running.

Cancel a queued or running job.
To cancel your own job, you can use the command scancel, executing: scancel <job_id>.

📝 Note: If your script is already running, the scancel command may not directly stop your process—it might only remove the job from the SLURM queue. Depending on your script type, your script could continue running, leaving the GPU occupied without SLURM's awareness 😱.

Therefore, after canceling a running script with scancel, always verify GPU usage with nvidia-smi. Identify the owner of any lingering processes using ps -aux | grep <PID>. If you detect unwanted processes, terminate them manually with kill -9 <PID>.

Conversely, if you manually kill your processes without removing the job from the SLURM queue using scancel, your job may remain in the queue, preserving its priority and potentially blocking GPUs unnecessarily.

5.4. General Recommendations

It is advisable not to occupy a GPU for more than 3 days continuously. This practice helps maintain fast access to GPU resources and facilitates their recirculation, benefiting all users. Any deviation from this guideline should be approved by your supervisor. If you encounter the need to execute a longer task, you should divide it into multiple jobs. For instance, you can opt for incremental training stages that resume from the last saved checkpoint.

⛔ WARNING: SLURM supports another allocation command (the one who shall not be named). Commands other than sbatch are not allowed on our cluster. Processes not using this command will be killed without notice. Please always remember to use sbatch for all your job submissions.

6. Docker 🚀

There are many users (just run a "cd .." from your home to verify how many active-user homes exist). Each user can have several projects (e.g., the proposed method and some baselines). Each project comes with its distinct set of requirements (Python libraries and their specific versions). Directly installing or updating libraries on the physical machine would be impractical. Hence, we heavily rely on Docker, where each user executes a specific project within a sandbox—a virtual environment equipped with all the necessary files and dependencies.

In Docker, there are three fundamental components: Dockerfile, Image, and Container. These components are interdependent, meaning they build upon each other in an incremental manner. See Docker in a nutshell.

Dockerfile. The Dockerfile serves as the blueprint for creating your virtual environment. It contains instructions that specify which dependencies and files you will find once you enter the environment. Dockerfiles are typically written in a simple, declarative syntax and can be version-controlled alongside your project code. Indeed, a Dockerfile is a raw file without extensions ("nano Dockerfile"). Instructions in a Dockerfile include FROM (specifying the base image), WORKDIR (name of the "home directory" for the virtual environment), RUN (executing commands inside the image), COPY (copying files into the image), and CMD (defining the default command to run when a container is started). It is essential to indicate the specific version for each package in the Dockerfile. This practice ensures reproducibility and prevents unexpected behavior due to potential updates or changes in package versions.

As reference and starting point to customize wit your needs, you can use the following Dockerfile:
```
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
LABEL maintainer="UniboNLP"

# Zero interaction (default answers to all questions)
ENV DEBIAN_FRONTEND=noninteractive

# Set work directory
WORKDIR /workspace
ENV APP_PATH=/workspace

# Install general-purpose dependencies
RUN apt-get update -y && \
    apt-get install -y curl \
                        git \
                        bash \
                        nano \
                        python3.11 \
                        python3-pip && \
    apt-get autoremove -y && \
    apt-get clean -y && \
    rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip
RUN pip install wrapt --upgrade --ignore-installed
RUN pip install gdown


COPY build/requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

# Back to default frontend
ENV DEBIAN_FRONTEND=dialog
```
We recommend avoiding modifications to the WORKDIR name (standardized naming convention).

For clarity and maintainability, you can organize the list of dependencies and their specific versions in a requirements.txt file. You can modify the following according to your specific project's requirements.
```
datasets==3.3.1
torch==2.6.0
transformers==4.49.0
colorlog==6.9.0
wandb==0.19.7
einops==0.8.0
python-dotenv==1.0.1
sentence-transformers==3.4.1
pretty-errors==1.2.25
accelerate==1.4.0
```
💡 Tip: If you encounter any difficulties, you have the option to refer to the Dockerfile of your colleagues. You possess read permissions on the home directories of other undergraduates.

RTX 5090 Special Requirements 📄

Our cluster now includes a cutting-edge NVIDIA GeForce RTX 5090 with 32GB VRAM, providing exceptional computational power for the most demanding AI workloads. However, due to its recent release and advanced architecture, this GPU requires specific library versions and compilation configurations that differ from our standard setup.

⚠️ Important: The RTX 5090 uses the latest CUDA architecture and requires libraries to be compiled from source rather than installed from PyPI. This is necessary to ensure compatibility with the GPU's advanced features and optimal performance.

When targeting the RTX 5090, you'll need to use a specialized Dockerfile that handles local compilation of critical libraries like PyTorch. Here's the recommended Dockerfile for RTX 5090 projects:
```
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04
LABEL maintainer="disi-Unibo-NLP"

ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /workspace
ENV APP_PATH=/workspace


ENV TORCH_CUDA_ARCH_LIST="12.0"

# Install dependencies including python3.12-venv
RUN apt-get update -y && \
    apt-get install -y curl \
                       git \
                       bash \
                       nano \
                       python3.12 \
                       python3-pip \
                       python3.12-venv && \
    apt-get autoremove -y && \
    apt-get clean -y && \
    rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Now pip commands work normally
RUN pip install --upgrade pip
RUN pip install wrapt --upgrade --ignore-installed
RUN pip install gdown


COPY build/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt


# Install PyTorch with CUDA 12.8 support for RTX 5090
RUN pip install --no-cache-dir \
    torch==2.7.1+cu128 \
    torchvision==0.22.1+cu128 \
    torchaudio==2.7.1+cu128 \
    --index-url https://download.pytorch.org/whl/cu128


ENV DEBIAN_FRONTEND=dialog
```
📝 Note: You can use the same requirements file but ensure to remove the torch package installation.

🔧 Flash-Attention Issues: If you encounter errors with flash-attn during runtime, add the following line after your imports in your Python script:

torch._dynamo.config.cache_size_limit = 32

This resolves compilation cache issues specific to the RTX 5090's architecture.

Image. An image is a template for creating containers based on the instructions specified in the Dockerfile. Each instruction in the Dockerfile adds a layer to the image. You can visualize an image as a stamp that can be applied repeatedly to create multiple identical containers. Images are immutable, meaning they cannot be modified once they are created. However, they can be used as the basis for creating new images (see FROM in a Dockerfile) or running containers. Images can be stored in the Docker Hub for sharing and distribution.
- Create an image.
  - docker build -f build/Dockerfile -t <image-name> .
    - -f is the name of the Dockerfile to refer for building the image (default: "$(pwd)/Dockerfile")
    - -t is used to define yout image's name and optionally a tag (format: "name:tag")
    - <image-name> is a placeholder; replace it with the name of your image (e.g., "project1-image:latest").
    - The last argument is the Dockerfile path; "." indicates that the Dockerfile is located in the current directory.
    - A single docker image can occupy significant amount of disk space (e.g., 20GB). Creating an image can indeed take some time, particularly during the build process, as Docker needs to download, extract, and process all the required layers and dependencies specified in the Dockerfile. To optimize image size and build times, it is recommended to use small base images and minimize the number of layers in the Dockerfile by consolidating related commands.
    - When you recreate an image with the same name and an updated Dockerfile, Docker strives to reconstruct the image efficiently by reusing cached layers and only creating new layers for the changes or additions made since the last build.
- Verify existing images.
  - docker images
- Delete an image.
  - docker rmi <image-id>
    - Given that Docker images can occupy substantial amounts of disk space, we kindly request that you delete any images that are not currently in use.
  - docker image prune
    - This removes all dangling (unused) Docker images, which are images not associated with any containers. It is a good practice for cleaning up Docker resources in one go.

Container.
A container is a runtime instance of a Docker image. Each container runs as an isolated process on the host system, with its own filesystem, networking, and process space. Containers are created from Docker images using the docker run command. In the following paragraphs, you will delve deeper into containers and explore their integration with SLURM.
- Container + SBATCH.
  We suggest you a pipeline of three files.
  1. train.sh
```
                                    #!/bin/bash

                                    flags="--model_checkpoint ${1} --dataset ${2}"
                                    
                                    python3 main.py $flags
                                
```
    - This bash script simply executes the Python file of your interest (e.g., model training).
    - As a proficient AI developer, it is common practice to explore several hyperparameters or experimental configurations (e.g., learning rate, number of epochs, base model, dataset). However, it is crucial to avoid creating separate files with nearly identical code except for these minor changes. Instead, you should design your script to accept hyperparameters as arguments, allowing for flexibility and reusability.
  2. run_docker.sh
```
                                    #!/bin/bash
 
                                    PHYS_DIR="your project dir" # e.g., /home/molfetta/project1
                                    
                                    docker run \
                                        -v "$PHYS_DIR":/workspace \
                                        --rm \
                                        --memory="30g" \
                                        --gpus '"device='"$CUDA_VISIBLE_DEVICES"'"' \
                                        <image-name> \
                                        "/workspace/train.sh" \
                                        "${1}" "${2}" # ... parameters to pass to the main function
                                
```
    - When creating a container, we must specify which image (stamp) to use.
    - The asynchronous operational mode of `sbatch` is also applicable to Docker containers. This encompasses requesting non-interactive execution of a file within a container or initiating an interactive shell session on the container. We pass the file generated in step (1) along with its associated arguments.
    - By default, a container does not have visibility of the underlying GPUs in the physical machines. If GPU access is required within a container, you need to explicitly specify the --gpus flag when starting the container to enable GPU support. With $CUDA_VISIBLE_DEVICES, you utilize only the GPU that has been assigned to you by the SLURM scheduler.
    - --rm indicates disposability. Once the execution of the file is completed, the container, along with all the output files within, will be deleted. However, if the computation spans several days, deleting all outputs may not be desirable. To address this, the -v flag provides a solution. It establishes a portal between a directory on the physical machine (left) and a directory on the container (right). Any data present or saved in one directory is mirrored in the other. This is why the train.sh script can be found in the WORKDIR despite setting an empty virtual environment in the Dockerfile (i.e., no file imported, only ready-to-use libraries). As a result, no output is lost, as it is automatically backed up in the project directory on the underlying machine.
  3. sbatch_script.sh
```
                                    #!/bin/bash
 
                                    # PARAMS:
                                    # 1: model_checkpoint
                                    # 2: dataset
                                    
                                    # 1: model_checkpoint
                                    bart_base="facebook/bart-base"
                                    bart_large="facebook/bart-large"
                                    
                                    # 2: dataset
                                    pubmed="pubmed"
                                    arxiv="arxiv"
                                    
                                    # run bart-base on arxiv
                                    sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_base" "$arxiv"
                                    # run bart-large on arxiv
                                    sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_large" "$arxiv"
                                    
                                    # run bart-base on pubmed
                                    sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_base" "$pubmed"
                                    # run bart-large on pubmed
                                    sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_large" "$pubmed"
                                
```
    - 4 sbatch = 4 jobs in the queue.
    - Overall, you are asking SLURM to execute a script with 4 different hyperparameter combinations. For each job, you ask to create a container with all the required dependencies and then execute the file inside.

💡 Tip: OBVIOUSLY you can use any name for these files.

🚸 CAREFUL: Docker containers and images are often the cause of memory disk saturation. Please ensure that you delete any unused containers and images. You can use the docker system prune command to remove all stopped containers, dangling images, and unused networks and volumes.

7. Independent File Systems and Code Distribution 🛜

We lack a distributed file system within the cluster. WHY this is a problem❓ As we said, SLURM dynamically decides where to allocate your job, interdependently from the location you ran the command from.

📝 Note: Slurm only determines which machine will run your command. It does not automatically transfer or synchronize your code files to that machine for execution. In this context, it's crucial to highlight that you cannot predict which server SLURM will allocate a GPU from.

Then, WHAT IF SLURM executes your task on a machine where your code doesn't exist ❓ The job will fail because the machine won't have access to the necessary files needed to run your program. In other words, if you create a file on a server, it won't automatically propagate to all other ones.

📝 Example: You're logged into server 40, where all your project files reside. You submit a job to the queue with sbatch, requesting execution of a training script on an NVIDIA RTX 3090. SLURM promptly allocates a 3090, but not on server 40; it's on server 153. SLURM searches for your specified file but doesn't find it. Consequently, the job terminates with an error.

Given the uncertainty of the node where your job will execute, it's imperative to ensure synchronization of the directory containing your project files. This way, regardless of the allocated node, the job remains executable. Storing your code on GitHub and simply pulling it onto the other servers is an ideal solution.

📝 Note: Not only code, you shoudl recreate the Docker image with the same name on all the servers. This way, you can be sure that the environment is the same on all the machines and the job won't fail.

🔢 TL;DR: To sum up, here are the steps to follow:

Create a GitHub Repository and Clone It
Set up a GitHub repository and clone it to your home directory.
```
git clone https://github.com/your-repo/project1.git /home/molfetta/project1
```
```
cd /home/molfetta/project1
```
Create a Build Directory and Build the Docker Image
Inside your project folder, create a build directory and build the Docker image.
```
mkdir /home/molfetta/project1/build
```
Run the following command to build the Docker image:
```
docker build -f build/Dockerfile -t project1_image_name .
```

Create Shell Scripts for Running Docker Containers

Write .sh scripts that use docker run, referencing the image name created in the previous step.

Example script:

docker run -v /home/molfetta/project1:/workspace 
           -v /llms:/llms project1_image_name 
           ...
           /workspace/RELATIVE_PATH_TO_TRAIN.sh ...

Ensure the Correct Image Name in run_docker.sh
Before submitting a job, verify that the run_docker.sh script specifies the correct Docker image name.
Submit the Job
Use the following command to submit the job via sbatch:
```
sbatch run_docker.sh
```
You're Done! 🎉
Your job is now scheduled and will be executed automatically. Just wait for the results!

📝 Note: In very very ... very rare cases, you may be using too memory-consuming resources (e.g. very large datasets), where replicating them on all servers is not feasible. ONLY in that case, you can set an additional variable in the sbatch command to force the scheduler to remain in the machine you have your data on. This argument is -w followed by the specific server name.

sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w faretra train.sh

sbatch -N 1 --gpus=nvidia_geforce_rtx_5090:1 -w moro43 train.sh

sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w deeplearn2 train.sh

sbatch -N 1 --gpus=titan_xp:1 -w deeplearn2 train.sh

USE IT SPARINGLY. Forcing the destination machine make low usage of the great scheduling and dynamic capabilities of SLURM, and may lead to a waste of resources.

8. Visualizing and Interacting with your Files 👀

Accessing the server via ssh and manually copying files using scp can be tedious and time-consuming. Every time you need to edit or transfer a file, you must run multiple commands, making development inefficient.

Instead, we recommend using Visual Studio Code's Remote - SSH extension. This extension allows you to connect to a remote server and interact with files as if they were on your local machine. With this setup, you can:

Browse and edit files directly on the server.
Run commands in the terminal without switching between windows.
Use VS Code's built-in tools for debugging and development.

🛠 Installation Guide: VS Code Remote - SSH

Follow these steps to set up and use the Remote - SSH extension in VS Code:

Install Visual Studio Code
If you haven't installed VS Code yet, download it from the official website:

🔗 VS Code Download
Install the Remote - SSH Extension
Open VS Code and install the extension:
- Click on the Extensions icon (Ctrl+Shift+X).
- Search for "Remote - SSH".
- Click "Install".
Alternatively, install it directly from the marketplace:

🔗 Remote - SSH Extension
Configure SSH in VS Code
To enable seamless SSH connections, IN YOUR LOCAL MACHINE configure the SSH settings in ~/.ssh/config (create the file if it doesn't exist yet). Copy-paste the following text into that file (using your username):
```
Host faretra
    HostName 137.204.107.40
    Port 37335
    User molfetta

Host moro232
    HostName 137.204.107.232
    Port 37335
    User molfetta

Host moro43
    HostName 137.204.107.43
    Port 22
    User molfetta

Host deeplearn2
    HostName 137.204.107.153
    Port 37335
    User molfetta
```
Once installed, close and re-open VS Code. Then, at the bottom-left of your VS Code window, a green icon similar to "><" should appear. Click on it and select the machine you want to connect to from the drop-down menu (those names are taken from the ".ssh/config" file).
Start Coding on the Remote Server 🛡
Once connected, you can:
- Use the built-in file explorer to navigate remote files.
- Open, edit, and save files directly on the server.
- Run commands in the VS Code terminal without opening a separate SSH session.
Now, you can work on your remote machine as if it were local!

✅ Done! Now you can interact with your files efficiently using VS Code instead of manually copying them with scp.

📝 Note: You can get more information about the Remote - SSH extension and its features on the official VS Code documentation page: 🔗 VS Code Remote - SSH Documentation

🚸 CAREFUL: Even if files are now completely accessible and visible on the servers, ALWAYS remember to push your changes.

9. Before Graduation or Project Completion 🚩

Ensure reproducibility.
Make sure that (i) the code is well documented, (ii) files and directories have meaningful names. Lastly, do not forget to write a README file detailing all the necessary steps for replicating the experiments outlined in your thesis/project.
Push your code to GitHub and add us contributors.
Ensure that all your code is pushed to a GitHub repository. This will allow you to access your code from anywhere and share it with others. Then add us as contributors to your repository so we can access your code for evaluation and to run some tests.
Collect and clean your files.
Ensure to remove unnecessary files and images. It is important to have all code and data stored on the 40 server, as only your home on the main server will be backed up.
Backup your files.
Before leaving, make sure to back up all your files to an external hard drive or cloud storage. Remember that your home directory on the main server will be deleted after graduation.

Cluster Usage Guide

Official Website

LinkedIn

Hugging Face