UniBo NLP Logo

Cluster Usage Guide

Official Website

Visit our NLP research site at University of Bologna.

LinkedIn

Connect for updates on research and opportunities.

Hugging Face

Explore our models and contribute to our projects.

1. SLURM 🚥

SLURM is a cluster manager that allows us to dynamically schedule and allocate jobs on the GPUs of our servers.
Our computing resources are limited, while the number of users (e.g., undergraduates, PhD Students, PostDocs) and their projects are many. Coordinating vocally (e.g., "I need GPU 0 on the X server for two hours, please don't use it") or via shared files would be a nightmare. SLURM effectively manages the competition for GPUs within our cluster, ensuring equitable access for all users and projects while maximizing resource utilization and system efficiency.


2. Servers 🗂

Our SLURM cluster includes 4 servers with GPUs:

Server Name IP Address SSH Port GPUs
Faretra 137.204.107.40 37335 💻 4 Ă— NVIDIA GeForce RTX 3090 (24GB)
Moro43 137.204.107.43 22 💻 1 Ă— NVIDIA GeForce RTX 5090 (32GB)
Deeplearn2 137.204.107.153 37335 💻 1 Ă— NVIDIA GeForce RTX 3090 (24GB)
💻 1 Ă— NVIDIA Titan XP (12GB)
Moro232 137.204.107.232 37335 💻 1 Ă— NVIDIA GeForce RTX 3090 (24GB)



2.1. Notation

We usually call nodes with the last digits of their IP address, the only part on which they differ (e.g., 40, 153, 232).

2.2. Master Node

faretra (or 40) is the master node, the one on which you should store all your code, data, and execute commands. The other nodes are used only for computation and storage. For instance, if you run a training script that get executed on the 153 node, you should access that node only to check the saved output files (e.g. model checkpoints, metric results) and eventually move them back to the 40 server.

2.3. Accessing

ssh username@ip -p port

To avoid repeating IP and port every time, save your ssh key with ssh-copy-id.


3. Working Mode 🚧


4. Preliminary Steps 🛫

On each server...

  1. Before start, refresh your Linux knowledge with a cheatsheet for engineers.
  2. Make sure you have an account by logging in with the credentials that you have been provided.
  3. Use passwd to change your default password with a private one containing at least 10 characters; use lower case and upper case letters adding also numbers and symbols. Avoid indicating personal passwords already used in other services.
  4. Install docker rootless by running the script install_rootless_docker.sh.
    • After the installation is complete, restart the shell and check that docker is working by running docker ps (check that no errors are returned).
    • If docker is not working, try to execute systemctl --user start docker.
    • If it doesn't work still, try running docker_rootless_fix.sh
    • , which will delete your current installation (along with all your images and containers!!) and reinstall docker rootless.

On the master node (40)...

  1. Execute sinfo --Format=NodeAddr,CPUs:10,Gres:80 to make sure SLURM is working properly.
    The output should be:

    
                        NODE_ADDR           CPUS      GRES                                                                            
                        137.204.107.153     16        gpu:titan_xp:1(S:0),gpu:nvidia_geforce_rtx_3090:1(S:0)                          
                        137.204.107.40      48        gpu:nvidia_geforce_rtx_3090:4(S:0)                                              
                        137.204.107.232     4         gpu:nvidia_geforce_rtx_3090:1(S:0)                                              
                        137.204.107.43      32        gpu:nvidia_geforce_rtx_5090:1(S:0-23)                                           
                        137.204.107.49      32        (null)                                                                          
                        137.204.107.157     48        (null)   
                    
  2. Upon logging in, you'll land in your home directory, which initially appears empty. To commence your project, you'll need to upload all required files. This can be accomplished by either pulling from a GitHub repository or employing a file transfer tool like WinSCP (e.g., local-remote drag and drop).

5. Executing Tasks with SLURM 🚥

In SLURM, a shared virtual queue serves as a centralized point to which jobs of any user can be appended. The SLURM scheduler efficiently manages and allocates resources for execution based on specified in-queue job requirements and system availability. Once suitable GPUs are identified, the scheduler allocates them to a job based on its queue position. Importantly, the SLURM queue operates with a dynamic priority assignment rather than adhering to a strict First-In-First-Out strategy. Instead of solely relying on the order of job submission, SLURM calculates an integer priority for each task by considering a multitude of factors, including load balancing between users. For example, if a user before you queues 50 jobs, you will not be in 51st position.

💡 Tip: In the event that you find yourself at the bottom of the queue and require immediate execution of a job, such as due to an impending deadline, you can reach out to the SLURM administrator via Microsoft Teams (lorenzo.molfetta@unibo.it). Should the request be deemed reasonable, your priority value will be elevated accordingly


This automatic GPU assignment ensures fair allocation and efficient resource utilization. Ideally the GPUs should operate continuously 24/7. When a job is assigned one or more GPUs, the index or indices of these GPUs are stored in the environment variable $CUDA_VISIBLE_DEVICES, which is specific to the node to which the GPU(s) belong(s). Note that each job created within SLURM is assigned with a unique identifier.

5.1. Asynchronous Job Scheduling (SBATCH)

The command sbatch is used to schedule the execution of a script file (e.g., the main one containing your training loop). The job is handled in background by SLURM and no longer linked to the shell you employed to submit it. This means that, after submission, you can log out, and close the terminal without consequences: when your turn comes, your job will be executed and GPUs freed upon its completion (i.e., non-blocking behavior). In fact, once a sbatch script completes its execution, SLURM releases the allocated resources automatically, including GPU locks, and moves on to the next task in the queue. This allows the cluster to minimize GPU wastage and maximize overall throughput (i.e., tasks completed within a given time frame). Within our SLURM web application, users operating in sbatch mode can be identified by the inclusion of the script name (e.g., "run_docker.sh") alongside the job.
By default, standard output and standard error are redirected to a file named "slurm-%j.out", where "%j" is replaced with the job ID. If your job ends with an error, this file will help you identify and troubleshoot any issues that may have occurred.

📝 Note: The "slurm-%j.out" file will be generated on the node where the job was allocated (NOT the one you ran the scheudling command from).

📝 Note: You should use on-cloud WandB logging for tracking your runs (not print commands).

Utilization:
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 train.sh


📝 Note: The sbatch command is not suitable for debugging. If you need to debug your script, you should run it interactively. Waiting for the job to be executed in the queue is not the best way to debug your code. This can lead to longer wait times, especially if the queue is busy and resources are not immediately available.

We recomend to use Colab for this purpose. Test your code in Colab with smaller models, datasets, and batch sizes. Once you are sure that your code works, you can move it to the cluster with enhanced configurations.

📝 GPU Selection Strategy: Consider your requirements carefully when choosing between RTX 3090 and RTX 5090. The RTX 5090 offers superior performance with 32GB VRAM and is highly optimized for both training and inference tasks. However, since there's only one RTX 5090 in the cluster, requesting it specifically may result in longer queue times. In contrast, we have multiple RTX 3090 GPUs available, which might lead to faster job execution despite their lower individual performance. If your workload can fit within 24GB VRAM and you prefer shorter wait times, consider using RTX 3090. Reserve the RTX 5090 for memory-intensive tasks that truly require the additional 8GB VRAM or benefit significantly from its enhanced performance.


5.2. Job Management and Monitoring

📝 Note: If your script is already running, the scancel command may not directly stop your process—it might only remove the job from the SLURM queue. Depending on your script type, your script could continue running, leaving the GPU occupied without SLURM's awareness 😱.

Therefore, after canceling a running script with scancel, always verify GPU usage with nvidia-smi. Identify the owner of any lingering processes using ps -aux | grep <PID>. If you detect unwanted processes, terminate them manually with kill -9 <PID>.

Conversely, if you manually kill your processes without removing the job from the SLURM queue using scancel, your job may remain in the queue, preserving its priority and potentially blocking GPUs unnecessarily.

5.4. General Recommendations

It is advisable not to occupy a GPU for more than 3 days continuously. This practice helps maintain fast access to GPU resources and facilitates their recirculation, benefiting all users. Any deviation from this guideline should be approved by your supervisor. If you encounter the need to execute a longer task, you should divide it into multiple jobs. For instance, you can opt for incremental training stages that resume from the last saved checkpoint.

WARNING: SLURM supports another allocation command (the one who shall not be named). Commands other than sbatch are not allowed on our cluster. Processes not using this command will be killed without notice. Please always remember to use sbatch for all your job submissions.



6. Docker 🚀

There are many users (just run a "cd .." from your home to verify how many active-user homes exist). Each user can have several projects (e.g., the proposed method and some baselines). Each project comes with its distinct set of requirements (Python libraries and their specific versions). Directly installing or updating libraries on the physical machine would be impractical. Hence, we heavily rely on Docker, where each user executes a specific project within a sandbox—a virtual environment equipped with all the necessary files and dependencies.

In Docker, there are three fundamental components: Dockerfile, Image, and Container. These components are interdependent, meaning they build upon each other in an incremental manner. See Docker in a nutshell.
  1. Dockerfile. The Dockerfile serves as the blueprint for creating your virtual environment. It contains instructions that specify which dependencies and files you will find once you enter the environment. Dockerfiles are typically written in a simple, declarative syntax and can be version-controlled alongside your project code. Indeed, a Dockerfile is a raw file without extensions ("nano Dockerfile"). Instructions in a Dockerfile include FROM (specifying the base image), WORKDIR (name of the "home directory" for the virtual environment), RUN (executing commands inside the image), COPY (copying files into the image), and CMD (defining the default command to run when a container is started). It is essential to indicate the specific version for each package in the Dockerfile. This practice ensures reproducibility and prevents unexpected behavior due to potential updates or changes in package versions.

    As reference and starting point to customize wit your needs, you can use the following Dockerfile:

    FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
    LABEL maintainer="UniboNLP"
    
    # Zero interaction (default answers to all questions)
    ENV DEBIAN_FRONTEND=noninteractive
    
    # Set work directory
    WORKDIR /workspace
    ENV APP_PATH=/workspace
    
    # Install general-purpose dependencies
    RUN apt-get update -y && \
        apt-get install -y curl \
                            git \
                            bash \
                            nano \
                            python3.11 \
                            python3-pip && \
        apt-get autoremove -y && \
        apt-get clean -y && \
        rm -rf /var/lib/apt/lists/*
    
    RUN pip install --upgrade pip
    RUN pip install wrapt --upgrade --ignore-installed
    RUN pip install gdown
    
    
    COPY build/requirements.txt .
    
    RUN pip install --no-cache-dir -r requirements.txt
    
    RUN VLLM_FLASH_ATTN_VERSION=2 MAX_JOBS=16 pip install flash-attn --no-build-isolation
    
    # Back to default frontend
    ENV DEBIAN_FRONTEND=dialog


    We recommend avoiding modifications to the WORKDIR name (standardized naming convention).

    For clarity and maintainability, you can organize the list of dependencies and their specific versions in a requirements.txt file. You can modify the following according to your specific project's requirements.

    datasets==3.3.1
    torch==2.6.0
    transformers==4.49.0
    colorlog==6.9.0
    wandb==0.19.7
    einops==0.8.0
    python-dotenv==1.0.1
    sentence-transformers==3.4.1
    pretty-errors==1.2.25
    accelerate==1.4.0


    💡 Tip: If you encounter any difficulties, you have the option to refer to the Dockerfile of your colleagues. You possess read permissions on the home directories of other undergraduates.


    RTX 5090 Special Requirements 📄

    Our cluster now includes a cutting-edge NVIDIA GeForce RTX 5090 with 32GB VRAM, providing exceptional computational power for the most demanding AI workloads. However, due to its recent release and advanced architecture, this GPU requires specific library versions and compilation configurations that differ from our standard setup.

    ⚠️ Important: The RTX 5090 uses the latest CUDA architecture and requires libraries to be compiled from source rather than installed from PyPI. This is necessary to ensure compatibility with the GPU's advanced features and optimal performance.

    When targeting the RTX 5090, you'll need to use a specialized Dockerfile that handles local compilation of critical libraries like flash-attn and PyTorch. Here's the recommended Dockerfile for RTX 5090 projects:

    FROM nvidia/cuda:12.8.0-devel-ubuntu24.04
    LABEL maintainer="disi-Unibo-NLP"
    
    ENV DEBIAN_FRONTEND=noninteractive
    WORKDIR /workspace
    ENV APP_PATH=/workspace
    
    
    ENV TORCH_CUDA_ARCH_LIST="12.0"
    
    # Install dependencies including python3.12-venv
    RUN apt-get update -y && \
        apt-get install -y curl \
                           git \
                           bash \
                           nano \
                           python3.12 \
                           python3-pip \
                           python3.12-venv && \
        apt-get autoremove -y && \
        apt-get clean -y && \
        rm -rf /var/lib/apt/lists/*
    
    # Create and activate virtual environment
    RUN python3 -m venv /opt/venv
    ENV PATH="/opt/venv/bin:$PATH"
    
    # Now pip commands work normally
    RUN pip install --upgrade pip
    RUN pip install wrapt --upgrade --ignore-installed
    RUN pip install gdown
    
    
    COPY build/requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    
    # Install PyTorch with CUDA 12.8 support for RTX 5090
    RUN pip install --no-cache-dir \
        torch==2.7.1+cu128 \
        torchvision==0.22.1+cu128 \
        torchaudio==2.7.1+cu128 \
        --index-url https://download.pytorch.org/whl/cu128
    
    
    
    # Install flash-attn with RTX 5090 specific compilation flags
    RUN VLLM_FLASH_ATTN_VERSION=2 MAX_JOBS=16 FLASH_ATTN_CUDA_ARCHS=128 pip install flash-attn --no-build-isolation
    
    
    ENV DEBIAN_FRONTEND=dialog


    📝 Note: You can use the same requirements file but ensure to remove the torch package installation.

    🔧 Flash-Attention Issues: If you encounter errors with flash-attn during runtime, add the following line after your imports in your Python script:

    torch._dynamo.config.cache_size_limit = 32

    This resolves compilation cache issues specific to the RTX 5090's architecture.


  2. Image. An image is a template for creating containers based on the instructions specified in the Dockerfile. Each instruction in the Dockerfile adds a layer to the image. You can visualize an image as a stamp that can be applied repeatedly to create multiple identical containers. Images are immutable, meaning they cannot be modified once they are created. However, they can be used as the basis for creating new images (see FROM in a Dockerfile) or running containers. Images can be stored in the Docker Hub for sharing and distribution.
    • Create an image.
      • docker build -f build/Dockerfile -t <image-name> .
        • -f is the name of the Dockerfile to refer for building the image (default: "$(pwd)/Dockerfile")
        • -t is used to define yout image's name and optionally a tag (format: "name:tag")
        • <image-name> is a placeholder; replace it with the name of your image (e.g., "project1-image:latest").
        • The last argument is the Dockerfile path; "." indicates that the Dockerfile is located in the current directory.
        • A single docker image can occupy significant amount of disk space (e.g., 20GB). Creating an image can indeed take some time, particularly during the build process, as Docker needs to download, extract, and process all the required layers and dependencies specified in the Dockerfile. To optimize image size and build times, it is recommended to use small base images and minimize the number of layers in the Dockerfile by consolidating related commands.
        • When you recreate an image with the same name and an updated Dockerfile, Docker strives to reconstruct the image efficiently by reusing cached layers and only creating new layers for the changes or additions made since the last build.
    • Verify existing images.
      • docker images
    • Delete an image.
      • docker rmi <image-id>
        • Given that Docker images can occupy substantial amounts of disk space, we kindly request that you delete any images that are not currently in use.
      • docker image prune
        • This removes all dangling (unused) Docker images, which are images not associated with any containers. It is a good practice for cleaning up Docker resources in one go.

  3. Container.
    A container is a runtime instance of a Docker image. Each container runs as an isolated process on the host system, with its own filesystem, networking, and process space. Containers are created from Docker images using the docker run command. In the following paragraphs, you will delve deeper into containers and explore their integration with SLURM.
    • Container + SBATCH.
      We suggest you a pipeline of three files.
      1. train.sh
                                            #!/bin/bash
        
                                            flags="--model_checkpoint ${1} --dataset ${2}"
                                            
                                            python3 main.py $flags
                                        
        • This bash script simply executes the Python file of your interest (e.g., model training).
        • As a proficient AI developer, it is common practice to explore several hyperparameters or experimental configurations (e.g., learning rate, number of epochs, base model, dataset). However, it is crucial to avoid creating separate files with nearly identical code except for these minor changes. Instead, you should design your script to accept hyperparameters as arguments, allowing for flexibility and reusability.
      2. run_docker.sh
                                            #!/bin/bash
         
                                            PHYS_DIR="your project dir" # e.g., /home/molfetta/project1
                                            
                                            docker run \
                                                -v "$PHYS_DIR":/workspace \
                                                --rm \
                                                --memory="30g" \
                                                --gpus '"device='"$CUDA_VISIBLE_DEVICES"'"' \
                                                <image-name> \
                                                "/workspace/train.sh" \
                                                "${1}" "${2}" # ... parameters to pass to the main function
                                        
        • When creating a container, we must specify which image (stamp) to use.
        • The asynchronous operational mode of `sbatch` is also applicable to Docker containers. This encompasses requesting non-interactive execution of a file within a container or initiating an interactive shell session on the container. We pass the file generated in step (1) along with its associated arguments.
        • By default, a container does not have visibility of the underlying GPUs in the physical machines. If GPU access is required within a container, you need to explicitly specify the --gpus flag when starting the container to enable GPU support. With $CUDA_VISIBLE_DEVICES, you utilize only the GPU that has been assigned to you by the SLURM scheduler.
        • --rm indicates disposability. Once the execution of the file is completed, the container, along with all the output files within, will be deleted. However, if the computation spans several days, deleting all outputs may not be desirable. To address this, the -v flag provides a solution. It establishes a portal between a directory on the physical machine (left) and a directory on the container (right). Any data present or saved in one directory is mirrored in the other. This is why the train.sh script can be found in the WORKDIR despite setting an empty virtual environment in the Dockerfile (i.e., no file imported, only ready-to-use libraries). As a result, no output is lost, as it is automatically backed up in the project directory on the underlying machine.


      3. 📝 Note: HELP US SAVE SOME MEMORY DISK Number of students and researcher using this server is increasing. Catching up with the demand, we would be glad if you can help us saving some memory disk. Since many of you may use the same Large Language Model (e.g. meta/Llama-3.1-8B), it would be a waste of memory if you all download a copy of this model in each of your home directories.

        Instead, we suggest you to download the model in a shared path (where the model you need may already be present !!) and mount this directory in your container. This way, you will save memory disk and you won't wast time downloading the same model over and over again.

        HOW TO DO THAT❓
        In all our machines, there's a shared directory where LLMs are saved. This directory is accessible to everyone. You can find them at /llms on all machines. To use it, you have mount it in your container as follows:

        #!/bin/bash
         
        PHYS_DIR="your project dir" # e.g., /home/molfetta/project1
        LLM_CACHE_DIR="/llms"
        DOCKER_INTERNAL_CACHE_DIR="/llms"
        
        docker run \
            -v "$PHYS_DIR":/workspace \
            -v "$LLM_CACHE_DIR":"$DOCKER_INTERNAL_CACHE_DIR" \
            -e HF_HOME="$DOCKER_INTERNAL_CACHE_DIR" \
            --rm \
            --memory="30g" \
            --gpus '"device='"$CUDA_VISIBLE_DEVICES"'"' \
            <image-name> \
            "/workspace/train.sh" \
            "${1}" "${2}" # ... parameters to pass to the main function


        In the example code you can copy-paste in your projects, we are mounting the shared folder to make it visible from within the container, and setting the environment variable HF_HOME to the path where the LLMs are stored. This way, the Hugging Face library will look for the models in the shared directory, saving memory disk and time. (this is an example for using model from HuggingFace. You can do the same for any other framework by changing the variable names - but 99.99% of the times you will use this configuration).


      4. sbatch_script.sh
                                            #!/bin/bash
         
                                            # PARAMS:
                                            # 1: model_checkpoint
                                            # 2: dataset
                                            
                                            # 1: model_checkpoint
                                            bart_base="facebook/bart-base"
                                            bart_large="facebook/bart-large"
                                            
                                            # 2: dataset
                                            pubmed="pubmed"
                                            arxiv="arxiv"
                                            
                                            # run bart-base on arxiv
                                            sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_base" "$arxiv"
                                            # run bart-large on arxiv
                                            sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_large" "$arxiv"
                                            
                                            # run bart-base on pubmed
                                            sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_base" "$pubmed"
                                            # run bart-large on pubmed
                                            sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 run_docker.sh "$bart_large" "$pubmed"
                                        
        • 4 sbatch = 4 jobs in the queue.
        • Overall, you are asking SLURM to execute a script with 4 different hyperparameter combinations. For each job, you ask to create a container with all the required dependencies and then execute the file inside.
💡 Tip: OBVIOUSLY you can use any name for these files.

🚸 CAREFUL: Docker containers and images are often the cause of memory disk saturation. Please ensure that you delete any unused containers and images. You can use the docker system prune command to remove all stopped containers, dangling images, and unused networks and volumes.


7. Independent File Systems and Code Distribution 🛜

We lack a distributed file system within the cluster. WHY this is a problem❓ As we said, SLURM dynamically decides where to allocate your job, interdependently from the location you ran the command from.

📝 Note: Slurm only determines which machine will run your command. It does not automatically transfer or synchronize your code files to that machine for execution. In this context, it's crucial to highlight that you cannot predict which server SLURM will allocate a GPU from.


Then, WHAT IF SLURM executes your task on a machine where your code doesn't exist ❓ The job will fail because the machine won't have access to the necessary files needed to run your program. In other words, if you create a file on a server, it won't automatically propagate to all other ones.

📝 Example: You're logged into server 40, where all your project files reside. You submit a job to the queue with sbatch, requesting execution of a training script on an NVIDIA RTX 3090. SLURM promptly allocates a 3090, but not on server 40; it's on server 153. SLURM searches for your specified file but doesn't find it. Consequently, the job terminates with an error.


Given the uncertainty of the node where your job will execute, it's imperative to ensure synchronization of the directory containing your project files. This way, regardless of the allocated node, the job remains executable. Storing your code on GitHub and simply pulling it onto the other servers is an ideal solution.

📝 Note: Not only code, you shoudl recreate the Docker image with the same name on all the servers. This way, you can be sure that the environment is the same on all the machines and the job won't fail.


🔢 TL;DR: To sum up, here are the steps to follow:
  1. Create a GitHub Repository and Clone It

    Set up a GitHub repository and clone it to your home directory.

    git clone https://github.com/your-repo/project1.git /home/molfetta/project1
    cd /home/molfetta/project1
  2. Create a Build Directory and Build the Docker Image

    Inside your project folder, create a build directory and build the Docker image.

    mkdir /home/molfetta/project1/build

    Run the following command to build the Docker image:

    docker build -f build/Dockerfile -t project1_image_name .
  3. Create Shell Scripts for Running Docker Containers

    Write .sh scripts that use docker run, referencing the image name created in the previous step.

    Example script:

    docker run -v /home/molfetta/project1:/workspace 
               -v /llms:/llms project1_image_name 
               ...
               /workspace/RELATIVE_PATH_TO_TRAIN.sh ... 
  4. Ensure the Correct Image Name in run_docker.sh

    Before submitting a job, verify that the run_docker.sh script specifies the correct Docker image name.

  5. Submit the Job

    Use the following command to submit the job via sbatch:

    sbatch run_docker.sh
  6. You're Done! 🎉

    Your job is now scheduled and will be executed automatically. Just wait for the results!



📝 Note: In very very ... very rare cases, you may be using too memory-consuming resources (e.g. very large datasets), where replicating them on all servers is not feasible. ONLY in that case, you can set an additional variable in the sbatch command to force the scheduler to remain in the machine you have your data on. This argument is -w followed by the specific server name.
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w faretra train.sh
sbatch -N 1 --gpus=nvidia_geforce_rtx_5090:1 -w moro43 train.sh
sbatch -N 1 --gpus=nvidia_geforce_rtx_3090:1 -w deeplearn2 train.sh
sbatch -N 1 --gpus=titan_xp:1 -w deeplearn2 train.sh
USE IT SPARINGLY. Forcing the destination machine make low usage of the great scheduling and dynamic capabilities of SLURM, and may lead to a waste of resources.



8. Visualizing and Interacting with your Files 👀

Accessing the server via ssh and manually copying files using scp can be tedious and time-consuming. Every time you need to edit or transfer a file, you must run multiple commands, making development inefficient.

Instead, we recommend using Visual Studio Code's Remote - SSH extension. This extension allows you to connect to a remote server and interact with files as if they were on your local machine. With this setup, you can:

🛠 Installation Guide: VS Code Remote - SSH

Follow these steps to set up and use the Remote - SSH extension in VS Code:

  1. Install Visual Studio Code

    If you haven't installed VS Code yet, download it from the official website:

    đź”— VS Code Download

  2. Install the Remote - SSH Extension

    Open VS Code and install the extension:

    • Click on the Extensions icon (Ctrl+Shift+X).
    • Search for "Remote - SSH".
    • Click "Install".

    Alternatively, install it directly from the marketplace:

    đź”— Remote - SSH Extension

  3. Configure SSH in VS Code

    To enable seamless SSH connections, IN YOUR LOCAL MACHINE configure the SSH settings in ~/.ssh/config (create the file if it doesn't exist yet). Copy-paste the following text into that file (using your username):

    Host faretra
        HostName 137.204.107.40
        Port 37335
        User molfetta
    
    Host moro232
        HostName 137.204.107.232
        Port 37335
        User molfetta
    
    Host moro43
        HostName 137.204.107.43
        Port 22
        User molfetta
    
    Host deeplearn2
        HostName 137.204.107.153
        Port 37335
        User molfetta

    Once installed, close and re-open VS Code. Then, at the bottom-left of your VS Code window, a green icon similar to "><" should appear. Click on it and select the machine you want to connect to from the drop-down menu (those names are taken from the ".ssh/config" file).

  4. Start Coding on the Remote Server 🛡

    Once connected, you can:

    • Use the built-in file explorer to navigate remote files.
    • Open, edit, and save files directly on the server.
    • Run commands in the VS Code terminal without opening a separate SSH session.

    Now, you can work on your remote machine as if it were local!

✅ Done! Now you can interact with your files efficiently using VS Code instead of manually copying them with scp.

📝 Note: You can get more information about the Remote - SSH extension and its features on the official VS Code documentation page: 🔗 VS Code Remote - SSH Documentation

🚸 CAREFUL: Even if files are now completely accessible and visible on the servers, ALWAYS remember to push your changes.


9. Before Graduation or Project Completion 🚩





Official Website

Visit our NLP research site at University of Bologna.

LinkedIn

Connect for updates on research and opportunities.

Hugging Face

Explore our models and contribute to our projects.