Frequently Asked Questions - UniboNLP Cluster

I have problems when building the Docker image: ERROR: failed to solve ...

I receive the following error:

ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 34b176e5-cfa7-4af4-a1e1-3f6aa8cd8431::qbj4u1eq6ct730v5oyplxo8ic: "/build/requirements.txt": not found

The error message indicates that the Docker build process cannot find the requirements.txt file. This file is essential for installing the required Python packages.

Please ensure that the requirements.txt file is located in the same directory as your Dockerfile. We recommend creating a build folder inside your project directory containing both the Dockerfile and the requirements.txt file. Then, assuming you are in your project folder (/home/molfetta/my_project), to create the Docker image, run the following command:

docker build -f build/Dockerfile -t IMAGE_NAME .

Make sure to replace IMAGE_NAME with the name you want for your Docker image and structure the files as suggested. This should resolve the issue.

I submitted a job but I cannot see it in the queue. What am I doing wrong?

If you submitted a job but cannot see it in the queue, all the information you need is contained in the .out file automatically created by the Slurm daemon.

The file is always written to the machine where your job was scheduled. As mentioned in the guide, unless you constrain the job to be run on a specific node, Slurm automatically decides the best machine for your job. The most common error is indeed forgetting to replicate code on all nodes in the cluster, which causes errors when the submitted job attempts to access the scripts.

Therefore, when encountering this problem, always look for the .out file on all nodes.

I'm running into CUDA out of memory errors when training my model. How can I fix this?

CUDA out of memory (OOM) errors occur when your model's memory requirements exceed the available GPU VRAM. Here are several strategies to address this issue:

Reduce batch size: This is the simplest solution. Try halving your batch size and see if it resolves the issue.

Enable gradient accumulation: This allows you to effectively increase the batch size without increasing memory usage. Example in PyTorch:

# Accumulate gradients over 4 batches
optimizer.zero_grad()
for i in range(4):
    outputs = model(inputs[i])
    loss = loss_fn(outputs, labels[i])
    loss = loss / 4  # Normalize the loss
    loss.backward()
optimizer.step()

Use mixed precision training: This can reduce memory usage by using float16 instead of float32 for most operations:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for data in dataloader:
    optimizer.zero_grad()
    with autocast():
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Use gradient checkpointing: This trades computation for memory by not storing all intermediate activations:
```
# In PyTorch
model.gradient_checkpointing_enable()
```
Optimize model architecture: Consider using more efficient architectures or reducing model size by pruning, quantization, or distillation.

If you continue to experience issues after trying these solutions, please contact the cluster administrators for further assistance.

I'm getting a permission error when trying to download a model from Hugging Face. How do I fix this?

If you're encountering permission errors when attempting to download models from Hugging Face, it's likely because you're trying to access a gated model that requires authentication. To resolve this issue:

Create a .env file in your project directory if it doesn't already exist
Add your Hugging Face token to the file as follows:
```
HF_TOKEN=your_huggingface_token_here
```
Make sure your code loads the environment variables, for example using the python-dotenv package:
```
from dotenv import load_dotenv
load_dotenv()
```
If you're using the transformers library, it should automatically pick up the token from the environment variables

To obtain your Hugging Face token:

Log in to your Hugging Face account at huggingface.co
Go to your profile settings and navigate to the "Access Tokens" section
Create a new token with at least "read" access
Copy the generated token and add it to your .env file as shown above

If you're using Docker, you'll need to pass the token as an environment variable to your container:

docker run -e HF_TOKEN=$HF_TOKEN -v $PWD:/workspace --rm --gpus ... image-name command

How do I synchronize my files across different servers in the cluster?

Since our cluster doesn't have a distributed file system, you need to ensure your project files are synchronized across servers. This is critical because SLURM may allocate resources on any server, and your job will fail if the required files aren't available there.

The recommended approach is to use GitHub (or another Git hosting service) for managing and synchronizing your code:

Create a GitHub repository for your project

Initialize Git in your project directory on the master node (40):

cd /home/your_username/your_project
git init
git remote add origin https://github.com/your_username/your_repo.git

Add and commit your project files:

git add .
git commit -m "Initial commit"

Push your code to the remote repository:
```
git push -u origin main
```

Clone your repository on each server where you need your code:

ssh username@137.204.107.xx -p port
cd /home/your_username
git clone https://github.com/your_username/your_repo.git

Synchronize changes whenever you update your code:

# On the master node where you made changes
git add .
git commit -m "Update code"
git push

# On other servers
cd /home/your_username/your_repo
git pull

This approach offers several advantages over file synchronization tools:

Version control with commit history
Easy rollback if something breaks
Conflict resolution when changes are made on different servers
Branch management for experimental features
Easier collaboration with other researchers

For large data files that shouldn't be in version control, consider using shared directories or Git LFS.