Frequently Asked Questions

I have problems when building the Docker image: ERROR: failed to solve ...

I receive the following error:

ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 34b176e5-cfa7-4af4-a1e1-3f6aa8cd8431::qbj4u1eq6ct730v5oyplxo8ic: "/build/requirements.txt": not found

The error message indicates that the Docker build process cannot find the requirements.txt file. This file is essential for installing the required Python packages.

Please ensure that the requirements.txt file is located in the same directory as your Dockerfile. We recommend creating a build folder inside your project directory containing both the Dockerfile and the requirements.txt file. Then, assuming you are in your project folder (/home/molfetta/my_project), to create the Docker image, run the following command:

docker build -f build/Dockerfile -t IMAGE_NAME .

Make sure to replace IMAGE_NAME with the name you want for your Docker image and structure the files as suggested. This should resolve the issue.

What computational resources are available on the cluster?

The UniboNLP Cluster features:

  • 6 NVIDIA RTX 3090 GPUs (24GB each)
  • 4 NVIDIA TitanX GPUs (12 each)
  • Total of 192GB RAM across all nodes

Resource allocation is managed through our job scheduling system to ensure fair usage across all research projects.

I'm running into CUDA out of memory errors when training my model. How can I fix this?

CUDA out of memory (OOM) errors occur when your model's memory requirements exceed the available GPU VRAM. Here are several strategies to address this issue:

  1. Reduce batch size: This is the simplest solution. Try halving your batch size and see if it resolves the issue.
  2. Enable gradient accumulation: This allows you to effectively increase the batch size without increasing memory usage. Example in PyTorch:
    # Accumulate gradients over 4 batches
    optimizer.zero_grad()
    for i in range(4):
        outputs = model(inputs[i])
        loss = loss_fn(outputs, labels[i])
        loss = loss / 4  # Normalize the loss
        loss.backward()
    optimizer.step()
  3. Use mixed precision training: This can reduce memory usage by using float16 instead of float32 for most operations:
    from torch.cuda.amp import autocast, GradScaler
    
    scaler = GradScaler()
    for data in dataloader:
        optimizer.zero_grad()
        with autocast():
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
  4. Use gradient checkpointing: This trades computation for memory by not storing all intermediate activations:
    # In PyTorch
    model.gradient_checkpointing_enable()
  5. Optimize model architecture: Consider using more efficient architectures or reducing model size by pruning, quantization, or distillation.

If you continue to experience issues after trying these solutions, please contact the cluster administrators for further assistance.

I'm getting a permission error when trying to download a model from Hugging Face. How do I fix this?

If you're encountering permission errors when attempting to download models from Hugging Face, it's likely because you're trying to access a gated model that requires authentication. To resolve this issue:

  1. Create a .env file in your project directory if it doesn't already exist
  2. Add your Hugging Face token to the file as follows:
    HF_TOKEN=your_huggingface_token_here
  3. Make sure your code loads the environment variables, for example using the python-dotenv package:
    from dotenv import load_dotenv
    load_dotenv()
  4. If you're using the transformers library, it should automatically pick up the token from the environment variables

To obtain your Hugging Face token:

  1. Log in to your Hugging Face account at huggingface.co
  2. Go to your profile settings and navigate to the "Access Tokens" section
  3. Create a new token with at least "read" access
  4. Copy the generated token and add it to your .env file as shown above

If you're using Docker, you'll need to pass the token as an environment variable to your container:

docker run -e HF_TOKEN=$HF_TOKEN -v $PWD:/workspace --rm --gpus ... image-name command

How do I synchronize my files across different servers in the cluster?

Since our cluster doesn't have a distributed file system, you need to ensure your project files are synchronized across servers. This is critical because SLURM may allocate resources on any server, and your job will fail if the required files aren't available there.

The recommended approach is to use GitHub (or another Git hosting service) for managing and synchronizing your code:

  1. Create a GitHub repository for your project
  2. Initialize Git in your project directory on the master node (40):
    cd /home/your_username/your_project
    git init
    git remote add origin https://github.com/your_username/your_repo.git
  3. Add and commit your project files:
    git add .
    git commit -m "Initial commit"
  4. Push your code to the remote repository:
    git push -u origin main
  5. Clone your repository on each server where you need your code:
    ssh username@137.204.107.xx -p port
    cd /home/your_username
    git clone https://github.com/your_username/your_repo.git
  6. Synchronize changes whenever you update your code:
    # On the master node where you made changes
    git add .
    git commit -m "Update code"
    git push
    
    # On other servers
    cd /home/your_username/your_repo
    git pull

This approach offers several advantages over file synchronization tools:

  • Version control with commit history
  • Easy rollback if something breaks
  • Conflict resolution when changes are made on different servers
  • Branch management for experimental features
  • Easier collaboration with other researchers

For large data files that shouldn't be in version control, consider using shared directories or Git LFS.

Can't find what you're looking for? Contact us at lorenzo.molfetta@unibo.it