If you have any questions about the UniboNLP Cluster, please send them to us and we will add more detailed information to this FAQ page.
I receive the following error:
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 34b176e5-cfa7-4af4-a1e1-3f6aa8cd8431::qbj4u1eq6ct730v5oyplxo8ic: "/build/requirements.txt": not found
The error message indicates that the Docker build process cannot find the requirements.txt
file. This file is essential for installing the required Python packages.
Please ensure that the requirements.txt
file is located in the same directory as your Dockerfile. We recommend creating a build folder inside your project directory containing both the Dockerfile and the requirements.txt
file. Then, assuming you are in your project folder (/home/molfetta/my_project), to create the Docker image, run the following command:
docker build -f build/Dockerfile -t IMAGE_NAME .
Make sure to replace IMAGE_NAME
with the name you want for your Docker image and structure the files as suggested. This should resolve the issue.
The UniboNLP Cluster features:
Resource allocation is managed through our job scheduling system to ensure fair usage across all research projects.
CUDA out of memory (OOM) errors occur when your model's memory requirements exceed the available GPU VRAM. Here are several strategies to address this issue:
# Accumulate gradients over 4 batches optimizer.zero_grad() for i in range(4): outputs = model(inputs[i]) loss = loss_fn(outputs, labels[i]) loss = loss / 4 # Normalize the loss loss.backward() optimizer.step()
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for data in dataloader: optimizer.zero_grad() with autocast(): outputs = model(inputs) loss = loss_fn(outputs, labels) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
# In PyTorch model.gradient_checkpointing_enable()
If you continue to experience issues after trying these solutions, please contact the cluster administrators for further assistance.
If you're encountering permission errors when attempting to download models from Hugging Face, it's likely because you're trying to access a gated model that requires authentication. To resolve this issue:
.env
file in your project directory if it doesn't already existHF_TOKEN=your_huggingface_token_here
from dotenv import load_dotenv load_dotenv()
To obtain your Hugging Face token:
.env
file as shown aboveIf you're using Docker, you'll need to pass the token as an environment variable to your container:
docker run -e HF_TOKEN=$HF_TOKEN -v $PWD:/workspace --rm --gpus ... image-name command
Since our cluster doesn't have a distributed file system, you need to ensure your project files are synchronized across servers. This is critical because SLURM may allocate resources on any server, and your job will fail if the required files aren't available there.
The recommended approach is to use GitHub (or another Git hosting service) for managing and synchronizing your code:
cd /home/your_username/your_project git init git remote add origin https://github.com/your_username/your_repo.git
git add . git commit -m "Initial commit"
git push -u origin main
ssh username@137.204.107.xx -p port cd /home/your_username git clone https://github.com/your_username/your_repo.git
# On the master node where you made changes git add . git commit -m "Update code" git push # On other servers cd /home/your_username/your_repo git pull
This approach offers several advantages over file synchronization tools:
For large data files that shouldn't be in version control, consider using shared directories or Git LFS.
Can't find what you're looking for? Contact us at lorenzo.molfetta@unibo.it