Learn how to fine-tune LLMs on multiple GPUs and parallelism with Unsloth.
Unsloth currently supports multi-GPU setups through libraries like Accelerate and DeepSpeed. This means you can already leverage parallelism methods such as FSDP and DDP with Unsloth.
We know that the process can be complex and requires manual setup. We’re working hard to make multi-GPU support much simpler and more user-friendly, and we’ll be announcing official multi-GPU support for Unsloth soon.
In the meantime, to enable multi GPU for DDP, do the following:
Create your training script as train.py (or similar). For example, you can use one of our training scripts created from our various notebooks!
Run accelerate launch train.py or torchrun --nproc_per_node N_GPUS train.py where N_GPUS is the number of GPUs you have.
Pipeline / model splitting loading
If you do not have enough VRAM for 1 GPU to load say Llama 70B, no worries - we will split the model for you on each GPU! To enable this, use the device_map = "balanced" flag:
from unsloth import FastLanguageModelmodel, tokenizer = FastLanguageModel.from_pretrained("unsloth/Llama-3.3-70B-Instruct",load_in_4bit=True,device_map="balanced",)
Stay tuned for our official announcement! For more details, check out our ongoing Pull Request discussing multi-GPU support.