Skip to content

Loading multiple LoRAs to 1 pipeline in parallel, 1 LoRA to 2-pipelines on 2-GPUs #11914

Closed
@vahe-toffee

Description

@vahe-toffee

Hi everyone,

I have the following scenario.

I have a machine with 2-GPUs and a running service that keep has two pipelines loaded to their corresponding devices. Also I have a list of LoRAs (say 10). On each request I split the batch into 2 parts (request also has the corresponding information about LoRA), load LoRAs and run the forward pass.

The problem I encounter is that whatever parallelization method I have tried (threading, multi-processing), the maximum I have achieved is pre-loading LoRAs on the cpu and then, moving them to GPU and only after that load_lora_weights from the state_dict.

Even if I attempt to achieve parallelization in by calling the chunk where I load in parallel in threads, the pipe starts to produce either a complete noise or a black image.

Where I would appreciate a lot the help is:

  1. To get an advice of elegantly loading multiple LoRAs at once into one pipe (all examples in the documentation indicate that one needs to do it 1 by 1)
  2. If I have 2 pipes on 2 different devices, how to parallelize the process of loading 1 LoRA to pipes on their corresponding devices.
def apply_multiple_loras_from_cache(pipes, adapter_names, lora_cache, lora_names, lora_strengths, devices):
    for device_index, pipe in enumerate(pipes):
        logger.info(f"Starting setup for device {devices[device_index]}")
        
        # Step 1: Unload LoRAs
        start = time.time()
        pipe.unload_lora_weights(reset_to_overwritten_params=False)
        logger.info(f"[Device {device_index}] Unload time: {time.time() - start:.3f}s")

        # Step 2: Parallelize CPU → GPU state_dict move
        def move_to_device(name):
            return name, {
                k: v.to(devices[device_index], non_blocking=True).to(pipe.dtype)
                for k, v in lora_cache[name]['state_dict'].items()
            }

        start = time.time()
        with ThreadPoolExecutor() as executor:
            future_to_name = {executor.submit(move_to_device, name): name for name in adapter_names}
            results = [future.result() for future in as_completed(future_to_name)]
        logger.info(f"[Device {device_index}] State dict move + dtype conversion time: {time.time() - start:.3f}s")

        # Step 3: Load adapters
        start = time.time()
        
        
        for adapter_name, state_dict in results:

            pipe.load_lora_weights(
                pretrained_model_name_or_path_or_dict=state_dict,
                adapter_name=adapter_name
            )
        logger.info(f"[Device {device_index}] Load adapter weights time: {time.time() - start:.3f}s")

        # Step 4: Set adapter weights
        start = time.time()
        pipe.set_adapters(lora_names, adapter_weights=lora_strengths)
        logger.info(f"[Device {device_index}] Set adapter weights time: {time.time() - start:.3f}s")

    torch.cuda.empty_cache()
    logger.info("All LoRAs applied and GPU cache cleared.")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions