在windows上通过pytorch-directml利用AMD显卡加速stable-diffusion

编辑：Y ┊ 时间：2024年01月05日 ┊ 访问: 16 次

automatic1111的webui在windows下只支持英伟达的显卡
又不想装linux双系统，只能勉强用CPU凑合一下，速度非常慢，一次迭代常常需要90秒，多的甚至超过两分钟，最多能有150~200秒。我去网上找，发现两种理论上可行的方法：

第一种是使用ort-nightly-directml，具体可参考这篇教程。

第二种就是pytorch-directml。但是，我电脑的显卡只有2GB专用GPU内存，OnnxStableDiffusionPipeline连512512的图都生成不了，提示GPU内存不足。256256的图的确可以生成，但输出的全是白色。

在2022年12月，微软终于发布了1.13版本，至少在理论上可以与webui兼容。但是，经过了多次调试之后，发现有一个运算符没有在pytorch-directml中实现（issue:pytorch-directml not working with Stable Diffusion）。

又到了2023年，微软发布了最新版本，其中支持了该运算符。由于torch-directml的一个bug（issue:torch-directml : RuntimeError on torch.cumsum with bool input），在运行webui时，会出现运行错误。但这个错误相对好改，只要在webui中把一个bool向量转换成int类型的就可以。然后，只要记得在运行参数中加上 --lowvram --skip-torch-cuda-test --precision full --no-half 即可正常运行。

但是，还是最大只能生成384*384大小的图片，因为虽然在文本编码器和图像信息创建器阶段没有出现问题，但由于GPU内存不够，在图像解码器环节又报错。

查找资料后，发现是因为在代码里，图像解码的第一个步骤是在GPU中进行的，所以用完了内存。我把这个步骤转移到了CPU中进行，就消除了内存报错，成功生成大小达768*768甚至更大的图。一张图快的一轮迭代需要半分钟，多的需要60秒，768大小的也就七八十秒。
（注：原文中此处问题已解决）

方法与细节

首先手动创建虚拟环境并安装torch-directml：

python -m venv venv
.\venv\scripts\activate
pip install torch-directml

再将stable-diffusion-webui/modules/devices.py中的get_optimal_device_name函数改成这样：

def get_optimal_device_name():
    if torch.cuda.is_available():
        return get_cuda_device_string()
    try:
        import torch_directml
        return "privateuseone"
    except Exception:
        pass
    if has_mps():
        return "mps"

    return "cpu"

然后修改stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py中的sigma_to_t 函数：

def sigma_to_t(self, sigma, quantize=None):
        quantize = self.quantize if quantize is None else quantize
        log_sigma = sigma.log()
        dists = log_sigma - self.log_sigmas[:, None]
        if quantize:
            return dists.abs().argmin(dim=0).view(sigma.shape)
        #low_idx = dists.ge(0).cumsum(dim=0).argmax(dim=0).clamp(max=self.log_sigmas.shape[0] - 2)
        low_idx = dists.ge(0).int().cumsum(dim=0).argmax(dim=0).clamp(max=self.log_sigmas.shape[0] - 2)
        high_idx = low_idx + 1
        low, high = self.log_sigmas[low_idx], self.log_sigmas[high_idx]
        w = (low - log_sigma) / (low - high)
        w = w.clamp(0, 1)
        t = (1 - w) * low_idx + w * high_idx
        return t.view(sigma.shape)

文中最后所提到的优化如下：（stable-diffusion-webui/modules/processing.py）

 samples_ddim = samples_ddim.to(devices.dtype_vae).to("cpu")
 x_samples_ddim = decode_first_stage(p.sd_model, samples_ddim)

以及stable-diffusion-webui/modules/lowvram.py中：

def first_stage_model_decode_wrap(z):
        #send_me_to_gpu(first_stage_model, None)
        return first_stage_model_decode(z)

2024-01-04T09:19:18.png

2024-01-04T09:20:00.png

上一篇: Windows下Stable Diffusion WebUI使用AMD显卡指北
下一篇: anaconda/miniconda常用的镜像站