Task: Text2Image
Stable Diffusion is a latent diffusion model conditioned on the text embeddings of a CLIP text encoder, which allows you to create images from text inputs. This model builds upon the CVPR'22 work High-Resolution Image Synthesis with Latent Diffusion Models. The official code was released at stable-diffusion and also implemented at diffusers. We support this algorithm here to facilitate the community to learn together and compare it with other text2image methods.
Model | Dataset | Download |
---|---|---|
stable_diffusion_v1.5 | - | model |
We use stable diffusion v1.5 weights. This model has several weights including vae, unet and clip.
You should download the weights from stable-diffusion-1.5 and change the 'pretrained_model_path' in config to the weights dir.
Download with git:
git lfs install
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
Running the following codes, you can get a text-generated image.
from mmengine import MODELS, Config
from torchvision import utils
from mmengine.registry import init_default_scope
init_default_scope('mmedit')
config = 'configs/stable_diffusion/stable-diffusion_ddim_denoisingunet.py'
config = Config.fromfile(config).copy()
config.model.init_cfg.pretrained_model_path = '/path/to/your/stable-diffusion-v1-5'
StableDiffuser = MODELS.build(config.model)
prompt = 'A mecha robot in a favela in expressionist style'
StableDiffuser = StableDiffuser.to('cuda')
image = StableDiffuser.infer(prompt)['samples']
utils.save_image(image, 'robot.png')
Our codebase for the stable diffusion models builds heavily on diffusers codebase and the model weights are from stable-diffusion-1.5.
Thanks for the efforts of the community!
@misc{rombach2021highresolution,
title={High-Resolution Image Synthesis with Latent Diffusion Models},
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
year={2021},
eprint={2112.10752},
archivePrefix={arXiv},
primaryClass={cs.CV}
}