Deep floyd
Core Features & Characteristics: DeepFloyd IF is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: a base model that generates 64x64 px images based on text prompt, and two super-resolution models designed to generate images of increasing resolution: 256x256 px and 1024x1024 px. All stages utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. The model outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset, underscoring the potential of larger UNet architectures in the first stage of cascaded diffusion models.
Main Modes & Use Cases:
- Dream (Text-to-Image): Generates images based on text prompts with customizable parameters like guidance_scale and sample_timestep_respacing.
- Zero-shot Image-to-Image Translation (Style Transfer): The output of the prompt comes out in the style of the
support_pil_img, supporting styles like professional origami, oil art, plastic building bricks, and classic anime. - Super Resolution: Users can run IF-II and IF-III or 'Stable x4' on an image not necessarily generated by IF to upscale low-resolution images to high-resolution.
- Zero-shot Inpainting: Performs local image repainting based on the provided original image, inpainting mask, and text prompt.
Usage Instructions & Integration:
- Integrated with the Hugging Face Diffusers library, utilizing model cpu offloading to run the whole IF pipeline with as little as 14 GB of VRAM. If using torch>=2.0.0, all
enable_xformers_memory_efficient_attention()functions must be deleted. - Users must have a Hugging Face account, accept the license on the model card, and login locally using
huggingface_hubwith an access token. - Local installation via
deepfloyd_ifPython library is also available, requiringxformersand CLIP.
Hardware Requirements:
- Minimum 16GB VRAM for IF-I-XL (4.3B) & IF-II-L (1.2B); 24GB VRAM required to also include Stable x4 upscaler (to 1024x1024). Requires
xformersand setting env variableFORCE_MEM_EFFICIENT_ATTN=1.
Model Zoo & Scale: Includes models of various parameter sizes such as IF-I-M (400M), IF-I-L (900M), IF-I-XL (4.3B), IF-II-M (450M), IF-II-L (1.2B), and IF-III-L (700M).
Pricing & License: The code is released under a bespoke license (with specific restricted points). The initial release of the IF model is under a restricted research-purposes-only license temporarily, with the intention to release a fully open-source model later. Model weights are accessible for free via Hugging Face.
의론