Infinity-2B GGUF with SageAttention

image

Unofficial Q8_0 GGUF quantization of Infinity-2B with SageAttention support for even faster generation.

Features

✨ SageAttention Integration - 2-5x faster than FlashAttention with automatic fallback 🎨 Gradio Web UI - Easy-to-use interface for image generation πŸ’Ύ Q8_0 Quantization - ~75% memory reduction with minimal quality loss πŸš€ Optimized Inference - T5 encoder on CPU, efficient VRAM usage πŸ”§ GGUF Support - On-the-fly dequantization with flexible deployment

Quick Start

Web UI (Recommended)

python gradio_webui.py --autoload

Then open http://127.0.0.1:7860 in your browser.

Command Line

python generate_image_2b_q8_gguf.py \
  --prompt "an astronaut riding a horse on the moon" \
  --output output.png

Installation

1. Basic Requirements

pip install -r Infinity/requirements.txt
pip install gradio gguf

2. Install SageAttention (Optional, Recommended)

For faster generation:

pip install sageattention>=2.2.0 --no-build-isolation

Requirements: CUDA β‰₯12.0 (CUDA 12.8+ for Blackwell GPUs like RTX 50-series)

Note: SageAttention is optional. The code automatically falls back to:

  1. SageAttention (if installed) - 2-5x faster ✨
  2. FlashAttention (if available) - faster than PyTorch
  3. PyTorch SDPA (always works) - built-in fallback

3. Download Models

You'll need:

  • infinity_2b_reg_Q8_0.gguf - Infinity-2B model (~2.1 GB)
  • flan-t5-xl-encoder-Q8_0.gguf - T5 text encoder (~1.0 GB)
  • Infinity/infinity_vae_d32_reg.pth - VAE decoder (~0.5 GB)

Memory Requirements

Component VRAM Usage
Infinity-2B (Q8_0) ~2.5 GB
VAE ~0.5 GB
Working Memory ~1-2 GB
Total (1M res) ~4-5 GB

T5 encoder runs on CPU to save VRAM!

Recommended: 8GB+ VRAM for comfortable 1M (1024Γ—1024) generation

Web UI Features

The Gradio web interface provides:

  • Model Management: Load models once, reuse for all generations
  • Full Parameter Control: CFG scale, tau, resolution, aspect ratio, seed
  • Real-time Preview: See your images as they generate
  • Progress Tracking: Visual feedback during loading and generation
  • Clean Layout: Model paths banner, settings on left, output on right

Web UI Options

# Basic usage
python gradio_webui.py

# Auto-load models on startup (faster)
python gradio_webui.py --autoload

# Create public share link
python gradio_webui.py --share

# Custom port
python gradio_webui.py --server-port 8080

# Full options
python gradio_webui.py \
  --autoload \
  --server-port 7860 \
  --infinity-gguf path/to/infinity.gguf \
  --t5-gguf path/to/t5.gguf \
  --vae-path path/to/vae.pth

Command-Line Options

python generate_image_2b_q8_gguf.py [OPTIONS]
Option Description Default
--prompt TEXT Text prompt for image generation "an astronaut..."
--infinity-gguf PATH Path to Infinity GGUF file infinity_2b_reg_Q8_0.gguf
--t5-gguf PATH Path to T5 encoder GGUF flan-t5-xl-encoder-Q8_0.gguf
--vae-path PATH Path to VAE checkpoint Infinity/infinity_vae_d32_reg.pth
--output PATH Output image path output.png
--cfg-scale FLOAT CFG scale (1.0-10.0) 3.0
--tau FLOAT Temperature (0.1-1.0) 0.5
--seed INT Random seed for reproducibility 42
--pn {0.06M,0.25M,1M} Resolution preset 1M
--aspect-ratio FLOAT Aspect ratio (height/width) 1.0

Technical Details

Quantization

  • Q8_0 format: 8-bit quantization with minimal quality loss
  • On-the-fly dequantization: Using custom GGUFLinear layers
  • Memory savings: ~75% reduction vs FP16
  • Quality: Nearly identical to FP16

Architecture

  • Infinity-2B: 2.0B parameters, embed_dim=2048, depth=32
  • T5-XL Encoder: 2048-dim text embeddings
  • VAE: d32 with dynamic resolution support

GGUF Support

The implementation includes:

  • Import utilities for GGUF tensors
  • Custom GGUFLinear layers for on-the-fly dequantization
  • Patched attention mechanisms for compatibility
  • F16 dtype handling for head layers

See patch_infinity_for_gguf.sh for implementation details.

Credits

License

MIT

Downloads last month
25
GGUF
Model size
1B params
Architecture
t5encoder
Hardware compatibility
Log In to view the estimation

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kzopp/Infinity-2B-GGUF_UNOFFICIAL

Quantized
(1)
this model