Infinity-2B GGUF with SageAttention

Unofficial Q8_0 GGUF quantization of Infinity-2B with SageAttention support for even faster generation.

Features

✨ SageAttention Integration - 2-5x faster than FlashAttention with automatic fallback 🎨 Gradio Web UI - Easy-to-use interface for image generation 💾 Q8_0 Quantization - ~75% memory reduction with minimal quality loss 🚀 Optimized Inference - T5 encoder on CPU, efficient VRAM usage 🔧 GGUF Support - On-the-fly dequantization with flexible deployment

Quick Start

Web UI (Recommended)

python gradio_webui.py --autoload

Then open http://127.0.0.1:7860 in your browser.

Command Line

python generate_image_2b_q8_gguf.py \
  --prompt "an astronaut riding a horse on the moon" \
  --output output.png

Installation

1. Basic Requirements

pip install -r Infinity/requirements.txt
pip install gradio gguf

2. Install SageAttention (Optional, Recommended)

For faster generation:

pip install sageattention>=2.2.0 --no-build-isolation

Requirements: CUDA ≥12.0 (CUDA 12.8+ for Blackwell GPUs like RTX 50-series)

Note: SageAttention is optional. The code automatically falls back to:

SageAttention (if installed) - 2-5x faster ✨
FlashAttention (if available) - faster than PyTorch
PyTorch SDPA (always works) - built-in fallback

3. Download Models

You'll need:

infinity_2b_reg_Q8_0.gguf - Infinity-2B model (~2.1 GB)
flan-t5-xl-encoder-Q8_0.gguf - T5 text encoder (~1.0 GB)
Infinity/infinity_vae_d32_reg.pth - VAE decoder (~0.5 GB)

Memory Requirements

Component	VRAM Usage
Infinity-2B (Q8_0)	~2.5 GB
VAE	~0.5 GB
Working Memory	~1-2 GB
Total (1M res)	~4-5 GB

T5 encoder runs on CPU to save VRAM!

Recommended: 8GB+ VRAM for comfortable 1M (1024×1024) generation

Web UI Features

The Gradio web interface provides:

Model Management: Load models once, reuse for all generations
Full Parameter Control: CFG scale, tau, resolution, aspect ratio, seed
Real-time Preview: See your images as they generate
Progress Tracking: Visual feedback during loading and generation
Clean Layout: Model paths banner, settings on left, output on right

Web UI Options

# Basic usage
python gradio_webui.py

# Auto-load models on startup (faster)
python gradio_webui.py --autoload

# Create public share link
python gradio_webui.py --share

# Custom port
python gradio_webui.py --server-port 8080

# Full options
python gradio_webui.py \
  --autoload \
  --server-port 7860 \
  --infinity-gguf path/to/infinity.gguf \
  --t5-gguf path/to/t5.gguf \
  --vae-path path/to/vae.pth

Command-Line Options

python generate_image_2b_q8_gguf.py [OPTIONS]

Option	Description	Default
`--prompt TEXT`	Text prompt for image generation	"an astronaut..."
`--infinity-gguf PATH`	Path to Infinity GGUF file	infinity_2b_reg_Q8_0.gguf
`--t5-gguf PATH`	Path to T5 encoder GGUF	flan-t5-xl-encoder-Q8_0.gguf
`--vae-path PATH`	Path to VAE checkpoint	Infinity/infinity_vae_d32_reg.pth
`--output PATH`	Output image path	output.png
`--cfg-scale FLOAT`	CFG scale (1.0-10.0)	3.0
`--tau FLOAT`	Temperature (0.1-1.0)	0.5
`--seed INT`	Random seed for reproducibility	42
`--pn {0.06M,0.25M,1M}`	Resolution preset	1M
`--aspect-ratio FLOAT`	Aspect ratio (height/width)	1.0

Technical Details

Quantization

Q8_0 format: 8-bit quantization with minimal quality loss
On-the-fly dequantization: Using custom GGUFLinear layers
Memory savings: ~75% reduction vs FP16
Quality: Nearly identical to FP16

Architecture

Infinity-2B: 2.0B parameters, embed_dim=2048, depth=32
T5-XL Encoder: 2048-dim text embeddings
VAE: d32 with dynamic resolution support

GGUF Support

The implementation includes:

Import utilities for GGUF tensors
Custom GGUFLinear layers for on-the-fly dequantization
Patched attention mechanisms for compatibility
F16 dtype handling for head layers

See patch_infinity_for_gguf.sh for implementation details.

Credits

Original Model: Infinity by FoundationVision
SageAttention: thu-ml/SageAttention
GGUF Format: ggerganov/ggml

License

MIT

Downloads last month: 25

GGUF

Model size

1B params

Architecture

t5encoder

Hardware compatibility

8-bit

Model tree for kzopp/Infinity-2B-GGUF_UNOFFICIAL

Base model

FoundationVision/Infinity

Quantized

(1)

this model