Spaces:

maddigit
/

layout_crazydesign

Build error

App Files Files Community

maddigit commited on 28 days ago

Commit

ddbdbca

verified ·

1 Parent(s): eb2cd25

Upload 27 files

Browse files

Files changed (28) hide show

.gitattributes +10 -0
README.md +105 -12
assets/figures/CreatiDesign_logo.png +3 -0
assets/figures/Qualitative_results.jpg +3 -0
assets/figures/Quantitative_results.png +3 -0
assets/figures/architecture.jpg +3 -0
assets/figures/dataset.jpg +3 -0
assets/figures/loop_edit.jpg +3 -0
assets/figures/motivation.jpg +3 -0
assets/figures/teaser.jpg +3 -0
dataloader/__pycache__/creatidesign_dataset_benchmark.cpython-310.pyc +0 -0
dataloader/arial.ttf +3 -0
dataloader/creatidesign_dataset_benchmark.py +554 -0
eval/layout.py +194 -0
eval/subject.py +233 -0
eval/text.py +184 -0
modules/common/__pycache__/lora.cpython-310.pyc +0 -0
modules/common/lora.py +26 -0
modules/flux/__pycache__/attention_processor_flux_creatidesign.cpython-310.pyc +3 -0
modules/flux/__pycache__/transformer_flux_creatidesign.cpython-310.pyc +0 -0
modules/flux/attention_processor_flux_creatidesign.py +0 -0
modules/flux/transformer_flux_creatidesign.py +1004 -0
modules/semantic_layout/__pycache__/layout_encoder.cpython-310.pyc +0 -0
modules/semantic_layout/layout_encoder.py +139 -0
pipeline/__pycache__/pipeline_flux_creatidesign.cpython-310.pyc +0 -0
pipeline/pipeline_flux_creatidesign.py +1068 -0
requirements.txt +14 -6
test_creatidesign_benchmark.py +210 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/figures/architecture.jpg filter=lfs diff=lfs merge=lfs -text
+assets/figures/CreatiDesign_logo.png filter=lfs diff=lfs merge=lfs -text
+assets/figures/dataset.jpg filter=lfs diff=lfs merge=lfs -text
+assets/figures/loop_edit.jpg filter=lfs diff=lfs merge=lfs -text
+assets/figures/motivation.jpg filter=lfs diff=lfs merge=lfs -text
+assets/figures/Qualitative_results.jpg filter=lfs diff=lfs merge=lfs -text
+assets/figures/Quantitative_results.png filter=lfs diff=lfs merge=lfs -text
+assets/figures/teaser.jpg filter=lfs diff=lfs merge=lfs -text
+dataloader/arial.ttf filter=lfs diff=lfs merge=lfs -text
+modules/flux/__pycache__/attention_processor_flux_creatidesign.cpython-310.pyc filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,105 @@
----
-title: Layout Crazydesign
-emoji: 🖼
-colorFrom: purple
-colorTo: red
-sdk: gradio
-sdk_version: 5.44.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# <img src='assets/figures/CreatiDesign_logo.png' alt="CreatiDesign Logo" width='24px' /> CreatiDesign
+<img src='assets/figures/teaser.jpg' width='100%' />
+<br>
+<a href="https://arxiv.org/pdf/2505.19114"><img src="https://img.shields.io/static/v1?label=Paper&message=2505.19114&color=red&logo=arxiv"></a>
+<a href="https://huizhang0812.github.io/CreatiDesign/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a>
+<a href="https://huggingface.co/datasets/HuiZhang0812/CreatiDesign_dataset"><img src="https://img.shields.io/badge/🤗_HuggingFace-Dataset-ffbd45.svg" alt="HuggingFace"></a>
+<a href="https://huggingface.co/datasets/HuiZhang0812/CreatiDesign_benchmark"><img src="https://img.shields.io/badge/🤗_HuggingFace-Benchmark-ffbd45.svg" alt="HuggingFace"></a>
+<a href="https://huggingface.co/HuiZhang0812/CreatiDesign"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
+>  <img src='assets/figures/CreatiDesign_logo.png' alt="CreatiDesign Logo" width='15px' />  **CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design**
+> <br>
+> [Hui Zhang](https://huizhang0812.github.io/),
+> [Dexiang Hong](https://scholar.google.com.hk/citations?user=DUNijlcAAAAJ&hl=zh-CN),
+> Maoke Yang,
+> Yutao Cheng,
+> Zhao Zhang,
+> Jie Shao,
+> [Xinglong Wu](https://scholar.google.com/citations?user=LVsp9RQAAAAJ&hl=zh-CN),
+> [Zuxuan Wu](https://zxwu.azurewebsites.net/),
+> and
+> [Yu-Gang Jiang](https://scholar.google.com/citations?user=f3_FP8AAAAAJ)
+> <br>
+> Fudan University & ByteDance Intelligent Creation.
+> <br>
+## 🎯 Introduction
+CreatiDesign tackles the challenge of automated graphic design generation that requires precise control over multiple heterogeneous elements—primary visual elements (product images), secondary visual elements (decorative objects), and textual elements (slogans, titles). CreatiDesign introduces a unified multi-conditional diffusion transformer that achieves flexible and harmonious integration of diverse design elements with minimal architectural modifications.
+<img src='assets/figures/motivation.jpg' width='100%' />
+## ✨ Key Features
+- **🎨 Multi-Conditional Image Generation**: Unified architecture supporting images, semantic layouts conditions simultaneously
+- **🎯 Precise Element Control**: Multimodal attention mask mechanism prevents condition interference
+- **🗂️ Graphic Design Datasets**: 400K graphic design samples with multi-condition annotations construced by automatic pipeline
+- **📊 Comprehensive Benchmark**: Rigorous evaluation of multi-subject preservation and semantic layout alignment.
+- **✏️ Zero-Shot Editing**: Natural extension to editing tasks without additional training or retraining
+## Quick Start
+### Setup
+1. **Environment setup**
+```bash
+conda create -n creatidesign python=3.10 -y
+conda activate creatidesign
+conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
+```
+2. **Requirements installation**
+```bash
+pip install -r requirements.txt
+```
+## Dataset and Benchmark
+### CreatiDesign Datasets <a href="https://huggingface.co/datasets/HuiZhang0812/CreatiDesign_dataset"><img src="https://img.shields.io/badge/🤗_HuggingFace-Dataset-ffbd45.svg" alt="HuggingFace"></a>
+Our CreatiDesign dataset contains **400K high-quality graphic design samples** with comprehensive multi-condition annotations, constructed through our fully automated pipeline. The dataset covers diverse design categories including movie posters, product advertisements, brand promotions, and social media content.
+### CreatiDesign Benchmark <a href="https://huggingface.co/datasets/HuiZhang0812/CreatiDesign_benchmark"><img src="https://img.shields.io/badge/🤗_HuggingFace-Benchmark-ffbd45.svg" alt="HuggingFace"></a>
+Our comprehensive benchmark contains **1,000 carefully curated samples** designed to rigorously evaluate graphic design generation capabilities across multiple dimensions. The benchmark assesses both fine-grained condition adherence and overall visual quality.
+To evaluate the model's graphic design generation capabilities through our benchmark, follow these steps:
+Generate images:
+```python
+python test_creatidesign_benchmark.py
+```
+Evaluate multi-subject preservation:
+```python
+python eval/subject.py
+```
+Evaluate semantic layout alignment:
+```python
+python eval/layout.py
+```
+```python
+python eval/text.py
+```
+## Models
+**Multi-Conditional Graphic Design:**
+| Model   | Base model    |  Description  |
+| ------------------------------------------------------------------------------------------------ | -------------- | -------------------------------------------------------------------------------------------------------- |
+| <a href="https://huggingface.co/HuiZhang0812/CreatiDesign"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a> | FLUX.1-dev | model used in the paper
+## ✒️ Citation
+If you find our work useful for your research and applications, please kindly cite using this BibTeX:
+```latex
+@article{zhang2025creatidesign,
+  title={CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design},
+  author={Zhang, Hui and Hong, Dexiang and Yang, Maoke and Chen, Yutao and Zhang, Zhao and Shao, Jie and Wu, Xinglong and Wu, Zuxuan and Jiang, Yu-Gang},
+  journal={arXiv preprint arXiv:2505.19114},
+  year={2025}
+}
+```

assets/figures/CreatiDesign_logo.png ADDED Viewed

Git LFS Details

SHA256: 2ffea0372e673a7381bbf369e6675089f1b0218d744249cb2d921d3267f12e14
Pointer size: 131 Bytes
Size of remote file: 477 kB

assets/figures/Qualitative_results.jpg ADDED Viewed

Git LFS Details

SHA256: cc5888193d7aa89cf67dc059323884a13d3bb302fcc61e1138dd349c9a3b1016
Pointer size: 132 Bytes
Size of remote file: 2.82 MB

assets/figures/Quantitative_results.png ADDED Viewed

Git LFS Details

SHA256: 62550edea5dc294ea9f57e6647378b7d64f7211518cbe83b7361fdd86ae01f28
Pointer size: 131 Bytes
Size of remote file: 634 kB

assets/figures/architecture.jpg ADDED Viewed

Git LFS Details

SHA256: 93acf777f1005b6baaf26627f5c5b1b3c928972d18ed2df3fb8eda96c90145bb
Pointer size: 131 Bytes
Size of remote file: 872 kB

assets/figures/dataset.jpg ADDED Viewed

Git LFS Details

SHA256: 8868318d37f5f176fc7845be91bfe2a63eb783d6f17e6c350430cc7843b87619
Pointer size: 131 Bytes
Size of remote file: 342 kB

assets/figures/loop_edit.jpg ADDED Viewed

Git LFS Details

SHA256: b080cdbebf31707aa9a5c8877d207bde463842ecbf580c059a5784b670a6da9a
Pointer size: 132 Bytes
Size of remote file: 1.12 MB

assets/figures/motivation.jpg ADDED Viewed

Git LFS Details

SHA256: 8fa691768656ccadb3d26e9b12d9729c8b9f6ce843e4671af7e02a32335dfef7
Pointer size: 132 Bytes
Size of remote file: 1.37 MB

assets/figures/teaser.jpg ADDED Viewed

Git LFS Details

SHA256: 762c885fc465688dc680ea01f564fd041b45938a0c092122e7d3d0331a68a005
Pointer size: 132 Bytes
Size of remote file: 1.42 MB

dataloader/__pycache__/creatidesign_dataset_benchmark.cpython-310.pyc ADDED Viewed

Binary file (13.1 kB). View file

dataloader/arial.ttf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35c0f3559d8db569e36c31095b8a60d441643d95f59139de40e23fada819b833
+size 275572

dataloader/creatidesign_dataset_benchmark.py ADDED Viewed

	@@ -0,0 +1,554 @@

+import os
+import json
+from PIL import Image
+from torch.utils.data import Dataset, DataLoader
+from torchvision import transforms
+import torch
+import numpy as np
+import random
+from datasets import load_dataset
+from tqdm import tqdm
+def find_nearest_bucket_size(input_width, input_height, mode="x64", ratio=1):
+    buckets = [
+            (512, 2048),
+            (512, 1984),
+            (512, 1920),
+            (512, 1856),
+            (576, 1792),
+            (576, 1728),
+            (576, 1664),
+            (640, 1600),
+            (640, 1536),
+            (704, 1472),
+            (704, 1408),
+            (704, 1344),
+            (768, 1344),
+            (768, 1280),
+            (832, 1216),
+            (832, 1152),
+            (896, 1152),
+            (896, 1088),
+            (960, 1088),
+            (960, 1024),
+            (1024, 1024),
+            (1024, 960),
+            (1088, 960),
+            (1088, 896),
+            (1152, 896),
+            (1152, 832),
+            (1216, 832),
+            (1280, 768),
+            (1344, 768),
+            (1408, 704),
+            (1472, 704),
+            (1536, 640),
+            (1600, 640),
+            (1664, 576),
+            (1728, 576),
+            (1792, 576),
+            (1856, 512),
+            (1920, 512),
+            (1984, 512),
+            (2048, 512)
+        ]
+    aspect_ratios = [w / h for (w, h) in buckets]
+    assert mode in ["x64", "x8"]
+    if mode == "x64":
+        asp = input_width / input_height
+        diff = [abs(ar - asp) for ar in aspect_ratios]
+        bucket_id = int(np.argmin(diff))
+        gen_width, gen_height = buckets[bucket_id]
+    elif mode == "x8":
+        max_pixels = 1024 * 1024
+        ratio = (max_pixels / (input_width * input_height)) ** (0.5)
+        gen_width, gen_height = round(input_width * ratio), round(input_height * ratio)
+        gen_width = gen_width - gen_width % 8
+        gen_height = gen_height - gen_height % 8
+    else:
+        raise NotImplementedError
+    return (int(gen_width * ratio), int(gen_height * ratio))
+def adjust_and_normalize_bboxes(bboxes, orig_width, orig_height):
+    # Adjust and normalize bbox
+    normalized_bboxes = []
+    for bbox in bboxes:
+        x1, y1, x2, y2 = bbox
+        x1_norm = round(x1 / orig_width,2)
+        y1_norm = round(y1 / orig_height,2)
+        x2_norm = round(x2 / orig_width,2)
+        y2_norm = round(y2 / orig_height,2)
+        normalized_bboxes.append([x1_norm, y1_norm, x2_norm, y2_norm])
+    return normalized_bboxes
+def img_transforms(image, height=512, width=512):
+    transform = transforms.Compose(
+        [
+            transforms.Resize(
+                (height, width), interpolation=transforms.InterpolationMode.BILINEAR
+            ),
+            transforms.ToTensor(),
+            transforms.Normalize([0.5], [0.5]),
+        ]
+    )
+    image_transformed = transform(image)
+    return image_transformed
+def mask_transforms(mask, height=512, width=512):
+    transform = transforms.Compose(
+        [
+            transforms.Resize(
+                (height, width),
+                interpolation=transforms.InterpolationMode.NEAREST
+            ),
+            transforms.ToTensor(),
+        ]
+    )
+    mask_transformed = transform(mask)
+    return mask_transformed
+class DesignDataset(Dataset):
+    def __init__(
+        self,
+        dataset_name,
+        resolution=512,
+        condition_resolution=512,
+        condition_resolution_scale_ratio=0.5,
+        max_boxes_per_image=10,
+        neg_condition_image = 'same',
+        background_color = 'gray',
+        use_bucket=True,
+        box_confidence_th = 0.0
+    ):
+        print(f"Loading dataset from Hugging Face: {dataset_name}")
+        self.dataset = load_dataset(dataset_name, split="test")
+        print(f"Loaded {len(self.dataset)} samples")
+        from IPython.core.debugger import set_trace
+        set_trace()
+        self.max_boxes_per_image = max_boxes_per_image
+        self.resolution = resolution
+        self.condition_resolution=condition_resolution
+        self.neg_condition_image = neg_condition_image
+        self.use_bucket = use_bucket
+        self.condition_resolution_scale_ratio=condition_resolution_scale_ratio
+        self.box_confidence_th = box_confidence_th
+        if background_color == 'white':
+            self.background_color = (255, 255, 255)
+        elif background_color == 'black':
+            self.background_color = (0, 0, 0)
+        elif background_color == 'gray':
+            self.background_color = (128, 128, 128)
+        else:
+            raise ValueError("Invalid background color. Use 'white' or 'black'.")
+    def __len__(self):
+        return len(self.dataset)
+    def __getitem__(self, idx):
+        sample = self.dataset[idx]
+        image_source = sample['original_image']
+        subject_image = sample['condition_gray_background']
+        subject_mask = sample['subject_mask']
+        json_data = json.loads(sample['metadata'])
+        #img info
+        img_info = json_data['img_info']
+        img_id = img_info['img_id']
+        orig_width, orig_height = int(img_info["img_width"]),int(img_info["img_height"])
+        if self.use_bucket:
+            target_width, target_height = find_nearest_bucket_size(orig_width,orig_height)
+            condition_width = int(target_width * self.condition_resolution_scale_ratio)
+            condition_height = int(target_height * self.condition_resolution_scale_ratio)
+        else:
+            target_width = target_height = self.resolution
+            condition_width = condition_height = self.condition_resolution
+        img_tensor = img_transforms(image_source,height=target_height,width=target_width)
+        # global caption
+        global_caption = json_data['global_caption']
+        # object_annotations
+        object_annotations = json_data['object_annotations']
+        # object bbox list
+        objects_bbox = [item['bbox'] for item in object_annotations]
+        # object bbox caption
+        objects_caption = [item['bbox_detail_description'] for item in object_annotations]
+        # object bbox score
+        objects_bbox_score = [item['score'][0] for item in object_annotations]
+        # text
+        text_list = json_data["text_list"]
+        txt_bboxs = [item['bbox'] for item in text_list]
+        txt_captions = ["text:"+item['text'] for item in text_list]
+        txt_scores = [1.0 for _ in txt_bboxs]
+        # combine bbox 和 description
+        objects_bbox.extend(txt_bboxs)
+        objects_caption.extend(txt_captions)
+        objects_bbox_score.extend(txt_scores)
+        objects_bbox =torch.tensor(adjust_and_normalize_bboxes(objects_bbox,orig_width,orig_height))
+        objects_bbox_score = torch.tensor(objects_bbox_score)
+        boxes_mask = objects_bbox_score > self.box_confidence_th
+        objects_bbox_raw = objects_bbox[boxes_mask]
+        objects_caption = [object_caption for object_caption, box_mask in zip(objects_caption, boxes_mask) if box_mask]
+        num_boxes = objects_bbox_raw.shape[0]
+        objects_boxes_padded = torch.zeros((self.max_boxes_per_image, 4))
+        objects_masks_padded = torch.zeros(self.max_boxes_per_image)
+        objects_caption = objects_caption[:self.max_boxes_per_image]
+        objects_boxes_padded[:num_boxes] = objects_bbox_raw[:self.max_boxes_per_image]
+        objects_masks_padded[:num_boxes] = 1.
+        # objects_masks_maps
+        objects_masks_maps_padded = torch.zeros((self.max_boxes_per_image, target_height, target_width))
+        for idx in range(num_boxes):
+            x1, y1, x2, y2 = objects_boxes_padded[idx]
+            x1_pixel = int(x1 * target_width)
+            y1_pixel = int(y1 * target_height)
+            x2_pixel = int(x2 * target_width)
+            y2_pixel = int(y2 * target_height)
+            x1_pixel = max(0, min(x1_pixel, target_width-1))
+            y1_pixel = max(0, min(y1_pixel, target_height-1))
+            x2_pixel = max(0, min(x2_pixel, target_width-1))
+            y2_pixel = max(0, min(y2_pixel, target_height-1))
+            objects_masks_maps_padded[idx, y1_pixel:y2_pixel+1, x1_pixel:x2_pixel+1] = 1.0
+        # subject
+        original_size_subject_tensor = img_transforms(subject_image,height=target_height,width=target_width)
+        subject_tensor = img_transforms(subject_image,height=condition_height,width=condition_width)
+        subject_mask_tensor = mask_transforms(subject_mask, height=condition_height,width=condition_width)
+        if self.neg_condition_image=='black':
+            subject_image_black = Image.new('RGB', (orig_width, orig_height), (0, 0, 0))
+            subject_image_neg_tensor = img_transforms(subject_image_black,height=condition_height,width=condition_width)
+        elif self.neg_condition_image=='white':
+            subject_image_white = Image.new('RGB', (orig_width, orig_height), (255, 255, 255))
+            subject_image_neg_tensor = img_transforms(subject_image_white,height=condition_height,width=condition_width)
+        elif self.neg_condition_image=='gray':
+            subject_image_gray = Image.new('RGB', (orig_width, orig_height), (128, 128, 128))
+            subject_image_neg_tensor = img_transforms(subject_image_gray,height=condition_height,width=condition_width)
+        elif self.neg_condition_image=='same':
+            subject_image_neg_tensor = subject_tensor
+        output = dict(
+            id=img_id,
+            caption=global_caption,
+            objects_boxes=objects_boxes_padded,
+            objects_caption=objects_caption,
+            objects_masks=objects_masks_padded,
+            objects_masks_maps=objects_masks_maps_padded,
+            img=img_tensor,
+            condition_img_masks_maps = subject_mask_tensor,
+            condition_img = subject_tensor,
+            original_size_condition_img = original_size_subject_tensor,
+            neg_condtion_img = subject_image_neg_tensor,
+            img_info = img_info,
+            target_width=target_width,
+            target_height=target_height,
+        )
+        return output
+def collate_fn(examples):
+    collated_examples = {}
+    for key in ['id', 'objects_caption', 'caption','img_info','target_width','target_height']:
+        collated_examples[key] = [example[key] for example in examples]
+    for key in ['img', 'objects_boxes',  'objects_masks','condition_img','neg_condtion_img','objects_masks_maps','condition_img_masks_maps','original_size_condition_img']:
+        collated_examples[key] = torch.stack([example[key] for example in examples]).float()
+    return collated_examples
+from typing import Dict
+import numpy as np
+from PIL import Image, ImageDraw, ImageFont, ImageOps
+import random
+def draw_mask(mask, draw, random_color=True):
+    """Draws a mask with a specified color on an image.
+    Args:
+        mask (np.array): Binary mask as a NumPy array.
+        draw (ImageDraw.Draw): ImageDraw object to draw on the image.
+        random_color (bool): Whether to use a random color for the mask.
+    """
+    if random_color:
+        color = (
+            random.randint(0, 255),
+            random.randint(0, 255),
+            random.randint(0, 255),
+            153,
+        )
+    else:
+        color = (30, 144, 255, 153)
+    nonzero_coords = np.transpose(np.nonzero(mask))
+    for coord in nonzero_coords:
+        draw.point(coord[::-1], fill=color)
+def visualize_bbox(image_pil: Image,
+              result: Dict,
+              draw_width: float = 6.0,
+              return_mask=True) -> Image:
+    """Plot bounding boxes and labels on an image with text wrapping for long descriptions.
+    Args:
+        image_pil (PIL.Image): The input image as a PIL Image object.
+        result (Dict[str, Union[torch.Tensor, List[torch.Tensor]]]): The target dictionary containing
+            the bounding boxes and labels. The keys are:
+                - boxes (List[int]): A list of bounding boxes in shape (N, 4), [x1, y1, x2, y2] format.
+                - labels (List[str]): A list of labels for each object
+                - masks (List[PIL.Image], optional): A list of masks in the format of PIL.Image
+    Returns:
+        PIL.Image: The input image with plotted bounding boxes, labels, and masks.
+    """
+    # Get the bounding boxes and labels from the target dictionary
+    boxes = result["boxes"]
+    categorys = result["labels"]
+    masks = result.get("masks", [])
+    color_list = [(255, 162, 76), (177, 214, 144),
+                 (13, 146, 244), (249, 84, 84), (54, 186, 152),
+                 (74, 36, 157), (0, 159, 189),
+                 (80, 118, 135), (188, 90, 148), (119, 205, 255)]
+    # Use smaller font size to allow more text to be displayed
+    font_size = 30  # Reduce font size
+    font = ImageFont.truetype("dataloader/arial.ttf", font_size)
+    # Get image dimensions
+    img_width, img_height = image_pil.size
+    # Find all unique categories and build a cate2color dictionary
+    cate2color = {}
+    unique_categorys = sorted(set(categorys))
+    for idx, cate in enumerate(unique_categorys):
+        cate2color[cate] = color_list[idx % len(color_list)]
+    # Create a PIL ImageDraw object to draw on the input image
+    if isinstance(image_pil, np.ndarray):
+        image_pil = Image.fromarray(image_pil)
+    draw = ImageDraw.Draw(image_pil)
+    # Create a new binary mask image with the same size as the input image
+    mask = Image.new("L", image_pil.size, 0)
+    # Create a PIL ImageDraw object to draw on the mask image
+    mask_draw = ImageDraw.Draw(mask)
+    # Draw boxes, labels, and masks for each box and label in the target dictionary
+    for box, category in zip(boxes, categorys):
+        # Extract the box coordinates
+        x0, y0, x1, y1 = box
+        x0, y0, x1, y1 = int(x0), int(y0), int(x1), int(y1)
+        box_width = x1 - x0
+        box_height = y1 - y0
+        color = cate2color.get(category, color_list[0])  # Default color
+        # Draw the box outline on the input image
+        draw.rectangle([x0, y0, x1, y1], outline=color, width=int(draw_width))
+        # Allow text box to be maximum 2 times the bounding box width, but not exceed image boundaries
+        max_text_width = min(box_width * 2, img_width - x0)
+        # Determine the maximum height for text background area
+        max_text_height = min(box_height * 2, 200)  # Also allow more text display, but limit height
+        # Handle long text based on bounding box width, split text into lines
+        lines = []
+        words = category.split()
+        current_line = words[0]
+        for word in words[1:]:
+            # Try to add the next word
+            test_line = current_line + " " + word
+            # Use textbbox or textlength to check if width fits the maximum text width
+            if hasattr(draw, "textbbox"):
+                # Use textbbox method
+                bbox = draw.textbbox((0, 0), test_line, font=font)
+                w = bbox[2] - bbox[0]
+            elif hasattr(draw, "textlength"):
+                # Use textlength method
+                w = draw.textlength(test_line, font=font)
+            else:
+                # Fallback - estimate width
+                w = len(test_line) * (font_size * 0.6)  # Estimate average character width
+            if w <= max_text_width - 20:  # Leave some margin
+                current_line = test_line
+            else:
+                lines.append(current_line)
+                current_line = word
+        lines.append(current_line)  # Add the last line
+        # Limit number of lines to prevent overflow
+        max_lines = max_text_height // (font_size + 2)  # Line height (font size + spacing)
+        if len(lines) > max_lines:
+            lines = lines[:max_lines-1]
+            lines.append("...")  # Add ellipsis
+        # Calculate actual required width for each line
+        line_widths = []
+        for line in lines:
+            if hasattr(draw, "textbbox"):
+                bbox = draw.textbbox((0, 0), line, font=font)
+                line_width = bbox[2] - bbox[0]
+            elif hasattr(draw, "textlength"):
+                line_width = draw.textlength(line, font=font)
+            else:
+                line_width = len(line) * (font_size * 0.6)  # Estimate width
+            line_widths.append(line_width)
+        # Determine actual required width for text box
+        if line_widths:
+            needed_text_width = max(line_widths) + 10  # Add small margin
+        else:
+            needed_text_width = 0
+        # Use bounding box width as minimum, only expand when needed
+        text_bg_width = max(box_width, min(needed_text_width, max_text_width))
+        # Ensure it doesn't exceed image boundaries
+        text_bg_width = min(text_bg_width, img_width - x0)
+        # Calculate text background height
+        text_bg_height = len(lines) * (font_size + 2)
+        # Ensure text background doesn't exceed image bottom
+        if y0 + text_bg_height > img_height:
+            # If it would exceed bottom, adjust text position to above the bounding box bottom
+            text_y0 = max(0, y1 - text_bg_height)
+        else:
+            text_y0 = y0
+        # Draw text background - note RGBA color handling
+        if image_pil.mode == "RGBA":
+            # For RGBA mode, we can directly use alpha color
+            bg_color = (*color, 180)  # Semi-transparent background
+        else:
+            # For RGB mode, we cannot use alpha
+            bg_color = color
+        draw.rectangle([x0, text_y0, x0 + text_bg_width, text_y0 + text_bg_height], fill=bg_color)
+        # Draw text
+        for i, line in enumerate(lines):
+            y_pos = text_y0 + i * (font_size + 2)
+            draw.text((x0 + 5, y_pos), line, fill="white", font=font)
+    # Draw the mask on the input image if masks are provided
+    if len(masks) > 0 and return_mask:
+        size = image_pil.size
+        mask_image = Image.new("RGBA", size, color=(0, 0, 0, 0))
+        mask_draw = ImageDraw.Draw(mask_image)
+        for mask in masks:
+            mask = np.array(mask)[:, :, -1]
+            draw_mask(mask, mask_draw)
+        image_pil = Image.alpha_composite(image_pil.convert("RGBA"), mask_image).convert("RGB")
+    return image_pil
+import torchvision.transforms as T
+from PIL import Image, ImageDraw, ImageFont, ImageChops
+def tensor_to_pil(img_tensor):
+    """将tensor转换为PIL图像"""
+    img_tensor = img_tensor.cpu()
+    # 反归一化 ([0.5], [0.5])
+    img_tensor = img_tensor * 0.5 + 0.5
+    img_tensor = torch.clamp(img_tensor, 0, 1)
+    return T.ToPILImage()(img_tensor)
+def make_image_grid_RGB(images, rows, cols, resize=None):
+    """
+    Prepares a single grid of images. Useful for visualization purposes.
+    """
+    assert len(images) == rows * cols
+    if resize is not None:
+        images = [img.resize((resize, resize)) for img in images]
+    w, h = images[0].size
+    grid = Image.new("RGB", size=(cols * w, rows * h))
+    for i, img in enumerate(images):
+        grid.paste(img.convert("RGB"), box=(i % cols * w, i // cols * h))
+    return grid
+if __name__ == "__main__":
+    resolution = 1024
+    condition_resolution = 512
+    neg_condition_image = 'same'
+    background_color = 'gray'
+    use_bucket = True
+    condition_resolution_scale_ratio=0.5
+    benchmark_repo = 'HuiZhang0812/CreatiDesign_benchmark' #  huggingface repo of benchmark
+    datasets = DesignDataset(dataset_name=benchmark_repo,
+                             resolution=resolution,
+                             condition_resolution=condition_resolution,
+                             neg_condition_image =neg_condition_image,
+                             background_color=background_color,
+                             use_bucket=use_bucket,
+                             condition_resolution_scale_ratio=condition_resolution_scale_ratio
+                             )
+    test_dataloader = DataLoader(datasets, batch_size=1, shuffle=False, num_workers=1,collate_fn=collate_fn)
+    for i, batch in enumerate(tqdm(test_dataloader)):
+        prompts = batch["caption"]
+        imgs_id = batch['id']
+        objects_boxes = batch["objects_boxes"]
+        objects_caption = batch['objects_caption']
+        objects_masks = batch['objects_masks']
+        condition_img = batch['condition_img']
+        neg_condtion_img = batch['neg_condtion_img']
+        objects_masks_maps= batch['objects_masks_maps']
+        subject_masks_maps = batch['condition_img_masks_maps']
+        target_width=batch['target_width'][0]
+        target_height=batch['target_height'][0]
+        img_info = batch["img_info"][0]
+        filename = img_info["img_id"]+'.jpg'

eval/layout.py ADDED Viewed

	@@ -0,0 +1,194 @@

+import os
+import json
+from PIL import Image
+from tqdm import tqdm
+from transformers import AutoModel, AutoTokenizer
+import torch
+from datasets import load_dataset
+if __name__ == "__main__":
+    model_id ="openbmb/MiniCPM-V-2_6"
+    model = AutoModel.from_pretrained(model_id, trust_remote_code=True,
+    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
+    model = model.eval().cuda()
+    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+    # evaluation
+    benchmark_repo = 'HuiZhang0812/CreatiDesign_benchmark' #  huggingface repo of benchmark
+    benchmark = load_dataset(benchmark_repo, split="test")
+    gen_root =  "outputs/CreatiDesign_benchmark/images"
+    print("processing:",gen_root)
+    save_json_path = gen_root.replace("images", "minicpm-vqa.json")
+    temp_root = gen_root.replace("images", "images-perarea")
+    os.makedirs(temp_root, exist_ok=True)
+    skipped_files_log = gen_root.replace("images", "skipped_files.log")
+    skipped_files = []
+    image_stats = {}
+    for case in tqdm(benchmark):
+        json_data = json.loads(case["metadata"])
+        case_info = json_data["img_info"]
+        case_id = case_info["img_id"]
+        file_name = f"{case_id}.jpg"
+        generated_img_path = os.path.join(gen_root, file_name)
+        global_caption = json_data["global_caption"]
+        object_annotations = json_data["object_annotations"]
+        detial_region_caption_list =  [item["bbox_detail_description"] for item in object_annotations]
+        region_caption_list = [item["class_name"] for item in object_annotations]
+        region_bboxes_list = [item["bbox"] for item in object_annotations]
+        img = Image.open(generated_img_path).convert("RGB")
+        width, height = img.size
+        orignal_img_width = json_data["img_info"]["img_width"]
+        orignal_img_height = json_data["img_info"]["img_height"]
+        temp_save_root = os.path.join(temp_root, file_name.split('.')[0])
+        os.makedirs(temp_save_root, exist_ok=True)
+        bbox_count = len(region_caption_list)
+        # Initialize scores
+        img_score_spatial = 0
+        img_score_color = 0
+        img_score_texture = 0
+        img_score_shape = 0
+        for i, (bbox,detial_region_caption,region_caption) in enumerate(zip(region_bboxes_list,detial_region_caption_list,region_caption_list)):
+            x1, y1, x2, y2= bbox
+            x1 = int(x1 / orignal_img_width*width)
+            y1 = int(y1 / orignal_img_height*height)
+            x2 = int(x2 / orignal_img_width*width)
+            y2 = int(y2 / orignal_img_height*height)
+            cropped_img = img.crop((x1, y1, x2, y2))
+            # save crop img
+            description = region_caption.replace('/', '')
+            detail_description = detial_region_caption.replace('/', '')
+            cropped_img_path = os.path.join(temp_save_root, f'{description}.jpg')
+            cropped_img.save(cropped_img_path)
+            # spatial
+            question = f'Is the subject "{description}" present in the image? Strictly answer with "Yes" or "No", without any irrelevant words.'
+            msgs = [{'role': 'user', 'content': [cropped_img, question]}]
+            res = model.chat(
+                image=None,
+                msgs=msgs,
+                tokenizer=tokenizer,
+                seed=42
+                )
+            if "Yes" in res or "yes" in res:
+                score_spatial = 1.0
+            else:
+                score_spatial = 0.0
+            score_color, score_texture,score_shape = 0.0, 0.0, 0.0
+            # attribute
+            if score_spatial==1.0:
+                #color
+                question_color = f'Is the subject in "{description}" in the image consistent with the color described in the detailed description: "{detail_description}"? Strictly answer with "Yes" or "No", without any irrelevant words. If the color is not mentioned in the detailed description, the answer is "Yes".'
+                msgs_color = [{'role': 'user', 'content': [cropped_img, question_color]}]
+                color_attribute = model.chat(
+                image=None,
+                msgs=msgs_color,
+                tokenizer=tokenizer,
+                seed=42
+                )
+                if "Yes" in color_attribute or "yes" in color_attribute:
+                    score_color = 1.0
+            # texture
+            if score_spatial==1.0:
+                question_texture = f'Is the subject in "{description}" in the image consistent with the texture described in the detailed description: "{detail_description}"? Strictly answer with "Yes" or "No", without any irrelevant words. If the texture is not mentioned in the detailed description, the answer is "Yes".'
+                msgs_texture = [{'role': 'user', 'content': [cropped_img, question_texture]}]
+                texture_attribute = model.chat(
+                image=None,
+                msgs=msgs_texture,
+                tokenizer=tokenizer,
+                seed=42
+                )
+                if "Yes" in texture_attribute or "yes" in texture_attribute:
+                    score_texture = 1.0
+            #shape
+            if score_spatial==1.0:
+                question_shape = f'Is the subject in "{description}" in the image consistent with the shape described in the detailed description: "{detail_description}"? Strictly answer with "Yes" or "No", without any irrelevant words. If the shape is not mentioned in the detailed description, the answer is "Yes".'
+                msgs_shape = [{'role': 'user', 'content': [cropped_img, question_shape]}]
+                shape_attribute = model.chat(
+                image=None,
+                msgs=msgs_shape,
+                tokenizer=tokenizer,
+                seed=42
+                )
+                if "Yes" in shape_attribute or "yes" in shape_attribute:
+                    score_shape = 1.0
+            # Update total scores
+            img_score_spatial += score_spatial
+            img_score_color += score_color
+            img_score_texture += score_texture
+            img_score_shape += score_shape
+        # Store image stats
+        image_stats[os.path.basename(file_name)] = {
+            "bbox_count": bbox_count,
+            "score_spatial": img_score_spatial,
+            "score_color": img_score_color,
+            "score_texture": img_score_texture,
+            "score_shape": img_score_shape,
+        }
+        if len(image_stats) % 50 == 0:
+            with open(save_json_path, 'w', encoding='utf-8') as json_file:
+                json.dump(image_stats, json_file, indent=4)
+    # Save the image_stats dictionary to a JSON file
+    with open(save_json_path, 'w', encoding='utf-8') as json_file:
+        json.dump(image_stats, json_file, indent=4)
+    print(f"Image statistics saved to {save_json_path}")
+    score_save_path = save_json_path.replace('minicpm-vqa.json', 'minicpm-vqa-score.txt')
+    # Read the JSON file containing image statistics
+    with open(save_json_path, "r") as f:
+        json_data = json.load(f)
+    total_num = 0
+    total_bbox_num = 0
+    total_score_spatial = 0
+    total_score_color = 0
+    total_score_texture = 0
+    total_score_shape = 0
+    miss_match =0
+    # Iterate over the JSON data
+    for key, value in json_data.items():
+        total_num += value["bbox_count"]
+        total_score_spatial +=value["score_spatial"]
+        total_score_color +=value["score_color"]
+        total_score_texture +=value["score_texture"]
+        total_score_shape +=value["score_shape"]
+        if value["bbox_count"]!=value["score_spatial"] or value["bbox_count"]!=value["score_color"] or value["bbox_count"]!=value["score_texture"] or value["bbox_count"]!=value["score_shape"]:
+            print(key,value["bbox_count"],value["score_spatial"],value["score_color"],value["score_texture"],value["score_shape"])
+            miss_match+=1
+    print(miss_match)
+    #save total_score_spatial,total_score_color,total_score_texture,total_score_shape
+    with open(score_save_path, "w") as f:
+        f.write(f"Total number of bbox: {total_num}\n")
+        f.write(f"Total score of spatial: {total_score_spatial}; Average score of spatial: {round(total_score_spatial/total_num,4)}\n")
+        f.write(f"Total score of color: {total_score_color}; Average score of color: {round(total_score_color/total_num,4)}\n")
+        f.write(f"Total score of texture: {total_score_texture}; Average score of texture: {round(total_score_texture/total_num,4)}\n")
+        f.write(f"Total score of shape: {total_score_shape}; Average score of shape: {round(total_score_shape/total_num,4)}\n")

eval/subject.py ADDED Viewed

	@@ -0,0 +1,233 @@

+import os, sys, json, math, argparse, glob
+from pathlib import Path
+from typing import List
+import torch
+from PIL import Image
+import pandas as pd
+from tqdm import tqdm
+from transformers import (
+    AutoProcessor, CLIPModel,
+    AutoImageProcessor, AutoModel
+)
+from datasets import load_dataset
+def scale_bbox(bbox, ori_size, target_size):
+    x_min, y_min, x_max, y_max = bbox
+    ori_width, ori_height = ori_size
+    target_width, target_height = target_size
+    width_ratio = target_width / ori_width
+    height_ratio = target_height / ori_height
+    scaled_x_min = int(x_min * width_ratio)
+    scaled_y_min = int(y_min * height_ratio)
+    scaled_x_max = int(x_max * width_ratio)
+    scaled_y_max = int(y_max * height_ratio)
+    scaled_x_min = max(0, scaled_x_min)
+    scaled_y_min = max(0, scaled_y_min)
+    scaled_x_max = min(target_width, scaled_x_max)
+    scaled_y_max = min(target_height, scaled_y_max)
+    return [scaled_x_min, scaled_y_min, scaled_x_max, scaled_y_max]
+@torch.no_grad()
+def encode_clip(imgs: List[Image.Image]) -> torch.Tensor:
+    features_list = []
+    for img in imgs:
+        inputs = clip_processor(images=img, return_tensors="pt").to(device)
+        image_features = clip_model.get_image_features(**inputs)
+        normalized_features = image_features / image_features.norm(dim=1, keepdim=True)
+        features_list.append(normalized_features.squeeze().cpu())
+    return torch.stack(features_list)
+@torch.no_grad()
+def encode_dino(imgs: List[Image.Image]) -> torch.Tensor:
+    features_list = []
+    for img in imgs:
+        inputs = dino_processor(images=img, return_tensors="pt").to(device)
+        outputs = dino_model(**inputs)
+        image_features = outputs.last_hidden_state.mean(dim=1)
+        normalized_features = image_features / image_features.norm(dim=1, keepdim=True)
+        features_list.append(normalized_features.squeeze().cpu())
+    return torch.stack(features_list)
+@torch.no_grad()
+def cosine(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    return (a @ b.T).squeeze()
+# ------------- Command line arguments -----------------
+parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+parser.add_argument("--benchmark_repo", type=str, default="HuiZhang0812/CreatiDesign_benchmark",
+                    help="Root directory for one thousand cases")
+parser.add_argument("--gen_root", type=str, default="outputs/CreatiDesign_benchmark",
+                    help="Root directory for generated images (should have images/<case_id>.jpg underneath)")
+parser.add_argument("--device", default="cuda", choices=["cuda", "cpu"])
+parser.add_argument("--outfile", type=str,
+                    help="Path for result CSV; by default written to gen_root")
+args = parser.parse_args()
+print("handling:", args.gen_root)
+if args.outfile is None:
+    args.outfile = os.path.join(args.gen_root,"scores.csv")
+# Convert outfile to Path object
+outfile_path = Path(args.outfile)
+device = torch.device(args.device if torch.cuda.is_available() else "cpu")
+print(f"[INFO] Using device: {device}")
+# ------------- Loading models -------------------
+print("[INFO] loading CLIP...")
+clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
+clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
+clip_model.eval()
+print("[INFO] loading DINOv2...")
+dino_processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
+dino_model = AutoModel.from_pretrained('facebook/dinov2-base').to(device)
+dino_model.eval()
+benchmark = load_dataset(args.benchmark_repo, split="test")
+DEBUG = True
+if DEBUG:
+    subject_save_roor = os.path.join(args.gen_root,"subject-eval-visual")
+    os.makedirs(subject_save_roor,exist_ok=True)
+records = []
+for case in tqdm(benchmark):
+    json_data = json.loads(case["metadata"])
+    case_info = json_data["img_info"]
+    case_id = case_info["img_id"]
+    # ---------- Read reference subjects ----------
+    ref_imgs = case['condition_white_variants']
+    if len(ref_imgs) == 0:
+        print(f"[WARN] {case_id} has no reference subject, skipping")
+        continue
+    # ---------- Read generated image ----------
+    gen_path = os.path.join(args.gen_root, "images", f"{case_id}.jpg")
+    gen_img = Image.open(gen_path).convert("RGB")
+    # Get width and height of generated image
+    gen_width, gen_height = gen_img.size
+    reg_bbox_id = [item["bbox_idx"] for item in sorted(json_data["subject_annotations"], key=lambda x: x["bbox_idx"])]
+    ref_bbox = [item["bbox"] for item in sorted(json_data["subject_annotations"], key=lambda x: x["bbox_idx"])]
+    ori_width,ori_height = json_data["img_info"]["img_width"],json_data["img_info"]["img_height"]
+    # Extract corresponding images from the generated image
+    gen_imgs = []
+    for bbox in ref_bbox:
+        # Scale the bounding box
+        scaled_bbox = scale_bbox(
+            bbox,
+            (ori_width, ori_height),
+            (gen_width, gen_height)
+        )
+        # Crop the image area
+        x_min, y_min, x_max, y_max = scaled_bbox
+        cropped_img = gen_img.crop((x_min, y_min, x_max, y_max))
+        gen_imgs.append(cropped_img)
+    if DEBUG:
+        folder_root = os.path.join(subject_save_roor,case_id)
+        os.makedirs(folder_root,exist_ok=True)
+        # Save cropped images
+        for i, (img, img_id) in enumerate(zip(gen_imgs, reg_bbox_id)):
+            img.save(os.path.join(folder_root, f"{img_id}.png"))
+    # ---------- Features ----------
+    ref_clip = encode_clip(ref_imgs)      # (n,dim)
+    gen_clip = encode_clip(gen_imgs)      # (n,dim)
+    ref_dino = encode_dino(ref_imgs)      # (n,dim)
+    gen_dino = encode_dino(gen_imgs)      # (n,dim)
+    # ---------- Similarity ----------
+    clip_sims = torch.nn.functional.cosine_similarity(ref_clip, gen_clip)
+    dino_sims = torch.nn.functional.cosine_similarity(ref_dino, gen_dino)
+    clip_i   = clip_sims.mean().item()
+    dino_avg = dino_sims.mean().item()
+    m_dino   = dino_sims.prod().item()
+    records.append(dict(
+        case_id=case_id,
+        num_subject=len(ref_imgs),
+        clip_i=clip_i,
+        dino=dino_avg,
+        m_dino=m_dino
+    ))
+# ---------------- Result statistics -----------------
+df = pd.DataFrame(records).sort_values("case_id")
+overall = df[["clip_i","dino","m_dino"]].mean().to_dict()
+print("\n========== Overall Average ==========")
+for k,v in overall.items():
+    print(f"{k:>8}: {v:.6f}")
+print("=====================================\n")
+# Group by number of subjects
+df_by_subjects = {}
+avg_by_subjects = {}
+# Create subset for each subject count (1-5)
+for i in range(1, 6):
+    # Filter records with subject count = i
+    subset = df[df["num_subject"] == i]
+    if len(subset) > 0:
+        # Calculate average for this group
+        subset_avg = subset[["clip_i", "dino", "m_dino"]].mean().to_dict()
+        avg_by_subjects[i] = subset_avg
+        # Create subset with average row
+        avg_row = {"case_id": f"average_subject_{i}", "num_subject": i}
+        avg_row.update(subset_avg)
+        # Add average row to subset
+        subset_with_avg = pd.concat([subset, pd.DataFrame([avg_row])], ignore_index=True)
+        df_by_subjects[i] = subset_with_avg
+        # Print average for this group
+        print(f"\n=== Subject {i} Average (n={len(subset)}) ===")
+        for k, v in subset_avg.items():
+            print(f"{k:>8}: {v:.6f}")
+        # Save subset - fixed path handling
+        subject_path = outfile_path.parent / f"{outfile_path.stem}_subject{i}_location_prior{outfile_path.suffix}"
+        subset_with_avg.to_csv(subject_path, index=False, float_format="%.6f")
+        print(f"[INFO] Subject {i} results written to {subject_path}")
+# Save overall average to CSV - fixed path handling
+overall_df = pd.DataFrame([overall], index=["overall"])
+overall_path = outfile_path.parent / f"{outfile_path.stem}_overall_location_prior{outfile_path.suffix}"
+overall_df.to_csv(overall_path, float_format="%.6f")
+print(f"[INFO] Overall results written to {overall_path}")
+# Write CSV
+df.to_csv(args.outfile, index=False, float_format="%.6f")
+print(f"[INFO] Written to {args.outfile}")
+# Create statistics table with averages for all groups
+if avg_by_subjects:
+    # Merge averages for each group into one table
+    stats_rows = []
+    for num_subject, avg_dict in avg_by_subjects.items():
+        row = {"num_subject": num_subject}
+        row.update(avg_dict)
+        stats_rows.append(row)
+    # Add overall average
+    overall_row = {"num_subject": "all"}
+    overall_row.update(overall)
+    stats_rows.append(overall_row)
+    # Create summary statistics table
+    stats_df = pd.DataFrame(stats_rows)
+    # Fixed path handling
+    stats_path = outfile_path.parent / f"{outfile_path.stem}_stats_location_prior{outfile_path.suffix}"
+    stats_df.to_csv(stats_path, index=False, float_format="%.6f")
+    print(f"[INFO] All group statistics written to {stats_path}")

eval/text.py ADDED Viewed

	@@ -0,0 +1,184 @@

+import os, json, csv, re, cv2, numpy as np, torch
+from tqdm import tqdm
+from editdistance import eval as edit_distance
+from paddleocr import PaddleOCR
+from datasets import load_dataset
+# -------------------------------------------------------------------
+# Paths
+benchmark_repo = 'HuiZhang0812/CreatiDesign_benchmark' #  huggingface repo of benchmark
+benchmark = load_dataset(benchmark_repo, split="test")
+root_gen = "outputs/CreatiDesign_benchmark/images"
+save_root = root_gen.replace("images", "text_eval")  # Output directory
+os.makedirs(save_root, exist_ok=True)
+DEBUG = True
+# -------------------------------------------------------------------
+# 1. OCR initialization (must be det=True)
+ocr = PaddleOCR(det=True, rec=True, cls=False, use_angle_cls=False, lang='en')
+# -------------------------------------------------------------------
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# -------------------------------------------------------------------
+# 3. Utility functions
+def spatial_match_iou(det_res, gt_box, gt_text_fmt, iou_thr=0.5):
+    best_iou = 0.0
+    if det_res is None or len(det_res) == 0:
+        return best_iou
+    for item in det_res:
+        poly = item[0]  # Detection box coordinates
+        txt_info = item[1]  # Text information tuple
+        txt = txt_info[0]   # Text content
+        if min_ned_substring(normalize_text(txt), gt_text_fmt) <= 0.7: # When calculating spatial, allow some degree of text error
+            iou_val = iou(quad2bbox(poly), gt_box)
+            best_iou = max(best_iou, iou_val)
+    return best_iou
+# ① New tool: Minimum NED substring
+def min_ned_substring(pred_fmt: str, tgt_fmt: str) -> float:
+    """
+    Find a substring in pred_fmt with the same length as tgt_fmt, to minimize normalized edit distance
+    Return the minimum value (0 ~ 1)
+    """
+    Lp, Lg = len(pred_fmt), len(tgt_fmt)
+    if Lg == 0:
+        return 0.0
+    if Lp < Lg:           # If prediction string is shorter than target, calculate directly
+        return normalized_edit_distance(pred_fmt, tgt_fmt)
+    best = Lg            # Maximum possible distance
+    for i in range(Lp - Lg + 1):
+        sub = pred_fmt[i:i+Lg]
+        d   = edit_distance(sub, tgt_fmt)
+        if d < best:
+            best = d
+            if best == 0:                 # Early exit
+                break
+    return best / Lg                      # Normalize
+def normalize_text(txt: str) -> str:
+    txt = txt.lower().replace(" ", "")
+    return re.sub(r"[^\w\s]", "", txt)
+def normalized_edit_distance(pred: str, gt: str) -> float:
+    if not gt and not pred:
+        return 0.0
+    return edit_distance(pred, gt) / max(len(gt), len(pred))
+def iou(boxA, boxB) -> float:
+    xA, yA = max(boxA[0], boxB[0]), max(boxA[1], boxB[1])
+    xB, yB = min(boxA[2], boxB[2]), min(boxA[3], boxB[3])
+    inter  = max(0, xB - xA) * max(0, yB - yA)
+    if inter == 0:
+        return 0.0
+    areaA = (boxA[2]-boxA[0]) * (boxA[3]-boxA[1])
+    areaB = (boxB[2]-boxB[0]) * (boxB[3]-boxB[1])
+    return inter / (areaA + areaB - inter)
+def quad2bbox(quad):
+    xs = [p[0] for p in quad]; ys = [p[1] for p in quad]
+    return [min(xs), min(ys), max(xs), max(ys)]
+def crop(img, box):
+    h, w = img.shape[:2]
+    x1,y1,x2,y2 = map(int, box)
+    x1, y1 = max(0, x1), max(0, y1)
+    x2, y2 = min(w-1, x2), min(h-1, y2)
+    if x2 <= x1 or y2 <= y1:
+        return np.zeros((1,1,3), np.uint8)
+    return img[y1:y2, x1:x2]
+# -------------------------------------------------------------------
+# 4. Main loop
+per_img_rows, all_sen_acc, all_ned, all_spatial, text_pairs = [], [], [], [], []
+for case in tqdm(benchmark):
+    json_data = json.loads(case["metadata"])
+    case_info = json_data["img_info"]
+    case_id = case_info["img_id"]
+    gt_list = json_data["text_list"]          # [{'text':..., 'bbox':[x1,y1,x2,y2]}, ...]
+    ori_w, ori_h = json_data["img_info"]["img_width"], json_data["img_info"]["img_height"]
+    img_path = os.path.join(root_gen, f"{case_id}.jpg")
+    img = cv2.imread(img_path)
+    H, W = img.shape[:2]
+    wr, hr = W / ori_w, H / ori_h        # GT → Generated image scaling ratio
+    # ---------- 1) Full image OCR ----------
+    pred_lines = []        # Save OCR line text
+    ocr_res = ocr.ocr(img, cls=False)
+    if ocr_res and ocr_res[0]:
+        for quad, (txt, conf) in ocr_res[0]:
+            pred_lines.append(txt.strip())
+    # Concatenate into full text and normalize
+    pred_full_fmt = normalize_text(" ".join(pred_lines))
+    # ==========================================================
+    # ③ For each GT sentence, do "substring minimum NED" ---- no longer using IoU
+    img_sen_hits, img_neds, img_spatials = [], [], []
+    for t_idx, gt in enumerate(gt_list):
+        gt_text_orig = gt["text"].replace("\n", " ").strip()
+        gt_text_fmt  = normalize_text(gt_text_orig)
+        # ---- Pure text matching ----
+        ned  = min_ned_substring(pred_full_fmt, gt_text_fmt)
+        acc  = 1.0 if ned == 0 else 0.0
+        img_sen_hits.append(acc)
+        img_neds.append(ned)
+        # ---------- Spatial consistency, using IOU ----------
+        gt_box = [v*wr if i%2==0 else v*hr for i,v in enumerate(gt["bbox"])]
+        det_res = ocr_res[0] if ocr_res else []
+        spatial_score = spatial_match_iou(det_res, gt_box, gt_text_fmt)
+        img_spatials.append(spatial_score)   # Can be used directly or binarized
+        crop_box_int = list(map(int, gt_box))
+        img_crop = crop(img, crop_box_int)
+        if DEBUG:
+            # Save cropped image
+            img_crop_for_ocr_save_root = os.path.join(save_root, case_id)
+            os.makedirs(img_crop_for_ocr_save_root, exist_ok=True)
+            safe_text = gt_text_orig.replace('/', '_').replace('\\', '_')
+            safe_filename = f"{t_idx}_{safe_text}.jpg"
+            cv2.imwrite(os.path.join(img_crop_for_ocr_save_root, safe_filename), img_crop)
+        # --------- Record text pairs ----------
+        text_pairs.append({
+            "image_id"       : case_id,
+            "text_id"        : t_idx,
+            "gt_original"    : gt_text_orig,
+            "gt_formatted"   : gt_text_fmt
+        })
+    # ---------- 3) Summarize to image level ----------
+    sen_acc  = float(np.mean(img_sen_hits))
+    ned      = float(np.mean(img_neds))
+    spatial  = float(np.mean(img_spatials))
+    per_img_rows.append([case_id, sen_acc, ned, spatial])
+    all_sen_acc.append(sen_acc)
+    all_ned.append(ned)
+    all_spatial.append(spatial)
+# -------------------------------------------------------------------
+# 5. Write results
+result_root = root_gen.replace("images","")
+csv_perimg = os.path.join(result_root, "text_results_per_image.csv")
+with open(csv_perimg, "w", newline='', encoding="utf-8") as f:
+    w = csv.writer(f); w.writerow(["image_id","sen_acc","ned","score_spatial"]); w.writerows(per_img_rows)
+with open(os.path.join(result_root, "text_overall.txt"), "w", encoding="utf-8") as f:
+    f.write(f"Images evaluated : {len(per_img_rows)}\n")
+    f.write(f"Global Sen ACC   : {np.mean(all_sen_acc):.4f}\n")
+    f.write(f"Global NED       : {np.mean(all_ned):.4f}\n")
+    f.write(f"Global Spatial   : {np.mean(all_spatial):.4f}\n")
+print("✓ Done! Results saved to", result_root)

modules/common/__pycache__/lora.cpython-310.pyc ADDED Viewed

Binary file (1.17 kB). View file

modules/common/lora.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import torch
+import torch.nn as nn
+class LoRALinearLayer(nn.Module):
+    def __init__(self, in_features, out_features, rank=4, network_alpha=None, device=None, dtype=None):
+        super().__init__()
+        self.down = nn.Linear(in_features, rank, bias=False, device=device, dtype=dtype)
+        self.up = nn.Linear(rank, out_features, bias=False, device=device, dtype=dtype)
+        self.network_alpha = network_alpha
+        self.rank = rank
+        nn.init.normal_(self.down.weight, std=1 / rank)
+        nn.init.zeros_(self.up.weight)
+    def forward(self, hidden_states):
+        orig_dtype = hidden_states.dtype
+        dtype = self.down.weight.dtype
+        down_hidden_states = self.down(hidden_states.to(dtype))
+        up_hidden_states = self.up(down_hidden_states)
+        if self.network_alpha is not None:
+            up_hidden_states *= self.network_alpha / self.rank
+        return up_hidden_states.to(orig_dtype)

modules/flux/__pycache__/attention_processor_flux_creatidesign.cpython-310.pyc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e534db89ad40a8e61c4c32b8bbeb3084e7d01a83667a66f426dbdfdf93a13936
+size 127465

modules/flux/__pycache__/transformer_flux_creatidesign.cpython-310.pyc ADDED Viewed

Binary file (25.8 kB). View file

modules/flux/attention_processor_flux_creatidesign.py ADDED Viewed

The diff for this file is too large to render. See raw diff

modules/flux/transformer_flux_creatidesign.py ADDED Viewed

	@@ -0,0 +1,1004 @@

+# Copyright 2024 Black Forest Labs, The HuggingFace Team and The InstantX Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any, Dict, Optional, Tuple, Union
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.loaders import FluxTransformer2DLoadersMixin, FromOriginalModelMixin, PeftAdapterMixin
+from diffusers.models.attention import FeedForward
+from modules.flux.attention_processor_flux_creatidesign import (
+    Attention,
+    AttentionProcessor,
+    DesignFluxAttnProcessor2_0,
+    FluxAttnProcessor2_0_NPU,
+    FusedFluxAttnProcessor2_0,
+)
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.models.normalization import AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle
+from diffusers.utils import USE_PEFT_BACKEND, is_torch_version, logging, scale_lora_layers, unscale_lora_layers
+from diffusers.utils.import_utils import is_torch_npu_available
+from diffusers.utils.torch_utils import maybe_allow_in_graph
+from diffusers.models.embeddings import CombinedTimestepGuidanceTextProjEmbeddings, CombinedTimestepTextProjEmbeddings, FluxPosEmbed
+from diffusers.models.modeling_outputs import Transformer2DModelOutput
+from modules.semantic_layout.layout_encoder import ObjectLayoutEncoder,ObjectLayoutEncoder_noFourier
+from modules.common.lora import LoRALinearLayer
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+@maybe_allow_in_graph
+class FluxSingleTransformerBlock(nn.Module):
+    r"""
+    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
+    Reference: https://arxiv.org/abs/2403.03206
+    Parameters:
+        dim (`int`): The number of channels in the input and output.
+        num_attention_heads (`int`): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`): The number of channels in each head.
+        context_pre_only (`bool`): Boolean to determine if we should add some blocks associated with the
+            processing of `context` conditions.
+    """
+    def __init__(self, dim, num_attention_heads, attention_head_dim, mlp_ratio=4.0, rank=16,network_alpha=16,lora_weight=1.0,attention_type="design"):
+        super().__init__()
+        self.mlp_hidden_dim = int(dim * mlp_ratio)
+        self.norm = AdaLayerNormZeroSingle(dim)
+        self.proj_mlp = nn.Linear(dim, self.mlp_hidden_dim)
+        self.act_mlp = nn.GELU(approximate="tanh")
+        self.proj_out = nn.Linear(dim + self.mlp_hidden_dim, dim)
+        if is_torch_npu_available():
+            processor = FluxAttnProcessor2_0_NPU()
+        else:
+            processor = DesignFluxAttnProcessor2_0()
+        self.attn = Attention(
+            query_dim=dim,
+            cross_attention_dim=None,
+            dim_head=attention_head_dim,
+            heads=num_attention_heads,
+            out_dim=dim,
+            bias=True,
+            processor=processor,
+            qk_norm="rms_norm",
+            eps=1e-6,
+            pre_only=True,
+        )
+        self.attention_type = attention_type
+        self.rank = rank
+        self.network_alpha = network_alpha
+        self.lora_weight = lora_weight
+        if attention_type == "design":
+            self.layernorm_subject = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6) # layernorm for subject
+            self.norm_subject_lora = nn.Sequential(
+                nn.SiLU(),
+                LoRALinearLayer(dim, dim*3, self.rank, self.network_alpha) # lora for adalinear of subject
+            )
+            self.layernorm_object_bbox = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6) # layernorm for object
+            self.norm_object_lora = nn.Sequential(
+                nn.SiLU(),
+                LoRALinearLayer(dim, dim*3, self.rank, self.network_alpha) # lora for adalinear of object
+            )
+    def single_block_adaln_lora_forward(self, x, temb, adaln, adaln_lora, layernorm, lora_weight):
+        norm_x, x_gate = adaln(x, emb=temb)
+        lora_shift_msa, lora_scale_msa, lora_gate_msa = adaln_lora(temb).chunk(3, dim=1)
+        norm_x = norm_x + lora_weight * (layernorm(x)* (1 + lora_scale_msa[:, None]) + lora_shift_msa[:, None])
+        x_gate = x_gate + lora_weight * lora_gate_msa
+        return norm_x, x_gate
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        temb: torch.Tensor,
+        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        subject_hidden_states = None,
+        subject_rotary_emb = None,
+        object_bbox_hidden_states = None,
+        object_rotary_emb = None,
+        design_scale = 1.0,
+        attention_mask=None,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> torch.Tensor:
+        residual = hidden_states
+        # handle hidden_states
+        norm_hidden_states, gate = self.norm(hidden_states, emb=temb)
+        mlp_hidden_states = self.act_mlp(self.proj_mlp(norm_hidden_states))
+        #creatidesign
+        use_subject = True if self.attention_type == "design" and subject_hidden_states is not None and design_scale!=0.0 else False
+        use_object = True if self.attention_type == "design" and object_bbox_hidden_states is not None and design_scale!=0.0 else False
+        # handle subejct_hidden_states
+        if use_subject:
+            residual_subject_hidden_states = subject_hidden_states
+            norm_subject_hidden_states, subject_gate = self.single_block_adaln_lora_forward(subject_hidden_states, temb, self.norm, self.norm_subject_lora, self.layernorm_subject,  self.lora_weight)
+            mlp_subject_hidden_states = self.act_mlp(self.proj_mlp(norm_subject_hidden_states))
+        if use_object:
+            residual_object_bbox_hidden_states = object_bbox_hidden_states
+            norm_object_bbox_hidden_states, object_gate = self.single_block_adaln_lora_forward(object_bbox_hidden_states, temb, self.norm, self.norm_object_lora, self.layernorm_object_bbox,  self.lora_weight)
+            mlp_object_bbox_hidden_states = self.act_mlp(self.proj_mlp(norm_object_bbox_hidden_states))
+        joint_attention_kwargs = joint_attention_kwargs or {}
+        attn_output, subject_attn_output, object_attn_output = self.attn(
+            hidden_states=norm_hidden_states,
+            image_rotary_emb=image_rotary_emb,
+            subject_hidden_states=norm_subject_hidden_states,
+            subject_rotary_emb=subject_rotary_emb,
+            object_bbox_hidden_states=norm_object_bbox_hidden_states,
+            object_rotary_emb=object_rotary_emb,
+            attention_mask = attention_mask,
+            **joint_attention_kwargs,
+        )
+        # handle hidden states
+        hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=2)
+        gate = gate.unsqueeze(1)
+        hidden_states = gate * self.proj_out(hidden_states)
+        hidden_states = residual + hidden_states
+        #handle subject_hidden_states
+        if use_subject:
+            subject_hidden_states = torch.cat([subject_attn_output, mlp_subject_hidden_states], dim=2)
+            subject_gate = subject_gate.unsqueeze(1)
+            subject_hidden_states = subject_gate * self.proj_out(subject_hidden_states)
+            subject_hidden_states = residual_subject_hidden_states + subject_hidden_states
+        #handle object_bbox_hidden_states
+        if use_object:
+            object_bbox_hidden_states = torch.cat([object_attn_output, mlp_object_bbox_hidden_states], dim=2)
+            object_gate = object_gate.unsqueeze(1)
+            object_bbox_hidden_states = object_gate * self.proj_out(object_bbox_hidden_states)
+            object_bbox_hidden_states = residual_object_bbox_hidden_states + object_bbox_hidden_states
+        if hidden_states.dtype == torch.float16:
+            hidden_states = hidden_states.clip(-65504, 65504)
+        return hidden_states, subject_hidden_states, object_bbox_hidden_states
+@maybe_allow_in_graph
+class FluxTransformerBlock(nn.Module):
+    r"""
+    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
+    Reference: https://arxiv.org/abs/2403.03206
+    Args:
+        dim (`int`):
+            The embedding dimension of the block.
+        num_attention_heads (`int`):
+            The number of attention heads to use.
+        attention_head_dim (`int`):
+            The number of dimensions to use for each attention head.
+        qk_norm (`str`, defaults to `"rms_norm"`):
+            The normalization to use for the query and key tensors.
+        eps (`float`, defaults to `1e-6`):
+            The epsilon value to use for the normalization.
+    """
+    def __init__(
+        self, dim: int, num_attention_heads: int, attention_head_dim: int, qk_norm: str = "rms_norm", eps: float = 1e-6, rank=16, network_alpha=16, lora_weight=1.0,attention_type="design"
+    ):
+        super().__init__()
+        self.norm1 = AdaLayerNormZero(dim)
+        self.norm1_context = AdaLayerNormZero(dim)
+        if hasattr(F, "scaled_dot_product_attention"):
+            processor = DesignFluxAttnProcessor2_0()
+        else:
+            raise ValueError(
+                "The current PyTorch version does not support the `scaled_dot_product_attention` function."
+            )
+        self.attn = Attention(
+            query_dim=dim,
+            cross_attention_dim=None,
+            added_kv_proj_dim=dim,
+            dim_head=attention_head_dim,
+            heads=num_attention_heads,
+            out_dim=dim,
+            context_pre_only=False,
+            bias=True,
+            processor=processor,
+            qk_norm=qk_norm,
+            eps=eps,
+        )
+        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
+        self.norm2_context = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff_context = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
+        # let chunk size default to None
+        self._chunk_size = None
+        self._chunk_dim = 0
+        # creatidesign
+        self.attention_type = attention_type
+        self.rank = rank
+        self.network_alpha = network_alpha
+        self.lora_weight = lora_weight
+        if self.attention_type == "design":
+            # lora for handle subject (img branch)
+            self.norm1_subject_lora = nn.Sequential(
+                nn.SiLU(),
+                LoRALinearLayer(dim, dim*6, self.rank, self.network_alpha) # lora for adalinear
+            )
+            self.layernorm_subject = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6) # norm for subject
+            # lora for handle object (txt branch)
+            self.norm1_object_lora = nn.Sequential(
+                nn.SiLU(),
+                LoRALinearLayer(dim, dim*6, self.rank, self.network_alpha) # lora for adalinear
+            )
+            self.layernorm_object = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6) # norm for object
+    def double_block_adaln_lora_forward(self, x, temb, adaln, adaln_lora, layernorm, lora_weight):
+        norm_x, x_gate_msa, x_shift_mlp, x_scale_mlp, x_gate_mlp = adaln(x, emb=temb)
+        lora_shift_msa, lora_scale_msa, lora_gate_msa, lora_shift_mlp, lora_scale_mlp, lora_gate_mlp = adaln_lora(temb).chunk(6, dim=1)
+        norm_x = norm_x + lora_weight * (layernorm(x)* (1 + lora_scale_msa[:, None]) + lora_shift_msa[:, None])
+        x_gate_msa = x_gate_msa + lora_weight*lora_gate_msa
+        x_shift_mlp = x_shift_mlp + lora_weight*lora_shift_mlp
+        x_scale_mlp = x_scale_mlp + lora_weight*lora_scale_mlp
+        x_gate_mlp = x_gate_mlp + lora_weight*lora_gate_mlp
+        return norm_x, x_gate_msa, x_shift_mlp, x_scale_mlp, x_gate_mlp
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor,
+        temb: torch.Tensor,
+        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        subject_hidden_states = None,
+        subject_rotary_emb = None,
+        object_bbox_hidden_states = None,
+        object_rotary_emb = None,
+        design_scale = 1.0,
+        attention_mask=None,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(hidden_states, emb=temb)
+        norm_encoder_hidden_states, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.norm1_context(
+            encoder_hidden_states, emb=temb
+        )
+        joint_attention_kwargs = joint_attention_kwargs or {}
+        use_subject = True if self.attention_type == "design" and subject_hidden_states is not None and design_scale!=0.0 else False
+        use_object = True if self.attention_type == "design" and object_bbox_hidden_states is not None and design_scale!=0.0 else False
+        if use_subject:
+            # subject adalinear
+            norm_subject_hidden_states, subject_gate_msa, subject_shift_mlp, subject_scale_mlp, subject_gate_mlp = self.double_block_adaln_lora_forward(
+                subject_hidden_states, temb, self.norm1, self.norm1_subject_lora, self.layernorm_subject, self.lora_weight
+            )
+        if use_object:
+            # object adalinear
+            norm_object_bbox_hidden_states, object_gate_msa, object_shift_mlp, object_scale_mlp, object_gate_mlp = self.double_block_adaln_lora_forward(
+                object_bbox_hidden_states, temb, self.norm1_context, self.norm1_object_lora, self.layernorm_object, self.lora_weight
+            )
+        attn_output, context_attn_output, subject_attn_output, object_attn_output = self.attn(
+            hidden_states=norm_hidden_states,
+            encoder_hidden_states=norm_encoder_hidden_states,
+            image_rotary_emb=image_rotary_emb,
+            subject_hidden_states=norm_subject_hidden_states if use_subject else None,
+            subject_rotary_emb=subject_rotary_emb if use_subject else None,
+            object_bbox_hidden_states=norm_object_bbox_hidden_states if use_object else None,
+            object_rotary_emb=object_rotary_emb if use_object else None,
+            attention_mask = attention_mask,
+            **joint_attention_kwargs,
+        )
+        # Process attention outputs for the `hidden_states`.
+        attn_output = gate_msa.unsqueeze(1) * attn_output
+        hidden_states = hidden_states + attn_output
+        norm_hidden_states = self.norm2(hidden_states)
+        norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+        ff_output = self.ff(norm_hidden_states)
+        ff_output = gate_mlp.unsqueeze(1) * ff_output
+        hidden_states = hidden_states + ff_output
+        # Process attention outputs for the `encoder_hidden_states`.
+        context_attn_output = c_gate_msa.unsqueeze(1) * context_attn_output
+        encoder_hidden_states = encoder_hidden_states + context_attn_output
+        norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states)
+        norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
+        context_ff_output = self.ff_context(norm_encoder_hidden_states)
+        encoder_hidden_states = encoder_hidden_states + c_gate_mlp.unsqueeze(1) * context_ff_output
+        # process attention outputs for the `subject_hidden_states`.
+        if use_subject:
+            subject_attn_output = subject_gate_msa.unsqueeze(1) * subject_attn_output
+            subject_hidden_states = subject_hidden_states + subject_attn_output
+            norm_subject_hidden_states = self.norm2(subject_hidden_states)
+            norm_subject_hidden_states = norm_subject_hidden_states * (1 + subject_scale_mlp[:, None]) + subject_shift_mlp[:, None]
+            subject_ff_output = self.ff(norm_subject_hidden_states)
+            subject_hidden_states = subject_hidden_states + subject_gate_mlp.unsqueeze(1) * subject_ff_output
+        # process attention outputs for the `object_bbox_hidden_states`.
+        if use_object:
+            object_attn_output = object_gate_msa.unsqueeze(1) * object_attn_output
+            object_bbox_hidden_states = object_bbox_hidden_states + object_attn_output
+            norm_object_bbox_hidden_states = self.norm2_context(object_bbox_hidden_states)
+            norm_object_bbox_hidden_states = norm_object_bbox_hidden_states * (1 + object_scale_mlp[:, None]) + object_shift_mlp[:, None]
+            object_ff_output = self.ff_context(norm_object_bbox_hidden_states)
+            object_bbox_hidden_states = object_bbox_hidden_states + object_gate_mlp.unsqueeze(1) * object_ff_output
+        if encoder_hidden_states.dtype == torch.float16:
+            encoder_hidden_states = encoder_hidden_states.clip(-65504, 65504)
+        return encoder_hidden_states, hidden_states, subject_hidden_states, object_bbox_hidden_states
+class FluxTransformer2DModel(
+    ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin, FluxTransformer2DLoadersMixin
+):
+    """
+    The Transformer model introduced in Flux.
+    Reference: https://blackforestlabs.ai/announcing-black-forest-labs/
+    Args:
+        patch_size (`int`, defaults to `1`):
+            Patch size to turn the input data into small patches.
+        in_channels (`int`, defaults to `64`):
+            The number of channels in the input.
+        out_channels (`int`, *optional*, defaults to `None`):
+            The number of channels in the output. If not specified, it defaults to `in_channels`.
+        num_layers (`int`, defaults to `19`):
+            The number of layers of dual stream DiT blocks to use.
+        num_single_layers (`int`, defaults to `38`):
+            The number of layers of single stream DiT blocks to use.
+        attention_head_dim (`int`, defaults to `128`):
+            The number of dimensions to use for each attention head.
+        num_attention_heads (`int`, defaults to `24`):
+            The number of attention heads to use.
+        joint_attention_dim (`int`, defaults to `4096`):
+            The number of dimensions to use for the joint attention (embedding/channel dimension of
+            `encoder_hidden_states`).
+        pooled_projection_dim (`int`, defaults to `768`):
+            The number of dimensions to use for the pooled projection.
+        guidance_embeds (`bool`, defaults to `False`):
+            Whether to use guidance embeddings for guidance-distilled variant of the model.
+        axes_dims_rope (`Tuple[int]`, defaults to `(16, 56, 56)`):
+            The dimensions to use for the rotary positional embeddings.
+    """
+    _supports_gradient_checkpointing = True
+    _no_split_modules = ["FluxTransformerBlock", "FluxSingleTransformerBlock"]
+    @register_to_config
+    def __init__(
+        self,
+        patch_size: int = 1,
+        in_channels: int = 64,
+        out_channels: Optional[int] = None,
+        num_layers: int = 19,
+        num_single_layers: int = 38,
+        attention_head_dim: int = 128,
+        num_attention_heads: int = 24,
+        joint_attention_dim: int = 4096,
+        pooled_projection_dim: int = 768,
+        guidance_embeds: bool = False,
+        axes_dims_rope: Tuple[int] = (16, 56, 56),
+        attention_type="design",
+        max_boxes_token_length=30,
+        rank = 16,
+        network_alpha = 16,
+        lora_weight = 1.0,
+        use_attention_mask = True,
+        use_objects_masks_maps=True,
+        use_subject_masks_maps=True,
+        use_layout_encoder=True,
+        drop_subject_bg=False,
+        gradient_checkpointing=False,
+        use_fourier_bbox=True,
+        bbox_id_shift=True
+    ):
+        super().__init__()
+        # #creatidesign
+        self.attention_type = attention_type
+        self.max_boxes_token_length = max_boxes_token_length
+        self.rank = rank
+        self.network_alpha = network_alpha
+        self.lora_weight = lora_weight
+        self.use_attention_mask = use_attention_mask
+        self.use_objects_masks_maps= use_objects_masks_maps
+        self.num_attention_heads=num_attention_heads
+        self.use_layout_encoder = use_layout_encoder
+        self.use_subject_masks_maps = use_subject_masks_maps
+        self.drop_subject_bg = drop_subject_bg
+        self.gradient_checkpointing = gradient_checkpointing
+        self.use_fourier_bbox = use_fourier_bbox
+        self.bbox_id_shift = bbox_id_shift
+        self.out_channels = out_channels or in_channels
+        self.inner_dim = num_attention_heads * attention_head_dim
+        self.pos_embed = FluxPosEmbed(theta=10000, axes_dim=axes_dims_rope)
+        text_time_guidance_cls = (
+            CombinedTimestepGuidanceTextProjEmbeddings if guidance_embeds else CombinedTimestepTextProjEmbeddings
+        )
+        self.time_text_embed = text_time_guidance_cls(
+            embedding_dim=self.inner_dim, pooled_projection_dim=pooled_projection_dim
+        )
+        self.context_embedder = nn.Linear(joint_attention_dim, self.inner_dim)
+        self.x_embedder = nn.Linear(in_channels, self.inner_dim)
+        self.transformer_blocks = nn.ModuleList(
+            [
+                FluxTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=num_attention_heads,
+                    attention_head_dim=attention_head_dim,
+                    attention_type=self.attention_type,
+                    rank=self.rank,
+                    network_alpha=self.network_alpha,
+                    lora_weight=self.lora_weight,
+                )
+                for _ in range(num_layers)
+            ]
+        )
+        self.single_transformer_blocks = nn.ModuleList(
+            [
+                FluxSingleTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=num_attention_heads,
+                    attention_head_dim=attention_head_dim,
+                    attention_type=self.attention_type,
+                    rank=self.rank,
+                    network_alpha=self.network_alpha,
+                    lora_weight=self.lora_weight,
+                )
+                for _ in range(num_single_layers)
+            ]
+        )
+        self.norm_out = AdaLayerNormContinuous(self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6)
+        self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True)
+        if self.attention_type =="design":
+            if self.use_layout_encoder:
+                if self.use_fourier_bbox:
+                    self.object_layout_encoder = ObjectLayoutEncoder(
+                        positive_len=self.inner_dim, out_dim=self.inner_dim, max_boxes_token_length=self.max_boxes_token_length
+                    )
+                else:
+                    self.object_layout_encoder = ObjectLayoutEncoder_noFourier(
+                        in_dim=self.inner_dim, out_dim=self.inner_dim
+                    )
+    @property
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
+    def attn_processors(self) -> Dict[str, AttentionProcessor]:
+        r"""
+        Returns:
+            `dict` of attention processors: A dictionary containing all attention processors used in the model with
+            indexed by its weight name.
+        """
+        # set recursively
+        processors = {}
+        def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
+            if hasattr(module, "get_processor"):
+                processors[f"{name}.processor"] = module.get_processor()
+            for sub_name, child in module.named_children():
+                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
+            return processors
+        for name, module in self.named_children():
+            fn_recursive_add_processors(name, module, processors)
+        return processors
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.set_attn_processor
+    def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
+        r"""
+        Sets the attention processor to use to compute attention.
+        Parameters:
+            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
+                The instantiated processor class or a dictionary of processor classes that will be set as the processor
+                for **all** `Attention` layers.
+                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
+                processor. This is strongly recommended when setting trainable attention processors.
+        """
+        count = len(self.attn_processors.keys())
+        if isinstance(processor, dict) and len(processor) != count:
+            raise ValueError(
+                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
+                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
+            )
+        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
+            if hasattr(module, "set_processor"):
+                if not isinstance(processor, dict):
+                    module.set_processor(processor)
+                else:
+                    module.set_processor(processor.pop(f"{name}.processor"))
+            for sub_name, child in module.named_children():
+                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
+        for name, module in self.named_children():
+            fn_recursive_attn_processor(name, module, processor)
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections with FusedAttnProcessor2_0->FusedFluxAttnProcessor2_0
+    def fuse_qkv_projections(self):
+        """
+        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
+        are fused. For cross-attention modules, key and value projection matrices are fused.
+        <Tip warning={true}>
+        This API is 🧪 experimental.
+        </Tip>
+        """
+        self.original_attn_processors = None
+        for _, attn_processor in self.attn_processors.items():
+            if "Added" in str(attn_processor.__class__.__name__):
+                raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")
+        self.original_attn_processors = self.attn_processors
+        for module in self.modules():
+            if isinstance(module, Attention):
+                module.fuse_projections(fuse=True)
+        self.set_attn_processor(FusedFluxAttnProcessor2_0())
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
+    def unfuse_qkv_projections(self):
+        """Disables the fused QKV projection if enabled.
+        <Tip warning={true}>
+        This API is 🧪 experimental.
+        </Tip>
+        """
+        if self.original_attn_processors is not None:
+            self.set_attn_processor(self.original_attn_processors)
+    def _set_gradient_checkpointing(self, module, value=False):
+        if hasattr(module, "gradient_checkpointing"):
+            module.gradient_checkpointing = value
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor = None,
+        pooled_projections: torch.Tensor = None,
+        timestep: torch.LongTensor = None,
+        img_ids: torch.Tensor = None,
+        txt_ids: torch.Tensor = None,
+        guidance: torch.Tensor = None,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        controlnet_block_samples=None,
+        controlnet_single_block_samples=None,
+        return_dict: bool = True,
+        controlnet_blocks_repeat: bool = False,
+        design_kwargs: dict | None = None,
+        design_scale =1.0
+    ) -> Union[torch.Tensor, Transformer2DModelOutput]:
+        """
+        The [`FluxTransformer2DModel`] forward method.
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)`):
+                Input `hidden_states`.
+            encoder_hidden_states (`torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)`):
+                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
+            pooled_projections (`torch.Tensor` of shape `(batch_size, projection_dim)`): Embeddings projected
+                from the embeddings of input conditions.
+            timestep ( `torch.LongTensor`):
+                Used to indicate denoising step.
+            block_controlnet_hidden_states: (`list` of `torch.Tensor`):
+                A list of tensors that if specified are added to the residuals of transformer blocks.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
+                tuple.
+        Returns:
+            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
+            `tuple` where the first element is the sample tensor.
+        """
+        if joint_attention_kwargs is not None:
+            joint_attention_kwargs = joint_attention_kwargs.copy()
+            lora_scale = joint_attention_kwargs.pop("scale", 1.0)
+        else:
+            lora_scale = 1.0
+        if USE_PEFT_BACKEND:
+            # weight the lora layers by setting `lora_scale` for each PEFT layer
+            scale_lora_layers(self, lora_scale)
+        else:
+            if joint_attention_kwargs is not None and joint_attention_kwargs.get("scale", None) is not None:
+                logger.warning(
+                    "Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective."
+                )
+        hidden_states = self.x_embedder(hidden_states)
+        timestep = timestep.to(hidden_states.dtype) * 1000
+        if guidance is not None:
+            guidance = guidance.to(hidden_states.dtype) * 1000
+        else:
+            guidance = None
+        temb = (
+            self.time_text_embed(timestep, pooled_projections)
+            if guidance is None
+            else self.time_text_embed(timestep, guidance, pooled_projections)
+        )
+        encoder_hidden_states = self.context_embedder(encoder_hidden_states)
+        if txt_ids.ndim == 3:
+            # logger.warning(
+            #     "Passing `txt_ids` 3d torch.Tensor is deprecated."
+            #     "Please remove the batch dimension and pass it as a 2d torch Tensor"
+            # )
+            txt_ids = txt_ids[0]
+        if img_ids.ndim == 3:
+            # logger.warning(
+            #     "Passing `img_ids` 3d torch.Tensor is deprecated."
+            #     "Please remove the batch dimension and pass it as a 2d torch Tensor"
+            # )
+            img_ids = img_ids[0]
+        attention_mask_batch = None
+        # handle design infos
+        if self.attention_type=="design" and design_kwargs is not None:
+            # handle objects
+            objects_boxes = design_kwargs['object_layout']['objects_boxes'].to(dtype=hidden_states.dtype, device=hidden_states.device) # [B,10,4]
+            objects_bbox_text_embeddings = design_kwargs['object_layout']['bbox_text_embeddings'].to(dtype=hidden_states.dtype, device=hidden_states.device) # [B,10,512,4096]
+            objects_bbox_masks = design_kwargs['object_layout']['bbox_masks'].to(dtype=hidden_states.dtype, device=hidden_states.device) # [B,10]
+            #token Truncation
+            objects_bbox_text_embeddings = objects_bbox_text_embeddings[:,:,:self.max_boxes_token_length,:]# [B,10,30,4096]
+            # [B,10,30,4096] -> [B*10,30,4096] -> [B*10,30,3072] -> [B,10,30,3072]
+            B, N, S, C = objects_bbox_text_embeddings.shape
+            objects_bbox_text_embeddings = objects_bbox_text_embeddings.reshape(-1, S, C) #[B*10,30,4096]
+            objects_bbox_text_embeddings = self.context_embedder(objects_bbox_text_embeddings) #[B*10,30,3072]
+            objects_bbox_text_embeddings = objects_bbox_text_embeddings.reshape(B, N, S, -1) # [B,10,30,3072]
+            if self.use_layout_encoder:
+                if self.use_fourier_bbox:
+                    object_bbox_hidden_states = self.object_layout_encoder(
+                        boxes=objects_boxes,
+                        masks=objects_bbox_masks,
+                        positive_embeddings=objects_bbox_text_embeddings,
+                    )# [B,10,30,3072]
+                else:
+                    object_bbox_hidden_states = self.object_layout_encoder(
+                        positive_embeddings=objects_bbox_text_embeddings,
+                    )# [B,10,30,3072]
+            else:
+                object_bbox_hidden_states = objects_bbox_text_embeddings
+            object_bbox_hidden_states = object_bbox_hidden_states.contiguous().view(B, N*S, -1)  # [B,300,3072]
+            # bbox_id shift
+            if self.bbox_id_shift:
+                object_bbox_ids = -1 * torch.ones(object_bbox_hidden_states.shape[0], object_bbox_hidden_states.shape[1], 3).to(device=object_bbox_hidden_states.device, dtype=object_bbox_hidden_states.dtype)
+            else:
+                object_bbox_ids = torch.zeros(object_bbox_hidden_states.shape[0], object_bbox_hidden_states.shape[1], 3).to(device=object_bbox_hidden_states.device, dtype=object_bbox_hidden_states.dtype)
+            if object_bbox_ids.ndim == 3:
+                object_bbox_ids = object_bbox_ids[0] #[300,3]
+            object_rotary_emb = self.pos_embed(object_bbox_ids)
+            # handle subjects
+            subject_hidden_states = design_kwargs['subject_contion']['condition_img']
+            subject_hidden_states = self.x_embedder(subject_hidden_states)
+            subject_ids = design_kwargs['subject_contion']['condition_img_ids']
+            if subject_ids.ndim == 3:
+                subject_ids = subject_ids[0]
+            subject_rotary_emb = self.pos_embed(subject_ids)
+            if self.use_attention_mask:
+                num_objects = N
+                tokens_per_object = self.max_boxes_token_length
+                total_object_tokens = object_bbox_hidden_states.shape[1]
+                assert total_object_tokens == num_objects * tokens_per_object, "Total object tokens do not match expected value"
+                encoder_tokens = encoder_hidden_states.shape[1]
+                img_tokens = hidden_states.shape[1]
+                subject_tokens = subject_hidden_states.shape[1]
+                # Total number of tokens
+                total_tokens = total_object_tokens + encoder_tokens + img_tokens + subject_tokens
+                attention_mask_batch = torch.zeros((B,total_tokens, total_tokens), dtype=hidden_states.dtype,device=hidden_states.device)
+                img_H, img_W = design_kwargs['object_layout']['img_token_h'], design_kwargs['object_layout']['img_token_w']
+                objects_masks_maps = design_kwargs['object_layout']['objects_masks_maps'].to(dtype=hidden_states.dtype, device=hidden_states.device) # [B,512,512]
+                subject_H,subject_W = design_kwargs['subject_contion']['subject_token_h'], design_kwargs['subject_contion']['subject_token_w']
+                subject_masks_maps = design_kwargs['subject_contion']['subject_masks_maps'].to(dtype=hidden_states.dtype, device=hidden_states.device) # [B,512,512]
+                for m_idx in range(B):
+                    # Create the base mask (all False/blocked)
+                    attention_mask = torch.zeros((total_tokens, total_tokens), dtype=hidden_states.dtype,device=hidden_states.device)
+                    # Define token ranges
+                    o_ranges = []  # Ranges for each object
+                    start_idx = 0
+                    for i in range(num_objects):
+                        end_idx = start_idx + tokens_per_object
+                        o_ranges.append((start_idx, end_idx))
+                        start_idx = end_idx
+                    encoder_range = (total_object_tokens, total_object_tokens + encoder_tokens)
+                    img_range = (encoder_range[1], encoder_range[1] + img_tokens)
+                    subject_range = (img_range[1], img_range[1] + subject_tokens)
+                    # Fill in the mask
+                    # 1. Object self-attention (diagonal o₁-o₁, o₂-o₂, o₃-o₃)
+                    for o_start, o_end in o_ranges:
+                        attention_mask[o_start:o_end, o_start:o_end] = True
+                    # 2. Objects to img and img to objetcs
+                    if not self.use_objects_masks_maps:
+                        # all objects can attend to img and img can attend to all objects
+                        for o_start, o_end in o_ranges:
+                            attention_mask[o_start:o_end, img_range[0]:img_range[1]] = True
+                        # img can attend to all
+                        attention_mask[img_range[0]:img_range[1], :] = True
+                    else:
+                        # all objects can only attend to the bbox area (defined by objects_mask) of img
+                        for idx, (o_start, o_end )in enumerate(o_ranges):
+                            mask = objects_masks_maps[m_idx][idx]
+                            mask = torch.nn.functional.interpolate(mask[None, None, :, :], (img_H, img_W), mode='nearest-exact').flatten().unsqueeze(1).repeat(1, tokens_per_object) #shape: [img_tokens,tokens_per_object]
+                            # objects to img
+                            attention_mask[o_start:o_end, img_range[0]:img_range[1]] = mask.transpose(-1, -2)
+                            # img to objects
+                            attention_mask[img_range[0]:img_range[1], o_start:o_end] = mask
+                    # img to img
+                    attention_mask[img_range[0]:img_range[1], img_range[0]:img_range[1]] = True
+                    # img to prompt
+                    attention_mask[img_range[0]:img_range[1], encoder_range[0]:encoder_range[1]] = True
+                    # img to subject
+                    subject_mask = subject_masks_maps[m_idx][0]
+                    if not self.use_subject_masks_maps:
+                        # all img can attend to subject
+                        attention_mask[img_range[0]:img_range[1], subject_range[0]:subject_range[1]] = True
+                    else:
+                        # img can only attend to the bbox area (defined by subject_mask) of subject
+                        subject_mask_img = torch.nn.functional.interpolate(subject_mask[None, None, :, :], (img_H, img_W), mode='nearest-exact').flatten().unsqueeze(1).repeat(1, subject_tokens) #shape: [img_tokens,subject_tokens]
+                        # img to objects
+                        attention_mask[img_range[0]:img_range[1], subject_range[0]:subject_range[1]] = subject_mask_img
+                    # 3. prompt to prompt, prompt to img, and prompt to subject
+                    # prompt to prompt
+                    attention_mask[encoder_range[0]:encoder_range[1], encoder_range[0]:encoder_range[1]] = True
+                    # prompt to img
+                    attention_mask[encoder_range[0]:encoder_range[1], img_range[0]:img_range[1]] = True
+                    # prompt to subject
+                    if not self.use_subject_masks_maps:
+                        attention_mask[encoder_range[0]:encoder_range[1], subject_range[0]:subject_range[1]] = True
+                    else:
+                        subject_mask_prompt = torch.nn.functional.interpolate(subject_mask[None, None, :, :], (subject_H, subject_W), mode='nearest-exact').flatten().unsqueeze(1).repeat(1, encoder_tokens) #shape: [subject_tokens,encoder_tokens]
+                        attention_mask[encoder_range[0]:encoder_range[1], subject_range[0]:subject_range[1]] = subject_mask_prompt.transpose(-1, -2)
+                    # 4. subject to prompt, subject to img, subject to subject
+                    # subject to prompt
+                    if not self.use_subject_masks_maps:
+                        attention_mask[subject_range[0]:subject_range[1], encoder_range[0]:encoder_range[1]] = True
+                    else:
+                        attention_mask[subject_range[0]:subject_range[1], encoder_range[0]:encoder_range[1]] = False
+                    # subject to img
+                    if not self.use_subject_masks_maps:
+                        attention_mask[subject_range[0]:subject_range[1], img_range[0]:img_range[1]] = True
+                    else:
+                        attention_mask[subject_range[0]:subject_range[1], img_range[0]:img_range[1]] = subject_mask_img.transpose(-1, -2)
+                    # subject to subject
+                    if not self.use_subject_masks_maps:
+                        attention_mask[subject_range[0]:subject_range[1], subject_range[0]:subject_range[1]] = True
+                    else:
+                        # blcok non-subject region
+                        if not self.drop_subject_bg:
+                            attention_mask[subject_range[0]:subject_range[1], subject_range[0]:subject_range[1]] = True
+                        else:
+                            attention_mask[subject_range[0]:subject_range[1], subject_range[0]:subject_range[1]] = subject_mask_img
+                    attention_mask_batch[m_idx] = attention_mask
+                attention_mask_batch = attention_mask_batch.unsqueeze(1).to(dtype=torch.bool, device=hidden_states.device)#[B,2860,2860]->[B,1,2860,2860]
+        ids = torch.cat((txt_ids, img_ids), dim=0)
+        image_rotary_emb = self.pos_embed(ids)
+        if joint_attention_kwargs is not None and "ip_adapter_image_embeds" in joint_attention_kwargs:
+            ip_adapter_image_embeds = joint_attention_kwargs.pop("ip_adapter_image_embeds")
+            ip_hidden_states = self.encoder_hid_proj(ip_adapter_image_embeds)
+            joint_attention_kwargs.update({"ip_hidden_states": ip_hidden_states})
+        for index_block, block in enumerate(self.transformer_blocks):
+            if torch.is_grad_enabled() and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
+                encoder_hidden_states, hidden_states, subject_hidden_states, object_bbox_hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    hidden_states,
+                    encoder_hidden_states,
+                    temb,
+                    image_rotary_emb,
+                    subject_hidden_states,
+                    subject_rotary_emb,
+                    object_bbox_hidden_states,
+                    object_rotary_emb,
+                    design_scale,
+                    attention_mask_batch,
+                    **ckpt_kwargs,
+                )
+            else:
+                encoder_hidden_states, hidden_states, subject_hidden_states, object_bbox_hidden_states = block(
+                    hidden_states=hidden_states,
+                    encoder_hidden_states=encoder_hidden_states,
+                    temb=temb,
+                    image_rotary_emb=image_rotary_emb,
+                    subject_hidden_states=subject_hidden_states,
+                    subject_rotary_emb=subject_rotary_emb,
+                    object_bbox_hidden_states=object_bbox_hidden_states,
+                    object_rotary_emb=object_rotary_emb,
+                    design_scale = design_scale,
+                    attention_mask = attention_mask_batch,
+                    joint_attention_kwargs=joint_attention_kwargs,
+                )
+            # controlnet residual
+            if controlnet_block_samples is not None:
+                interval_control = len(self.transformer_blocks) / len(controlnet_block_samples)
+                interval_control = int(np.ceil(interval_control))
+                # For Xlabs ControlNet.
+                if controlnet_blocks_repeat:
+                    hidden_states = (
+                        hidden_states + controlnet_block_samples[index_block % len(controlnet_block_samples)]
+                    )
+                else:
+                    hidden_states = hidden_states + controlnet_block_samples[index_block // interval_control]
+        hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
+        for index_block, block in enumerate(self.single_transformer_blocks):
+            if torch.is_grad_enabled() and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
+                hidden_states, subject_hidden_states, object_bbox_hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    hidden_states,
+                    temb,
+                    image_rotary_emb,
+                    subject_hidden_states,
+                    subject_rotary_emb,
+                    object_bbox_hidden_states,
+                    object_rotary_emb,
+                    design_scale,
+                    attention_mask_batch,
+                    **ckpt_kwargs,
+                )
+            else:
+                hidden_states, subject_hidden_states, object_bbox_hidden_states = block(
+                    hidden_states=hidden_states,
+                    temb=temb,
+                    image_rotary_emb=image_rotary_emb,
+                    subject_hidden_states=subject_hidden_states,
+                    subject_rotary_emb=subject_rotary_emb,
+                    object_bbox_hidden_states=object_bbox_hidden_states,
+                    object_rotary_emb=object_rotary_emb,
+                    design_scale=design_scale,
+                    attention_mask = attention_mask_batch,
+                    joint_attention_kwargs=joint_attention_kwargs,
+                )
+            # controlnet residual
+            if controlnet_single_block_samples is not None:
+                interval_control = len(self.single_transformer_blocks) / len(controlnet_single_block_samples)
+                interval_control = int(np.ceil(interval_control))
+                hidden_states[:, encoder_hidden_states.shape[1] :, ...] = (
+                    hidden_states[:, encoder_hidden_states.shape[1] :, ...]
+                    + controlnet_single_block_samples[index_block // interval_control]
+                )
+        hidden_states = hidden_states[:, encoder_hidden_states.shape[1] :, ...]
+        hidden_states = self.norm_out(hidden_states, temb)
+        output = self.proj_out(hidden_states)
+        if USE_PEFT_BACKEND:
+            # remove `lora_scale` from each PEFT layer
+            unscale_lora_layers(self, lora_scale)
+        if not return_dict:
+            return (output,)
+        return Transformer2DModelOutput(sample=output)

modules/semantic_layout/__pycache__/layout_encoder.cpython-310.pyc ADDED Viewed

Binary file (4.26 kB). View file

modules/semantic_layout/layout_encoder.py ADDED Viewed

	@@ -0,0 +1,139 @@

+import torch
+import torch.nn as nn
+def zero_module(module):
+    """
+    Zero out the parameters of a module and return it.
+    """
+    for p in module.parameters():
+        p.detach().zero_()
+    return module
+def get_fourier_embeds_from_boundingbox(embed_dim, box):
+    """
+    Args:
+        embed_dim: int
+        box: a 3-D tensor [B x N x 4] representing the bounding boxes for GLIGEN pipeline
+    Returns:
+        [B x N x embed_dim] tensor of positional embeddings
+    """
+    batch_size, num_boxes = box.shape[:2]
+    emb = 100 ** (torch.arange(embed_dim) / embed_dim)
+    emb = emb[None, None, None].to(device=box.device, dtype=box.dtype)
+    emb = emb * box.unsqueeze(-1)
+    emb = torch.stack((emb.sin(), emb.cos()), dim=-1)
+    emb = emb.permute(0, 1, 3, 4, 2).reshape(batch_size, num_boxes, embed_dim * 2 * 4)
+    return emb
+class PixArtAlphaTextProjection(nn.Module):
+    """
+    Projects caption embeddings. Also handles dropout for classifier-free guidance.
+    Adapted from https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/nets/PixArt_blocks.py
+    """
+    def __init__(self, in_features, hidden_size, out_features=None, act_fn="gelu_tanh"):
+        super().__init__()
+        if out_features is None:
+            out_features = hidden_size
+        self.linear_1 = nn.Linear(in_features=in_features, out_features=hidden_size, bias=True)
+        if act_fn == "gelu_tanh":
+            self.act_1 = nn.GELU(approximate="tanh")
+        elif act_fn == "silu":
+            self.act_1 = nn.SiLU()
+        elif act_fn == "silu_fp32":
+            self.act_1 = FP32SiLU()
+        else:
+            raise ValueError(f"Unknown activation function: {act_fn}")
+        self.linear_2 = nn.Linear(in_features=hidden_size, out_features=out_features, bias=True)
+    def forward(self, caption):
+        hidden_states = self.linear_1(caption)
+        hidden_states = self.act_1(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        return hidden_states
+class ObjectLayoutEncoder(nn.Module):
+    def __init__(self, positive_len, out_dim, fourier_freqs=8 ,max_boxes_token_length=30):
+        super().__init__()
+        self.positive_len = positive_len
+        self.out_dim = out_dim
+        self.fourier_embedder_dim = fourier_freqs
+        self.position_dim = fourier_freqs * 2 * 4  # 2: sin/cos, 4: xyxy #64
+        if isinstance(out_dim, tuple):
+            out_dim = out_dim[0]
+        self.null_positive_feature = torch.nn.Parameter(torch.zeros([max_boxes_token_length, self.positive_len]))
+        self.null_position_feature = torch.nn.Parameter(torch.zeros([self.position_dim]))
+        self.linears = PixArtAlphaTextProjection(in_features=self.positive_len + self.position_dim,hidden_size=out_dim//2,out_features=out_dim, act_fn="silu")
+    def forward(
+            self,
+            boxes,  # [B,10,4]
+            masks,  # [B,10]
+            positive_embeddings,  # [B,10,30,3072]
+        ):
+        B, N, S, C = positive_embeddings.shape  # B: batch_size, N: 10, S: 30, C: 3072
+        positive_embeddings = positive_embeddings.reshape(B*N, S, C)  # [B*10,30,3072]
+        masks = masks.reshape(B*N, 1, 1)  # [B*10,1,1]
+        # Process positional encoding
+        xyxy_embedding = get_fourier_embeds_from_boundingbox(self.fourier_embedder_dim, boxes)  # [B,10,64]
+        xyxy_embedding = xyxy_embedding.reshape(B*N, -1)  # [B*10,64]
+        xyxy_null = self.null_position_feature.view(1, -1)  # [1,64]
+        # Expand positional encoding to match sequence dimension
+        xyxy_embedding = xyxy_embedding.unsqueeze(1).expand(-1, S, -1)  # [B*10,30,64]
+        xyxy_null = xyxy_null.unsqueeze(0).expand(B*N, S, -1)  # [B*10,30,64]
+        # Apply mask
+        xyxy_embedding = xyxy_embedding * masks + (1 - masks) * xyxy_null  # [B*10,30,64]
+        # Process feature encoding
+        positive_null = self.null_positive_feature.view(1, S, -1).expand(B*N, -1, -1)  # [B*10,30,3072]
+        positive_embeddings = positive_embeddings * masks + (1 - masks) * positive_null  # [B*10,30,3072]
+        # Concatenate positional encoding and feature encoding
+        combined = torch.cat([positive_embeddings, xyxy_embedding], dim=-1)  # [B*10,30,3072+64]
+        # Process each box's features independently
+        objs = self.linears(combined)  # [B*10,30,3072]
+        # Restore original shape
+        objs = objs.reshape(B, N, S, -1)  # [B,10,30,3072]
+        return objs
+class ObjectLayoutEncoder_noFourier(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+        self.linears = PixArtAlphaTextProjection(in_features=self.in_dim,hidden_size=out_dim//2,out_features=out_dim, act_fn="silu")
+    def forward(
+            self,
+            positive_embeddings,  # [B,10,30,3072]
+        ):
+        B, N, S, C = positive_embeddings.shape  # B: batch_size, N: 10, S: 30, C: 3072
+        positive_embeddings = positive_embeddings.reshape(B*N, S, C)  # [B*10,30,3072]
+        # Process each box's features independently
+        objs = self.linears(positive_embeddings)  # [B*10,30,3072]
+        # Restore original shape
+        objs = objs.reshape(B, N, S, -1)  # [B,10,30,3072]
+        return objs

pipeline/__pycache__/pipeline_flux_creatidesign.cpython-310.pyc ADDED Viewed

Binary file (32.3 kB). View file

pipeline/pipeline_flux_creatidesign.py ADDED Viewed

	@@ -0,0 +1,1068 @@

+# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+from typing import Any, Callable, Dict, List, Optional, Union
+import numpy as np
+import torch
+from transformers import (
+    CLIPImageProcessor,
+    CLIPTextModel,
+    CLIPTokenizer,
+    CLIPVisionModelWithProjection,
+    T5EncoderModel,
+    T5TokenizerFast,
+)
+from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
+from diffusers.loaders import FluxIPAdapterMixin, FluxLoraLoaderMixin, FromSingleFileMixin, TextualInversionLoaderMixin
+from diffusers.models.autoencoders import AutoencoderKL
+from diffusers.models.transformers import FluxTransformer2DModel
+from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
+from diffusers.utils import (
+    USE_PEFT_BACKEND,
+    is_torch_xla_available,
+    logging,
+    replace_example_docstring,
+    scale_lora_layers,
+    unscale_lora_layers,
+)
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+from diffusers.pipelines.flux.pipeline_output import FluxPipelineOutput
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+    XLA_AVAILABLE = True
+else:
+    XLA_AVAILABLE = False
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> import torch
+        >>> from diffusers import FluxPipeline
+        >>> pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)
+        >>> pipe.to("cuda")
+        >>> prompt = "A cat holding a sign that says hello world"
+        >>> # Depending on the variant being used, the pipeline call will slightly vary.
+        >>> # Refer to the pipeline documentation for more details.
+        >>> image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0]
+        >>> image.save("flux.png")
+        ```
+"""
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.16,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+class FluxPipeline(
+    DiffusionPipeline,
+    FluxLoraLoaderMixin,
+    FromSingleFileMixin,
+    TextualInversionLoaderMixin,
+    FluxIPAdapterMixin,
+):
+    r"""
+    The Flux pipeline for text-to-image generation.
+    Reference: https://blackforestlabs.ai/announcing-black-forest-labs/
+    Args:
+        transformer ([`FluxTransformer2DModel`]):
+            Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
+        scheduler ([`FlowMatchEulerDiscreteScheduler`]):
+            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+        vae ([`AutoencoderKL`]):
+            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+        text_encoder ([`CLIPTextModel`]):
+            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
+            the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
+        text_encoder_2 ([`T5EncoderModel`]):
+            [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically
+            the [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
+        tokenizer (`CLIPTokenizer`):
+            Tokenizer of class
+            [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
+        tokenizer_2 (`T5TokenizerFast`):
+            Second Tokenizer of class
+            [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).
+    """
+    model_cpu_offload_seq = "text_encoder->text_encoder_2->image_encoder->transformer->vae"
+    _optional_components = ["image_encoder", "feature_extractor"]
+    _callback_tensor_inputs = ["latents", "prompt_embeds"]
+    def __init__(
+        self,
+        scheduler: FlowMatchEulerDiscreteScheduler,
+        vae: AutoencoderKL,
+        text_encoder: CLIPTextModel,
+        tokenizer: CLIPTokenizer,
+        text_encoder_2: T5EncoderModel,
+        tokenizer_2: T5TokenizerFast,
+        transformer: FluxTransformer2DModel,
+        image_encoder: CLIPVisionModelWithProjection = None,
+        feature_extractor: CLIPImageProcessor = None,
+    ):
+        super().__init__()
+        self.register_modules(
+            vae=vae,
+            text_encoder=text_encoder,
+            text_encoder_2=text_encoder_2,
+            tokenizer=tokenizer,
+            tokenizer_2=tokenizer_2,
+            transformer=transformer,
+            scheduler=scheduler,
+            image_encoder=image_encoder,
+            feature_extractor=feature_extractor,
+        )
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
+        # Flux latents are turned into 2x2 patches and packed. This means the latent width and height has to be divisible
+        # by the patch size. So the vae scale factor is multiplied by the patch size to account for this
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
+        self.tokenizer_max_length = (
+            self.tokenizer.model_max_length if hasattr(self, "tokenizer") and self.tokenizer is not None else 77
+        )
+        self.default_sample_size = 128
+    def _get_t5_prompt_embeds(
+        self,
+        prompt: Union[str, List[str]] = None,
+        num_images_per_prompt: int = 1,
+        max_sequence_length: int = 512,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ):
+        device = device or self._execution_device
+        dtype = dtype or self.text_encoder.dtype
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt)
+        if isinstance(self, TextualInversionLoaderMixin):
+            prompt = self.maybe_convert_prompt(prompt, self.tokenizer_2)
+        text_inputs = self.tokenizer_2(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            return_length=False,
+            return_overflowing_tokens=False,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids
+        untruncated_ids = self.tokenizer_2(prompt, padding="longest", return_tensors="pt").input_ids
+        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(text_input_ids, untruncated_ids):
+            removed_text = self.tokenizer_2.batch_decode(untruncated_ids[:, self.tokenizer_max_length - 1 : -1])
+            logger.warning(
+                "The following part of your input was truncated because `max_sequence_length` is set to "
+                f" {max_sequence_length} tokens: {removed_text}"
+            )
+        prompt_embeds = self.text_encoder_2(text_input_ids.to(device), output_hidden_states=False)[0]
+        dtype = self.text_encoder_2.dtype
+        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+        _, seq_len, _ = prompt_embeds.shape
+        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+        return prompt_embeds
+    def _get_clip_prompt_embeds(
+        self,
+        prompt: Union[str, List[str]],
+        num_images_per_prompt: int = 1,
+        device: Optional[torch.device] = None,
+    ):
+        device = device or self._execution_device
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt)
+        if isinstance(self, TextualInversionLoaderMixin):
+            prompt = self.maybe_convert_prompt(prompt, self.tokenizer)
+        text_inputs = self.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=self.tokenizer_max_length,
+            truncation=True,
+            return_overflowing_tokens=False,
+            return_length=False,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids
+        untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
+        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(text_input_ids, untruncated_ids):
+            removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer_max_length - 1 : -1])
+            logger.warning(
+                "The following part of your input was truncated because CLIP can only handle sequences up to"
+                f" {self.tokenizer_max_length} tokens: {removed_text}"
+            )
+        prompt_embeds = self.text_encoder(text_input_ids.to(device), output_hidden_states=False)
+        # Use pooled output of CLIPTextModel
+        prompt_embeds = prompt_embeds.pooler_output
+        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)
+        # duplicate text embeddings for each generation per prompt, using mps friendly method
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt)
+        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, -1)
+        return prompt_embeds
+    def encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        prompt_2: Union[str, List[str]],
+        device: Optional[torch.device] = None,
+        num_images_per_prompt: int = 1,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        max_sequence_length: int = 512,
+        lora_scale: Optional[float] = None,
+    ):
+        r"""
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                prompt to be encoded
+            prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
+                used in all text-encoders
+            device: (`torch.device`):
+                torch device
+            num_images_per_prompt (`int`):
+                number of images that should be generated per prompt
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
+                If not provided, pooled text embeddings will be generated from `prompt` input argument.
+            lora_scale (`float`, *optional*):
+                A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
+        """
+        device = device or self._execution_device
+        # set lora scale so that monkey patched LoRA
+        # function of text encoder can correctly access it
+        if lora_scale is not None and isinstance(self, FluxLoraLoaderMixin):
+            self._lora_scale = lora_scale
+            # dynamically adjust the LoRA scale
+            if self.text_encoder is not None and USE_PEFT_BACKEND:
+                scale_lora_layers(self.text_encoder, lora_scale)
+            if self.text_encoder_2 is not None and USE_PEFT_BACKEND:
+                scale_lora_layers(self.text_encoder_2, lora_scale)
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        if prompt_embeds is None:
+            prompt_2 = prompt_2 or prompt
+            prompt_2 = [prompt_2] if isinstance(prompt_2, str) else prompt_2
+            # We only use the pooled prompt output from the CLIPTextModel
+            pooled_prompt_embeds = self._get_clip_prompt_embeds(
+                prompt=prompt,
+                device=device,
+                num_images_per_prompt=num_images_per_prompt,
+            )
+            prompt_embeds = self._get_t5_prompt_embeds(
+                prompt=prompt_2,
+                num_images_per_prompt=num_images_per_prompt,
+                max_sequence_length=max_sequence_length,
+                device=device,
+            )
+        if self.text_encoder is not None:
+            if isinstance(self, FluxLoraLoaderMixin) and USE_PEFT_BACKEND:
+                # Retrieve the original scale by scaling back the LoRA layers
+                unscale_lora_layers(self.text_encoder, lora_scale)
+        if self.text_encoder_2 is not None:
+            if isinstance(self, FluxLoraLoaderMixin) and USE_PEFT_BACKEND:
+                # Retrieve the original scale by scaling back the LoRA layers
+                unscale_lora_layers(self.text_encoder_2, lora_scale)
+        dtype = self.text_encoder.dtype if self.text_encoder is not None else self.transformer.dtype
+        text_ids = torch.zeros(prompt_embeds.shape[1], 3).to(device=device, dtype=dtype)
+        return prompt_embeds, pooled_prompt_embeds, text_ids
+    def encode_image(self, image, device, num_images_per_prompt):
+        dtype = next(self.image_encoder.parameters()).dtype
+        if not isinstance(image, torch.Tensor):
+            image = self.feature_extractor(image, return_tensors="pt").pixel_values
+        image = image.to(device=device, dtype=dtype)
+        image_embeds = self.image_encoder(image).image_embeds
+        image_embeds = image_embeds.repeat_interleave(num_images_per_prompt, dim=0)
+        return image_embeds
+    def prepare_ip_adapter_image_embeds(
+        self, ip_adapter_image, ip_adapter_image_embeds, device, num_images_per_prompt
+    ):
+        image_embeds = []
+        if ip_adapter_image_embeds is None:
+            if not isinstance(ip_adapter_image, list):
+                ip_adapter_image = [ip_adapter_image]
+            if len(ip_adapter_image) != len(self.transformer.encoder_hid_proj.image_projection_layers):
+                raise ValueError(
+                    f"`ip_adapter_image` must have same length as the number of IP Adapters. Got {len(ip_adapter_image)} images and {len(self.transformer.encoder_hid_proj.image_projection_layers)} IP Adapters."
+                )
+            for single_ip_adapter_image, image_proj_layer in zip(
+                ip_adapter_image, self.transformer.encoder_hid_proj.image_projection_layers
+            ):
+                single_image_embeds = self.encode_image(single_ip_adapter_image, device, 1)
+                image_embeds.append(single_image_embeds[None, :])
+        else:
+            for single_image_embeds in ip_adapter_image_embeds:
+                image_embeds.append(single_image_embeds)
+        ip_adapter_image_embeds = []
+        for i, single_image_embeds in enumerate(image_embeds):
+            single_image_embeds = torch.cat([single_image_embeds] * num_images_per_prompt, dim=0)
+            single_image_embeds = single_image_embeds.to(device=device)
+            ip_adapter_image_embeds.append(single_image_embeds)
+        return ip_adapter_image_embeds
+    def check_inputs(
+        self,
+        prompt,
+        prompt_2,
+        height,
+        width,
+        negative_prompt=None,
+        negative_prompt_2=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        pooled_prompt_embeds=None,
+        negative_pooled_prompt_embeds=None,
+        callback_on_step_end_tensor_inputs=None,
+        max_sequence_length=None,
+    ):
+        if height % (self.vae_scale_factor * 2) != 0 or width % (self.vae_scale_factor * 2) != 0:
+            logger.warning(
+                f"`height` and `width` have to be divisible by {self.vae_scale_factor * 2} but are {height} and {width}. Dimensions will be resized accordingly"
+            )
+        if callback_on_step_end_tensor_inputs is not None and not all(
+            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
+        ):
+            raise ValueError(
+                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
+            )
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt_2 is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+        elif prompt_2 is not None and (not isinstance(prompt_2, str) and not isinstance(prompt_2, list)):
+            raise ValueError(f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}")
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        elif negative_prompt_2 is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt_2`: {negative_prompt_2} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+        if prompt_embeds is not None and pooled_prompt_embeds is None:
+            raise ValueError(
+                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
+            )
+        if negative_prompt_embeds is not None and negative_pooled_prompt_embeds is None:
+            raise ValueError(
+                "If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed. Make sure to generate `negative_pooled_prompt_embeds` from the same text encoder that was used to generate `negative_prompt_embeds`."
+            )
+        if max_sequence_length is not None and max_sequence_length > 512:
+            raise ValueError(f"`max_sequence_length` cannot be greater than 512 but is {max_sequence_length}")
+    @staticmethod
+    def _prepare_latent_image_ids(batch_size, height, width, device, dtype,scale_h=1.0,scale_w=1.0):
+        latent_image_ids = torch.zeros(height, width, 3)
+        latent_image_ids[..., 1] = latent_image_ids[..., 1] + torch.arange(height)[:, None]* scale_h
+        latent_image_ids[..., 2] = latent_image_ids[..., 2] + torch.arange(width)[None, :]* scale_w
+        latent_image_id_height, latent_image_id_width, latent_image_id_channels = latent_image_ids.shape
+        latent_image_ids = latent_image_ids.reshape(
+            latent_image_id_height * latent_image_id_width, latent_image_id_channels
+        )
+        return latent_image_ids.to(device=device, dtype=dtype)
+    @staticmethod
+    def _pack_latents(latents, batch_size, num_channels_latents, height, width):
+        latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
+        latents = latents.permute(0, 2, 4, 1, 3, 5)
+        latents = latents.reshape(batch_size, (height // 2) * (width // 2), num_channels_latents * 4)
+        return latents
+    @staticmethod
+    def _unpack_latents(latents, height, width, vae_scale_factor):
+        batch_size, num_patches, channels = latents.shape
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (vae_scale_factor * 2))
+        width = 2 * (int(width) // (vae_scale_factor * 2))
+        latents = latents.view(batch_size, height // 2, width // 2, channels // 4, 2, 2)
+        latents = latents.permute(0, 3, 1, 4, 2, 5)
+        latents = latents.reshape(batch_size, channels // (2 * 2), height, width)
+        return latents
+    def enable_vae_slicing(self):
+        r"""
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
+        """
+        self.vae.enable_slicing()
+    def disable_vae_slicing(self):
+        r"""
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
+        computing decoding in one step.
+        """
+        self.vae.disable_slicing()
+    def enable_vae_tiling(self):
+        r"""
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
+        processing larger images.
+        """
+        self.vae.enable_tiling()
+    def disable_vae_tiling(self):
+        r"""
+        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
+        computing decoding in one step.
+        """
+        self.vae.disable_tiling()
+    def prepare_latents(
+        self,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))
+        shape = (batch_size, num_channels_latents, height, width)
+        if latents is not None:
+            latent_image_ids = self._prepare_latent_image_ids(batch_size, height // 2, width // 2, device, dtype)
+            return latents.to(device=device, dtype=dtype), latent_image_ids
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+        latent_image_ids = self._prepare_latent_image_ids(batch_size, height // 2, width // 2, device, dtype)
+        return latents, latent_image_ids
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+    @property
+    def joint_attention_kwargs(self):
+        return self._joint_attention_kwargs
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+    @property
+    def interrupt(self):
+        return self._interrupt
+    @torch.no_grad()
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        negative_prompt: Union[str, List[str]] = None,
+        negative_prompt_2: Optional[Union[str, List[str]]] = None,
+        true_cfg_scale: float = 3.0,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 28,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: float = 3.5,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        ip_adapter_image: Optional[PipelineImageInput] = None,
+        ip_adapter_image_embeds: Optional[List[torch.Tensor]] = None,
+        negative_ip_adapter_image: Optional[PipelineImageInput] = None,
+        negative_ip_adapter_image_embeds: Optional[List[torch.Tensor]] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+        objects_boxes=None,
+        objects_caption=None,
+        objects_masks = None,
+        objects_masks_maps=None,
+        subject_masks_maps=None,
+        condition_img = None,
+        neg_condtion_img=None,
+        max_boxes_per_image = 10,
+        position_delta=[0,-64],
+        scale_h=1.0,
+        scale_w=1.0,
+        use_bucket=False
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
+                will be used instead
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image. This is set to 1024 by default for the best results.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image. This is set to 1024 by default for the best results.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            sigmas (`List[float]`, *optional*):
+                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
+                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
+                will be used.
+            guidance_scale (`float`, *optional*, defaults to 7.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
+                If not provided, pooled text embeddings will be generated from `prompt` input argument.
+            ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
+            ip_adapter_image_embeds (`List[torch.Tensor]`, *optional*):
+                Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
+                IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. If not
+                provided, embeddings are computed from the `ip_adapter_image` input argument.
+            negative_ip_adapter_image:
+                (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
+            negative_ip_adapter_image_embeds (`List[torch.Tensor]`, *optional*):
+                Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
+                IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. If not
+                provided, embeddings are computed from the `ip_adapter_image` input argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.flux.FluxPipelineOutput`] instead of a plain tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
+        Examples:
+        Returns:
+            [`~pipelines.flux.FluxPipelineOutput`] or `tuple`: [`~pipelines.flux.FluxPipelineOutput`] if `return_dict`
+            is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated
+            images.
+        """
+        height = height or self.default_sample_size * self.vae_scale_factor
+        width = width or self.default_sample_size * self.vae_scale_factor
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            height,
+            width,
+            negative_prompt=negative_prompt,
+            negative_prompt_2=negative_prompt_2,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
+        )
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._interrupt = False
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        lora_scale = (
+            self.joint_attention_kwargs.get("scale", None) if self.joint_attention_kwargs is not None else None
+        )
+        #creatidesign
+        negative_prompt = negative_prompt if negative_prompt is not None else [""]*batch_size
+        do_true_cfg = true_cfg_scale > 1 and negative_prompt is not None
+        (
+            prompt_embeds,
+            pooled_prompt_embeds,
+            text_ids,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+            lora_scale=lora_scale,
+        )
+        if do_true_cfg:
+            (
+                negative_prompt_embeds,
+                negative_pooled_prompt_embeds,
+                _,
+            ) = self.encode_prompt(
+                prompt=negative_prompt,
+                prompt_2=negative_prompt_2,
+                prompt_embeds=negative_prompt_embeds,
+                pooled_prompt_embeds=negative_pooled_prompt_embeds,
+                device=device,
+                num_images_per_prompt=num_images_per_prompt,
+                max_sequence_length=max_sequence_length,
+                lora_scale=lora_scale,
+            )
+        # 4. Prepare latent variables
+        num_channels_latents = self.transformer.config.in_channels // 4
+        latents, latent_image_ids = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
+        image_seq_len = latents.shape[1]
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.base_image_seq_len,
+            self.scheduler.config.max_image_seq_len,
+            self.scheduler.config.base_shift,
+            self.scheduler.config.max_shift,
+        )
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+        # handle guidance
+        if self.transformer.config.guidance_embeds:
+            guidance = torch.full([1], guidance_scale, device=device, dtype=torch.float32)
+            guidance = guidance.expand(latents.shape[0])
+        else:
+            guidance = None
+        if (ip_adapter_image is not None or ip_adapter_image_embeds is not None) and (
+            negative_ip_adapter_image is None and negative_ip_adapter_image_embeds is None
+        ):
+            negative_ip_adapter_image = np.zeros((width, height, 3), dtype=np.uint8)
+        elif (ip_adapter_image is None and ip_adapter_image_embeds is None) and (
+            negative_ip_adapter_image is not None or negative_ip_adapter_image_embeds is not None
+        ):
+            ip_adapter_image = np.zeros((width, height, 3), dtype=np.uint8)
+        if self.joint_attention_kwargs is None:
+            self._joint_attention_kwargs = {}
+        image_embeds = None
+        negative_image_embeds = None
+        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
+            image_embeds = self.prepare_ip_adapter_image_embeds(
+                ip_adapter_image,
+                ip_adapter_image_embeds,
+                device,
+                batch_size * num_images_per_prompt,
+            )
+        if negative_ip_adapter_image is not None or negative_ip_adapter_image_embeds is not None:
+            negative_image_embeds = self.prepare_ip_adapter_image_embeds(
+                negative_ip_adapter_image,
+                negative_ip_adapter_image_embeds,
+                device,
+                batch_size * num_images_per_prompt,
+            )
+        #creatidesign
+        objects_boxes = objects_boxes.to(device=device, dtype=latents.dtype).repeat_interleave(batch_size, dim=0)
+        objects_masks = objects_masks.to(device=device, dtype=latents.dtype).repeat_interleave(batch_size, dim=0)
+        objects_masks_maps = objects_masks_maps.to(device=device, dtype=latents.dtype).repeat_interleave(batch_size, dim=0)
+        subject_masks_maps = subject_masks_maps.to(device=device, dtype=latents.dtype).repeat_interleave(batch_size, dim=0)
+        N = len(objects_caption[0])
+        print("N",N)
+        bbox_text_embeddings = torch.zeros(
+            max_boxes_per_image,max_sequence_length,4096, device=device, dtype=latents.dtype
+        )
+        if N>0:
+            bbox_text_embeddings_temp,_,_ = self.encode_prompt(prompt=objects_caption[0],prompt_2=None,device=device,
+                                        num_images_per_prompt=num_images_per_prompt,
+                                        max_sequence_length=max_sequence_length,)
+            bbox_text_embeddings[:N]=bbox_text_embeddings_temp
+        bbox_text_embeddings = bbox_text_embeddings.unsqueeze(0).to(device=device, dtype=latents.dtype).repeat_interleave(batch_size, dim=0) #[b,10,30,4096]
+        # Convert condition images to latent space
+        condition_img = condition_img.to(device=device,dtype=self.vae.dtype).repeat_interleave(batch_size, dim=0)
+        condition_img_input = self.vae.encode(condition_img).latent_dist.sample()
+        condition_img_input = (condition_img_input - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+        condition_img_input = condition_img_input.to(dtype=latents.dtype)
+        condition_latent_image_ids = self._prepare_latent_image_ids(
+            condition_img_input.shape[0],
+            condition_img_input.shape[2] // 2,
+            condition_img_input.shape[3] // 2,
+            device,
+            latents.dtype,
+            scale_h = scale_h,
+            scale_w = scale_w,
+        )
+        # shift condition image ids
+        if use_bucket:
+            # offset determined by condition image width and scale
+            condition_latent_image_ids[:, 1] += 0  # H dimension unchanged
+            condition_latent_image_ids[:, 2] += -1*(condition_img_input.shape[3]*scale_w//2)
+        else:
+            # shift condition image ids
+            condition_latent_image_ids[:, 1] += position_delta[0]  # H dimension unchanged
+            condition_latent_image_ids[:, 2] += position_delta[1]  # W dimension shift left
+        packed_clean_condition_input = self._pack_latents(
+            condition_img_input,
+            batch_size=condition_img_input.shape[0],
+            num_channels_latents=condition_img_input.shape[1],
+            height=condition_img_input.shape[2],
+            width=condition_img_input.shape[3],
+        )
+        design_kwargs = {
+                "object_layout": {"objects_boxes": objects_boxes, "bbox_text_embeddings": bbox_text_embeddings, "bbox_masks": objects_masks,"objects_masks_maps":objects_masks_maps,"img_token_h":(int(height) // (self.vae_scale_factor * 2)), "img_token_w":(int(width) // (self.vae_scale_factor * 2))}, #[b,10,4], [B,10,512,4096],[b,10]
+                "subject_contion":{"condition_img":packed_clean_condition_input,"subject_masks_maps":subject_masks_maps,"condition_img_ids":condition_latent_image_ids,"subject_token_h":condition_img_input.shape[2]//2, "subject_token_w":condition_img_input.shape[3]//2}, # [B,4,64,64]
+            }
+        neg_objects_masks = torch.zeros_like(objects_masks).to(device=device, dtype=latents.dtype)
+        neg_condtion_img = neg_condtion_img.to(device=device,dtype=self.vae.dtype).repeat_interleave(batch_size, dim=0)
+        neg_condtion_img_input = self.vae.encode(neg_condtion_img).latent_dist.sample()
+        neg_condtion_img_input = (neg_condtion_img_input - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+        neg_condtion_img_input = neg_condtion_img_input.to(dtype=latents.dtype)
+        neg_condition_latent_image_ids = self._prepare_latent_image_ids(
+            neg_condtion_img_input.shape[0],
+            neg_condtion_img_input.shape[2] // 2,
+            neg_condtion_img_input.shape[3] // 2,
+            device,
+            latents.dtype,
+            scale_h = scale_h,
+            scale_w = scale_w
+        )
+        if use_bucket:
+            # offset determined by condition image width and scale
+            neg_condition_latent_image_ids[:, 1] += 0  # H dimension unchanged
+            neg_condition_latent_image_ids[:, 2] += -1*(condition_img_input.shape[3]*scale_w//2)
+        else:
+            # shift negative condition image ids
+            neg_condition_latent_image_ids[:, 1] += position_delta[0]  # H dimension shift
+            neg_condition_latent_image_ids[:, 2] += position_delta[1]  # W dimension shift
+        packed_clean_neg_condtion_input = self._pack_latents(
+            neg_condtion_img_input,
+            batch_size=neg_condtion_img_input.shape[0],
+            num_channels_latents=neg_condtion_img_input.shape[1],
+            height=neg_condtion_img_input.shape[2],
+            width=neg_condtion_img_input.shape[3],
+        )
+        neg_subject_masks_maps = subject_masks_maps
+        neg_objects_masks_maps = objects_masks_maps
+        neg_design_kwargs = {
+                "object_layout": {"objects_boxes": objects_boxes, "bbox_text_embeddings": bbox_text_embeddings, "bbox_masks": neg_objects_masks,"objects_masks_maps":neg_objects_masks_maps,"img_token_h":(int(height) // (self.vae_scale_factor * 2)), "img_token_w":(int(width) // (self.vae_scale_factor * 2))},  #[b,10,4], [B,10,512,4096],[b,10]
+                "subject_contion":{"condition_img":packed_clean_neg_condtion_input,"subject_masks_maps":neg_subject_masks_maps,"condition_img_ids":neg_condition_latent_image_ids,"subject_token_h":condition_img_input.shape[2]//2, "subject_token_w":condition_img_input.shape[3]//2}, # [B,4,64,64]
+            }
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+                if image_embeds is not None:
+                    self._joint_attention_kwargs["ip_adapter_image_embeds"] = image_embeds
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0]).to(latents.dtype)
+                noise_pred = self.transformer(
+                    hidden_states=latents,
+                    timestep=timestep / 1000,
+                    guidance=guidance,
+                    pooled_projections=pooled_prompt_embeds,
+                    encoder_hidden_states=prompt_embeds,
+                    txt_ids=text_ids,
+                    img_ids=latent_image_ids,
+                    joint_attention_kwargs=self.joint_attention_kwargs,
+                    return_dict=False,
+                    design_kwargs = design_kwargs,
+                )[0]
+                if do_true_cfg:
+                    if negative_image_embeds is not None:
+                        self._joint_attention_kwargs["ip_adapter_image_embeds"] = negative_image_embeds
+                    neg_noise_pred = self.transformer(
+                        hidden_states=latents,
+                        timestep=timestep / 1000,
+                        guidance=guidance,
+                        pooled_projections=negative_pooled_prompt_embeds,
+                        encoder_hidden_states=negative_prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        joint_attention_kwargs=self.joint_attention_kwargs,
+                        return_dict=False,
+                        design_kwargs = neg_design_kwargs,
+                    )[0]
+                    noise_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+                if latents.dtype != latents_dtype:
+                    if torch.backends.mps.is_available():
+                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                        latents = latents.to(latents_dtype)
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                if XLA_AVAILABLE:
+                    xm.mark_step()
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image,)
+        return FluxPipelineOutput(images=image)

requirements.txt CHANGED Viewed

@@ -1,6 +1,14 @@
-accelerate
-diffusers
-invisible_watermark
-torch
-transformers
-xformers

+diffusers
+accelerate
+transformers
+sentencepiece
+protobuf
+bitsandbytes
+prodigyopt
+opencv-python
+beautifulsoup4
+xformers==0.0.27.post2
+flash-attn
+gradio

test_creatidesign_benchmark.py ADDED Viewed

	@@ -0,0 +1,210 @@

+from random import uniform
+import torch
+import os
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+import time
+from IPython.core.debugger import set_trace
+from dataloader.creatidesign_dataset_benchmark import DesignDataset,visualize_bbox,collate_fn,tensor_to_pil,make_image_grid_RGB
+import numpy as np
+from PIL import Image
+from safetensors.torch import save_file, load_file
+from accelerate import load_checkpoint_and_dispatch
+from modules.flux.transformer_flux_creatidesign import FluxTransformer2DModel
+from pipeline.pipeline_flux_creatidesign import FluxPipeline
+import json
+from huggingface_hub import snapshot_download
+from datasets import load_dataset
+if __name__ == "__main__":
+    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+    weight_dtype = torch.bfloat16
+    resolution = 1024
+    condition_resolution = 512
+    neg_condition_image = 'same'
+    background_color = 'gray'
+    use_bucket = True
+    condition_resolution_scale_ratio=0.5
+    benchmark_repo = 'HuiZhang0812/CreatiDesign_benchmark' #  huggingface repo of benchmark
+    datasets = DesignDataset(dataset_name=benchmark_repo,
+                             resolution=resolution,
+                             condition_resolution=condition_resolution,
+                             neg_condition_image =neg_condition_image,
+                             background_color=background_color,
+                             use_bucket=use_bucket,
+                             condition_resolution_scale_ratio=condition_resolution_scale_ratio
+                             )
+    test_dataloader = DataLoader(datasets, batch_size=1, shuffle=False, num_workers=4,collate_fn=collate_fn)
+    model_path = "black-forest-labs/FLUX.1-dev"
+    ckpt_repo = "HuiZhang0812/CreatiDesign" # huggingface repo of ckpt
+    ckpt_path = snapshot_download(
+        repo_id=ckpt_repo,
+        repo_type="model",
+        local_dir="./CreatiDesign_checkpoint",
+        local_dir_use_symlinks=False
+    )
+    # Load transformer config from checkpoint
+    with open(os.path.join(ckpt_path, "transformer", "config.json"), 'r') as f:
+        config = json.load(f)
+    transformer = FluxTransformer2DModel(**config)
+    transformer = load_checkpoint_and_dispatch(transformer, checkpoint=os.path.join(model_path,"transformer"), device_map=None)
+    # Load lora parameters using safetensors
+    state_dict = load_file(os.path.join(ckpt_path, "transformer","model.safetensors"))
+    # Load parameters, allow partial loading
+    missing_keys, unexpected_keys = transformer.load_state_dict(state_dict, strict=False)
+    print(f"Loaded parameters: {len(state_dict)}",state_dict.keys())
+    print(f"Missing keys: {len(missing_keys)}",missing_keys)
+    print(f"Unexpected keys: {len(unexpected_keys)}",unexpected_keys)
+    transformer = transformer.to(dtype=torch.bfloat16)
+    pipe = FluxPipeline.from_pretrained(model_path, transformer=transformer,torch_dtype=torch.bfloat16)
+    pipe = pipe.to("cuda")
+    seed=42
+    num_samples = 1
+    true_cfg_scale=3.5
+    guidance_scale=1.0
+    if resolution == 512:
+        position_delta=[0,-32]
+    else:
+        position_delta=[0,-64]
+    if use_bucket:
+        scale_h = 1/condition_resolution_scale_ratio
+        scale_w = 1/condition_resolution_scale_ratio
+    else:
+        scale_h = resolution/condition_resolution
+        scale_w = resolution/condition_resolution
+    num_inference_steps = 28
+    # Create save directory based on benchmark directory name
+    save_root =os.path.join("outputs",benchmark_repo.split("/")[-1])
+    os.makedirs(save_root,exist_ok=True)
+    img_save_root = os.path.join(save_root,"images")
+    os.makedirs(img_save_root,exist_ok=True)
+    img_withgt_save_root = os.path.join(save_root,"images_with_gt")
+    os.makedirs(img_withgt_save_root,exist_ok=True)
+    total_time = 0
+    for i, batch in enumerate(tqdm(test_dataloader)):
+        prompts = batch["caption"]
+        imgs_id = batch['id']
+        objects_boxes = batch["objects_boxes"]
+        objects_caption = batch['objects_caption']
+        objects_masks = batch['objects_masks']
+        condition_img = batch['condition_img']
+        neg_condtion_img = batch['neg_condtion_img']
+        objects_masks_maps= batch['objects_masks_maps']
+        subject_masks_maps = batch['condition_img_masks_maps']
+        target_width=batch['target_width'][0]
+        target_height=batch['target_height'][0]
+        img_info = batch["img_info"][0]
+        filename = img_info["img_id"]+'.jpg'
+        start_time = time.time()
+        with torch.no_grad():
+            images = pipe(prompt=prompts*num_samples,
+                          generator=torch.Generator(device="cuda").manual_seed(seed),
+                          num_inference_steps = num_inference_steps,
+                          objects_boxes=objects_boxes,
+                          objects_caption=objects_caption,
+                          objects_masks = objects_masks,
+                          objects_masks_maps=objects_masks_maps,
+                          condition_img = condition_img,
+                          subject_masks_maps = subject_masks_maps,
+                          neg_condtion_img = neg_condtion_img,
+                          height= target_height,
+                          width = target_width,
+                          true_cfg_scale = true_cfg_scale,
+                          position_delta=position_delta,
+                          guidance_scale=guidance_scale,
+                          scale_h = scale_h,
+                          scale_w = scale_w,
+                          use_bucket=use_bucket
+                    )
+        images=images.images
+        use_time = time.time() - start_time
+        total_time +=use_time
+        make_image_grid_RGB(images, rows=1, cols=num_samples).save(os.path.join(img_save_root,filename))
+        use_time = time.time() - start_time
+        total_time +=use_time
+        # Process original image and bounding boxes
+        ori_image = tensor_to_pil(batch['img'][0])
+        orig_width, orig_height = ori_image.size
+        normalized_boxes = batch['objects_boxes'][0].cpu().numpy()
+        denormalized_boxes = []
+        for box in normalized_boxes:
+            x1, y1, x2, y2 = box
+            denorm_box = [
+                x1 * orig_width,  # x1
+                y1 * orig_height, # y1
+                x2 * orig_width,  # x2
+                y2 * orig_height  # y2
+            ]
+            denormalized_boxes.append(denorm_box)
+        objects_result = {
+            "boxes": denormalized_boxes,
+            "labels": batch['objects_caption'][0],
+            "masks": []
+        }
+        # Only keep boxes and captions where mask is 1
+        valid_boxes = []
+        valid_labels = []
+        for box, label, mask in zip(objects_result['boxes'],
+                                objects_result['labels'],
+                                batch['objects_masks'][0]):
+            if mask:
+                valid_boxes.append(box)
+                valid_labels.append(label)
+        objects_result['boxes'] = valid_boxes
+        objects_result['labels'] = valid_labels
+        ori_image_with_bbox = visualize_bbox(ori_image ,objects_result)
+        # Concatenate images
+        total_width = ori_image.width + ori_image.width+ num_samples*ori_image.width
+        max_height = ori_image.height
+        # Create a new blank image to hold the concatenated images
+        new_image = Image.new('RGB', (total_width, max_height))
+        new_image.paste(ori_image_with_bbox, (0, 0))
+        # Process condition image
+        condition_img = tensor_to_pil(batch['original_size_condition_img'][0])
+        subject_canvas_with_bbox = visualize_bbox(condition_img ,objects_result)
+        new_image.paste(subject_canvas_with_bbox, (ori_image.width, 0))
+        # Paste generated images
+        for j, image in enumerate(images):
+            save_name=os.path.join(img_withgt_save_root,filename)
+            image_with_bbox = visualize_bbox(image ,objects_result)
+            new_image.paste(image_with_bbox, (ori_image.width*(j+2), 0))
+        new_image.save(save_name)
+    print(f"Total inference time: {total_time:.2f} seconds")
+    print(f"Average time per image: {total_time/len(test_dataloader):.2f} seconds")