Recent advancements in diffusion models have demonstrated remarkable outcomes in text-to-image synthesis. However, relying solely on text prompt can introduce spatial ambiguity and limit user controllability. In this project, we introduce a free-training method to enhance spatial object placement in diffusion models without the need for additional training data or complex architectures. This method leverages the inherent cross-attention mechanism within diffusion models.
Our method allows users to achieve precise control over spatial object placement by fine-tuning the cross-attention map. This map establishes connections between specific regions and corresponding object descriptors in the text prompt. Users can define which regions contain desired objects, leading to enhanced user controllability without significantly increasing computational costs. The result is a method that balances prompt and conditional information, producing high-aesthetic image generation.
Developing and implementing method with the Hugging Face Diffusers library. All the result images were generated from models with architectures similar to Stable Diffusion v1.5.
- Spatial Control: Fine-tune cross-attention maps to control the placement of objects in generated images.
- Free-training Method: Achieve spatial control without requiring additional training data or complex model architectures.
- High-Aesthetic Image Generation: Maintain a balance between prompt and conditional information for high-quality image generation.
The user provides a text prompt , with being the word the user wants to control, and is the number of objects. The region map is defined as , with corresponding to the region mask associated with word
The results were obtained with different
$S$ and$S'$ values for the "1girl", with$S$ and$S'$ of the "sun" set to a constant value of 0.5 and 0.
The method's hyperparameter selection is crucial. In Figure, images were generated with the same seed, using a guidance scale of 7.5, DPM++ 2M Karras sampler with 25 steps, all on the same device. Proper values for
Visualizing the method's process. Users manipulate object placement by choosing phrases like 'A girl' and 'bridge'. User-designated masks enhance the importance of these phrases in the attention matrix within cross-attention layers.
The cross-attention maps of a Stable Diffusion by using DAAM with the prompt "A girl sitting on the bridge." and chosing 2 pharses "A girl" and "bridge"
Visualizing cross-attention maps for Stable Diffusion with DAAM. The top row depicts the scenario without our method, while the bottom row demonstrates its impact. Highlighted pixels in heatmaps show stronger relationships with each word, showcasing the network's focus on distinct pixels for individual words.
Through experimentation, we observed that using a scale
where:
-
$\sigma$ is the current noise level; -
$a$ is the outcome of the computational process involving Q, K, and the attention mask, as expressed by the formula$a$ =$Q\cdot K^{T} + M$ .
For each prompt, the results utilized the same configurations and a region map. The region map is a map that illustrates the areas of instances, and on these region maps, we also attached the hyperparameters used to generate that image. Furthermore, for each prompt, we generated images without using the region map and with multiple seed usage. Our method, when using Stable Diffusion took 5 seconds to generate an image, whereas without it, the process took 4.6 seconds on a computer with 1 NVIDIA T4 GPU.
-
To ensure fairness, we ran them on identical hardware using the "QuinceMix v2.0" model, which is structurally similar to Stable Diffusion v1.5 with specific hyperparameters: negative prompt ("bad quality, low quality, jpeg artifact, cropped"), clip skip = 2, guidance scale = 7.5, and the consistent generated image size of 512x512. All methods used the same seeds and the DPM++ 2M Karras sampler with 25 sampling steps for the reverse diffusion process.
-
To ensure optimal performance, we adopted the hyperparameters from the respective papers for MultiDiffusion and Masked-Attention Guidance methods. For MultiDiffusion, we used bootstrapping with a 20% value for denoising steps. Masked-Attention Guidance's guidance scale (
$\alpha$ ) and loss weight ($\lambda$ ) were set to 0.08 and 0.5, respectively. We customized$S$ and didn't use$S'$ in our method , as indicated in the region maps. Our evaluation process involved randomly selecting seeds for each prompt and region map, and the methods generated images across all these selected seeds. The assessment results adhered to three criteria:- Region Map Compliance: Evaluates the faithfulness of generated objects to the predefined region. Higher scores indicate better alignment with the specified region, a crucial criterion.
- Prompt Compliance: Evaluates how well the generated image includes all objects from the prompt, with a higher score indicating better adherence to the prompt.
- Secondary Criterion: Awards additional points for aesthetically pleasing and high-quality generated images, serving as a supplementary evaluation criterion.
-
By considering these three criteria, our evaluation aims to provide a comprehensive and nuanced perspective on the performance of the methods under various conditions.
Illustrate the generated images for each method with each pair of prompt and a region map, where the methods' results on the left represent the least satisfying criteria and on the right represent the most satisfying criteria.
- Relying solely on these two cases for conclusions may be misleading; thus, we provide the generated images at here for a comprehensive individual evaluation.
The different diffusion models' generated images from each prompt using the same configurations and seed with our proposed method. All generated images have the same resolution of 512x512.
The proposed method enhances Stable Diffusion's ability to prevent prompt manipulation in image generation. Columns show results with the same seed, and rows depict results with and without a Region map, all at a consistent size of 1920x1088.
With a general description prompt, by combining ControlNet and our methodology, images tailored to the users' needs can be created. Columns show results with the same seed, while rows illustrate results with and without a Region map.
Notice without method a lot of intances are missing.
Visualization of images generated with IP-Adapter, combining region maps and prompts. Each column shows results with the same seed and the same input in each row. The first row shows results without a region map, and the second row uses a region map. All results have the same resolution of 768x512.
All generated images are used in the same configuration, except for the IPAdapter scale, a unique seed, and the same dimensions of 768x512. The object's region image describes the character's position in the picture. It can be observed that our method performs better than IPAdapter attention masking. "Regular" means that the images are generated in a normal way without applying masks to the objects.
- From the images generated with corresponding region maps, this method performs quite well across various cases and sizes of test images. However, due to its reliance on cross-attention refinement to highlight specific regions, there may be instances where it does not work optimally. For example, if the region of interest is relatively small or exhibits unusual characteristics, the model may fail to generate an appropriate image.
- Furthermore, if the positions of the objects are chosen by the user and the model has not been trained on those positions, it will not generate the desired images. In addition, to obtain the desired images, users need to set the input hyperparameters appropriately. If the hyperparameters are set too high, the generated images will be very poor, while if the hyperparameters are set too low, they will not meet the user's expectations.
To use our project, follow these installation steps:
git clone https://github.com/duongve13112002/DiffusionSpatialControl.git
cd DiffusionSpatialControl/source
pip install -r requirements.txt
For the convenience of using this method, we have implemented it on a simple web application using the Gradio library.
cd DiffusionSpatialControl/source
python app.py
We welcome contributions! Follow these steps to contribute to our project:
- Fork the repository
- Create a new branch:
git checkout -b feature/your-feature
- Make your changes and commit them:
git commit -m 'Add new feature'
- Push to the branch:
git push origin feature/your-feature
- Submit a pull request
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.