Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject’s spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model’s prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model’s original distribution.
Noise-induced layouts leverage the fact that the initial random noise in a diffusion model already encodes a natural spatial arrangement based on the model's prior. Instead of forcing an external layout (made by the user or a LLM), our method extracts a layout directly from this noise and refines it throughout the denoising process.
Here, we display the noise-induced layouts obtained from the initial noise for different random seeds.
Our method is based on the interplay between soft- and hard-layouts.
A soft-layout is a feature map that represents each pixel's potential to associate with other pixels in composing a single subject. We extract the soft-layout from the diffusion's model features.
A hard-layout is derived by clustering the soft-layout according to the number of subjects described in the prompt.
This figure illustrates the progression of the soft- and hard-layouts in two cases.
The bottom row shows vanilla SDXL, where we present the soft-layouts extracted throughout generation. The generated image doesn't adhere to the prompt.
The top row shows results from our method, where the model is encouraged to produce soft-layouts that respect the prompt-aligned hard-layouts, which are refined between each step. The refined hard-layouts respect the subject boundaries from previous timesteps, resulting in a “decisive” generation process, where each subject is consistently delgated to a specific region.
Our method steers the denoising process by applying iterative guidance (turquoise box) after each denoising step (orange regions). At denoising step t (left orange box), we predict a soft-layout St based on the diffusion model’s features, and cluster it to form a prompt-aligned hard-layout Mt (purple box). This hard-layout is then used to control the layout of the next denoising step (right orange box). In the guidance stage, we optimize the latent image, with the objective to align its associated updated soft-layout with the hard-layout Mt.