Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.
Method Overview
SEGA turns fixed attention scaling into dynamic, content-aware scaling by looking at the latent's frequency content during denoising.
At each denoising step, SEGA analyzes the current latent in the frequency domain, maps spectral energy to RoPE dimensions, and adaptively rescales attention so the model preserves both global structure and fine detail at high resolution.
SEGA modifies RoPE by multiplying each rotary dimension by a dynamic scale. The scale changes across dimensions, axes, samples, and denoising steps.
During sampling, the diffusion transformer maintains a noisy latent representation. SEGA reads this latent directly, without adding new networks or learned parameters.
SEGA reshapes the latent into its spatial layout and applies an FFT to estimate how much energy lies in different spatial-frequency bands.
Individual RoPE dimensions encode specific spatial frequencies. SEGA maps the spectral energy directly onto the relevant positional embeddings instead of treating all dimensions equally.
Low-energy bands receive stronger scaling to recover under-resolved structure, while high-energy bands receive weaker scaling to avoid over-amplification.
Fixed scaling creates a trade-off: preserving coarse layout can blur detail, while emphasizing detail can destabilize global structure. SEGA avoids this by adapting scaling to the image's actual spectral structure.
Polar bear swimming underwater, paws paddling, fur streaming, bubbles, blue-green Arctic water, bioluminescent plankton
4096×4096 · YaRN, DyPE, UltraImage & SEGA (Flux)
A cat and a bunny sitting on a table having a tea party.
4096×4096 · YaRN, DyPE, UltraImage & SEGA (Qwen)
A staircase made of stacked books leading up to a glowing door.
4096×4096 · YaRN, DyPE, UltraImage & SEGA (Qwen)
Nile crocodile eye at surface level, golden eye above waterline, reflection of savanna sky, scales in perfect detail
4096×4096 · YaRN, DyPE, UltraImage & SEGA (Flux)
A koala reading a newspaper while sitting on a park bench.
4096×4096 · ScaleDiff, I-Max, HiFlow & SEGA (Flux)
A train driving through a tunnel made of giant colorful candy.
4096×4096 · ScaleDiff, I-Max, HiFlow & SEGA (Flux)
@misc{rajabi2026,
title = {Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers},
author = {Rajabi, Javad and Shaban, Kimia and Roohi, Koorosh and Lindell, David B. and Taati, Babak},
year = {2026}
}