SEGA

Abstract

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

Method Overview

How SEGA Works

SEGA turns fixed attention scaling into dynamic, content-aware scaling by looking at the latent's frequency content during denoising.

TL;DR

At each denoising step, SEGA analyzes the current latent in the frequency domain, maps spectral energy to RoPE dimensions, and adaptively rescales attention so the model preserves both global structure and fine detail at high resolution.

The core idea

SEGA modifies RoPE by multiplying each rotary dimension by a dynamic scale. The scale changes across dimensions, axes, samples, and denoising steps.

f_SEGA(x, n, d) = m_d · f_RoPE(x, n, d)

1

Start from the latent

During sampling, the diffusion transformer maintains a noisy latent representation. SEGA reads this latent directly, without adding new networks or learned parameters.

2

Measure spectral energy

SEGA reshapes the latent into its spatial layout and applies an FFT to estimate how much energy lies in different spatial-frequency bands.

3

Match energy to RoPE dimensions

Individual RoPE dimensions encode specific spatial frequencies. SEGA maps the spectral energy directly onto the relevant positional embeddings instead of treating all dimensions equally.

4

Adapt attention scaling

Low-energy bands receive stronger scaling to recover under-resolved structure, while high-energy bands receive weaker scaling to avoid over-amplification.

✦ Noisy latent

→

≈ FFT spectrum

→

↕ Per-dimension scaling

Why it helps

Fixed scaling creates a trade-off: preserving coarse layout can blur detail, while emphasizing detail can destabilize global structure. SEGA avoids this by adapting scaling to the image's actual spectral structure.

No training No new weights No architecture changes Works at inference time

Results

Enable JavaScript to browse results. Images are listed in static/images/slider/flux4096/manifest.json.

Baseline comparisons

Polar bear swimming underwater, paws paddling, fur streaming, bubbles, blue-green Arctic water, bioluminescent plankton

4096×4096 · YaRN, DyPE, UltraImage & SEGA (Flux)

A cat and a bunny sitting on a table having a tea party.

4096×4096 · YaRN, DyPE, UltraImage & SEGA (Qwen)

A staircase made of stacked books leading up to a glowing door.

4096×4096 · YaRN, DyPE, UltraImage & SEGA (Qwen)

Nile crocodile eye at surface level, golden eye above waterline, reflection of savanna sky, scales in perfect detail

4096×4096 · YaRN, DyPE, UltraImage & SEGA (Flux)

Citation

@article{rajabi2026sega,
  title={SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers},
  author={Rajabi, Javad and Shaban, Kimia and Roohi, Koorosh and Lindell, David B and Taati, Babak},
  journal={arXiv preprint arXiv:2605.22668},
  year={2026}
}