CondRef-AR: Condition-as-a-Reference Randomized Autoregressive Modelling for Controllable Aerial Image Generation
Pu Jin
Paper Github Repo Huggingface


Controllable image generation, particularly for complex aerial scenes, demands both high-fidelity synthesis and precise adherence to input conditions such as semantic maps or edge diagrams. Aerial imagery is characterized by extensive spatial dependencies, where geographical features and objects exhibit strong correlations across the entire scene. Accurately modeling these long-range interactions often requires a broader contextual field of view to capture comprehensive spatial relationships. Recently, autoregressive (AR) models make significant progress in controllable generation. However, their inherent unidirectional context modeling limits the global coherence and structural fidelity of generated images. To overcome this constraint, we propose ConRef-AR, a novel Condition-as-a-Reference randomized autoregressive modelling for controllable aerial image generation. Our approach is built upon a key insight: in controllable tasks, the input condition is not merely a static guide but a dynamic, intrinsic reference that specifies the content for each spatial location. To leverage this, CondRef-AR employs a randomized training strategy where the image token sequence is permuted during training to expose the model to diverse contexts. Crucially, instead of relying on auxiliary mechanisms to resolve prediction ambiguities introduced by randomization, our core innovation is to directly utilize the corresponding control signal as the primary reference for predicting each token. This elegant substitution allows the model to naturally disambiguate predictions while simultaneously achieving superior control precision, thereby enhancing both generation quality and fidelity within the efficient autoregressive paradigm.

CondRef-AR Method


This section details our proposed Condition-as-a-Reference Randomized Autoregressive Modelling (CondRef-AR) framework. Our core insight is that in controllable generation tasks, the input condition should not be treated as a static, global guide, but rather as a dynamic, intrinsic reference that provides precise guidance for the prediction at each spatial location. Building upon this insight, CondRef-AR employs a randomized training strategy to learn rich bidirectional contexts and introduces a novel mechanism that leverages the control signal itself as the primary means to resolve the prediction ambiguities introduced by randomization.

Figure 1: Overview of CondRef-AR framework

overview

Results


We demonstrate the effectiveness of CondRef-AR through extensive experiments on the DesignEarth dataset, showcasing its superior performance in generating high-fidelity aerial images. By varying the input conditions and prompts, CondRef-AR can generate diverse aerial images:

Figure 2: Generated Aerial Images

overview


ConRef-AR can generate continuous, plausible, and high-resolution sequences of land-use change images based on a series of temporal semantic condition graphs. As shown in the figure below, the model successfully simulates the entire process—from a pristine forest gradually transforming into a modern residential urban area:

Figure 3: Generated Land-Use Change Sequences

overview