This proliferation of Unmanned aerial vehicles (UAVs) has led to an exponential increase in the volume of aerial imagery, simultaneously spurring a burgeoning demand for automated interpretation and synthesis methods. While significant strides have been made in general image generation and remote sensing image analysis, a crucial gap remains in datasets that cohesively integrate diverse modalities—such as high-resolution aerial photos, rich map representations, and comprehensive semantic descriptions—to facilitate controllable aerial image generation. Existing remote sensing datasets often focus on single modalities or lack the explicit structural and semantic annotations required for fine-grained generative control. To address this critical need, we introduce DesignEarth, a novel, large-scale, multi-modal dataset specifically designed to enable and benchmark controllable aerial image generation. DesignEarth uniquely comprise 265, 247 georeferenced aerial images and their corresponding multi-layered condition images, i.e., map image, pencil sketch, canny, and lineart image, and the detailed semantic descriptions with more than 38.4M tokens. This synergistic combination empowers models to synthesize aerial scenes not merely from textual prompts (text-to-image), but also with precise spatial and structural guidance derived from various map inputs (map-to-image, controllable generation). Furthermore, we establish a comprehensive benchmark on DesignEarth, evaluating state-of-the-art generative models across variety of tasks such as text-to-image generation, and conditioned image generation, offering baseline results and highlighting persistent challenges. Our extensive experiments underscore the dataset's utility in advancing research toward highly controllable and semantically consistent aerial scene synthesis.