Addressing Data Limitations in Power Grid Research with Synthetic Datasets
Data-driven methods rely critically on the availability of large, high-quality datasets. While such data is abundant for domains like text and image generation, information on the operation of critical infrastructures is rarely public, and when available it is often aggregated and incomplete. In this talk, we present a method for generating large-scale synthetic datasets of power injections in a transmission grid model of continental Europe. The approach combines structural information on the grid – its line admittances, the location, type, and capacity of generators – with publicly available aggregated load data from ENTSO-E. This enables the creation of arbitrarily long time series that are statistically consistent with real-world behavior. The resulting datasets have been validated against empirical measurements and provide a valuable resource for developing and testing data-driven methods in power systems research. We will also discuss ongoing work on enriching the dataset with additional features and sampling strategies to better capture rare and critical operating conditions.

