Background & Summary
Water Distribution Networks (WDNs) are considered critical infrastructures as they provide clean and safe water to humans, which is one of the Sustainable Development Goals proposed by the United Nations. Water providers have to deal with critical challenges during the design, planning, and management phases of a WDN in order to fulfill this goal, such as adaptability and robustness to an ever changing environment. Climate, consumer behavior, aging infrastructure, failures, all can lead to drastic changes in the conditions under which WDNs must continue working adequately. Monitoring of WDN operations plays an important role in guaranteeing the water supply. The state of the network must be known at any given time to prevent unwanted situations, e.g., pipe …
Background & Summary
Water Distribution Networks (WDNs) are considered critical infrastructures as they provide clean and safe water to humans, which is one of the Sustainable Development Goals proposed by the United Nations. Water providers have to deal with critical challenges during the design, planning, and management phases of a WDN in order to fulfill this goal, such as adaptability and robustness to an ever changing environment. Climate, consumer behavior, aging infrastructure, failures, all can lead to drastic changes in the conditions under which WDNs must continue working adequately. Monitoring of WDN operations plays an important role in guaranteeing the water supply. The state of the network must be known at any given time to prevent unwanted situations, e.g., pipe leaks.
Hydraulic modeling has been the most straightforward approach for practitioners to simulate the WDN dynamics and aid design, planning and management. While pure physics-based hydraulic modeling is still being commonly used in the water domain, water engineering research and practice are experiencing a shift towards hybrid data-driven approaches. Such approaches combine the power of physics and mathematical simulation tools with data-driven deep learning models to solve water engineering problems. Paradoxically, while data are the key to such approaches, WDN operation data are scarce and seldom shared among practitioners and researchers due to privacy, safety and other domain related constraints[1](https://www.nature.com/articles/s41597-025-06026-0#ref-CR1 “Brumbelow, K., Torres, J., Guikema, S., Bristow, E. & Kanta, L. Virtual cities for water distribution and infrastructure system research. In World environmental and water resources congress 2007: Restoring our natural habitat, 1–7, https://doi.org/10.1061/40927(243)469
(2007).“),2. A notable example is nodal demand patterns. Demand is one of the most important inputs for solving the WDN hydraulics3. Surprisingly, it is one of the inputs that is rarely found in the WDN asset description files. It is common to find just a few demand patterns reused many times on several nodes in the network4, or no demand patterns at all5. The stochastic nature of water demand explains some of the uncertainties found in WDNs6, which should be properly modeled and considered in the simulations7. Hence, reusing the same demand patterns on multiple nodes assumes that several users/consumers have exactly the same water consumption behavior, which is unrealistic. This not only harms the robustness of the models to uncertainties, but also limits the variety of the data, which is especially important for deep learning data-driven approaches.
Benchmark datasets for WDN data analysis are, in fact, very limited8,[9](https://www.nature.com/articles/s41597-025-06026-0#ref-CR9 “Tello, A., Truong, H., Lazovik, A. & Degeler, V. Large-scale multipurpose benchmark datasets for assessing data-driven deep learning approaches for water distribution networks. Engineering Proceedings 69, https://doi.org/10.3390/engproc2024069050
(2024).“). LeakDB8 is a dataset commonly used in research at the moment, but it only includes a single small WDN with limited variability of scenarios, as explained in Section Technical Validation. This limits the diversity of the data for training data-driven models. More commonly found in practice is a collection of static asset descriptions of water networks, and different algorithms and implementations for data generation from them. Researchers and practitioners working on data-driven approaches for water engineering lack data to train their models, and count on static asset descriptions of the WDN rather than operational data. Those asset descriptions are represented as configuration files, which serve as input to physics-based mathematical tools to simulate the data required for data-driven models’ training. Although the simulation of WDN hydraulics from well-defined configuration files seems to be straightforward, it is a cumbersome process that involves expert knowledge, time-consuming models’ calibration, uncertainties, and computational complexity, among other challenges. Moreover, such configuration files only allow practitioners to simulate the WDN states determined by the input parameters explicitly specified. Hence, if new data for a different WDN is needed, or a different condition in the input parameters needs to be evaluated, the whole process has to be repeated from scratch.
Contrary to previous approaches, we provide operational data, which are ready-to-use for model training purposes. The aim of this work is to support the shift towards data-driven approaches for WDN data analysis. We provide a multi-purpose dataset generated based on 36 publicly available WDNs, which includes 228 million network state snapshots, when operating under normal conditions. The synthetic nature of the data eliminates the privacy and safety concerns, facilitating data sharing among researchers, and within the commercial sector, without any risks. Even the models trained on our data can be shared because the models will learn generic patterns which are not attached to any particular clients or real water utility assets. Those pre-trained models can be adapted to solve use-case specific problems at a later stage. This approach can lower the institutional barriers and enhance collaboration between water utilities. While data and models can be unrestrictedly shared among practitioners and researchers, the way models are applied and sensitive use-case data will still be private and safe within each water company or research institution.
Previous work demonstrates the transferability of deep learning models trained on synthetic data to real-world applications[10](https://www.nature.com/articles/s41597-025-06026-0#ref-CR10 “Truong, H., Tello, A., Lazovik, A. & Degeler, V. Graph neural networks for pressure estimation in water distribution systems. Water Resources Research 60, e2023WR036741, https://doi.org/10.1029/2023WR036741
(2024).“). It shows that a model trained on synthetic data performs well on real scenarios for pressure estimation with data provided by a water utility company. In addition, it shows that real operational conditions can be incorporated by using them to fine-tune a model pre-trained on synthetic data to adapt it to the real use-case data. The examples of tasks supported by our dataset are surrogate modeling, state estimation, and demand forecasting. The data provided includes all the inputs used for the simulations and their respective outputs, which allows researchers to work on surrogate modeling of physics-based mathematical simulation tools. The large number of provided snapshots allows practitioners to work on state estimation models. The data include unique demand patterns per node, facilitating demand forecasting.
Methods
To create operational data, we begin by collecting available WDNs represented as EPANET input files. The detailed procedure and file format are explained in the Data acquisition section. Based on these inputs, we implement a generation pipeline to produce synthetic scenarios. The pipeline first extracts available hydraulic parameters, using their fields to construct an optimization configuration, and their values to record in a global profiler, as detailed in the Preprocessing step. The configuration defines sampling boundary values for each parameter field, which serve as objectives of an optimization process. Throughout this process, the global profiler guides the selection and validation of potential sampling values, as described in the Hydraulic Sampling Parameters Optimization (HSPO) Section. The optimized configuration is used to sample parameter values that are then fed into EPANET to simulate scenarios. Finally, only error-free scenarios are retained and packaged into a compressed format, as outlined in the Simulation section. We now describe each step in detail.
Data acquisition
The described synthetic dataset was created based on publicly available WDNs. In order to achieve this, we collected data related to the topology and the physical properties of the networks’ components. As mentioned before, such information is available as configuration files describing the assets of the WDNs. Initially, we collected the asset description data of 55 WDNs. In those initial files, we found duplicated data related to the same WDN but under different names. We also found configuration files with unreadable characters which did not allow a proper data reading. After a data depuration process, we included 36 WDNs in our final dataset. The full list of the WDNs included in our dataset, and their main components, is shown in the Supplementary Table S1.
Our data is generated using the EPANET11 and WNTR12 physics-based simulation tools, which allow us to run simulations of the hydraulic behavior of WDNs. These tools are widely used by researchers and practitioners in the water domain. All collected configuration files are represented in EPANET input file format (.inp).
The input file contains the metadata about the WDN and the description of the components of the network, the system’s operation, water quality, and other options used at simulation time. The file is organized in sections, where each section begins with a keyword enclosed in brackets. For example, the sections related to the network components include: [TITLE], [JUNCTIONS], [RESERVOIRS], [TANKS], [PIPES], [PUMPS], [VALVES], and [EMITTERS]. The sections related to the system’s operation include: [CURVES], [PATTERNS], [ENERGY], [STATUS], [CONTROLS], [RULES], and [DEMANDS]. The complete description of the input file format can be found in the EPANET 2.2 User Manual13.
In the context of this work, the EPANET input file represents the input to the data generation process. Accordingly, each section represents a collection of parameters that needs to be optimized in order to obtain a simulation outcome that is considered to be a valid state of the network. The [PATTERNS] section is used to specify the water consumption patterns associated with each junction node. The pattern is represented as a list of values, where each element of the list represents the water consumption at time step t. Another important section is [TIMES], where we can specify the duration of the simulation and the time step, i.e., the sampling rate of the simulation’s outputs.
At runtime, the simulation generates a set of outputs corresponding to time step t, which is the state of the network at such given time. In our work, each network state is called a snapshot. The collection of snapshots that span the entire duration of the simulation is called a scenario. This dataset includes 10 WDNs where each scenario spans 24 hours, and 26 WDNs where each scenario spans 1 year, with a 1-hour time step in both cases. The complete dataset comprises 1,000 scenarios per network, which represent 228 million snapshots of water networks’ states.
Data generation
Following the network collection, we present a data generation scheme. Overall, the scheme involves three subsequent steps: Data Preprocessing, Hydraulic Parameter Optimization, and Simulation. The first step filters targeted simulation parameters and collects statistics across available networks. Both are then fed into an optimization algorithm to determine the sampling strategy and corresponding bounds for specific parameters. The last step plays a role in sampling concrete values, performing simulation, and encapsulating the data in a compressed format. The following sections explain each step in detail.
Preprocessing step
A static network description from an input file, described in Section Data Acquisition, contains useful simulation-oriented data and irrelevant information, such as titles, labels, and water quality parameters. Since this study focuses on hydraulic-related parameters, it is crucial to filter and refine only this specific data before proceeding to the next stage. Table 1 indicates selected parameters and their corresponding information. For some parameters, the original measurement units vary depending on the geographical region of each water network. For example, demand is measured in liters per second (LPS) in hanoi WDN, but in gallons per minute (GPM) in ky8 WDN. For the sake of consistency, they are converted to the corresponding International System of Units (SI system) using the wrapper tool WNTR12.
These selected parameters are then stored in a YAML configuration file. It is similar to the input file but contains essential metadata for both optimization and simulation phases, such as computed duration, time step, and names of skipped nodes. The configuration also records the sampling strategy and bounds for available parameters in a specific network. This metadata is included in the final delivery for reproducibility purposes.
Besides the configuration files, another type of information is computed by the profiler, calculating statistics of those 38 parameters collected from the original water networks. For each parameter, the profiler captures the minimum, maximum, mean, standard deviation, first quartile, third quartile, parameter dimension, and the number of components that can obtain this parameter. The statistics are computed for each baseline network and, additionally, for the global network representing the overall perspective. We leveraged them to 1) determine the sampling range and size and 2) perform data imputation in case of missing values in the following step.
Hydraulic Sampling Parameters Optimization (HSPO)
Consider a WDN with n nodes and m links, where each node and edge can obtain three types of parameters: static, pattern, and curve. The static parameter is a scalar or categorical value assigned per component, such as elevation or status. A pattern is a time-series that typically changes throughout the scenario (e.g., junction input demand, head pattern). A curve defines the relationship between two measurements, such as a pump curve, which reflects a pump’s operating capacity based on flow rate and head. Assume a node has s**n static parameters, p**n patterns, and c**n curves, each with a maximum length l**n, with corresponding parameters for edges represented by s**e, p**e, c**e, l**e. Given a simulation duration d, a simulation candidate lies in a space of s**n + pnd + cnl**n + s**e + ped + cel**e dimensions. Given this high dimensionality, we consider an alternative approach: identifying a sampling strategy to define appropriate values per parameter to generate a simulation candidate while reducing the search space, but preserving the data diversity. We call this the HSPO problem.
In particular, HSPO aims to identify stable pairs of sampling strategies and value bounds for each hydraulic parameter of all components. In other words, given a baseline WDN, the goal is to model numerous network variants and validate their parameters to ensure data stability within a specified time frame. There are two time frames, corresponding to the two dataset types: short-term and long-term. The short-term dataset includes scenarios observed over a 24-hour period, while the long-term dataset covers scenarios with a span of 1 year. Both use an hourly time step for sampling. Before diving into details, we outline the following potential sampling strategies to determine the value range of a specific parameter:
Keep. Following the principle “Doing nothing is better than doing anything”, this strategy preserves the parameter’s state as in the baseline network. This approach significantly reduces the search space and, therefore, mitigates the optimization complexity[14](https://www.nature.com/articles/s41597-025-06026-0#ref-CR14 “Donninger, C. Null move and deep search. ICGA Journal 16, 137–143, https://doi.org/10.3233/ICG-1993-16304
(1993).“).
Series. This strategy applies an existing series of a particular parameter across all components. For instance, pump curve pattern can be retrieved in the pump manual supplied by the manufacturer and applied to every pump curve within the networks. The value is then shared across all scenarios.
Sampling. Given a predefined range [min, max] of a particular parameter, we uniformly sample a new value for a hydraulic parameter per component. This approach ensures that every component has its own distinct value. For patterns and curves, this strategy leverages statistics from the profiler to sample series accordingly.
Perturbation. For a parameter, we gather the mean and standard deviation from the baseline WDN and sample from a Gaussian distribution. This strategy is beneficial when the parameter’s value is unavailable in the target network, allowing us to use values from the global perspective.
Factor. We sample a scale and bias to apply a linear transformation to existing values gathered from the baseline network. This approach ensures consistency, which is essential for certain parameters. For example, three adjacent pipes should have similar diameters. In such a case, the Factor serves as a potential strategy, while Sampling and Perturbation cause a pipe bottleneck as a modeling anomaly in practice.
Substitute. It randomly selects an existing value of the target parameter and shares it with all components. This approach also injects minor noise into the values to maintain diversity. Similar to the Factor method, it respects consistency in modeling.
Terrain. This is a special strategy applicable to junctions’ elevation. In particular, we employ the Diamond Square algorithm with proper noise to generate a 2D height map[15](https://www.nature.com/articles/s41597-025-06026-0#ref-CR15 “Fournier, A., Fussell, D. & Carpenter, L. Computer rendering of stochastic models. Commun. ACM 25, 371–384, https://doi.org/10.1145/358523.358553
(1982).“). Given the nodal coordinates from the input file, we project the network onto the map to obtain new elevations.
Automatic Demand Generator (ADG). This sampling strategy is specially tailored for junctions’ input demands, the most crucial but scarce parameter. Due to its importance, Section Automatic Demand Generator is dedicated to describing this approach.
For each target WDN, a default blueprint configuration is set up as follows: ADG for junction input demand, terrain for junction elevation, factor (substitute) for pipe diameter, and keep for all remaining parameters. Following this, the configuration is fed into a HSPO algorithm to iteratively refine the sampling strategy and sampling values for all parameters until convergence.
Particle Swarm Optimization (PSO)
Assume that sampling is the default generation strategy, each parameter needs a lower bound l**b and upper bound u**b to construct the sample space. This yields a total of 2D sampling parameters in the HSPO problem. To address this challenge, we chose Particle Swarm Optimization (PSO)[16](https://www.nature.com/articles/s41597-025-06026-0#ref-CR16 “Kennedy, J. & Eberhart, R. Particle swarm optimization. In Proceedings of ICNN’95 - International Conference on Neural Networks, 4, 1942–1948, https://doi.org/10.1109/ICNN.1995.488968
(1995).“), a simple yet robust approach extensively validated in water-related parameter optimization tasks[17](#ref-CR17 “Suribabu, C. R. & and, T. R. N. Design of water distribution networks using particle swarm optimization. Urban Water Journal 3, 111–120, https://doi.org/10.1080/15730620600855928
(2006).“),[18](#ref-CR18 “Moghaddam, A., Mokhtari, M., Afsharnia, M. & Minaee, R. P. Simultaneous hydraulic and quality model calibration of a real-world water distribution network. Journal of Water Resources Planning and Management 146, 06020007, https://doi.org/10.1061/(ASCE)WR.1943-5452.0001209
(2020).“),19.
Mathematically, PSO opts to construct a solution ({\bf{X}}\in {{\mathbb{R}}}{D\times 2}), representing a sampling configuration. This configuration drives a function (gensim:{{\mathbb{R}}}{D\times 2}\to {{\mathbb{R}}}^{out\times d}) to yield a scenario corresponding to out measurements, each as a d-length time-series. Internally, gensi**m implicitly solves a system of equations12 and typically produces a large batch of diverse scenarios in practice. From this perspective, only Ncases created scenarios are considered to evaluate a sampling solution. Nevertheless, their measurements could exhibit anomalies, such as negative pressure or time inconsistency. As the dataset is expected to be clean, this violates our assumption. To alleviate this, we form a set of rules R = {r1, r2. . . } in which each rule judges whether simulated outcomes are valid.
Formally, a binary function (validate:{{\mathbb{R}}}^{D\times 2}\to {1,0}) is defined as follows:
$$\mathrm{validate}(X)={\begin{array}{ll}1 & \mathrm{if},\forall ,r\in R,r(\mathrm{gensim}(X))=true\ 0 & \mathrm{otherwise}\end{array}$$
(1)
Nevertheless, empirical trials indicate that simultaneously optimizing all sampling parameters struggles to converge, more frequently for long-term cases (t = 8,760). This can be attributed to the large search space. To mitigate this, we facilitate PSO with a divide-and-conquer approach. As shown in Fig. 1, PSO considers a sampling set of a particular parameter while maintaining the fixed state of other sets at every timestep. This isolation reduces the complexity and makes PSO more manageable than addressing all parameters simultaneously. After an iteration, the updated optimal value for the selected parameter is fixed for the remainder of the epoch. A new PSO is then executed to optimize a random candidate from the parameter list, iterating until the list is empty. This process repeats across multiple epochs until the maximum number of epochs is reached or the intermediate solution is desired.
Fig. 1
Illustration of the dataset generation. The left figure (a) shows a divide-and-conquer PSO optimizing a strategy’s configuration. The right figure (b) depicts the usage of the optimized configuration to sample parameter sets and simulate diverse scenarios with unique characteristics (e.g., per-node demand patterns).
At each iteration, a sampling solution could be formed as a concatenation of the latest optimized and other sets. We evaluated the “goodness” of this solution by defining a fitness function ({f}_{success}:{{\mathbb{R}}}^{D\times 2}\to {\mathbb{R}}) computing the average success rate over Ncases generation cases:
$${f}_{success}(X)=\frac{{\sum }_{i=1}^{{N}_{\mathrm{cases}}},\mathrm{validate}(X)}{{N}_{\mathrm{cases}}}$$
(2)
Considering the stochastic nature, we set Ncases to 100 to estimate the goodness of each sampling solution. However, merely relying on fsucces**s leads to a collapse of the solution since particles tend to shrink in a local optimum, which is unrealistic and results in poor generalization. For instance, in one case of junction elevation, PSO proposed a narrow sampling range of [0.12, 0.12], resulting in flat terrain. To restrict such cases, a customized fitness function was designed.
Assume a particle i has its position represented as a solution ({x}_{i}\in {{\mathbb{R}}}{D\times 2}), we designed a fitness function ({f}_{pso}:{{\mathbb{R}}}{D\times 2}\to {\mathbb{R}}) as follows:
$${f}_{pso}({x}_{i})={f}_{success}({x}_{i})(\alpha {f}_{ubiqr}({x}_{i}))+(1-\alpha ){f}_{range}({x}_{i}))$$
(3)
where α is a hyper-parameter balancing the two auxiliary criteria: diversity indicator fubiq**r, and range expansion measurement frang**e. While the success ratio fsucces**s still plays a crucial role in assessing goodness, we encourage PSO to find optimal solutions beyond the baseline scenario. The fubiq**r computes the Upper Bound of the Inter-Quartile Range (UBIQR), a statistical measure of the spread of populations20. In this study, we compare the UBIQR of junction output demand between a generated case and the baseline, denoted as y**i and ybl. For the sake of brevity, we implied a simulation executed before computing this fitness (i.e., y**i = sim(x**i)). Mathematically, the comparison can be written as:
$${f}_{ubiqr}({y}_{i})=\frac{\mathrm{UBIQR}({y}_{i})}{\mathrm{UBIQR}({y}_{bl})}$$
(4)
The last fitness frang**e encourages the expansion of the sampling range. For Sampling strategy with two normalized value bounds (vmi**n, vma**x), the calculation is expressed as:
$${f}_{range}=| {v}_{max}-{v}_{min}| $$
(5)
Using the combination loss given in Equation (3), the modified PSO algorithm iteratively evaluates and “exploits” values of the sampling set of a specific hydraulic parameter while holding the latest states of other parameters constant over an extended timeframe. In addition, as shown in Fig. 1, parameter permutation introduces uncertainty, allowing PSO to explore solutions within a dynamic landscape. This strategy enables us to retrieve optimal sampling sets of hydraulic parameters for all available networks. These sets are stored in corresponding networks’ configurations and, therefore, leveraged by a simulation to produce data points on a large scale.
Simulation
Subfigure (b) of Fig. 1 illustrates the simulation workflow. Overall, the entire workflow leverages multi-core processing powered by a high-performance computing cluster. From the previous stage, an optimal sampling set associated with its strategy was transferred to the Generator, where we sampled actual simulation parameters. Thanks to the prior optimization process, the sampling yields statistically sound parameter values aligned with plausible scenarios. These parameters were batched and passed through a Simulator, where EPANET was used to simulate outcome scenarios. Importantly, only successfully executed scenarios indicated by Error Code 0 from EPANET feedback were accepted. This ensures their hydraulic feasibility. Additionally, scenario validation was further assessed using expert-defined rules (e.g., in-range pressure and time consistency). Finally, the input and output parameters of the validated scenarios were encapsulated in a compressed file.
Automatic Demand Generator
The ADG algorithm aims to generate the junctions’ demand patterns for each node in a WDN. The demand pattern is defined following an additive model of three components: a daily pattern, a yearly seasonal pattern, and noise, as expressed in Equation (6).
$$D=\mathrm{daily}({x}_{t})+\mathrm{yearly}({x}_{t})+{\varepsilon }_{t},:t\in T$$
(6)
where D is the demand pattern of each node in the network, daily(x**t) is the daily pattern, yearly(x**t) is the yearly seasonal pattern, and ε**t is white noise. The demand patterns generated are a multiplier time-series, i.e., a factor that is multiplied by the base demand of each node specified in the configuration file of each WDN. The generated time-series are normalized in the range [0, 1].
Daily Pattern
The daily pattern defines the water consumption per day based on consumption profiles: household, commercial, extreme-demand, and zero-demand. The consumption profiles are determined by splitting the 24-hour of a day into four 6-hour segments. Thus, starting at midnight, these segments represent the water consumption from 00:00 to 06:00, 06:00 to 12:00, 12:00 to 18:00, and 18:00 to 00:00. Each segment is assigned either a low, medium or high consumption. The range for low, medium or high consumption is defined by lower and upper bounds determined at random. Thus, from N random numbers in the range [0.00, 1.00], we compute the quantiles Q1 and Q3. Then, the low consumption goes from [0.00, Q1), the medium consumption goes from [Q1, Q3), and high consumption is in the range [Q3, 1]. For example, the household profile is represented as (low, high, medium, low). It is assumed a low consumption between midnight and six in the morning, with a peak consumption in the morning when people are preparing for work. Then, after noon, the demand gradually decreases during the day because people are at work, and finally the demand is low again at the end of the day when people are going to bed. In a similar way, the commercial profile is defined as (high, high, high, medium). In this case, assuming a high consumption most of the time with a small decrease at the end of the day.
Using the consumption ranges described before, we generate random samples for each of the 6-hour segments. The number of samples_per_hou**r is determined based on the sampling frequency (tim**e_ste**p) defined in the configuration file. Those 6-hour segments are then concatenated to compose the 24-hour corresponding to one day. Then, these 24-hour samples are repeated to span the entire duration of the demand pattern. The daily demand pattern is generated using the periodic function described in Equation (7).
$$\mathrm{daily}({x}_{t})=cos({x}_{t})+sin({x}_{t})+{z}_{t}:t\in T$$
(7)
where the cos(.) and sin(.) terms introduce the daily periodicity in the time-series, x**t represents the previously generated random sample at time t, and the z**t term represents white noise. The noise component guarantees that each repetition of the 24-hour pattern along the time-series is not a fidelity copy of the previous one. Finally, we use the Savitzky-Golay filter21,22 to smooth the generated time-series.
After the consumption profiles are defined, they have to be assigned to each node in the network. Hence, we need to determine which nodes belong to household profile and which ones to commercial. Domain knowledge indicates that commercial nodes are grouped in certain regions of the WDN. In order to resemble this characteristic, we propose to cluster the nodes into two main groups: household and commercial. The clusters are computed using the Louvain community detection algorithm, a heuristic approach that maximizes the modularity of the network23. This algorithm works in two phases. In the first phase, each node i is isolated and belongs to a community C. Then, the modularity gain is computed after each node is moved to its neighbor communities. If there is no positive gain in modularity, the node remains in its original community. This phase is repeated until no individual move can improve the modularity. For directed graphs, the modularity gain is computed as follows23,24,25:
$$\Delta Q=\frac{{k}_{i,\mathrm{in}}}{m},-,\gamma \frac{{k}_{i}{\mathrm{out}}\cdot {\Sigma }_{\mathrm{tot}}{\mathrm{in}}+{k}_{i}{\mathrm{in}}\cdot {\Sigma }_{\mathrm{tot}}{\mathrm{out}}}{{m}^{2}}$$
(8)
where m is the size of the network, γ is the resolution parameter which controls the size of the communities26, ({k}_{i}{\mathrm{out}}), ({k}_{i}{\mathrm{in}}) are the outer and inner weighted degrees of node i, and ({\Sigma }_{\mathrm{tot}}{\mathrm{in}}), ({\Sigma }_{\mathrm{tot}}{\mathrm{out}}) are the sum of in-going and out-going links incident to nodes in community C.
In the second phase, the communities found in the previous step become nodes in the network, and the sum of the weights in the corresponding communities becomes the link weights in the new graph. Then the whole algorithm is applied again. The algorithm stops when no modularity gain is achieved or when the modularity is lower than a certain threshold.
At this stage, we have coherent communities within each WDN. Now, we need to define the number of nodes from those communities that will be assigned to either commercial or household profiles. According to the statistics provided by the association of water companies in the Netherlands, about 28% of the users belong to the commercial sector27,[28](#ref-CR28 “Vewin. Dutch Drinking Water Statistics 2017 - From source to tap. https://www.vewin.nl/wp-content/uploads/2024/08/Drinkwaterstatistieken-2017-EN.pdf
Online; accessed 05-March-2025 (2017).“),[29](https://www.nature.com/articles/s41597-025-06026-0#ref-CR29 “Vewin. Dutch Drinking Water Statistics 2022 - From source to tap. https://www.vewin.nl/wp-content/uploads/2024/06/vewin-dutch-drinking-water-statistics-2022-eng-web.pdf
Online; accessed 05-March-2025 (2022).“). We randomly choose the percentag**e_commercia**l from the range (0.25, 0.35). This allows to resemble commercial consumption profiles in other countries around the world. We set the number of nodes that will be assigned to the commercial consumption profile as num_nodes_commercia**l = floor(percentag**e_commercia**l × total_numbe**r_o**f_nodes). After that, we iterate the communities found in the previous stage and sequentially assign the nodes in each community to the commercial consumption profile until we reach the num_nodes_commercia**l. Finally, the remaining nodes will be assigned to household profile at this stage. While household and commercial profiles are self-explanatory, extreme and zero-demand are a special type of consumption profiles.
The extreme-demand is a special case for some nodes with a very high water consumption. Thus, the extreme-demand profile is represented as (high, high, high, high). Usually, an extreme node represents a group of nodes, commonly external to the water network, but also connected to it. We set the extreme_dem_rat**e = 0.02, i.e., 2% of the scenarios will have nodes whose demand is always high. In addition, we limited the number of nodes per scenario that can have extreme demand values, specifically we set max_extreme_dem_junctions = 2. Domain knowledge can help to determine this parameter if the number of extreme nodes is known beforehand. The nodes to be assigned an extreme-demand profile are chosen at random and excluded from the nodes in the household or commercial profile. Then, for these nodes, the demand is randomly generated in the range [Q3, 1], as described before.
The zero-demand is another special case that represents nodes that do not consume water, but which are part of the network. Thus, these nodes always have a zero-demand value. These nodes are used for monitoring and control of the network operation, or they are modeled due to a planned expansion of the network. We set the zer**o_dem_rat**e = 0.05, i.e., 5% of the scenarios will have nodes whose demand is zero. Likewise, 5% of the total number of nodes in the WDN will be assigned the zero-demand profile. Alternatively, the zer**o_dem_rat**e can be set to the ratio between the number of nodes in the baseline network whose base demand is zero with respect to the total number of nodes, and accordingly, the number of nodes belonging to this profile. The zero-demand nodes are chosen at random and excluded from the remaining household or commercial profiles.
The presence and use of both, extreme-nodes and zero-demand nodes, at modeling WDNs are seen in the baselines and also confirmed by experts in the water management domain. Including these two profiles in the generated data enables to cover a wider range of pressures and demands compared to the baselines. Otherwise, if the baselines have those types of nodes but those are not included in our generation algorithm, there is a mismatch between baseline and our data. Our goal is to extend the range of the generated data but still cover and resemble the WDNs baselines.
Yearly Pattern
The yearly pattern defines the trend of water consumption in the entire year, considering a seasonal component with a peak consumption in summer. The default configuration assumes the European summer season starting in June with a 3-month span. In addition, to introduce variability in the data, beneficial for training deep learning models, we randomly move the summer period along the entire year for approximately 20% of simulated scenarios. This approach introduces the seasonal patterns in other regions across the globe. The yearly pattern is composed of a yearly component, a seasonal component, and noise, as described in Equation (9).
$$yearly({x}_{t})=y({x}_{t})+s({x}_{t})+{z}_{t},:,t\in T$$
(9)
where y(x**t) is the yearly component generated using a Fourier time-series as described by Equation (10), s(x**t) is the seasonal pattern generated using a periodic cosine function as described by Equation (11), and z**t is white noise.
$$y({x}_{t})={A}_{0}+\mathop{\sum }\limits_{n=1}^{H}({A}_{n}\cos (2\pi \frac{n{x}_{t}}{\mathrm{num}_\mathrm{samples}})+{B}_{n}\sin (2\pi \frac{n{x}_{t}}{\mathrm{num}_\mathrm{samples}})),:,t\in T$$
(10)
where the Fourier coefficients A**n and B**n determine the amplitude of the signal, and they are randomly sampled from a uniform distribution in the range [0, 1), the value of H represents the number of harmonics used for the time-series, and the periodicity of the signal is 24-hour for the short-term dataset and 1-year for the long-term. The periodicity is given by the number of samples parameter num_samples.
$$s({x}_{t})=C(cos(2\pi \frac{{x}_{t},-,{s}_{peak}}{\mathrm{num}_\mathrm{samples}})),:,t\in T$$
(11)
where C is a constant that represents the amplitude of the signal, reaching its maximum value in the summer peak speak, num_samples defines the periodicity of the signal. Finally, the yearly time-series are normalized in the range [0, 1].
Noise
The noise component ε**t, from Equation (6), is used to model the high and unexpected fluctuations in water consumption. Such variations can be caused by unpredictable changes in consumer behavior, network maintenance, transients or other unforeseen circumstances8. The noise component was sampled from a Gaussian normal distribution centered at zero and a standard deviation randomly sampled from a uniform distribution in the range [min_noise, max_noise].
Data Records
The DiTEC-WDN dataset[30](https://www.nature.com/articles/s41597-025-06026-0#ref-CR30 “Truong, H., Tello, A., Lazovik, A. & Degeler, V. ditec-wdn, https://doi.org/10.57967/hf/6341
(2025).“) is available at https://doi.org/10.57967/hf/6341under CC BY 4.0 license. This dataset comprises 36,000 synthesized scenarios devised from publicly available WDNs that served as structural backbones. Specifically, the network’s topology, node names, and link names remain unchanged, while other parameter values are machine-generated automatically. The repository where the raw dataset is located supports several data interface options to read and process .parquet files, allowing practitioners to select a concrete parameter or a subset of networks. Before use, the downloaded dataset requires an additional preprocessing step. Specifically, we removed columns corresponding to nodes along with their adjacent links, listed in skip_names in the metadata. Network metadata is accessible in any .parquet file in the corresponding folder. Additionally, to analyze graph topology, the metadata contains adj_list formatted as a list of tuples (source node, adjacent link, destination node).
The DiTEC-WDN dataset[30](https://www.nature.com/articles/s41597-025-06026-0#ref-CR30 “Truong, H., Tello, A., Lazovik, A. & Degeler, V. ditec-wdn, https://doi.org/10.57967/hf/6341
(2025).“) comprises 36 WDNs, which includes ky131, ky231, ky331, ky431, ky531, ky631, ky731, ky831, ky1031, ky1331, ky14[31](https://