Author(s): Sonia Guerra-Rodriguez; David J. Vicente; Jorge Rodriguez-Chueca; Alejandro Perez-Aja; Encarnacion Rodriguez; Fernando Salazar
Linked Author(s):
Keywords: Dvanced oxidation process; Feature importance; Machine learning; Water disinfection
Abstract: Advanced Oxidation Processes (AOPs) are becoming emerging alternatives to conventional treatments because of their efficiency to remove organic and biological pollutants in water. AOPs are based on the generation of free radical species with high oxidation potential, coming from the decomposition of different oxidants such as H₂O₂, or persoxymonosulfate. This research paper presents the use of Machine Learning (ML) as a tool to optimize water disinfection through different AOPs. The data used were taken from real experimental laboratory tests. The experimental framework consisted of 91 trials in which different configurations of techniques and substances related to water disinfection were combined. In each of the trials, the Enterococcus sp. concentration (CFU/mL) was measured over time. The parameters that define the design of each experiment were the following (i) type of water (distilled, saline, and simulated wastewater), (ii) whether or not ultraviolet light is used, (iii) type of oxidant (PMS, H₂O₂, and sulfites) and (iv) type of catalyst (none, Fe(II) and Fe(III)-Cit). Each of these parameters, in addition to the variable ‘Time’, were used as ‘input feature’ to build different data-driven models through ML techniques, being the ‘output’ of these models the ‘Enterococcus sp. Concentration’. In this work, a ML library for the Python programming language was used: Scikit-learn. From this library, three different ML supervised algorithms were used: decision trees (DT), random forests (RF), and gradient boosted regression trees (GBRT). DT are easy to interpret but are also prone to overfitting. RF and GBRT are categorized as tree ensembles methods since they are composed by multiple DTs to create more robust and less biased models. They were used because they have proven to be effective for regression on a wide range of datasets, as well as for their ability to quantify the importance of input variables. The metaparameters of the models were optimized, i.e., number of nodes, maximum depth, number of trees (for RF and GBRT), and learning rate (for GBRT). Since the size of the dataset was small, k-fold cross-validation was used to evaluate the models. Moreover, due to the complexity of the analized data and the heterogeneous nature of the variables, several models where assesed in order to find the one that best fitted the experimental data. As a result, interesting information about the importance of the parameters used in the experiments was obtained from model interpretation.
DOI: https://doi.org/10.3850/IAHR-39WC252171192022903
Year: 2022