Cluster and Factor Analyses as Contributions to the Groundwater Quality Monitoring of the Marizal/São Sebastião Aquifer System, Alagoinhas

The Marizal/São Sebastião aquifer system is the main water supply of the municipality of Alagoinhas in the state of Bahia. However, anthropic interventions contribute to soil and groundwater pollution, increasing the need for related research. Multivariate statistical analysis is a widely used tool, helping in the investigation of groundwater quality while being capable of simultaneously evaluating diverse variables of a sample set. In this study, factor analysis and multivariate cluster analysis methodologies were applied. Ten of the most influential variables for groundwater quality were selected and then grouped into two factors. The first factor included electrical conductivity, salinity, calcium, chloride, sulfate, manganese, and iron, which are indicators of water salinity. The second factor encompassed pH, bicarbonate, and phosphate, indicating anthropic interventions and alkalinity in the environment. The multivariate cluster analysis was applied to the parameters of both factors, resulting in dendrograms with four clusters. The present study showed that the multivariate statistical analysis is an efficient tool for monitoring and can contribute to the management of groundwater quality.


Introduction
The Marizal/São Sebastião aquifer system presents a high hydrogeological potential and is considered the main source of water supply for the municipality of Alagoinhas (Moraes et al. 2004;Nascimento et al. 2006). However, urban and rural occupation contribute to soil and groundwater pollution, which have mobilized researchers from different fields of knowledge into working with methods capable of quantitatively and qualitatively characterize the hydrogeological environment. In general, these studies enable a better management of the water resources and land use and occupation.
In this approach, multivariate statistical analysis is presented as a widely used tool. It comprises a set of statistical procedures capable of simultaneously evaluating variables from a sample set, gathering and associating similar components, and also investigating their interdependence and testing hypotheses (Vicini 2005;Ferreira 1996).
When studying the similarity between variables, two methodologies are of particular interest: the factor analysis (FA), using a principal components analysis (PCA), and the hierarchical cluster analysis (HCA). The first methodology reduces the number of original variables (factors) into a smaller set, with minimal loss of information; the latter groups a sample set considering the similarities between them.
By applying a multivariate analysis, it is possible to select the parameters that best characterize groundwater and define the physicochemical characteristics that should be monitored, reducing costs by excluding less relevant analyses. In this context, Gomes et al. (2017), Gomes and Cavalcante (2017), Gomes and Franca (2019), and Costa et al. (2020) used the FA and HCA techniques as contributions to the management of groundwater in the state of Ceará, Brazil, a semiarid region that presents high environmental vulnerability, similar to the present study area.
The objective of the present study was to select and identify the similarity of determinant variables for groundwater quality in the Marizal/São Sebastião aquifer system in the municipality of Alagoinhas, state of Bahia, Brazil, based on factor analysis and hierarchical cluster analysis methodologies.

Geological Setting
The study area is located in the northern part of the Recôncavo Sedimentary Basin, Bahia, Brazil, featuring the Marizal/São Sebastião Aquifer System, in the municipality of Alagoinhas (Figure 1). The Marizal Formation (Km) aquifer section is classified as unconfined and consists mainly of sandstones and conglomerates, and secondly of siltstones, shales, and limestones, deposited in an Anu. Inst. Geociênc., 2023;46:54180 environment of alluvial fans and braided fluvial systems (Ribeiro 2008;Nascimento & Alves 2014). The aquifer portion representing the São Sebastião Formation (Kss) is classified as unconfined and semiconfined and consists of three members divided as Joanes River (upper member), Passagem dos Teixeiras (medium member), and Paciência (lower member). It comprises gray-yellowish, pinkish, and yellow-reddish sandstones, with particle sizes varying from fine, to medium, and to coarse-grained. Sandstones are massive, show tabular geometry and low angle plane-parallel stratification and/ or trough cross-stratification (Nascimento & Alves 2014;Alves 2015).

Methodology and Data
Twenty groundwater well samples from the Marizal/ São Sebastião aquifer system were evaluated (Table 1). They were collected at the end of the rainy season (August 2021), with field support from SAAE (Water and Sewage Service Provider from Alagoinhas). Samples were chosen based on the wells that collect water from the Marizal/São Sebastião Formation and their spatial distribution.
In total, 18 physicochemical parameters were used in the study: pH, electrical conductivity (EC), salinity, calcium, sodium, magnesium, potassium, nitrate, chloride, bicarbonate, sulfate, fluoride, phosphate, manganese, aluminum, iron, copper, and lead. These parameters were selected considering the relevance of the major cations and anions in groundwater classification and minor elements listed in the contamination area, according to Nascimento et al. (2006), Pereira and Lima (2007), Nascimento and Alves (2014), and Alves (2015).
Physicochemical parameters were analyzed in the ALS (Life Sciences Brazil) laboratory and statistically processed using the software SPSS Statistics 17.0 version, Excel, and PHREEQC Interactive.
A factor analysis (FA) and a hierarchical cluster analysis (HCA) were applied using SPSS Statistics software. FA aims to reduce the number of initial variables, with the lowest possible loss of information, and group them into basic categories called factors. The analysis calculates the linear combination of the original variables, and then indicates a correlation between a new factor set. Together with the FA, a Principal Component Analysis (PCA) and a normalized varimax rotation were used. In the varimax rotation method, the variables are converted into standard scores by the Z scores method, in which data is reassigned to a simplified position on the Cartesian axis, where new factors have minimized loadings (Vicini 2005;Araújo et al. 2013).
Data correlation was subjected to the Bartlett sphericity test and KMO (Kaiser-Meyer-Olkin Measure of Sampling Adequacy). The Bartlett sphericity test assesses the null hypothesis regarding the absence of correlation between variables. The KMO evaluates the input value of the variables, which ranges between 0.5 and 0.9 and can be used to perform the FA (Hair Jr et al. 1998;Araújo et al. 2013;Franca et al. 2018). The KMO value is calculated according to Equation 1 as: where rij: correlation coefficient between i and j variables; aij: partial correlation coefficient between i and j variables The Hierarchical Cluster Analysis (HCA) aims to find and sort objects into similar clusters. It was conducted considering the maximum number of variables strongly explained by one single factor. This was achieved through the Ward method, and the squared Euclidean distance was adopted as a measure of similarity. According to Vicini (2005), the Euclidean distance is obtained through the Pythagorean theorem for a multidimensional space, where there are n individuals and each one of them has values for p variables. Considering two individuals, I and I', the distance between them is calculated according to Equation 2 as: The HCA resulted in dendrograms with cut-off points defining the number of groups formed. The cut-off point was determined from the adjusted variation distance of the agglomeration coefficient (Ferreira 1996).
The correlation level among attributes was assessed through Person's correlation matrix using Excel software. This parameter indicates the linear relationship degree between two variables that can be either positive, when an increase in one variable reflects an increase in the other, or negative, when a decrease in one variable implies in a decrease in the other. Coefficient values equal to 1 and -1 indicate a perfect correlation (either positive or negative), while a coefficient value of 0 indicates no linear relationship between the variables. In this study, the cut-off value adopted was 0.7, for a 95% confidence level (Silva 2014 Evaluation of the chemical reactions in groundwater was set by calculating the Saturation Index (SI) through PHREEQC Interactive 3.0 software. As stated by Nascimento and Alves (2011), SI (Equation 3) is defined by the ratio between the ion activity product (IAP) from cations and anions dissociated in an aqueous medium and the equilibrium constant (Ksp) at a given temperature, as follows: Lastly, the results obtained from the physicochemical analyses were compared to the maximum permissible limit (MPL) set by Ordinance No. 888/2021 of the Ministry of Health (Brasil 2021), which sets the procedures for controlling and monitoring water quality and its potability standards considered suitable for human consumption, and Ordinance No. 396/2008 of CONAMA (Brasil 2008), which establishes a classification and environmental directives for groundwater.

Results
The factor analysis through PCA was first carried out with all physicochemical variables: pH, electrical conductivity, salinity, calcium, sodium, magnesium, potassium, nitrate, chloride, bicarbonate, sulfate, fluoride, phosphate, manganese, aluminum, iron, copper, and lead. To obtain the most satisfactory result in the factor analysis, four simulations were necessary based on its criteria, which reduced in 55.5% the number of initial variables. The final simulation preserved 10 variables (pH, electrical conductivity, salinity, calcium, chloride, bicarbonate, sulfate, phosphate, manganese, and iron), which were then distributed between two factors, corresponding to 88% of the total accumulated variance. The result showed that the KMO was 0.72 and the Bartlett sphericity test was equal to p < 0.01.
The correlation matrix for the 10 variables selected during the last factor simulation is shown in Table 2. Of the 45 correlation coefficients, 66.6% showed significant values and 51.1% were within the range of 0.6 ≤ | r | < 0.9 which, according to Callegari-Jacques (2003), represents a strong correlation. From the remaining coefficients (48.9%), 11.1% were within the range of 0.3 ≤ | r | < 0.6, which characterizes a moderate correlation, and 37.8% presented values of r > 0.3, which characterizes a low correlation.
The correlation matrix exhibited a strong correlation between electrical conductivity and calcium, chloride, sulfate, manganese, and iron (Table 3), which indicates a possible influence of these ions on the salinity of the groundwater in the area. Bicarbonate showed a strong correlation with pH, considering its ability for alkalinization (Kehew 2001). The matrix indicated moderate correlations between pH and sulfate and phosphorus; between bicarbonate and manganese and electrical conductivity; and between phosphorous and chloride. In addition to the low correlations between pH and electrical conductivity, calcium, chloride, manganese, and iron; between bicarbonate and calcium, chloride, and iron; and between phosphorous and electrical conductivity, calcium, chloride, manganese, and iron.
The first factor from the factor analysis contributed approximately 63.5% to explain the total variance of the sample set, showing a strong correlation among the following variables: electrical conductivity, salinity, calcium, chloride, sulfate, manganese, and iron (Table 4). In addition, chloride (0.981), electrical conductivity (0.945), and salinity (0.945) represent the variables with the highest factor loadings and are indicators of groundwater salinity (mineralization). Moreover, they showed the greatest influence in characterizing groundwater quality.
The constituent parameters of factor 1 are present in the lithological composition of the Marizal/São Sebastião System (Alves 2015), and can precipitate in groundwater, according to the saturation index (SI) through PHREEQC calculations, preferably as calcite minerals (CaCO 3 ), dolomite [CaMg(CO 3 ) 2 ], alunite [KAl (SO 4 ) 2 (OH) 6 ], and jarosite [KFe 2 (SO 4 ) 2 (OH) 6 ]. The second factor from the factor analysis represented about 24.5% of the sample set total variance and showed a strong correlation among the variables pH, bicarbonate, and phosphate (Table 4). As stated by Nossa (2011), Peixinho (2016, and Gomes and Franca (2019), pH and bicarbonate show a strong correlation between each other and are water alkalinity indicators, whereas phosphate has a secondary participation in the process (Pina 2012). These variables may be related to the presence of calcium carbonate levels in the Marizal and São Sebastião Formations. Another explanation is related to the distance covered by groundwater, considering the recharge zone as a starting point, which is responsible for incorporating more bicarbonates and carbonates in the environment (Nascimento & Alves 2011). Moreover, the influence of phosphate on the environment may be associated with anthropic activities. As reported by Nascimento and Barbosa (2005), some fertilizers and pesticides used in agriculture are derived from phosphate compounds that can accumulate in the unsaturated zone of the soil, representing a source of pollution to groundwater. According to the same study, phosphate is less concerning than nitrate, as it is less

7
Cluster and Factor Analyses as Contributions to the Groundwater Quality Monitoring Costa et al. Anu. Inst. Geociênc., 2023;46:54180 movable in the aqueous medium and is easily adsorbed by the aquifer lithology during ionic exchange, which eases its dispersion in the water. It is also found alongside feces and cleaning products, such as soaps and detergents coming from domestic sewage. The cluster analysis used the components selected from the factor analysis to group chemically similar samples. According to Franca et al. (2018), in this case, the number of groups is defined considering the first major difference between the rescaled coefficients from the cluster. In the present study, these coefficients elucidate cut-off point 3, which resulted in two dendrograms from factors 1 and 2, and also formed 4 well clusters (Figure 2).
Iron concentration was above the maximum permissible limit for human consumption in wells P-01 (0.44 mg/L) and P-09 (1.7 mg/L). According to data acquired from the saturation index (SI), this result reflects the existence of iron oxide in the environment and facilitates precipitation of goethite and hematite in the groundwater of the Marizal/São Sebastião system. Excessive iron intake, as reported by Nutti et al. (2006), can harm the immune system and cause developmental delay, anemia, reduced work capacity, and death.
Manganese concentration was above the maximum permissible limit for human consumption in wells P-02 (0.11 mg/L) and P-10 (0.11 mg/L). The saturation index (SI) value reflects the presence of manganese carbonate in the water, enabling the precipitation of rhodochrosite in the groundwater of the Marizal/São Sebastião system. As stated by Ishimine (2002), excessive manganese intake can cause headaches, irritability, and emotional instability, which directly affects the central nervous system.
Regarding water quality for irrigation, the samples from cluster 1 met the maximum permissible limit established by CONAMA Ordinance No. 396/2008, except for sulfate, which does not have a defined maximum concentration. Cluster 1 was also characterized by waters with low levels of salinity and electrical conductivity ranging from 19.8 to 40.8 µS/cm.
Cluster 2 comprised 35% (7 wells) of the samples analyzed, in which 100% met the potability standards established by Ordinance No. 888/2021 of the Ministry of Health and by CONAMA Ordinance No. 396/2008 regarding the concentration of chloride, sulfate, manganese, and iron. Thus, this cluster showed the best quality level for the waters analyzed. The water from this cluster is suitable for irrigation, with the concentration of chloride, manganese, and iron within the maximum permissible limit set by CONAMA Ordinance No. 396/2008. Lastly, it showed electrical conductivity values varying between 35.5 and 100.9 µS/cm, which indicates low salinity.