Using Machine Learning and Multivariate Statistical Analyses to Study Stream Health
Using Machine Learning and Multivariate Statistical Analyses to Study Stream Health
Lynn Tao
Thomas Jefferson High School for Science and Technology
This paper was originally included in the 2021 print publication of the Teknos Science Journal.
Abstract
Impervious surface area is projected to triple within the next three decades as a direct consequence of proliferating urbanization. Impervious surfaces, which are man-made architectural features such as buildings and roads that prevent absorption of water, profoundly affect surface runoff and physio-chemical properties of stream systems. Thus, quantifying and studying impervious surfaces is crucial to understanding the breadth of anthropogenic influence. However, current methods of quantifying impervious surfaces require complex procedures, expensive software, and experienced personnel. As an alternative, we designed a novel machine learning approach that utilizes Google Maps and a K-Nearest-Neighbors (KNN) supervised algorithm to quantify the percentage of impervious surfaces (PoIS) surrounding 21 urban stream sites in Fairfax County, VA. Non-metric Multidimensional Scaling (nMDS) was conducted to analyze the relationship between PoIS and 10 water quality parameters based on the Bray (Sorenson) distance matrix. Permutational Multivariate Analysis of Variance (PERMANOVA) was used to detect the strength of dissimilarities among stream sites. Our research demonstrates that impervious surfaces are negatively correlated with the ecological health of Fairfax County streams. In addition, the developed machine learning algorithm used to quantify PoIS may serve as a useful tool to identify high-risk streams that should be monitored. The algorithm will help both managers and the general public better understand our urban stream environment, serving as a foundation for cost-effective water-resource management.
Introduction
The U.S. Cities Factsheet projects the combined urbanized land area to triple within the next three decades [1]. Urbanization results in an increase in impervious surfaces, surfaces covered by impenetrable materials [2], which results in significant changes to the biological, physical, and chemical conditions of local freshwater ecosystems by influencing the structure of macroinvertebrate communities, reducing natural landscape complexity, and diminishing sustainable drainage to water bodies [3][4].
These impacts are especially evident in areas with large population growth. Fairfax County, Virginia, has seen a more than 10-fold increase in population size in the past 70 years, from 98,255 in 1950 to 1,147,532 in 2019 [5]. Consequently, most streams in Fairfax County show symptoms of “urban stream syndrome” - the consistent ecological degradation of a stream ecosystem [6]. Unfortunately, the balance between urbanization and preservation is thin. Thus, research into the consequences of urban development is essential to cost-effective water-resource management [7].
Current methods of quantifying impervious surfaces, such as using ArcGIS or ISAT, require complex procedures, expensive software, and experienced personnel. In addition, although various statistical techniques such as Cluster Analysis, Principal Component Analysis, and Multidimensional Scaling (MDS) have been used to investigate the anthropogenic impacts on the water ecosystems [8][9], impervious surface has not been employed as a quantifiable measure in these analyses. This is likely because the relationship between impervious surfaces and stream health factors is not well understood.
Thus, the present study had a two-fold purpose. First, we wanted to develop a user-friendly, efficient, and accurate method to quantify impervious surface area around a sampling site. Second, we used this method to test whether there is a significant relationship between impervious surfaces and biological and physio-chemical stream health factors in Fairfax County. We designed a novel K-Nearest-Neighbors (KNN) supervised machine learning (ML) algorithm to classify the satellite image pixels as either impervious or pervious. This algorithm extracted statistical profiles from supervised training sets and stored these profiles in three-dimensional cluster centers, then calculated Euclidian distance from unknown image pixels to these cluster centers to determine the class membership using votes from the K nearest neighbors. Then, we applied this ML algorithm to satellite images of Fairfax County streams (n = 21) and quantified the surrounding impervious surfaces. We incorporated the percentage of impervious surfaces (PoIS) as a numerical variable in statistical analyses. We found that impervious surfaces in Fairfax County had a significant negative correlation with ecological stream health score, i.e., as impervious surface area increased, stream health declined.
Methods and Materials
Water sample collection
Streams in Fairfax County (n = 21) were assessed based on the Virginian Save Our Streams (VASOS) protocol from the Autumn of 2019 as part of the Integrated Biology, English, and Technology (IBET) program at Thomas Jefferson High School for Science and Technology (Table 1). These data sampling points were labeled on a satellite image (Figure 1). Physical, chemical, and biological data were collected including a multi-metric index based on macroinvertebrate richness, in addition to E. coli growth, pH, alkalinity, chloride, dissolved oxygen, phosphate, nitrates, the number of riffles, water temperature, and transparency. Selected streams had also been previously categorized as “healthy”, “moderate”, or “unhealthy” by the Fairfax County Park Authority and had an overall health score assigned following the protocols of the Virginia Department of Environmental Quality.
A health score, also called a stream ecological number or multi-metric score, is an important parameter for water quality monitoring and is part of VASOS standard protocol. It is a weighted sum calculated based on what percentage of various macroinvertebrate groups live in a stream using the Rocky Bottom method following the field guide to aquatic macroinvertebrates. The resulting ecological number defines stream health into three categories: Acceptable Ecological Condition (9–12), Ecological conditions cannot be determined at this time (8), and Unacceptable Ecological Condition (0–7) [10].
Machine Learning (ML) Supervised Classification
ML supervised classification was used to calculate PoIS from satellite imaging near each water sample collection point (Figure 2). Impervious surfaces were defined as man-made architectural features that prevent absorption of water, e.g., buildings, roads, parking lots, sidewalks, brick, and asphalt. Pervious surfaces include surfaces composed of vegetation, water bodies, and bare soil.
KNN, K-Nearest-Neighbors
The k-nearest neighbors (KNN) classifier is a supervised ML algorithm that assigns an unknown object with a feature vector to a known class membership by a plurality vote of its neighbors. The object is assigned to the class most common among its k nearest neighbors. KNN involves a supervised training phase and classification phase [17].
Supervised training phase: a software program was developed in this study to assist in collecting training samples of well-known impervious surface or pervious surface types. The statistical profiles of these training examples were extracted and stored in three-dimensional feature spaces (RGB) as vectors. To improve accuracy and reduce noise/outliers of the sample data, the cluster center of each distinct type was calculated and used as the training sample vector.
Classification phase: Each satellite image pixel was processed and its Euclidean distance to all the cluster centers of these sample vectors was calculated. The pixel was then classified as either impervious or pervious based on the plurality vote of its k nearest neighbors. In our usage of KNN based on distinct surface types, the k value chosen was 3.
Statistical Analysis
Two multivariate statistical techniques and a regression analysis were used to study the relationship between PoIS and the stream health factors.
Non-metric Multidimensional Scaling (nMDS) Analysis
MDS is a powerful tool for dimensional reduction. It achieves this by mapping the original data into a distance space where their distances correspond to the similarities of the objects [11][12]. NMDS uses a repetitive process in finding the optimal transformation to minimize the stress, which is calculated as below:
where dij is the dissimilarity of sample i to j and Dij is the distance between samples i and j in the Cartesian space of the ordination.
During nMDS analysis, all the data were first standardized to avoid misclassification due to wide differences in dimensionality. A screen test was then performed to decide an appropriate number of dimensions. The goodness-of-fit of the mapping was assessed with Shepard diagrams. All procedures were performed using R vegan functions.
PERMANOVA Analysis
Permutational Multivariate Analysis of Variance (PERMANOVA) is used to test whether groups of objects are significantly different. It is popular for ecological studies due to its less restricted data assumption [13]. The test statistic is a pseudo-F-ratio:
where SSW is the sum of squared dissimilarities within groups, SSA is the sum of the squared dissimilarities among groups, a is the number of groups, and N is the total number of objects. The significance of this ratio is usually used to indicate the strength of dissimilarity, as in this study.
Regression Analysis
Linear regression was used to model the relationship between PoIS versus each of the physio-chemical and biological stream variables.
Results
KNN Supervised Classification
Evaluation of the K-Nearest-Neighbors (KNN) supervised classification is done using precision, recall, and accuracy calculations. In the context of impervious surfaces, precision measures the percentage of KNN classified-impervious surface that is true positive impervious. On the other hand, recall measures the percentage of actual impervious surface that was true positive. Accuracy represents the percentage of correct analyses.
One hundred random pixels were chosen from each of the satellite images around the stream sites. These pixels covered various impervious and pervious surface types. We were able to achieve an overall classification accuracy of 88% (Table 2).
nMDS Analysis and PERMANOVA Analysis
The sample water quality parameters were analyzed using non-metric Multidimensional Scaling (nMDS) analysis to find the relationship between impervious surfaces and the other variables. The dataset collected was organized into two sets, biological or physio-chemical. NMDS (Stress = 0.16, Bray Curtis as the distance measure) and Permutational Multivariate Analysis of Variance (PERMANOVA) were conducted for each set of data.
NMDS analysis (Figure 3) revealed several interesting facts. Impervious surface was close (or similar) to the stream ecology number. This was confirmed by the result of the regression analysis. Stream ecology number was a little distance away from the total organism count, though both were based on macroinvertebrates. This result made sense because the stream ecology number was calculated with various weight factors for different species of macroinvertebrates instead of just the count. When stream health category was used as the env-fit variable, streams fell nicely into two clusters, healthy and unhealthy, based on PoIS and other biological conditions. PERMANOVA analysis showed these two groups were significantly different from each other with Pr = 0.043.
When associated with physio-chemical parameters (Figure 4), impervious surface was close to most variables such as DO, PH, chloride, riffles, and transparency. Comparatively, it was not as close to alkalinity and water temperature. Alkalinity levels may be particularly affected by the existence of Dulles Airport on the west side of Fairfax County. Different physio-chemical conditions placed the streams sites into two separate groups, healthy and unhealthy. PERMANOVA showed these two groups were significantly different with Pr = 0.038.
Regression Analysis
Impervious surfaces showed a relatively strong linear correlation with stream biological factors. Total organism count displayed a negative correlation with PoIS, with a coefficient of correlation of 48% and a p-value of 0.04631. A positive correlation was found between PoIS and algae growth as well as between PoIS and E. coli bacterial density. These results were consistent with the above nMDS analysis.
Though nMDS analysis indicated a closeness between PoIS and stream physio-chemical parameters, this was not a linear relationship. Analyzing PoIS for these physio-chemical parameters produced low coefficients of correlation and insignificant p-values.
Overall stream health had the most significant correlation with impervious surface percentage, confirming the closeness between the impervious surface and stream health score on the prior nMDS analysis (Figure 5). This high correlation and significance suggest that the linear regression equation can be used to model the Fairfax County stream health score based on the measures of PoIS. The equation is as follows:
Stream Health Score = -0.110242 * PoIS + 12.07060
Discussion and Conclusion
In this study, we designed an effective and efficient ML algorithm to calculate the percent of impervious surfaces (PoIS) based on any satellite image. We then applied our K-Nearest-Neighbors (KNN) algorithm for Fairfax County streams and performed data analyses.
We found that PoIS showed a significant correlation with almost all biological factors, which is consistent with previous studies [14][15], and supports the theory that urban development is a major contributor to the degradation in stream ecosystems [16]. We also found a trend between PoIS and stream physio-chemical factors, but the relationship was nonlinear and unclear but warrants further investigation.
Past research showed that chemical and physical features are impacted by a wide variety of factors related to catchment area, such as stream current, varied precipitation, and regional (not just local) urbanization [15]. As a result, physio-chemical data would need to be collected across multiple seasons and multiple years to reveal more information. Comparatively, macroinvertebrate diversity and the VASOS multi-metric index are more stable over time and thus serve as better indicators of stream health. This is why macroinvertebrate diversity is so valuable when assessing stream health.
A few potential sources of error could have limited the present study. Some water parameters were unavailable for a few collection sites, so the sample size was smaller than intended for some variables and thus may have carried less statistical weight. Furthermore, differences in timing and weather may have affected readings of some stream parameters such as temperature and transparency. In the future, multiple trials of stream parameters should be taken and averaged to reduce the chance of outliers. In addition, collecting samples across multiple seasons and years at larger spatial scales would reduce environmental and human-induced error in the data. Further exploration of machine learning algorithms would improve the performance in classifying impervious surfaces from satellite images.
Our study found that the VASOS stream health score had the most significant correlation with PoIS. Their closeness on the non-metric Multidimensional Scaling (nMDS) plots was confirmed by their high correlation in regression analysis, which also produced a linear model. The model could be used as a basis for a quick and inexpensive way to estimate stream health score from the PoIS in its surrounding area. The computer program developed in this study can be used to identify high-risk streams, or areas that should be monitored. Future research could look into its application as a mechanism for determining the impact of proposed construction projects on the environment.
Acknowledgements
This work was supported by the TJHSST STREAM TEAM teachers: Ms. Harris, Ms. Litchford, and Dr. Morrow. My biggest thanks to Dr. Morrow, who guided my research process and provided valuable feedback on my paper. Because of your support, I was able to learn so much while completing this research. My thanks also go to Dr. Smith and Mr. Stern for their insights.
I also want to thank TJ 2023 IBET students for their support, especially the members of Group 17 (Suraj N. Vaddi, Krisha Pahwa, Shahzad K. Sohail) for their assistance in the field.
References
[1] Center for Sustainable Systems, University of Michigan. (2020). U.S. Cities Factsheet. http://css.umich.edu/factsheets/us-cities-factsheet
[2] Obiakor, M. O., et al. (2012). Effects of vegetated and synthetic (impervious) surfaces on the microclimate of urban area. Journal of Applied Science & Environmental Management, 16(1), 85-94. https://search.proquest.com/docview/1347624562?accountid=34939
[3] McGrane, Scott J. (2016). Impacts of urbanization on hydrological and water quality dynamics, and urban water management: a review. Hydrological Sciences Journal, 61:13, 2295-2311. DOI: 10.1080/02626667.2015.1128084
[4] Peipoch, M., et al. (2015). Ecological Simplification: Human Influences on Riverscape Complexity. BioScience, 65(11). https://doi.org/10.1093/biosci/biv120
[5] U.S. Census Bureau. (July 1, 2019). QuickFacts: Fairfax County, Virginia. https://www.census.gov/quickfacts/fairfaxcountyvirginia
[6] Jastram, J. (2014). Streamflow, Water Quality, and Aquatic Macroinvertebrates of Selected Streams in Fairfax County, Virginia, 2007–12. U.S. Geological Survey Scientific Investigations. http://dx.doi.org/10.3133/sir20145073.
[7] Parece, T., & Campbell, J. (2015). Identifying Urban Watershed Boundaries and Area, Fairfax County, Virginia. Photogrammetric Engineering & Remote Sensing, 81(5), 365-372. https://doi.org/10.14358/PERS.81.5.365
[8] Wu, M., et al. (2011). Investigation of Spatial and Temporal Trends in Water Quality in Daya Bay, South China Sea. International Journal of Environmental Research and Public Health, 8, 2352-2365. doi:10.3390/ijerph8062352
[9] Akbulut, M. (2010). Assessment of Surface Water Quality in the Atikhisar Reservoir and Sarýçay Creek. Ekoloji 19, 74, 139-149.
[10] Virginia Save Our Streams. (Spring 2020). Biological Monitoring Data Form for Muddy Bottom Streams. https://vasos.org/wp-content/uploads/Rocky-Example-Datasheet.pdf.
[11] Holland, S. (2019). Non-metric multidimensional scaling (NMS). http://strata.uga.edu/8370/lecturenotes/multidimensionalScaling.html.
[12] Letten, A. (2017). Multidimensional scaling. http://environmentalcomputing.net/
[13] Joshuaebner, V. (2018). Permutational Multivariate Analysis of Variance (PERMANOVA) in R. https://archetypalecology.wordpress.com/2018/02/21/permutational-multivariate-analysis-of-variance-permanova-in-r-preliminary/
[14] Jacobson, C. R. (2011). Identification and quantification of the hydrological impacts of imperviousness in urban catchments. Journal of Environmental Management, 92(6). https://doi.org/10.1016/j.jenvman.2011.01.018
[15] Sponseller, R. A., et al. (2008). Relationships between land use, spatial scale and stream macroinvertebrate communities. Freshwater Biology, 46(10). https://doi.org/10.1046/j.1365-2427.2001.00758.x
[16] Gaffield, S. J. et al. (2003). Public health effects of inadequately managed stormwater runoff. American Journal of Public Health, 93(9), 1527-1533. https://ajph.aphapublications.org/doi/pdf/10.2105/AJPH.93.9.1527
[17] Cover, T. M. & P. E Hart. (1967). Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27.