Use of Machine Learning Algorithms in the Classification of Forest Species

Optimization in the process of managing forest resources seeks alternatives that make data collection possible. One of them alternatives is spectroradiometry, which consists of measuring the spectral response, having as product the response of the target in relation to the incident radiation along the electromagnetic spectrum, and that, using machine learning, with pre-selected models, makes it possible to identify. Given the above, the study aimed to use machine learning algorithms to classify species by vegetation indices from reflectance data. The study was developed at the Federal University from Santa Maria, working with the species Ficus benjamina , Inga marginata , Handroanthus chrysotrichus , Psidium cattleianum , Salix humboldtiana , Corymbia citriodora and Myrcianthes pungens , and spectral readings of the leaves were taken using the FieldSpec®3 spectroradiometer connected to RTS-3ZC3 integrating sphere. The reflectance values with wavelength ranged in amplitude from 350 ƞm to 2,500 ƞm and spectral resolution of 1 ƞm. Vegetation indices were calculated using the software R Studio, being: NDVI, SAVI, RVI, GNDVI, NDWI, NDWI2, GEMI, DVI, TVI, RVI, MSAVI, WDVI. The algorithms used to develop machine learning were: Random Forest (RF), k-Nearest Neighbors (K-NN), Naive Bayes (NB) and Support Vector Machine (SVM). RF proves to be the most appropriate for data validation, with 85% global accuracy, followed by SVM, with 71%, K-NN with 64% and NB with 35%. The indices with the best performance to point the species were NDWI and SAVI.


Introduction
Species identification is essential for the preservation of forests, and for precise supervision of forest management, ensuring the maintenance of existing species and providing accurate forest inventories (Paula Filho 2013).Traditionally, this kind of activity is made in the field, demanding a considerable quantity of time, financial and human resources.Moreover, these activities are depended on the flowering and fruiting season of the species, among other factors that are observed to facilitate identification via morphological characteristics.
In order to decrease these disadvantages, optimizing the management of forest resources, alternatives that allow the obtention of this data more quickly and with a lower cost has been researched.Remote sensing techniques have been shown to be promising methodologies for the identification of forest species (Kovacs, Wang & Flores Verdugo 2005).
The constant evolution in technological and methodological development in relation to data worked on remote sensing, contribute to the accuracy of vegetation analysis, being possible adopt, for example, the spectroradiometry, which consists of measuring the spectral response in situ, that is, close to the target, in order to reduce the interference of environmental factors that are present in the readings of other sensors (Demarez & Gastellu-Etchegorry 2000).
The final product of the spectroradiometry approach is the design of the target's response to the incident radiation along the electromagnetic spectrum.Thus, it becomes possible to estimate a series of parameters on the general conditions of the variable studied, and also to obtain vegetation indices from the process of calculating these indices, being possible identify the most suitable for the type of data being analyzed.
However, for the recognition of the patterns present in the data coming from spectroradiometry, the use of machine learning is an alternative that consists of investigating computational techniques for learning and obtaining knowledge, assuming that computers learn from models (samples) provided by the researcher.Recent machine learning models are divided into supervised and unsupervised, and what differentiates them is the presence of labels in the data (Mitchell 1997;Rezende 2003).Supervised machine learning is based on a set of real data or training, where an answer is provided, in other words, based on the training model previously labeled there is the construction of a classifier (prediction model) that will be able to predict the new example's label (Mitchell 1997).Each machine learning technique has unique properties.The K-Nearest Neighbors (K-NN) is based on Instances calculating the Euclidean distance of a new example, where is a function for each Instance belonging to the database.
The Random Forest (RF) algorithm (Santacruz 2015) is a learning method that proposes to group data entry variables through several decision trees, built at the time of method training (TrainData) (Oshiro 2013).The algorithm creates multiple decision trees, which are trained from the random selection of a part of the data (two thirds), while the rest is used in the cross-validation of the generated tree (Breiman 2001).The final product of the classifier is given by the class that was returned as an answer by most of the trees belonging to the classifications (Tan, Steinbach & Kumar 2009).Random Forest uses prediction from different decision trees that arise from resampling the original data set and calculates an average from it (Inza et al. 2010).
Support Vector Machines (SVM) (Vapnik 1995), developed with the formulation that encompasses the principle of minimizing structural risk (SRM), involving the minimization of an upper limit for the generalization error.It is a technique that uses the Theory of Statistical Learning and builds a binary classifier based on a set of patterns (training examples).Considering Xi and Yi, where Xi is the input vector and Yi is the desired classification, the objective is to use the training examples so that there is a correct classification in the tests not used in the training.Thus, machine learning models based on the SRM principle tend to have a greater ability to generalize unobserved data, which is one of the main purposes of statistical learning (Vapnik 1995).
Naive Bayes (NB) is a classification technique based on Bayes' theorem that completely disregards the correlation between variables (features).In simple theory, a Naive Bayes classifier assumes that the presence of a particular characteristic in a class is not related to the presence of any other resource.Each training example can decrease or increase the probability that a hypothesis is correct, using a probabilistic model to describe the data set (Santos 2016).
The aim of the study was to use the machine learning algorithms to classify the species by the vegetation indices from reflectance data for 7 forest species commonly found in the study region.used wood for this purpose in Brazil, the Pinus spp.and the Eucalyptus spp, for this a complete analysis of the raw materials technological characteristics and their beheiver in the kraft pulping processes were carried out.

Methodology and Data
The study was conducted on the campus of the Federal University of Santa Maria.According to the Köppen classification, the climate is humid subtropical -Cfa, with an average annual temperature of 19.2 °C, and well-distributed rainfall throughout the year, with average annual rainfall ranging from 1.400 to 1.900 mm (Alvares et al. 2013) whose location is shown in Figure 1.
The campus is located in southern Brazil, in a transition zone between the Central Depression and the sandstone-basalt cliff of the Southern Brazilian Plateau, with an average altitude of 113 m (INMET 2018).Variations in soil classes are accentuated in the region, with Typic Hapludalf (USDA 2003) being the predominant class in the study area.The vegetation in the region is formed by clean fields and seasonal deciduous forest, escarpments of the Serra Geral and several testimony hills (Longhi et al. 2000).
The material was collected on August 15, 2018, between 7 am and 8 am.The temperature varied from 12 ºC to 13 ºC and the relative humidity of the air was 73 to 85% (INPE 2013).The choice of forest species was random.Adult leaves were collected, visibly free from pests and diseases of seven species, which are: Ficus benjamina, Inga marginata, Handroanthus chrysotrichus, Psidium cattleianum, Salix humboldtiana, Corymbia citriodora and Myrcianthes pungens.The methodological procedures are summarized in the flowchart (Figure 2).
The spectral readings of the leaves were performed in the Remote Sensing laboratory at UFSM using the FieldSpec®3 one spectroradiometer connected to the integrating sphere RTS-3ZC3, to perform the spectral readings.After the optimization and calibration of the sensor system with Spectralon plates, the samples (isolated sheets) were positioned with the adaxial face inside the equipment's integrating sphere.Five pages were read (one reading each), totaling 35 readings.The spectra were stored on the microcomputer and recorded in a text file for further processing.
The resulting data were reflectance values with a wavelength in the range of 350 ƞm to 2,500 ƞm and spectral resolution of 1 ƞm.These data in the ".txt" format were converted into ".csv" so that they could be processed statistically in the R software through the Rstudio interface (R Core Team 2014).The vegetation indexes were calculated using a programming script in the R Studio software.This script contains functions and equations based on Table 1.As a result, a new table "ML Indexes" in the ".csv" format was generated.The table contains fields composed of the 12 vegetation indices and the field of the species analyzed, while the lines contain the data.To evaluate the vegetation indexes, the equations described in Table 1 were used.To develop machine learning and evaluate the efficiency of each vegetation index, the Random Forest, k-Nearest Neighbors, Naive Bayes and Support Vector Machine algorithms were used, implemented in the R packages presented in Table 2.For the analysis of training samples in the classification process, 70% of the data were drawn, while for the training of classifiers, 30% were used for testing.
In the Random Forest method, an importance ranking graph was generated for each vegetation index in the species classification.Regarding the analysis of machine learning methods, the confusion matrix of the test analyzes was taken into account, which aims to generate a matrix of real values and values predicted by its classifier, indicating the amount of data classified correctly.Finally, the machine learning performance table was generated, demonstrating the capacity of the methods to learn automatically from the available data.

Results
The indices that showed the best efficiency to distinguish forest species, when submitted to analysis by the Random Forest algorithm, were the Normalized Difference Water Index (NDWI) and the Soil Adjusted Vegetation Index (SAVI) (Figure 3).
With the machine learning validation process using the test samples, the Random Forest algorithm obtained 85% of the global accuracy, being the most appropriate among the tested algorithms (Table 3).
In similar studies Gaiaad et al. ( 2017), Random Forest presented 95.3% of global accuracy.When analyzing a multi-temporal classification on the dynamics of land use and occupation, the best performance with the Random Forest algorithm was also the best result (Monteiro 2015).However, when comparing five different machine learning algorithms for mapping three different coffee areas, the  Anu.Inst.Geociênc., 2023;46:50490 worst results were found with the Random Forest algorithm, with an overall accuracy of 76.7% (Souza et al. 2016).
The evaluation of the SVM resulted in a total accuracy of 71%, indicating the smallest errors when compared to K-NN with 64% and NB, in which it had only 35% of accuracy in identifying species.In the classification of two forest types from a forest inventory associated with bands 3, 4 and 5 of the Landsat 5 TM satellite through SVM (Gonçalves, De Sá & Ribeiro 2017), an overall accuracy of 86% was obtained.
In similar studies, when evaluating six decision tree algorithms (Gaiaad et al. 2017), the algorithm that obtained the best performance was SVM, with 98.3%.When evaluating the performance of two algorithms based on SVM machine learning and the Multi Layer Perceptron (MLP) for the classification of land use and land cover in the Caatinga biome (Souza et al. 2010), the global accuracy values were 86.03% and 82, 14% respectively, demonstrating the best performance of machine learning methods.
When implementing the Naive Bayers algorithm in the classification of 10 species from the Atlantic Forest leaf database (Souza & Kai 2014), there was a 70.6% accuracy rate.When evaluating different algorithms in three different municipalities (Souza et al. 2016), the algorithm that had the best accuracy was SVM, with an overall accuracy of 85.3%, 87% and 88.3% respectively.The worst results were found with the Random Forest (76.6%) and Naive Bayes (76% and 82%) algorithms, respectively.

Conclusions
The indices that had the best performance to distinguish the species evaluated were the Normalized Difference Water Index and Soil Adjusted Vegetation Index.The machine learning method for species classification, the best performance was Random Forest (85%).
It should also be noted that one of the contributions of this work is to highlight the use of the R Studio software, in which the license is made available free of charge, thus allowing users the freedom to study the dynamics of the operation of their respective packages, as well as, adapt them to your needs.This work is a precursor to research involving other species, not only in the biome addressed, but also in other Brazilian biomes, as well as manifesting the possibility of using the algorithms tested in a greater number of samples in future research.

Figure 1 Figure 2
Figure 1 Location of the study area.

Figure 3
Figure 3 Efficiency of vegetation indexes.

Table 2
Machine learning algorithms used to classify forest species.

Table 3
Global accuracy of the algorithms.