Agriculture data analysis using parallel k-nearest neighbour classification algorithm

ABSTRACT


INTRODUCTION
To improvise the agricultural productivity, it is essential to update the system with data such as yield, crop type, and crop growth conditions along with rainfall pattern data as well as weather related information (such as pressure, humidity, and temperature) time to time.The agro data captured by these sensors is usually in unstructured form and is moved to cloud environment though gateway or internet.For smart agro farming, an effective system is needed for storing, and analysing such unstructured type of data on cloud platforms.
This research sought to address these issues and propose effective categorization model (ECM) methodology.In order to categorise unstructured type of multi-dimension high-dimensional data to structural form, a priority-based k-nearest neighbour (KNN) algorithm is first developed.Additionally, a concurrent categorization approach using the Hadoop MapReduce (HMR) architecture is provided.Figure 1 illustrates the design of a quick and effective agro data classification algorithm for an agricultural management system.
The significance of proposed crop classification technique are as follows.First, a multi-dimension, high-dimensional, unstructured agro data classification system based on priority was developed.Next, a parallel classification approach using the HMR is described.The proposed classification model can perform analysis considering real-time agro sensory data with good accuracy, reduced time, higher memory efficiency, and speedup.
Because it can analyse enormous volumes of data and extract crucial information, big data (machine learning and deep learning) is used in precision agriculture.For the purpose of monitoring environmental factors on a farm, this project uses internet of things (IoT) technology for intelligent agriculture.Threedimensional cluster analysis (3D CA) was used to study the environmental factors impacting the farm.The hyperspectral series of images or videos accelerates the rate at which data is generated and the volume at which it is produced, which poses challenges for big data, especially in applications for agricultural remote sensing.We provide an overview of the IoT, big data, and artificial intelligence (AI), as well as how these technologies will impact the agri-food sector in the future [1]- [4].We undertake an analysis of the most recent research on the application of intelligent data processing technologies in agriculture, particularly in the production of rice.We provide a unified vision for IoT technology, data processing, and practical analytics in digital agriculture.Thanks to coronavirus disease-2019 (COVID-19), more people are now concerned about food safety, which is advantageous for the market share of smart agriculture.Contrary to existing solutions, the framework for integrating and analysing agricultural data from various sources provided in this research uses cloud computing (CC), which improves the solution's scalability, flexibility, affordability, and maintainability [5]- [8].
We thoroughly assess agriculture mobile crowd sensing (AMCS) and offer recommendations for approaches to agricultural data collection.Using a small quantity of ground truth data, this work offered Gaussian kernel regression for estimating rice yield from optical and synthetic aperture radar (SAR) imaging.We provide a unique joint federated learning (FL) model based on partial least squares (PLS) regression and neural networks (NN) (FL-NNPLS).This paper suggested a high-resolution spatiotemporal image fusion approach (HISTIF) made up of multiplicative modulation of temporal change (MMTC) and filtering for cross-scale spatial matching (FCSM).First, we evaluate the state of industrial agriculture and the takeaways from industrialized agricultural production patterns in this essay [9]- [12].We start by suggesting an image compression method for data gathering.Initially provide a picture compression method for data gathering.We analyse how close a drone using a long range (LoRa) radio essential fly toward sensors in order to gather the data within a certain level of data quality [13]- [16].
In this study, a brand-new mechanism for automatically defining zones for variable rate application is proposed.In this work, we demonstrate an embedded system enhanced with AI that enables continuous analysis and on-site prediction of plant leaf growth dynamics.Finding the significant technologies towards the advancement of intelligent agriculture that may successfully enhance the production efficiency to ensure the quality of the agricultural yields is done using data visualization analysis along with cluster analysis [17], [18].The paper is organized as following.In second section of paper provides the efficient classification methodology for analyzing raw unstructured data is presented.In penultimate section, experiment is conducted for evaluating accuracies of classification model is presented.The conclusion of research and future work is defined in last section.

PRIORITY-BASED K-NEAREST NEIGHBOR CLASSIFICATION MODEL TO ANALYZE UNSTRUCTURED AGRO DATA
This research provides a quick and effective classification algorithm for analysing unstructured agricultural data and storing it at various cloud storage levels (provider).Agriculture-related unstructured data is classified into structured data, for that a priority KNN algorithm is first introduced.To speed up the classification process for relatively large data, a parallel classification model utilising the HMR framework is then given.Figure 2 shows the block architecture of proposed classification model.

Figure 2. Block architecture of proposed classification model
For analysis or categorization in this work, crop-monitoring datasets gathered from [19] are used.Sensory data acquired from various temperature, humidity, and gas sensors makes up the information.The circumstances under which wine and banana fruits mature are determined using this data.The data comprises 11 attributes or dimensions, including id, time, R1, R2, R3, R4, R5, R6, R7, and R8, as well as temperature and humidity, and is made up of 919,438 data points that are dispersed throughout various locations and periods.The dataset used in this investigation is described in full in [19].We categorised these data using priority clustering.Set to 3, the K (i.e.we take into consideration three groups, such as not affected, averagely affected, and totally impacted).The K can be modified to meet the criteria for user categorization.This is why we separate the data into three groups and store it in the cloud.

Clustering model for classifying unstructured raw data into structured data
The suggested priority-based KNN classification model is constructed by utilising k-mean clustering to divide the data points at each stage into L distinct areas.The data points in a location region are iteratively subjected to the same procedure following clustering.When there are less data points in an area than L, the iterative calculation is finished.Algorithm  The algorithm's feature or attribute known as the diverging influence is the number of clusters L that should be taken into account while separating the data at each node, and choosing L is important for achieving a successful classification conclusion.J_, which represents the maximum clustering iterations, is another parameter of the priority-based KNN clustering method.Smaller iterations can speed up clustering at the expense of accuracy.Finally, yet importantly, the parameter Dstr is utilised to govern the initial centres selection in the clustering algorithm.The suggested priority-based KNN clustering, however, achieves good convergence with minimal time.The raw input data used to perform classification is displayed in Figure 3. From Figure 3 it is visible the raw data is composed of 20-dimension point, which is generated similar to [19], [20].The complexity of computation mainly dependent on dimension size rather than size of data (rows).Classification is carried out to identify least affected (i.e.class a), averagely affected (i.e.class b) and most affected (i.e.class c) under assumption described in Figure 4.The outcome of classification model is shown in Figure 5.

Parallelizing classification using HMR framework
Additionally, this paper proposes a parallel classification system that makes use of the HMR framework [20].Figure 6 depicts the HMR framework's fundamental design.Since HMR follows the execute-once paradigm, all state data for iterative execution should be put into distributed file system (DFS) and then read back in for each stage of algorithm calculation or evaluation.HMR is a widely used software model for MR computations that is accessible to the public (i.e. it is open source in nature).

Advantage of HMR
Hadoop [21] is a distributed computing framework designed using java programing language adopting cloud-computing environment, which supports the MR architecture as shown in Figure 7. HMR has execute-once paradigm, implying that with iterative execution strategy all state data should be written into DFS and after that read back in for each progression of the algorithm calculation or evaluation.HMR  Owen et al. [22] has been worked to keep running over HMR and the Hadoop distributed file system [23], [24].Hadoop distributed file system (HDFS) is an execution of the google file system (GFS) where an extensively large dataset is fragmented into equal length of small blocks and a duplicate copy of each blocks maintained (this process is known as data replication).While handling the information, the framework pushes calculations to the virtual computing nodes where these chunks are facilitated to expand information location awareness amid computing for quicker algorithm computation makespan.At the point when HMR is initiated with HDFS, HMR can exploit information location awareness and push calculations to the information they should work on, eliminating the systems or network administration overhead, which might be caused when collecting from HDFS.This may offer the HMR based usage an edge in computing overheads when contrasted with other distributed and parallel processing architecture.

Parallel classification algorithm for Hadoop MapReduce framework
HMR is a combination of two important functions known as map and reduce as shown in The MapReduce function combines the tasks of mapping and reducing.The input dataset is divided into uniformly sized blocks of data, which are then distributed among the nodes of the Hadoop cluster.Applying a user-defined mapper function to the input from the map task results in intermediate output that serves as data for the reduce task's input.Reduced stage combines reduction phase and two-phase shuffle.The output data to the map job is used as an input into the shuffle phase, where the already completed map task is shuffled and then sorted.The sorted data is now sent into the user-defined reduce function, and the output is written back into HDFS.A map stage involves several distinct map tasks, each of which is listed.
Reduce stage is combination of shuffle/sort and reduce phases.In reduce stage shuffle/sort phase start working only after the first map task completed.Working of shuffle phase completed after the all map  ISSN: 2089-4864 Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 332-340 338 task work is completed.Once the shuffle/sort work over reduce task start working.Shuffle phase result obtained in first cycle may differ from result obtained in 2 nd cycle.Result of shuffle phase varies due to dependency on Map cycle.Reduce shuffle phase measurement based on two reduce cycle one is called initial shuffle and other is called typical shuffle.Reduce phase begins once the shuffling phase is finished [20].Provides information on HMR operation details.The Hadoop HDInsight cluster's distributed key building technique is displayed in Algorithm 2. This work uses distributed architecture to classify agricultural data, and our model achieves good accuracy, reduces computing time, and satisfies the real-time requirement, as empirically demonstrated in the next section.

Algorithm 2. Building distributed Key on Hadoop HDInsight cluster
Input: Data , keyVal  Output: (, )  ← _()   read chunk of the data  with respect to function  using Hadoop distributed file system.construct key in parallel on each worker with data   and keyVal  _() // Synchronize all workers.

RESULT AND DISCUSSION
This section compares the proposed effective categorization model (ECM) approach to the current approach [25] and evaluates how well it performs in terms of speedup, accuracy, central processing unit (CPU) time, as well as memory overhead.The information is used to determine how temperature and humidity affect the effects of gases on wine and bananas.In general, spreading sensor devices around the agricultural area improves yield.The sensors keep a look on conditions such as temperature along with humidity and make decisions depending on them, such as whether to release water or use pesticides, among other things.Additionally, by keeping an eye on the wind, which helps predict the onset of rain, cyclones, and other weather events in a specific location with less delay, agriculture production can be improved.So that the right decision can be made at the right moment with the least amount of harm to the corps.To assess the performance in terms of memory and time efficiency when taking into account real-time agrosensor dataset received from [19] such as Inspiral, this work compares with previous technique [26].This research is carried on the Windows 10 operating system (OS) along with I-7 processor (64-bit).The memory use in this research is 16 GB RAM along with 4 GB GPU dedicated with compute unified device architecture (CUDA) support.One master worker node and four slave worker nodes are taken into account while designing the HDInsight cluster utilizing the database Azure HDInsight cluster and A3.
An experiment was carried out to evaluate the performance of ECM with the existing models [25], [26] in terms of total CPU time, memory overhead, and accuracy attained in generating classification trees for turning unstructured input into structured data.Table 1 shows the comparison along with several state of art approach for developing classification tree.Table 2 lists the results of this evaluation.The outcome demonstrates that artificial neural network (ANN) performs better than a random categorization model.Figure 8 shows the classification performance assessment considering different dimension size.We contrast the proposed outcome performance improvement to the ANN classification model therefore.While decreasing overall CPU time and memory overhead by 32.85% and 55.07%, respectively, the ECM-local classification model improves accuracy by 1.82%.Additionally, the ECM-Hadoop classification model obtains a speedup of 16, increases accuracy by 1.82%, decreases overall CPU time and memory overhead by 95.86% and 84.05%, respectively.Additionally, we assessed how dimension size affected classification ability, as shown in Figure 8.As shown in Table 2, we modified the dimension size to be 4, 6, 8, and 10 and assessed the classification result in terms of total CPU time, accuracy, and memory overhead.The results of the experiment demonstrate that when dimension size rises, computation time and memory overhead also increase.Similar to this, precision is achieved when dimension size is 5 and increases to 11 to get accuracy of 2.17.This makes it obvious that the size of the dimension affects categorization accuracy.The entire outcome demonstrates the ECM model's scalable performance in comparison to state-of-the-art models.

CONCLUSION
From the above research, we can establish an efficient classification technique regarding the performance analysis based on agro related data in unstructured form.Here a priority-based KNN classification model is presented, which performs the analysis on multi-dimensional data (high dimensional data).Here we have adopted a distributed computing framework for the analysis purpose.Parallel clustering algorithm approach by applying Hadoop framework is developed for establishing scalable performance during analysis of high dimensional data.All the research are carried out on real-time data scrapped from agro sensors.Further, the results display that the ECM-local reduces the total CPU time as well as memory overhead by 32.85% along with 55.07%respectively.Here the accuracy improvises by 1.82%.Likewise, the ECM-Hadoop model for classification decreases the total CPU time by 95.86% along with memory overhead by 84.05% respectively.Here the accuracy is improvised by 1.82% and the speedup enhances to 16.The overall performance result displays the scalable performance of developed ECM model when compared with several state-of-art paradigms on several parameters such as total CPU time as well as accuracy and memory efficiency along with speedup.Further, the future research would consider evaluating considering different dataset and minimize the storage and processing cost.

Figure 1 .
Figure 1.Accurate classification model's architectural design for a multi-level cloud storage concept

Figure 3 .
Figure 3. Raw input dataset used for performing classification operation

Figure 4 .Figure 5 .
Figure 4. Classification input data for performing classification operation

Figure 6 .
Figure 6.The architecture of HMR framework

Figure 8 .
Figure 8. Classification performance assessment considering different dimension size 1 presents the proposed priority-based KNN model.Algorithm 1. Building priority-based KNN algorithm Input: Agriculture Dataset , diverging influence , maximum iteration  ↑ , center selection strategy to be applied   .Output: Structured (Classified) data.if || then build terminal node with feature points in .else  ← choose  data points from  using   .  ∈  do build non-terminal node with center   Continuously apply clustering method to the feature points in   end for end if

Table 1 .
Comparison along with several state of art approach for developing classification tree Agriculture data analysis using parallel k-nearest neighbour classification … (Vimala Muninarayanappa) 339

Table 2 .
Classification performance assessment by considering different dimension size Dimension size Total CPU time (s) Average accuracy Memory overhead (kilobytes)