Vehicle detection and classification using three variations of you only look once algorithm

ABSTRACT


INTRODUCTION
Perceiving their surroundings is one of the fundamental needs for autonomous vehicles with advanced driver assistance systems (ADAS). These vehicles use sensors including cameras, lidar, radars, and ultrasonic technology to gather environmental data. Cameras provide the greatest resolution and most detailed textural data on the environment of the vehicles when compared to other sensor types. However, deciphering cameracaptured visual data is still a difficult task for computers to complete, particularly when real-time performance is needed [1]. This task is made more challenging by the fluctuating weather and illumination, complicated backgrounds, and object occlusions [1].
Proper solutions are already in place to address the automotive industry's traffic control issues. Toyota, for example, makes use of its technology [2]. Mobileye [3], NVidia [4], and StradVision [5] are just a few of the companies working on developing algorithms that can find objects using cutting edge of technology. Various studies have been conducted to detect and classify objects for autonomous vehicles, and supervised machine learning (SML) is one of the most popular techniques. By classifying input photos for SML, classifiers must extract features from the images. The histogram of oriented gradients, scale-invariant feature transform
Although rapid, precise, and adaptive, these methods do not perform well for many features or occlusion [9]. Deep learning (DL) approaches are rapidly being used by researchers for automatic vehicle recognition and classification. ANN underlies the architecture of DL with layers that can learn and make intelligent selections on their own [10], [11]. Popular DL approaches for object recognition and classification include region-based convolutional neural network (R-CNN) [12], [13], Fast R-convolutional neural network (Fast R-CNN) [14], [15], faster R-convolutional neural network (faster R-CNN) [16], and you only look once (YOLO) [17]. These methods are adaptable, dependable, and precise [18]. Detecting diverse shapes, colours, and vehicle types, on the other hand, remains a significant challenge.
Building a robust object detection and classification algorithm is still a challenge today. Many of us have misunderstood the magnitude of this issue because it is easy for our eyes to detect and classify objects like vehicles coming in different shapes, colours, and types such as cars, lorries, and buses. Nonetheless, there are numerous challenges in designing a machine to design and mimic human eyes. One object detection and classification problem are the disparity in vehicle heights and shapes [19]. Vehicle sizes can vary in height and shape, not just for one type of car. When many vehicle entities have the same height and shape, it is hard for a computer system to find and sort them by figuring out how their shapes are different [20].
Another complex problem in the detection and classification process is occlusion. The amount of information in an image is limited by occlusion [21]. Occlusions can occur in real-world scenarios between vehicle types, such as cars, trucks, buses, and other objects, such as humans, animals, or buildings. These are all common types of occlusions. These occlusions can reduce the accuracy of detection and classification algorithms, particularly in a crowded environment [19]. Another challenging task is that different lighting conditions can affect the visibility and even the appearance of a vehicle. Figure 1 shows some occlusion samples impacting vehicle detection and classification performance. In this figure, the parked vehicles are occluded by other vehicles in a crowded car parking area. When combined with bad lighting caused by fogs, occlusion can be a major problem for accurate object detection and classification. Fast R-CNN has a few misclassification errors, such as misclassifying background patches in an image as vehicles because it cannot detect the larger context [14]. Faster R-CNN has been used for vehicle detection and tracking; however, the training and testing images were only captured during the day and not at night [22], even though detecting and classifying vehicles at night is just as important as during the day. YOLO is faster than faster R-CNN, but it cannot detect small objects such as vehicles farther away from the camera [23], [24]. YOLOv2 was created to address this limitation and can also detect and recognise objects with significant differences between classes, such as humans and cars, but not different types of vehicles [25]. Based on aerial images, a performance comparison of YOLOv3 and YOLOv4 for small object detection was conducted, and the results show that YOLOv4 has better accuracy but not speed [26], [27]. YOLOv5 has demonstrated exemplary performance in detecting vehicle front but not rear parts [28]. Regarding accuracy in detecting faulty unmanned aerial vehicles (UAVs), YOLOv5 outperforms YOLOv4 and YOLOv3 but has a slightly slower inference speed [27]. Thus, this study compares the performance of YOLOv3, YOLOv4, and YOLOv5 in terms of speed and detection accuracy based on videos captured using a mobile phone mounted on a car's dashboard in various views and different types of vehicles, during the day and at night.

THEORETICAL OVERVIEW
Due to its powerful learning capabilities and advantages in coping with occlusion, scale transformation, and background switches, DL-based object detection has been a prevalent study issue in recent years. DL-based object identification systems deal with sub-problems like occlusion, clutter, and low resolution [29]. Existing DL-based detectors are classified into two types. The first type of detector is a CNN-based twostage detector, which has the advantage of accuracy, but not speed. The first step generates the proposed region, while the second stage classes and regresses the region, as in faster R-CNN. The second type of detector, YOLO, does classification and regression on dense anchor boxes without generating a sparse region of interest (RoI).
Previous convolutional neural network (CNN) models required a fixed input size; for example, AlexNet supports images up to 224×224 pixels. The development of a spatial pyramid pooling (SPP) layer, which allows a CNN to produce a fixed-length representation without rescaling, regardless of the size of the image/region of interest, is the main feature of SPP-Net [30]. When using SPP-Net for object detection, the feature maps can only be created once from the entire image. Then, instead of computing completely convolutional features repeatedly, fixed-length representations of arbitrary regions can be used to train detectors. Without losing identification sensitivity (VOC07 mAP=59.2%), SPP-Net is 20 times faster than R-CNN. SPP-Net has successfully increased detection speed, although there are still certain limitations. For starters, training continues to be multi-staged. Second, SPP-Net fine-tunes the already connected layers while ignoring the ones that came before [31].
Lin et al. [32] proposed faster R-CNN-based feature pyramid networks (FPN). Most DL detectors only detect the top layer of a network before FPN. While CNN's deeper layers are excellent for recognising categories, finding objects is difficult. FPN does this by constructing high-level semantics at all sizes using a top-down architecture with lateral relations. Because CNN's forward propagation forms a function pyramid by default, FPN has demonstrated substantial performance in detecting objects of various scales. The Microsoft common objects in context (MSCOCO) dataset produces state-of-the-art single-model detection using FPN in a simple faster R-CNN method. FPN is now a fundamental component of several modern detectors [31]. Girshick et al. [12] suggested a multi-stage classification method based on the region's model, which was realised using the R-CNN. The national proposal element, the CNN vector extractor function, and the SVM classifier are all part of the framework.
As a result, CNN is more effective at removing attribute vectors from the past for both positive and negative soil-reality regions while reducing file size for the training phase. Then, in SVM classification processing, such functional vectors are used. The external region recommendation section uses a selective search to construct 2,000 category-dependent, fixed-size sections containing objects. Subsequently, the CNN extractor converts all feasible regions into vectors, which the SVM uses to define the domain. A linear regression model and an intersection over union (IoU) overlay with a superior score field optimise the bounding box and eliminate duplicate detections.
In comparison to other DL techniques, R-CNN achieves higher detection precision. However, Rdynamic CNN's multi-stage pipeline has significant drawbacks. CNN contains sensitive information. Adding specific regions with slow detection and identification times slows down the overall system and depends on region prediction. However, automating the multiple training methods for each part is difficult. In other words, the CNN element cannot be changed while the SVM classifier is being trained [33]. R-CNN and SPP-Net expansion can be considered. Fast R-CNN [14] was created to perform complete training and testing, while Fast R-CNN, including SPP-Net, switches the zone extraction feature and runs CNN for measurement sharing. Later, the final pooling is also tested to process images of any scale. Fast R-CNN, on the other hand, differs from SPP-multi-scale Net pooling in that it employs only one RoI pooling layer.
Because loss problems propagate back to updated convolutional layers and RoI bundling layers, this approach is advantageous for training. On each labelled RoI, Fast R-CNN also incorporates class score loss, boundary box loss, and multi-task lowering. As a result, the entire network can be trained. These innovations simplify training and evaluation while enhancing identification accuracy, according to the results. Accelerated network time exposes slow geographic notions as a bottleneck [34].
Faster R-CNN [15] successfully solved a region selection problem. It has established a national suggestion network to support the proposed regional mission without employing external field advisory methods but instead shares image convolutional calculation with a detection network. A Fast R-CNN is a region proposal network (RPN) for class evaluation. Aside from the feature diagram, each region is represented by an m x n matrix with nine different area ratios and sizes. The RPN will incorporate both geographic theories to predict the appearance and location of artefacts. Then, the second Fast R-CNN can be applied to the RPN's high-scoring regions for additional grouping and bounding box optimisation. Faster R-CNN is the basis of good follow-up detection systems, using multiple data sets to find new objects accurately. Liu et al. [34] introduced the single shot multibox detector (SSD). It is the second one-stage detector in computer vision. The fundamental contribution of SSD is the introduction of multi-reference and multi-resolution detection techniques, which considerably increase the detection precision of a single-stage detector, especially for small objects. SSD offers benefits in terms of detection speed and accuracy. The fundamental distinction between SSD and previous detectors is that the former detects objects at various scales on different network layers. On the other hand, the latter identifies things solely on their top layers [31].
RetinaNet was proposed by Lin et al. [35]. The primary cause is a significant foreground-background mismatch observed during dense detector preparation. RetinaNet has introduced a novel loss feature known as "focused loss" by altering the conventional cross-entropy loss during preparation to concentrate more on rough, misclassified examples. Due to focus loss, one-stage detectors may retain extremely fast detection speeds while achieving the excellent performance of two-stage detectors [36].
The YOLO algorithm is a collection of detectors first developed in 2016 by Redmon et al. [17], and it divides the input image into an S grid. It investigates using a network for real-time object detection using the one-stage detector to detect and recognise objects simultaneously. YOLO sensing speed is roughly ten times faster than other cutting-edge systems to a single architecture. The YOLOv1 algorithm comprises 24 convolutional layers, followed by two fully connected layers. Convolutions of size 11 are used in some convolutional layers to reduce the depth dimension of feature maps. Fast YOLO employs only nine convolutional layers, but it affects accuracy.
Redmon and Farhadi [23] recommended the current YOLOv2 edition. A new Darknet-19 network configuration is generated by deleting the network's entire connection layers and performing batch normalisation on each layer. In the anchor boxes of the faster R-CNN anchor mechanism, K-means clustering is employed. The predicted boxes have now been retrained using direct prediction. YOLOv2 outperforms YOLOv1 in terms of precision and speed of target recognition; it is faster but less accurate, making it unsuitable for sensitive tasks, including security and autonomous vehicles. Redmon and Farhadi [23] pioneered object detection by merging feature extraction and object localisation into a single monolithic block.
The first difference between YOLOv3 and previous models is using a multi-label grouping rather than a mutually exclusive label. Instead of the general mean square error used in previous versions, the binary crossentropy loss for each label is used for classification loss. It employs a logistic classifier to assess the likelihood of the versions, with the softmax feature producing the score probabilities. Using various bounding box predictions is the second improvement. Unlike others, it assigns the objectless score of 1 to the bounding box anchor overlapping the ground truth object. It excludes anchors overlapping the ground truth object by more than the implementation's chosen threshold of 0.7. As a result, YOLOv3 assigns one bounding box anchor to each ground truth object. The third enhancement is feature pyramid networks for prediction across sizes. YOLOv3 predicts boxes on three different scales before deriving the characteristics from those scales [37].
The number of convolutional layers is added to the base feature extractor. Using the Darknet-53 framework as the foundation for YOLOv3, the number of convolutional layers is increased to 53. Three bounding boxes are predicted for each scale, and each bounding box's predicted classes have multiclass classifications. The network is a YOLOv2/Darknet-19 hybrid with 3 × 3 and 1 × 1 shortcut filters. It provides better features earlier in the network, and predictions from later layers benefit from early computing [38]. Figure 2 shows the Darknet-53 network, and Table 1 shows how different backbones work.  Bochkovskiy et al. [39] proposed a YOLOv4 algorithm with significant changes and improved accuracy compared to previous versions. YOLOv4 is a significant improvement over YOLOv3; by implementing modern backbone architecture and neck improvements [40], mean average precision (mAP) and FPS accuracy have increased by 10% and 12%, respectively. It is now possible to train this neural network on a single GPU [39], [41]. It employs the dense block to construct a deeper and more complex network to gain more precision. The dense blocks input charts are divided into two parts by cross stage partial (CSP) connections. The first part is immediately guided to the next transformation layer, while the second moves through the thick block. The computing requirements are reduced because only one component passes through the dense block. It provides the foundation for feature extraction with the CSPDarknet-53, using CSP connections with the Darknet-53 from the previous YOLOv3. In YOLOv3, path aggregation network (PANet) is used instead of FPN.
YOLOv4 now contains more tools for constructing robust and reliable object detection models. It allows anyone to train a quick and reliable object detector utilising a 1080 Ti or 2080 Ti GPU. YOLOv4's architecture features a training mechanism dubbed Freebie's Bag, improving accuracy without requiring hardware, and it helps to increase the results while incurring no additional processing costs. The effect of cutting-edge bag-of-freebies and bag-of-specials object detection methods, such as cross-iteration batch normalisation (CBN), PANet, and spatial attention module (SAM), is also examined during detector training. It changes and increases their performances and applicability for single-GPU training [39].
There are additional improvements to the single GPU robust training implementation. They include adding new data increase methods (CutMix, mosaic, and others), selecting hyperparameters while implementing general algorithms, and adjusting existing methods to make their design best suited for detection and training. Concerning the YOLOv4 architecture, the CSP Darknet-53 backbone, additional SPP module, PANet neck, non-maximum suppression (Greedy NMS), and YOLOv3 (anchor-based) headers have been chosen [39].
ResNet, DenseNet, and VGG backbone models were used as feature extractors. They are fine-tuned on the detection dataset after being trained on datasets like ImageNet [42]. As the network becomes deeper (more layers), these networks generating different feature levels with higher semantics are useful for later sections of the object detection network [35], [43].
Between the backbone and the head, there are additional layers. Neck networks include FPN, PANet, and Bi-FPN, to name a few. At various backbone levels, they are used to extract various feature maps. For example, YOLOv3 uses FPN to remove features from the backbones of various scales [39], [43]. Ahead is a network detecting bounding boxes (classification and regression). Objected anchor-based detectors, such as YOLOv4, anchor-based, one-stage detectors, SSD, and RetinaNet [39], [43]. add the head network to each anchor box. A single output may look like four values representing the predicted bounding box (x, y, h, w) and the probability of k classes +1. (one extra, for background).
Bag of freebies (BoF) and bag of specials (BoS) are two strategies utilised in YOLOv4 to improve the object detector's precision (BoS). BoF enables the object detector to receive improved accuracy without increasing inference costs; it only alters the training plan or increases the training expenditures. BoF is used to augment data, improving the generalisability of the YOLOv4 model. Photometric distortions, such as adjusting the brightness, saturation, contrast, or noise, are elements of BoF, as are picture geometric distortions, such as rotating, cropping, and others [39], [43]. BoF approaches enhance detector accuracy.
BoS modestly raises the inference cost while significantly improving object detection accuracy. Adding attention mechanisms (squeeze-and-excite and spatial attention module), increasing the model's receptive field, and improving feature incorporation are examples of module/method categories [39], [43]. Figure 3 depicts the YOLOv4 architecture. Figure 4 illustrates the YOLOv5 architecture. YOLOv5 uses CSPDarknet53 as the backbone and PANet as the neck to boost the information flow. The head in YOLOv5 is the same as YOLOv4 and YOLOv3.

EXPERIMENTAL RESULT AND ANALYSIS
The open image [44] dataset is one of the most extensive datasets of images annotated in many ways to train computer vision tasks for deep CNN. The open image dataset integrates 9M images with annotated 36M labels, 15.8M bounding boxes, 2.8M instance divisions, and 391K visual relationships with the launch of version V6. Open images V6 represents critical qualitative and quantitative steps toward enhancing coherent image recognition, object identification, visual relationship detection, and instance segmentation. Images are modified in Roboflow [45]. It organises, prepares, and enhances the training data on photos and annotations and creates a higher computer vision model dataset. Roboflow supports object classification and detection models [45]. In this research, YOLOv3, YOLOv4, and YOLOv5 models were trained with 5,000 images of the COCO dataset in the system's default setting. In addition, 30 images captured within a few areas in Shah Alam and Kuala Lumpur ( Figure 5) were also used. These images were converted into joint photographic experts group (JPEG) format before being fed into YOLOv3, YOLOv4, and YOLOv5. All additional images are captured using a mobile phone camera under natural illumination settings. Most images were captured under direct, bright sunlight and some were acquired under bridges or highways creating non-uniform illumination settings. While some vehicles are fully visible in the image, others included occluded vehicles. The different variations are to create image data collections for robust training datasets. Figure 6 displays three examples of the test images. Figure 6(a) shows two fully visible vehicles captured from a bird's eye view. A motorcycle was also present partially occluded by the red car. Figure 6(b) with four cars is captured under low-intensity lighting and at night time. Figure 6(c), there are four motorcycles and two cars. One motorcycle is partly hidden behind the blue car and only a small part of another car is visible parked next to the blue car.  Table 2 shows a few examples of object detection and classification using YOLOv3, YOLOv4, and YOLOv5. For the first test image named as image 1, YOLOv4 outperformed YOLOv3 in terms of classification and classification speed. YOLOv3 detected three objects and correctly classified only one object as a car. Meanwhile, YOLOv4 detected five objects, and both cars are correctly classified. YOLOv4 also is more efficient than YOLOv3 in which the detection and classification were done faster at 33.42 ms compared to YOLOv3 at 41.56 ms. YOLOv5 took the longest time at 154 ms, and six vehicles were detected with four misclassifications. Furthermore, the confidence scores for car classification were 99% and 65% compared to 99% confidence scores of YOLOv4 for both cars. For image 1, YOLOv4 seemed to outdo YOLOv3 and YOLOv5 both for detection, classification, and speed.
As for image 2, YOLOv5 performed the worst compared to YOLOv4 and YOLOv3. YOLOv5 required 190 ms to conduct the detection and classification as compared to YOLOv3 at 41.53 ms and YOLOv4 at 54.72. YOLOv3 detected five cars, YOLOv4 detected eight objects of which six were classified as cars and two as traffic lights. However, YOLOv5 only detected three objects of which two were classified as cars and one potted plant. YOLOv3 seemed to perform better than YOLOv4 and at a faster rate for image 2. It is also interesting to note that YOLOv4 classified two objects as traffic lights. Upon closer inspection, we believed that the two streetlights in image 2 were misclassified as two traffic lights.
YOLOv4 clearly performed the best for image 3. All six objects were detected, and all four motorcycles and two cars were classified correctly. The detection and classification were also done in 54.72 ms. YOLOv3 was also able to classify the four motorcycles and two cars successfully. However, a false negative object was also detected and classified as a person. YOLOv3 took longer than YOLOv4 with a speed of 91.30 ms. YOLOv5, however, performed poorly at with the slowest speed of 180 s.

DISCUSSIONS AND FUTURE WORK
This paper discusses vehicle detection and classification achievements using DL techniques. It shows preliminary results from three DL techniques: YOLOv3, YOLOv4, and YOLOv5. Based on the experimental results, the three versions of YOLO performed differently depending on the size of objects in the image and the illumination setting. In an earlier study Jeong et al. [18] on aerial images, YOLOv4 has better accuracy but slower speed compared to YOLOv3. However, the same conclusion cannot be drawn based on our test images. While YOLOv3 performed faster in good illumination settings, it deteriorates when non-uniform lighting such as in image 2 and image 3. YOLOv5 has been shown to be slower in [21] and the same finding is also derived in this research. YOLOv5 also has shown unpredictable results with misclassifications and has a lower detection rate compared to YOLOv3 and YOLOv4, especially in non-uniform lighting and occluded objects. Figure 6 also shows that YOLOv4 can detect more objects than YOLOv3 and YOLOv5. Future work will improve YOLOv5's efficiency and detection performance for occlusion and apply it to vehicle counting applications. [47]

CONCLUSION
This study presents the results of vehicle detection and classification using DL approaches. It shows preliminary findings from three DL techniques: YOLOv3, YOLOv4, and YOLOv5. For the training, two separate datasets were used. The first dataset was COCO, which was trained with the classes' default values of 80, and the second dataset was Open Image, which was trained with only four classes. The experimental findings with thirty photographs of different types of cars obtained in Kuala Lumpur and Shah Alam, Selangor, during the day and at night show that YOLOv5 recognises vehicles quite successfully but at a slower rate than its predecessors. However, recognising vehicles with occlusion for YOLOv5 still has potential for improvement. Future work will focus on improving the efficiency of YOLOv5 and its detection performance for occlusion and applying it to vehicle counting applications.