Video saliency detection using modified high efficiency video coding and background modelling

ABSTRACT


INTRODUCTION
The human eye is a complex organ, the way it works with the brain to filter and analyses the necessary components in the image it sees has perplexed scientists for ages, and many have tried to replicate it using algorithms and computations.Using the process done in the brain, researchers have tried to develop methods to pick out those areas of interest from the given image, just like the human visual system.With the inclusion of deep learning technology, there has been a significant rise in methods of image saliency detection with remarkable accuracy when tested on large-scale static gaze datasets such as the silicon dataset [1].However, there have been several types of research done in the field of image saliency detection; it is quite challenging to produce the same effect of dynamic fixation prediction with moving images or videos.Video saliency has a great role in video compression, captioning, object segmentation and so on.This has led to the classification of saliency into two models, namely salient object detection and human eye fixation prediction.The input is also of two types, dynamic and static saliency models.Static models, as the name suggests, have images as their input and likewise, dynamic models take video input.
The inspiration for this paper has stemmed from various research papers based in a similar field.This paper have a significant impact on the world of saliency [2], [3].These two papers have used the difference in the features between the surrounding and central patches to estimate visual saliency.This is another research that attempts to detect saliency by representing a combinational block, based on random walk models, of all neighboring blocks [4]- [6] has a unique technique that involves graphs and is named as graph-based visual saliency.It includes the formation of activation maps on certain features followed by normalization.It has an  ISSN: 2089-4864 Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 431-440 432 amazing receiver operating characteristic (ROC) value of about 98%.They have a similar approach to the problem with difference being that it uses random walk models on a graph to imitate the eye movements [7].Their first step was to extract intensity, colour and compactness features, construct a fully connected graph and then the proposed algorithm computes the stationary distribution of the Markov chain on the graph as a saliency map.They have used another method of saliency detection using spectral features of an image.Exploits the features of images like luminescence and colour, which helps in reducing computational complexity and gives accurate results [8]- [10].The researchers have used a base method involving regional application in saliency detection.In this, the input image is first segregated into different regions for saliency levels to be applied to each of them uses global contrast features with spatial weighted coherency, while [11]- [15] uses robust background measures along with a principled optimization framework to integrate all low-level maps to create a final clean and uniform saliency map.All the above algorithms are used for still images and these help in creating algorithms for video saliency detection.To modify them to be able to accurately detect visual saliency, we would need motion information to imitate the human eye's perception of movement.They have quaternion representation using the features of images like colour, intensity and motion and employment of phase spectrum of quaternion Fourier transform.This methodology involves discriminant center surround hypothesis.It combines colour orientations, and spatial and temporal saliency by taking summation of the absolute difference between temporal gradients of central and surrounding regions [16].They made use of feature extraction from the partially decoded data.It uses global and local spatiotemporal (GLST) features [17].The compressed video bitstream is partially decoded to obtain discrete cosine transform (DCT) coefficients and motion vectors and then GLST features are extracted.Then the spatial and temporal maps are generated and fused to get the result.This paper uses random walk with restart methodology.They figure out temporal saliency distribution using motion distinctiveness, abrupt change and temporal consistency [18].Then it is used as restarting distribution and steady-state distribution is used to find spatiotemporal saliency.
All these researches and experiments tell us that many state-of-the-art methods are available in the uncompressed domain.Since videos and images are generally sent in a compressed format, these conventional algorithms do not perform well in these situations.The only way for them to work effectively on the available data is to fully decode the data but this increases time consumption and the complexity of the code.There has been some research to solve this problem [19].Has tried to improve the DCT-domain transcoder or deterministic discrete-time (DDT), by proposing a fast extraction method for partial low-frequency coefficients in DCT domain motion compensation operation (DCT-MC).Zhang et al. [20] is redesigned to exploit the lowlevel compressed domain features from the bitstream.They uses object recognition for fast saliency detection.Colour clustering and region merging is based on spatiotemporal similarities, pixel edge extraction and regional classification [21].They have similar video saliency detection methods [22]- [25].They, have come across several methods for bettering the saliency area.One of them introduced the G-picture methodology, which meant that reference will be maintained, probably a second frame reference, for reducing the complexity of the high efficiency video coding (HEVC) algorithm, then there is the usage of a quantization parameter for quantizing the G-picture (ground) and with the employment of background reference prediction (BRP) and background difference prediction (BDP).Even small coding blocks called coding units were introduced to lower complexity and increase efficient compression [26]- [30].However, all these works were in different times and different regions of work.We have tried to incorporate all these modifications to come up with a solution that not only reduces complexity but also helps in input size flexibility with reduced time consumption and better compression precision, accuracy and efficiency [31]- [33].
In this study, a modified version of the HEVC method is suggested that makes use of backdrop modeling with a hierarchical prediction structure (HPS).It consists of two parts.The first is the modification of the reference frame used, making G-picture the fourth reference frame rather than the second as stated in other research papers, quantizing it with a relatively smaller valued parameter, and adding coding blocks for less complexity.The division of each coding block into F G , B G and H G is the second element.Depending on the information included in the G-picture, each of these elements is sped up in a unique way.To avoid further coding and calculation, another alteration is included in which the coding block portioning is halted early.
There are a total of five sections in this essay.The introduction is covered in the first section, and the related works for this paper are listed in the second.The third section covers the mathematical and coding components of the suggested system, and the fourth section displays the outcomes of the tests done using the dataset dynamic human fixation 1K (DHF1K).The paper is then concluded in the fifth portion.

LITERATURE SURVEY
This section will provide a quick overview of the numerous studies and tests that have aided in the development of our solution.We now have a better iteration of the HEVC method, starting with [34]- [37], in which the perceptual redundancy has been decreased for higher compression value.With the use of a Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  Video saliency detection using modified high efficiency video coding … (Sharada P. Narasimha) 433 convolutional neural network, this suggested technique combines the motion estimation results from each block during the compression phase and employs adaptive dynamic fusion for the saliency map.The fundamental element of this suggested algorithm is the application of the spatiotemporal algorithm.The next one is a survey that has assisted in grouping and selecting the appropriate database as well as the modification approach for our suggested solution.They provides an up-to-date overview of all the video compression research along with its milestones [38]- [40].It is done for conventional codec adaption along with learning-based end-to-end and their advantages and disadvantages.In their conclusion, the computation complexity is an issue that needs solving at the earliest available opportunity.This paper is another survey about the different saliency models available and what are the drawbacks that have led to insufficient accuracy and precision in compression [41].
The researcher has provided insight into the different ways the various saliency models can try to mimic the actual process of the human eye and brain.This has helped in making the right modification to our algorithm with actual practical comparisons.The main dataset that has been used for the proposed solution's experiment as well as the dataset of the base reference that is used for comparison of our results [27], [42].The dynamic human fixation 1K, often known as DHF1K, forecasts fixations when viewing dynamic scenes.With 1000 high-definition, diverse video clips taken by 17 observers while wearing eye trackers.Attentive convolutional neural network (ACLNet)-long sortterm memory (LSTM) network is a cutting-edge video saliency approach that has also been proposed.Additionally, it has contrasted its findings with those of other techniques using various datasets, including Hollywood-2 and University of Central Florida (UCF) sports.It was one of the quickest approaches up to this point.They have given us knowledge on hyper saliency [43].Convolutional neural networks are trained using manual algorithmic annotations of smooth pursuits, and the findings are developed with the aid of 26 dynamic saliency models that are freely available online.Here another study that has aided in algorithm development?For prediction in dynamic scenarios, they have devised a brand-new 3-dimensional (3D) convolutional encoder-decoder architecture [41].The encoder has two subnetworks that separate the spatial and temporal components of each frame and then fuse them.The decoder then aggregates temporal data and enlarges the features in spatial dimensions.It is tested on the DHF1K dataset after receiving end-to-end training.This is another survey of various video saliency methods available in today's world that employs deep learning and has tried its level best to reach the human level of eye tracking movements and feature detection [44].
They provides a no-reference bitstream human vision system (NRHVS) based video quality assessment (VQA) [45].The saliency maps are generated by extracting the features from the HEVC bitstream and then a visual memory model is created using saliency map statistics.The support vector regression pipeline helps in learning the approximate video quality.VS-video saliency (DeepVS2.0) is a video saliency prediction approach based on deep neural networks [39].It has aided in comparing our outcomes and evaluating how we did against other cutting-edge techniques.In order to create the intra-frame saliency map, it has presented an object-to-motion convolutional neural network (OM-CNN) that learns spatiotemporal properties.Then, using the OM-CNN extracted features, a convolutional LSTM network is created to enable inter-frame saliency.Our baseline reference is [46].For different levels of the 3D convolutional backbone for the video saliency mapping, it uses its spatiotemporal self-assessment (STSANet) model [47]- [50].In order to integrate many levels with context in semantic and spatiotemporal subspaces, attentional multi-scale fusion (AMSF) is used.

PROPOSED SYSTEM 3.1. Optimizing low-delay hierarchical prediction structure efficiently
In this part, we will briefly discuss the constituents of the low delay HPS of the HEVC test model.They are namely two components.One is called hierarchal quantization (HQ), which uses the data of the last frame and other prioritized frames from the last three short groups of frames, and the other is called hierarchal reference (HR).Where the quantization parameter of each important frame is the same as two less than its next image while the quantization parameter of the middle image in the short group of frames is equivalent to one more than the important frame's value.To optimize it, we need to replace the fourth reference frame with the G-Picture (generated using a general running less complex algorithm).This will remain as a long-term reference.For this, we shall use the Lagrange rate-distortion optimization and this helps in evaluating the rate distortion (RD) cost C.Where Q denotes the quality of reconstructed video about the original, η denotes the number of bits and μ denotes the Lagrange multiplier.There will be m input frames, and let (I i , p) represent the rate-distortion cost of encoding the i-th picture (I i ).p will represent the coding units' quantization parameter using a cost function.

C = μη + B
(1) Where U i,r,p represents the motion vectors and Q i,r,p is the data prediction quantized with p.With a smaller p ′ , it provides a better reference for a images (I j+1 ~ I j+a ).Assuming that there are n j+1 coding blocks for I j+1 for indexes e(j + 1, 1)~e(j + 1, n j+1 ) for better reference I j but the other coding blocks q j+1 for indexes t(j + 1,1)~t(j + 1, q j+1 ) cannot do so.This is similar for I j+2 ~Ii+a with n j+s and this has better prediction than coding blocks indexed by t(j + s, 1)~t(j + s, q j+s ) and e(j + s, 1)~e(j + s, n j+s ).The new costing equation comes out to be as shown in (3).
As it can be deciphered from (4), T 1 give the rate-distortion cost before I j is used, T 2 is the ratedistortion cost after using p′ for encoding, T 3 is costing for coding blocks I j+1 ~Ij+a , T 4 is the rate-distortion cost for all the combined rates of the coding blocks for modified I j and T 5 is cost for I j+a .
In T 4 , the term Q i,e(i,l),p′, has lesser quantization loss than the term Q i,e(i,l),p, (i = j + 1~j + a, l = 1~n i ) in Y due to this, the inequality is satisfied is shown in (8).
Thus, the conclusion can be stated as -"for a large a in rate-distortion cost, C′ that satisfies the equation Y − T 4 > T 2 −X, then C − C' > 0." There can be several conclusions drawn from the analysis of the equations above.If we frequently choose an image as a reference for the next batch of pictures, then the quantization parameters must be selected in such a way that the values are relatively smaller also on extending this conclusion.We can say that the G-picture, as it is taken as a long-term reference, must be quantized at a value lesser than the quantization parameter.
For the above conclusion to work well there must be the availability of a large number of pixels that have the same features as the G-picture.These groups of pictures can be collectively put into similar background batches.There are other groups of images, they will be put under general-background-batch, and they will work without the G-picture, as it does not hold any significant advantage.A similar background batch needs to be reworked for better bit encoding for better quality preservation and compression.For this, we can have two types of quantizing methodologies, one is to use the same value as the one used for the generalbackground-batch (p) and the other is to use another value for the quantization parameter (p′), this is not the same as the one used in the first case and this helps in better rate-distortion values.General-background-batch's Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  Video saliency detection using modified high efficiency video coding … (Sharada P. Narasimha) 435 valuable frames will be denoted by G B and we must follow (7) analogy.This means that using the G-picture as a long-term reference, we can quantize any batch of frames with a large valued quantization parameter than the one used for the adjacent frame in the batch of frames.

Speeding up the algorithm
The foreground units contain the basic coding blocks, which are 4 × 4 units in size and each input in the coding blocks is classified based on the number of basic blocks present in the foreground.Taking K(f) as input type for basic coding block f and g i,j be a pixel value of basic unit f while for G-picture it is G B i,j , then as shown in (9).
Here, x is a predefined threshold valued at 80. Then taking in the basic blocks in the group of coding blocks (o is used for its representation), the categories of classes for the coding blocks are calculated with the help of the proportion values of foreground blocks (F G ), its background blocks (B G ) and its hybrid blocks (H G ).The size taken here is (2N × 2N) computed through (10).
In the traditional HEVC encoder, the encoding value is chosen between 2N × 2N coding blocks or just four recursively-coded parts.To avoid this confusion and reduce time consumption by not calculating and comparing the rate-distortion costs, there is a need for partition termination methods in the HEVC test model.For this, a static background for a large time is used.Each input is considered as a potential coding block and is segregated into the respective blocks as in (10).B G s with a value of N > 8 will occupy a larger proportion than the other two and that needs an early termination.So, whenever there are 16 × 16, or N = 8 coding blocks then the B G will be a pure version of the coding blocks and will not undergo further partition.There is also an issue regarding prediction pixels for coding blocks.For better accuracy, it has been decided that only 2N × 2N coding blocks must be used for B G N > 8.The rest will have it for N ≥ 8 and H G s have no asymmetric motions partitions.In addition, the range for searching motion must be at 1 pixel for B G s and unchanged for H G s and F G s.

Modelling the background and selection
We need to calculate the average of all background frames in a running fashion.J denotes the current frame in training, M is the matrix that has unsigned *-bit integers for average result representation.Then M′, that is, the average value is given by (11).
The number of training frames, m, is indicated here.Only one multiply, shift, floor, divide, and three extra operations are performed during this process.The first image, if it is large enough, will be spotted by the algorithm and can be thought of as a large group of frames for minimal time delay in the coding stage.Assume that this batch of frames' HPS has a size that is even. and (, ) = 1 or 0 demonstrates that  and  have vast amounts of data with different/similar data proportions.Then, O(J m ) of any input picture with thickness m, where m is denoted as  + (0,  = 0 − 1) and his represents the initial image, is determined by (12): for R(X, Y) a 1-pixel range is taken to search in Y the basic units A. This is given by (13).Algorithm 1 mentioned for background modelling.A(X, Y) = A(X, Y) ∪ {(q, p)}; In addition, if we take the starting intra image in the predictor of hierarchy algorithm and quantize it for each similar-background-patch, then the quantization value comes out to be as shown in (14).
Now to effectively calculate the quantization parameters for each general-background-patch frame, we can follow the (15).
Next, we must take the G-picture to be quantized at a lesser value for the surrounding frames, as shown in (16).

EXPERIMENTS AND RESULTS
The entire work has been compared with Wang et al. [26] and uses the various saliency detection methods for our evaluation.To maintain uniformity in comparison, we have used the same datasets as mentioned by Wang et al. [26].This will help in evaluating our performance and accuracy in terms of other state the art methods.The collection, referred to as DHF1K [27], has around 1,000 films with a frame rate of 30 and a resolution of 640 × 360.There are 600 training tests, 300 testing exams, and 100 validation tests.The data from 17 observers is collected using the eye tracker.

Results
The comparison among all the mentioned state-of-the-art methods is given in Table 1.As can be discerned from Table 1, the evaluation metrics for the proposed solution have outperformed almost all stateof-the-art methods.This has done best in the SIM metric while ViNet [31] has done best in the sauce.In the remaining list, the performance has been quite good and even the Kullback-Leibler divergence values of the base reference STSANet [26] and the proposed system are 1.344 and 1.297 respectively.Table 1.Comparison of all the values of the evaluation metrics mentioned for all the state-of-the-art methods along with our proposed system METHOD DHF1K CC NSS SIM AUC sAUC TSFP-Net [29] 0.517 2.966 0.392 0.912 0.723 HD2S [30] 0.503 2.812 0.406 0.908 0.700 ViNet [31] 0.511 2.872 0.381 0.908 0.729 DeepVS [32] 0.344 1.911 0.256 0.856 0.583 Chen et al. [33] 0.476 2.685 0.353 0.900 0.680 ACLNet [27] 0.434 2.354 0.315 0.890 0.601 STRANet [34] 0.458 2.558 0.355 0.895 0.663 TASED-Net [35] 0.470 2.667 0.361 0.895 0.712 SalSAC [36] 0.479 2.673 0.357 0.896 0.697 SalEMA [37] 0.449 2.574 0.466 0.890 0.667 UNISAL [38] 0.490 2.776 0.390 0.901 0.691 STSANet [26] 0.529 3.010 0.383 0.913 0.723 Proposed system 0.547 3.109 0.407 0.933 0.701 This tells us that dissimilarity for our proposed solution is much better and has outperformed once again.The accuracy of the suggested solution is significantly superior than the other evaluated methods, as shown in Figures 1 and 2, because it is much more in line with reality.This proves that the suggested answer is the most accurate and precise of all the alternatives.Figure 1 shows the comparison of all existing models with proposed system.Figure 2 shows the comparison of the ground truths with the proposed system and other state-of-the-art methods.

CONCLUSION
In this paper, a modified HEVC technique with spatiotemporal saliency encoding and background adjustment was offered as a potential remedy.The use of the G-picture methodology in the fourth frame as a long-term reference frame is one of two strategies used to make this solution work.Then comes the need to use the coding blocks classification for background segregation for quantization of each frame respectively along with quantization of the G-picture as well.This has led to a reduction in time consumption and coding complexity along with an increase in efficiency and accuracy when the video is compressed.Even though the results display a good increase in almost every evaluation metric chosen for this paper, there is still quite enough room for improvement.We hope that this solution will act as a stepping-stone for other researchers to build on their future solutions in bringing video saliency detection closer to the level of humans and their eye and brain coordination.

Algorithm 1 .
Algorithm for background modellingInput: Frame J m of size h × w, where m = l × L + i Output: O(J m ) type frame with general − background − patch or similar − background − patch if O(J m ) == O(J m−i ) and i ≠ 0; return X = J m ; Y = G B ; A(X, Y) = ∅ for q = 1 to h