An efficient novel dual deep network architecture for video forgery detection

ABSTRACT


INTRODUCTION
Digital videography processing software like Photoshop, Adobe Premiere, and Final Cut Pro used for rapid growth and development of widespread images and video-processing software, results in tampering with the original video without retaining any obvious traces.The malicious tampering of the videos results in serious legal and social issues.By considering an example, the tampered videos and images may serve as evidence to present in the court, which may deviate the truth from the public in news reports.As multimedia content is growing extensively, it makes it a tedious task to detect tampered video content caused by human insight, because video manipulation is common these days, academics have recently concentrated their efforts on video forensics.This is because video data is being upgraded quickly [1].To this, several potential alterations are applied such as deleting the frame, inserting the frame, and compressing the video.Necessarily the digital forensic techniques are distinguished into the active and passive approaches.However, most of these passive forensic methods are allotted for analyzing still images [2].Recently, the research focus is provided on video forensics, because video tampering becomes easy with each passing day.Among these, copy-move forgery (CMF) is extended to hide particular objects in the same video in contradiction with Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  An efficient novel dual deep network architecture for video forgery detection … (Chandrakala) 459 similar techniques.Hence, the frames are retrieved from similar video sequences based on their operational functions, which is convenient for operating and complex to distinguish [3].Regional forgery and frame cloning are two categories used to classify video-copy motion forgeries based on various operational domains.Similarly, to the image copy-move mechanism, regional copy-move causes alterations for specific portions of frame images seen in the more mature images.Table 1 shows the original image and CMF image.Video copy-move forgery (VCMF) produces homogenous information and intricate modifications without different forgery traces.The VCMF is differentiated into three types: interframe, intra-frame, and hybrid known as inter/intra frame.Video intra-frame forgery is comparable to CMF modification, which pastes the items copied in a similar frame.Inter-frame movie forgery copies and pastes the contents of the objects in concurrent frames in the same video.The items in the movie are further divided into additive and occlusive classes in the context of modification.The items targeted are added up in additive forging.Occlusive forging summarizes the background information, covering the target material as a result.The video content is plagiarized in line with their clip examples [4].
VCMF is classified into two categories: the first category includes frame cloning and regional forgery; Similar to copy-image estimated by a mature image, regional-CMF modifies specific portions of the frame.The imperceptibility and difficulty estimation frame CMF enhance the pasting and cloning of subsequent frames in a frame known as CMF, which results in ineffective colour changes, shooting parameters, and illuminating conditions [5].Leads to an anomaly in the parameter distribution, which leads to the correlation of original and duplicated frames.Various methods are designed to detect frames and CMF is classified into two groups i.e. video-based and image-based.The algorithms used in the image feature exploit and extract each frame to detect correlation; this includes the detection of categories of grey values, image texture detection, noise features and colour modes.Different types of feature extraction techniques are applied to identify films using their distinctive motion features, when the video is combined with coding features, the copy-move operation creates a disadvantage [6]; moreover Figure 1 shows the CMF illustration.
Video tampering is increasing each day; however, a few digital videography contents have been discovered, this occurrence has worn-out public interest in digitalized content videography clips.The main aim of video tampering detection here ensures authenticating the potential modifications and forgeries i.e needlessly checking whether a specified clip is tampered with or not.The forged area within a frame and its adjacent frame indicates position of frame insertion, replacement, ordering, and deletion of a tampered video.Various approaches are proposed that authenticate and localize tampering necessarily in the images [5], [6].However, these techniques are not applied directly to the videos for the following reasons: i) due to the presence of the enormous amount of data the storage transmission is compressed before the videos are encoded into video frames, ii) the techniques reported here apply to video sequences that generate a huge amount of computational complexity, and iii) the temporal tampering mechanism like that as insertion, deletion, duplication, and data shuffling in a video is not responsible for the detection of applicability of any image forgery detection mechanism.
There exist various techniques depicted through the literature surveys, particularly for detection as well as localization of video tampering.VCMF requires an exceptional mechanism that relevantly changes complicated modifications that are classified into two types one is inter-frame and the other is intra-frame, intra-frame forgery involves simultaneous activities by pasting each copied object from one frame into the same frame, as opposed to inter-frame forgery, which copies and pastes the object from one frame to another in a subsequent manner.The main aim of the VCMF mechanism results in confusing the frames by the addition of a few objects termed additive modification.Consequently, this is called a modification that aims at hiding a few objects.It is a complex task, which is cautiously constructed inter/interframe forgery by the above-stated machine learning techniques achieved by constant statistical measures.It is necessary because the relevant objects copied and background of the frame pasted is shot under specific surveillance camera, these techniques exhibit similar statistical applications and are hence in differentiable [7]- [10].CMF mechanism seems to be the most challenging problem to tackle in the field of video forensics.Consequently, the proposed detection algorithms shift towards video copy-move forgeries, which leaves a strong impact on the current methodologies [11].
Pixel embedded correlation directed approach based on the applicability generally suffers from potential computational load termed as high computational complexity.In comparison with numerous videos, the majority of specific data ensures maximum effort on a large number of videos in comparison to the still images.Techniques based on image features result in unstable performance estimation incorporating additive noise, secondary compression, and post-processing of all threats to textual noise and pixel grey values.The constraints for finding sensitive parameters consider robustness into account for the existing approaches.Few techniques have been restricted in detecting videos in a specific format, the tampered frames, and ways of tampering for manipulation in various ways that restrict the applicability in video forensics.This method explicitly implies that a CMF detection mechanism resulting in excess demand necessitates three basic types of functionalities termed as a low computational complication, increased accuracy with robust pertinence.In Video copy-move forgery detection (VCMFD) is a major challenging task due to various obstacles including the requirement of video information, homogenous forgery sources, rich forgery objects and diverse types of forgery; these issues create challenges such as high false positives in forgery video detection, low trade-off efficiency and effectiveness.Hence, motivated by the challenges, this research work adopts the deep learning domain and provides the solution for the same; further research contribution is given as follows: i) this research work proposed dual deep network (DDN) for efficient and effective video forgery detection; DDN comprises two networks, first detection network (DetNet1) is utilized for general feature extraction whereas second detection network (DetNet2) is developed for custom and deep feature extraction; ii) DetNet1 and DetNet2 both are integrated models as the output of DetNet1 is given to DetNet2; iii) furthermore, the proposed research also develops algorithms for frame detection, frame matching and optimization of false detection; and iv) DDN is evaluated considering the REWIND and video tampering dataset (VTD) dataset considering different metrics like accuracy, precision, recall, and F1-score; further comparative analysis is carried out with various existing models.

RELATED WORK
Research carried out on various existing systems for the process that focuses on CMF detection is depicted in the form of examining the copy-move process's unintended consequences also described as feature correlation among the duplicated frames and original frame.Moreover, these are carried out through frame replacement or frame insertion.This section focuses on a review of various existing VCMFD.The prevailing effective existing systems are known as VCMFD techniques, such as dense moment feature index and best match (DFMI-BM) [12], exponential Fourier moments (EFMs) [13], PatchMatch-2D (PM-2D) [14], and PM-2D (fast) [14], are meticulously created and share the common concepts.Extracting the robust features by incorporating invariant features for several geometric and post-processing tasks for the section of forgery objects, serves as the critical approach to detecting the effectiveness of the approach VCMFD.In recent years the VCMFD method has applicability to existing methods for block extraction for invariant moments.
These invariant moments (such as the Polar complex exponential transform (PCET) [15] for the DFMI-BM, the Zernike [16] for the PM-2D and PM-2D (fast), and the EFMs have faultless invariances for rotation and mirror but lack scaling capabilities.These methods fall short of addressing scaled forgery techniques resulting in large-scale exponential transformations through factors ranging at least from 150% to 50%.Various algorithms match effective features, including the batch algorithm proposed by the effective DFMI-BM approach.PatchMatch is an algorithm proposed by PM-2D, whereas a fast match is an algorithm proposed by EFMs that looks for a potential block between matching pairs.Filtering and morphology are the post-processing techniques represented as the implementation.Summarizing the VCMFD methods are not capable of resisting scaling attacks as well as matching each step based on block approaches.The block features are determined in every pair, this particular process yields inefficient experimental findings.However, dense neural network (DNN) is studied in-depth and successful in an application to pattern classification and recognition with each aspect.The primitive DNN models, such as DenseNet [17], are not entirely competent when it comes to fraud detection because of the various forging kinds and complex backdrop contents.CMFD schemes are a few copy-image forgery detection approaches.Techniques like endto-end Dense-InceptionNet (E-DIN) [18], a serial CMFD approach [19], and dual-order attentive generative adversarial network (DOA-GAN) [20] enhance the DNN detecting capabilities.The DenseNet, InceptionNet, VGG16, and VGG19 networks are essentially used for feature extraction in all three methods.
An image CMFD feature matching approach is the main component of these models, which are embedded in images, and it acts as a manual procedure.The E-DIN technique segments the correlation of feature matches using a second nearest-neighbor (2NN) test to determine the best match correspondingly.According to Liu et al. [21], a unique two-stage platform is designed specifically for the detection of copymove fraud.The self-deep matching network's foundation is provided by the first stage.The second stage refers to the proposal SuperGlue, whereas the first stage shows the Atrous convolution-incorporating skip matching that ensures a spatial combination of and influences hierarchical features.A spatial mechanism based on self-correlation incorporates the capability to notice the appearance of relevant areas.In the second phase proposal, the superglue technique is to discard false alarmed regions and provide a remedy to incorporate incomplete regions.Furthermore, in [22] An accurate convolutional neural network (CNN) architecture-based method is suggested for the efficient detection of copy-move image tampering.The appropriate number of pooling convolutional layers is determined computationally by the suggested method.According to Zhong and Pun [18], an end-to-end-based method termed Dense-InceptionNet requires a multidimensional dense-feature connection known as a DNN.The first DNN model incorporates automatically based forgery snippets by matching values.The techniques for hierarchical post-processing, PFE modules are proposed to extract a multi-dimensional feature approach from a dimensional multi-scanned approach.For extracting dimensional and multi-scale information, the PFE modules are proposed.The features of each layer, which are ordered by direction, are extracted.

PROPOSED METHOD
Video is considered forged if the content is subjected to manipulation for the general viewer where the person's intellect can be challenged and influenced.Forged video can mislead the general public and is quite difficult to identify especially forgery like copy move; thus, VCMFD has been one of the vital research areas utilizing various techniques like deep learning as it tends to extract the deep feature in comparison with the traditional approach.This research work adopts the deep learning domain for forgery detection where the main goal of our proposed model is CMF detection to differentiate between being original area and tampered area in a digital video.This research introduces DDN for VCMFD; the DDN model detects the tampered area and the original area.DDN comprises two detection networks i.e.DetNet1 and DetNet2; First detection network is responsible for general feature extraction and the second network i.e.DetNet2 is utilized for deep feature extraction.Moreover, the proposed workflow is presented in Figure 2.

Efficient frame formation
In a collection video of  frames, in the first phase, the extraction of individual frames results in computing the optical flow of two parallel frames x to  + 1( = 1,2, … … S − 1).A matrix is computed resulting in two directions like oa x the matrix in the x direction and oa y the matrix in the y direction.They are summed up to determine the sequences of the sums computed consisting of S − 1 values.For the x − th frame, it is possible to detect whether the frame is tampered with or the original one, a tampered area results in a sudden spike in the symmetry.The average mean is estimated by the parallel frames determined by (1): here S is the size for finding the parallel frames.The shift of to α x determine change of the x − th frame is given by (2): consider α x larger than the threshold_A results in a spike in total oa x , the tampered parallel frames as (x − 1)th, (x + 1)th frames are detected to find the tampered area.The x − th frame is detected based on the symmetric center, which determines the CMF for total oa x where the x th frame is satisfied.
This determines that the frames have accurate total oa x before and after computation during symmetric centre and tampered frames.In the Algorithm 1, hence the x − th frame is a probable tampered area detection process.

DetNet1
The number of network features is decreased in the pooling layer henceforth it results in a reduction of spatial resolution.To enhance the features generated results in the high-resolution feature maps neural network that extracts the features as shown in Figure 3.The CMF detection mechanism separates the original area and the tampered area.

Global feature extraction through dilated convolution
Upon application of the self-attention methodology, a broad spectrum of information is embedded in the features.These features enhance the neural network features, below-given matrix maps the features of AM are computed as stated in (4): in the attention module AM xy determines the impact of x − th pixel,y − th pixel, U and V are feature maps after convolution, normalization and rectified linear unit (ReLU), the self-attention feature maps F P (5): β is the learning constraint initialised with a value of 0. R is the feature map extracted after each convolution.Whereas F P and F t is determined by Figure 2.This transfer results in information loss independent of the weights associated with each other.The F P and F t values are fused along each other ensuring a relationship between the features at various positions.The CMF detection module captures the context information that represents the convolution features.This is given as ( 6): here, τ and µ are the parameters associated with the Gaussian distribution, that are learned during the training process.

Estimation of correlation
To estimate the correlation features of the main issue encountered here, the forged frames are generated, in correspondence to this the original area in the frame is also found and, the tampered area is mapped from the original area which helps in allocating the similar area.L 3 , L 4 and L 5 estimate the mapped features.The similarity measure of T a,b v between the a − th patch, L a v and the mapped feature of b_th patch L b v is determined as given by ( 7): the irrelevant information is not considered, a sorting technique is used here that selects the index corresponding to index v (X) and further mathematical formulation of it is given as (8): Peak_X_index denotes the peak value, and T v is considered as the similarity measure of mapped feature L v .
The mapped features have a similar dimension but different channels.The mapped features have the same dimensions but different paths, the matching process L total is given as (9): the tampered region is necessarily scaled in the CMF given, as it is essential to utilise the correlation mechanism.

DetNet2
The existing methods are capable of only detecting the tampered area and not the fine-tuning of the model, which affects the model largely affects the detection.DetNet2 comprises five components; the first component includes the input layer, a down-sampling layer, an up-sampling layer, a bridge layer, and an output layer.Moreover, the input layer comprises 64 filters along with activation function and batch normalization; furthermore, bilinear interpolation is utilized for up-sampling and average pooling for down sampling.The skip connection layer is introduced after up sampling; also another activation function is added.Figure 4 displays the DetNet2 architecture.

Frame matching algorithm
The algorithm presented in Algorithm 2 is to match the frame after duplication, determine the tampered area, and estimate the correlation coefficient.It is essential to sample the input for reducing the number of pixels for computational purposes.The efficiency of the computation is enhanced to find the distribution coefficient.The procedure for frame matching is given by Algorithm 2.

False detection reduction
Furthermore, Algorithm 3 presents the optimization process of false frame detection that comprises various phases, in the case of the first phase the tampered frame is detected by an abnormal spike by the correlation coefficients to determine the maximum among the correlation coefficient used.Among the correlation coefficients threshold_A D1 has significantly higher value in determining the similar frame set.For each tampered frame x is detected as n the local symmetric center.The while loop is iterated multiple times for the copy-moved frames.The output of the given algorithm is given by the initiation and end of the tampered frames.The copy-move forgeries necessarily result in the abnormal behavior of the sum of sequences.The tampered area is not the only factor necessary for determining the spikes when a tampered area is detected.Many other factors are also responsible for the rise in spikes or local symmetric centres.In the correlation phase, the parallel frames with high similarity may result in false detection.

Loss computation network
While training the module the cross-entropy function value minimizes the constraint set in the network.Forgery detection is essential for classification.The cross-entropy function value is calculated as (10): were, P(x, y){0,1} denotes the pixel value of (x, y) and X, also denoted as the tampered area.The loss is considered in each pixel and the relationship between the adjacent pixels is considered between the boundary of the tampered area and the original area.To ensure the structural information the summation of all the losses is given as L lf : here L cel , L sl , L iu where L cel , determines its ability for segmentation purposes at each pixel level and assists the model to meet on all pixels, L sl determines the similarity loss and L iu determines the loss encountered by performing intersection over the union.L cel loss determines the total loss encountered by each pixel.Whereas L sl loss is responsible for fine-tuning the network that focuses more on the tampered area.L sl loss is determined by (12): here, i P i X indicates the average mean of P and X, τ P and µ 2 is the standard deviation and covariance matrix: L iu loss is estimated during the training process to detect the object and segmentation.These three losses are combined to generate necessarily a hybrid loss as depicted in (11).

PERFORMANCE EVALUATION
This section of the research evaluates the proposed model; moreover, evaluation is carried out on the ideal system configuration of Windows 10 packed with 16 GB of RAM along with 4 GB of Cuda-enabled Nvidia graphics.Furthermore, the model is designed considering the deep learning architecture with the help of various libraries using python as a programming language.This section evaluates the proposed model considering the different metrics; also, the efficiency of the model is proved through comparative analysis with the state-of-art technique and existing model.

Dataset details
VCMF is one complex manipulation, which is carried out with relatively complex manipulation; thus, designing the dataset for the same is quite complicated.This research considers two distinctive datasets namely REWIND [23] and VTD [24].This two-benchmark dataset comprises various CMF i.e. inter-frame and intra frame, which has been discussed later.

Metrics evaluation 4.2.1. Accuracy
Accuracy is metric which is described as how a model performs across various classes; here it tends to predict the forgery frame and is computed as (14).

Precision
Precision is defined as the collective ratio among the correctly classified forged frame and positive samples observed, given as (15).

Recall
The recall is defined as the collective ratio among the number of the positive samples classified correctly to the completely positive numbers and given as (16).

Dataset 1 evaluation
REWIND dataset is one of the benchmark datasets where there are 10 distinctive genuine videos along with 40 derivative inter-frame forgeries and 10 forged videos; moreover, each sequence has a frame rate of 30 fps This dataset is designed for video-based CMFD.Furthermore, evaluation is carried out on considering the Detection accuracy, false positive and F1-score with existing comparison model E-DIN [18], serial-CMFD [19], PM-2D (fast) [14], PM-2D [14], DFMI-BM [12], and existing model novel-VCMFD [4].Table 2 presents the sample frame of the non-forged frame and forged frame.

Detection accuracy
Figure 5 shows the number of frames detected correctly in a given video; the x-axis presents the number of various methodologies and y-axis presents the forged videos.In the case of the E-DIN mechanism, 5 videos were detected correctly whereas, in the case of serial-CMFD, PM-2D (fast), and PM-2D observes 6,

Falsely positive comparison
Figure 6 presents a false positive comparison; y-axis presents false positive and x-axis presents methodologies; despite detecting the video as forged, it is also important to detect the correct frame as an incorrect frame leads to misconception; Figure 6   Figure 7 shows the F1-score comparison on dataset 1; E-DIN and serial-CMFD observe very low F1-score of 16% and 19%, whereas other methodologies like PM-2D (fast), PM-2D, and DFMI-BM observe above-average F1-score of 79%, 84%, and 86%.Similarly, the existing model observes 87% whereas the proposed model observes a 95% F1-score.VTD dataset [24] is another public forensic library for different types of forgery including CMF; moreover, this dataset is modified in the year 2019.Each of the videos comprises the quality of 720p.Table 3 presents the sample forged frame and non-forged frame.DDN model is evaluated considering the accuracy, precision, recall, and F1-score with comparing with various existing model like fast and robust [25], histogram of oriented gradients (HOG) and compression [26], adaptive over segmentation [27], spatio-temporal context [28], inter-frame mechanism [29], local binary patterns (LBP)-detection [30], discrete Radon polar exponential transform (DRPCET) [31], fast and effective [32], and existing model i.e. video forgery detection using the histogram of second order gradients (VFDHSOG) [33].
Figure 8 shows the accuracy comparison of the various existing model considering the various model; method like fast and robust-CMFD achieves an average accuracy of 69.7%, and other models like HOG and compression, adaptive over-segmentation and spatiotemporal context achieves good accuracy of 88.3%, 91.4%, and 93.1%.Similarly, inter-frame achieves the accuracy of 96.

Comparative analysis and detection
This section discusses the improvisation of DDN over the existing model considering various parameters; considering the dataset 1 evaluation, all 10 videos were detected correctly.Furthermore, the existing model false positive is 1 out of 10 whereas DDN-PS false positive is 0. Furthermore, considering dataset 2, DDN achieves accuracy improvisation of 2%, recall improvisation of 1.45%, precision improvisation of 1.97%, and F1-score of 3.50% with the best performance model.

CONCLUSION
Tampering the digital videography, which serves as a reference in the court, is uncertain and stays still in its early stages and reliability in the field of digital video forensics.Various models for video editing that as Adobe's (Premier and After Effect), GNU Gimp, Premier, and Vegas are freely available which tamper with the video content.Various techniques are proposed here in the past literature survey that detect tampered video content; however, these models suffer from limitations.Thus, this research develops DDN for video forgery detection; DDN comprises two networks for general feature detection and a deep custom feature to distinguish between the original frame and tempered frame.DDN is an end-to-end approach for forgery detection where the output of DetNet1 is integrated to DetNet2 and optimality is carried out; also, three algorithms for probably tampered detection algorithm, frame matching and reducing false detection are introduced for efficient and effective forgery detection.DDN is evaluated considering the two benchmark datasets i.e.REWIND and VTD dataset considering the various metrics; comparative analysis shows that DDN outperforms the other existing model with marginal improvisation as DDN achieves lower false positive, higher detection accuracy for REWIND dataset and higher value of precision, recall, accuracy and F1-score for dataset 2. The future work would focus on enhancing the ability of the system to deal with tampered videos in the context of large static scenes and careful modification.Further, the aim focus should be a generation of a more comprehensive approach based on a large-scale video forgery approach, which serves as the basis for future work.


ISSN: 2089-4864 Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 458-471 460 this paper, a new approach is recreated by incorporating these three techniques and designing a unique technique for detecting the CMF mechanism.

F1
F1-score integrates the precision along with the recall of classifier into the particular metric through computation of harmonic mean and it is computed as (17): Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  An efficient novel dual deep network architecture for video forgery detection … (Chandrakala) 467 9 and 9 videos were detected as forged respectively.Similarly, DFMI-BM also detects 9 videos as forged videos.Moreover, the existing model detects 10 videos as forged so as the proposed model.
shows the comparison of the falsely detected frame.Moreover, serial-CMFD, E-DIN, PM-2D (fast), and PM-2D detect 5, 3, 3 and 2 videos incorrectly out of 10.VCMD i.e. existing model fails in 1 whereas the proposed model fails in none.

Figure 7 .
Figure 7. F1-score comparison 3; in comparison to all these

Table 1 .
Original and copy move

Table 2 .
Sample non-forged frame and forged frame

Table 3 .
Sample forged and non-forged frames from the VTD dataset 469VFDHSOG achieves the accuracy of 92.6 and the proposed model DDN achieves the accuracy of 98.3%.Figure9shows the recall comparison of the various existing model; model like LBP-detection, DRPCET and VFDHSOG model achieves a recall value of 82.7 %, 92.7%, and 93.2% respectively.Similarly, fast and effective-CMFD achieves a recall value of 95.8 whereas dual deep network-proposed system (DDN-PS) achieves 97.2%.
An efficient novel dual deep network architecture for video forgery detection … (Chandrakala)