A novel ensemble deep network framework for scene text recognition

ABSTRACT


INTRODUCTION
The text contains many details about concepts used in various applications of artificial intelligence, including image storage, navigation, and translation.Recognizing text in natural situations has been important in understanding the vision system because text is widely used in communication and is versatile.Problems arise due to the information of the text, including the readability of the text.Reading begins with text searching by finding text in images and then using text recognition to convert those events into a readable word.On-the-spot reading has many applications in daily life, including machine translation to overcome the challenges of language restrictions and allow text to be read and translated instantly.Using text-to-speech, visual aids can be used to help visually impaired people read instructions on signs, automatic teller machine (ATM) machines or books.Many applications include multimedia retrieval, object identification, and intelligent analysis [1].
The existing text on the site is classified as an optical character recognition (OCR) problem after text detection and segmentation.Studies on literature have yielded monotonically positive results in some areas.While different fonts are often used to represent text, the background shows researchers a way to unravel the complexity of writing.Image distortion is very difficult to achieve and is affected by the limitations of the image medium, including blur, image brightness, and orientation.These features always reveal unique characteristics of the image.Another potential problem is identifying relevant features from a region of the recognition is done through fragment-to-fragment matching based on listening to natural text images.The audio-based encoder automatically focuses on the most relevant parts of the text.According to Lei et al. [15], query recognition is considered as a convolutional problem, which is a combination of CNN and RNN.Multiple variants were extracted from the sequence-tagged architecture and analyzed.According to Sheng et al. [16], encoder and decoder models with sequence-to-sequence self-tracking stacking were used.Additionally, a transformation method can be used that can transform 2D images of natural phenomena into 1D features.
According to Ahmed et al. [17], a network architecture model called "transformer" for machine translation, which does not include convolution and recursion and is based on a tracking mechanism, is introduced.By calculating the position pair through a neural network, the individual behavior of the module can be linked to a specific function of the system, thus achieving a more and less parallel network connection.According to Dong et al. [18], a transformer architecture was adopted to solve the speech recognition problem.Similarly, Yu et al. [19], a network was designed to compensate for reading difficulties by combining convolutional techniques into a self-monitoring system.The transformer's frame is inspired by the combination model.According to Dehghani et al. [20], the transformer architecture is based on a design called "universal transformer" to process strings and other translations based on the length of the string found at runtime.Considering the transformer model in [21], a piece-to-piece acyclic STR framework with selfmonitoring as an important part of the encoder and decoder architecture is designed for better understanding.According to Yang et al. [22], a stronger and easier to implement STR network based on holistic representation is proposed.According to Mu et al. [23], random blur unit (RBU) was proposed that divides the fuzzy function into different classes.The pixels of a unit have similar characteristics.Similarly, RBU provides additional reads to train the model and deliver more effective models.According to Zou et al. [24], the Bi-Long-term memory algorithm was found and implemented from the heat map and information about the behavior of the plates.

PROPOSED METHOD
The plan has three stages; one is a customized CNN and the other is an autoencoder EDN architecture is shown in Figure 1. Figure 1 shows the full operation and design of deep integration; it consists of three parts.The first part uses CNN to enhance the normal text to be like the normal input type and extract some important text.The second and third involve the encoding and decoding process of deep autoencoders.Additionally, each component and its mathematical model are discussed in the subsections.

Customized convolutional neural network
The CNN rule here is a lightweight network that transforms input into larger-sized, more readable images.A two-stage autoencoder has two stages; uses image editing and puts the output into two representations.Finally, the output is decoded by the autoencoder.This model has been proven to be very good by simply detecting the text image gradient and then from this stage which allows the neural network to learn the transformer network is obtained and one has three levels to find the right one, network, generators, and standards.The process controls the content and output using a sampling engine.The original image is converted into an edited image and captured by the sampler.

. Localized network
A transformation is evaluated across two sets of control points with similar sizeI, allocated by G a and G q .However G a =[G 1 a , … … , G I a ] ∈ A 2 * I is the source of the control point, here G I a =[s I a , n I a ] q is the i − th point.G q is the target point which is the same as G a , the G q is placed at the top and bottom border of the output image at a specific position.Henceforth we obtain the G a allotted to the localized network.This network is responsible to process the input image for the convolutional and pooling layer.This descends down G a through a connected layer whose dimension is of size 2I.This is noted down based on the neural network model, which is not trained by the back-propagation model.

Location mapping
Here G a and G q , results in the generation of the sample grid which maps each location to the rectified image through the input image.The transformation is shown in (1): here x ∈A 2 * 1 ,y ∈A 2 * 2 , l q ∈ A I * 2 and γ is a function calculated by ( 2) and (3): a linear model is formed with specific boundary conditions.
The parameters x, y, w are evaluated, l 1 and l 2 is the first and second column l.G s q and G t q through the co-ordinatess and t for G q .The ( 1) is rewritten as ( 5): in the above equation V ∈ A I * I is the matrix and  , = (||   −    ||).

Interpolation of an input image
This example ensures that the pixel values of the edited image are imported to reflect the input image.The position is fixed outside the input image; Its value is cropped to save the image.A bilinear correlation is used to measure the pixel values of the image corrected for the newest pixels.Similar to the local network, the model is responsible for differentiation and allows CNN to leverage gradient-based algorithms.

Deep auto-encoder
Virtual feature extraction mechanism used in autoencoders.Scene text images are biased.The face mechanism created by the autoencoder suppresses the background and expands the foreground process accordingly.Constraints from the reception domain in the convolutional process.The integration network here is responsible for elaborating the regional context behind the visual extraction mechanism.Conventional labels indicate the disadvantages of RNN, both extract features from various retrieval domains.In the first paragraph, the visual feature map of the evaluation network feeds the output of the decoder.The second part is responsible for processing the specific maps determined by the best combination before being fed to the autoencoder.

Optimized attention residual
In the auto-encoder phase, adopt an attention-based residual block to extract visual features.The architecture of the residual block is shown below.An attention-based mechanism is implemented before integrating trunks and shortcuts.These channels are responsible for separate spatial attention across negligible parameter overhead.This is plugged through different residual blocks.])) (7) Here FM allocates the broadcast model multiplied by AM g to AM x to generate an attentional feature map.This model here is capable of modelling the long-term dependencies for input sequence features.The output of this model at each timestamp depends on the present input and previous inputs.The feature map here works in a single direction and the ability of the single layer is constricted.Given an input FM ′ ∈ A L ′ * Z ′ * G ′ , the size is varied accordingly to (L ′ Z ′ )*U ′ , here U ′ determines the hidden units.The auto-encoderbased text recognition specifically relies on two different types, which are the inherent text features in the text images for semantic dependency in-between, the characters.To evaluate these two aspects by taking into account the advantages to adopt these techniques by the auto-encoder incorporating the attention mechanism.This feature accommodates the visual FM from the attention-based ensemble network of the auto-encoder.This attention mechanism focuses on the semantic context features; this utilizes the output through the stacked auto-encoder.These losses are focused on (P At and ∁P enc ) the weights are added for backpropagation in training.The total loss is determined by P tot in (8): here, ∁ is the hyper-parameter set to a proximal value.There exist many advantages as parallel training and parameter-free decoding.The scene-text recognition here is responsible for selecting the most probable character sequence.The dimension of the output is given by the class symbol represented by (K + 1).The input sequence for feature s of length Q, the probability is aligned in the sequence of the output fetched.One sequence of the labelp is represented by various modifications, the probability distribution on p by summing up the probability over the possible alignment π.The probability ξ(p|s) labelled as p on s is determined as follows.Here P q is the probability of time t and Y −1 (p) represents the sequence set that is mapped to p by Y.
Upon assuming this the relevant labelling for the input sequence is allotted by ( 9)- (11): The result is exponentially proportional to Q, for the decoding mechanism is predictive.This sequence-based prediction translates feature sequence to character sequence through varied lengths through a separate mechanism.This attention mechanism takes the visual features by taking into account to model out the dependencies and its ability to achieve effectiveness in capturing the output and dependency simultaneously.The attention mechanism uses the output at each step through the attention mechanism.This process is utilized foro steps to create a sequence  = ( 1,  2 , … … .,   ) for length A for the entire sequence.
For the  − ℎ step based on the output for the division  = ( 1,  2 , … … .,   ), and   is predicted through the ( 12)-( 16): = ( −1 ,  −1 ,   ) here E o , y o , x, E, X and y are all learnable parameters and v a is the hidden state in the decoder at step o. f an is the alignment for the model that selects the inputs through this position a and the output of this position a and the output is given at position n.SAM indicates soft_arg_max function and HTF indicates hyperbolic tangent function.

PERFORMANCE EVALUATION
This section of the research focuses on evaluating the proposed framework considering the irregular and regular benchmark dataset.The data sets consisting images with multi lingual content in it with major irregularities.EDN-PS utilizes the deep learning libraries with system configuration of 4 GB CUDA enabled graphics packed with 16 GB of RAM on windows platform.

Dataset details 4.1.1. Irregular dataset
The IIIT5K dataset [9], which was produced with the use of inter-network data, contains the 3k clipped word 641-test pictures.The 643 words in each picture are broken down into 50 small words, 1,000 words in length.The remaining 644 words were from the dictionary, but some of them were randomly selected.The majority of the (ICDAR-13) [25] dataset image 658 samples come from IC03, which is its successor.Words with non-alphanumeric characters were 660 for a fair 659 comparison is taken out of the dataset.661 1015-cropped word pictures without lexicons linked with 662 of them make up the filtered test dataset.

Regular dataset
The dataset ICDAR-15 [26] contains 6,545 cropped 664 text images, 2,077 testing photos, and 4,468 training images.It is not associated with any word.Almost all of the 666 within-word expressions.Images from the ICDAR-15 dataset 668 were captured using Google glasses without the requisite positioning or focus.The collection consists of high-resolution pictures taken in realistic locations.It comes with 288 cropped text pictures for testing purposes [27].Because the majority of the word graphics are made up of randomly shaped letters, CUTE is the most challenging dataset to analyze.This dataset is unrelated to any lexicon.

Experimental analysis
In the above section, we evaluate our proposed model EDN-PS with two types of datasets irregular dataset and regular dataset.We have considered two datasets for an irregular dataset that is IIIT5K dataset and ICDAR-13 dataset, whereas the regular dataset we consider two datasets that is ICDAR-15 dataset and CUTE dataset and the results for each are shown below: furthermore, in order to prove the model efficiency.

IIIT5K dataset
The IIIT5K dataset, which was produced with the use of inter-network data, contains the 3k clipped word 641-test pictures.We can see in Table 1 and Figure 2 result the proposed EDN-PS model is evaluated with the existing state-of-art techniques in the form of accuracy for irregular IIIT5K dataset and results are plotted in the form of a graph, in the end to conclude we can state that our proposed EDN-PS model outperforms the existing state-of-art techniques and generates an accuracy value of 98.3.

ICDAR-13 dataset
The majority of the ICDAR-13 dataset image 658 samples come from IC03, which is its successor.Words with non-alphanumeric characters, we can see in the below result our proposed EDN-PS model is evaluated with the existing state-of-art techniques in the form of accuracy, and results are plotted in the form of a graph for the irregular ICDAR-13 dataset.Table 2 and Figure 3 shows the accuracy comparison for ICDAR-13 dataset.In the end to conclude we can state that our proposed EDN-PS model outperforms the existing state-of-art techniques and generates an accuracy value of 98.

ICDAR-15 dataset
The ICDAR-15 dataset contains 6,545-cropped 664 text images, 2,077 testing photos, and 4,468 training images.It is not associated with any word, we can see in the below result our proposed EDN-PS model is evaluated with the existing state-of-art techniques in the form of accuracy, and results are plotted in the form of a graph for the irregular ICDAR-15 dataset.Table 3 and Figure 4 shows the accuracy comparison of ICDAR-15 dataset.In the end to conclude we can state that our proposed EDN-PS model outperforms the existing state-of-art techniques and generates an accuracy value of 90.72.

CUTE dataset
The collection consists of high-resolution pictures taken in realistic locations.It comes with 288 cropped text pictures for testing purposes, we can see in the below result our proposed EDN-PS model is evaluated with the existing state-of-art techniques in the form of accuracy, and results are plotted in the form of a graph for irregular CUTE dataset.However, MORAN [27] approach gives the least performance in terms of accuracy of 77.4 and AON [32] approach gives an accuracy of 76.8.whereas the SCATTER [30] method generates an accuracy value of 87.5 and MASTER [37] method generates a value of 87.5, and the ES [41] gives a high-performance value of 91.3 and in the end to conclude we can state that our proposed EDN-PS model outperforms the existing state-of-art techniques and generates an accuracy value of 98.96.Table 4 and Figure 5 shows the accuracy comparison for CUTE dataset.

Comparative analysis
Comparative analysis is made with existing and planned methods and the level of improvisation for erroneous and irregular data is calculated.While the accuracy value of the existing method for the IIIT5K dataset is 97.7, the accuracy value of the proposed EDN-PS is 98.3 and gives 0.612% improvisation.For the ICDAR-13 dataset, the actual value of the existing system is 98, while the actual value of the proposed EDN-PS is 98.42, giving a result of 0.42765% improvisation.The comparison analysis is shown in Table 5.For the ICDAR-15 dataset, the correct value of the existing method is 88.2, while the correct value of the proposed EDN-PS is 90.72, giving a result of 2.8169% improvisation.For the CUTE dataset, the accuracy value of the existing method is 91.3, while the accuracy value of the proposed EDN-PS is 98.96, resulting in an improvisation of 0.42765%.Finally, we can conclude that our design works better than the current one.

CONCLUSION
This research work develops an EDN for STR considering regular and irregular text; EDN comprises customized CNN for irregular text feature extraction and deep autoencoder for enhancement of accuracy in text recognition through optimizing the cost.In the first step, the arbitrary image is converted into a more reliable form henceforth the complexity to reduce the feature extraction process.In the second step, the feature extraction model extracts the feature representation accommodating the sequential transformation from a rectified image.In the third step, the auto-encoder performs feature extraction and transformation simultaneously.Our proposed approach EDN-PS is evaluated for IIIT5K dataset and an improvisation of 0.612% is achieved, whereas for the ICDAR-13 dataset, the improvisation is 0.42765%, for the ICDAR-15 dataset, the improvisation is 2.8169% and for CUTE dataset the improvisation over the existing system is 8.05214%, henceforth we can conclude that our proposed EDN-PS model outperforms the existing state-of-art techniques.Future work could focus on real time video-based STR.

Figure 1
Figure 1.EDN attention modules, channel attention and spatial attention.An intermediate feature map FM, evaluates a pooling operation determined as H avg g and the peak value is denoted as H max g , these are evaluated parallel to show-case global distinctive features.The channel attention mask   () ∈   *  *  is succeeded by the set of convolutional layers via descriptors.Consequently to AM g , the FM is obtained by channel-wise max-pooling and average-pooling.They are connected and processed by a convolutional layer to allocate the mask   () ∈   *  * 1 .
A novel ensemble deep network framework for scene text recognition (Sunil Kumar Dasari) 407 consists of two

Table 1 .
42. Accuracy comparison for IIIT5K dataset A novel ensemble deep network framework for scene text recognition (Sunil Kumar Dasari) 409

Table 4 .
Accuracy comparison for CUTE dataset A novel ensemble deep network framework for scene text recognition (Sunil Kumar Dasari)

Table 5 .
Comparative analysis