Telugu letters dataset and parallel deep convolutional neural network with a SGD optimizer model for TCR

ABSTRACT


INTRODUCTION
One of the key modules in most optical character recognition (OCR) systems is character recognition, one of the pattern recognition study fields.The method typically starts with feature extraction and ends with classification [1].Character recognition feature extraction involves converting a segmented character image into a real-valued feature vector that more accurately describes the character on the image.The capacity of features to discover characteristics that distinguish among participating classes aids classifiers in creating models with clearly defined decision limits [2].The feature extraction algorithm significantly influences the character recognition process' accuracy.The segmented character images occasionally show the character in a slightly translated, rotated, or deformed state [3].Images of documents in a high-quality state can use accurate character recognition technologies.Regarding real-time applications, document quality is essential and greatly impacts how well the recognition system works [4].The key elements determining the document image quality are the characteristics of the document used, the contents within the document, and document deterioration.These characteristics subsequently impact recognition.Characters may appear broken or touching depending on the type of paper used, especially when printed documents are involved [5].
Non-standard fonts result in uneven spaces between the characters in printed documents and the issue of touching glyphs.In the case of handwritten documents, the content is created by a human; the uniformity of the space usage and the smoothness of the strokes influence the document's quality [6].Another problem with handwritten papers is that some people simplify complex glyphs in the language script while writing them, which reduces inter-class diversity and increases intra-class variability [7].Document aging and document digitization for processing may create many types of distortion in addition to problems in the paper used and defects that occurred during printing or writing.Some characters in documents may not be correctly recognized even by humans without contextual information due to the distortions and degradations that occurred to the document images [8]- [10].Adopting a verification technique might be beneficial for such accuracy-sensitive applications where the recognition errors may be costly in some cases.The verification procedure assesses the classifier's or recognizer's performance and generates a trustworthy acceptance or rejection of the input pattern [11], [12].
Even in distortions, the recognition system should correctly identify the character.To develop a powerful character recognizer, the character image must be efficiently represented as an invariant picture.In most pattern recognition applications, deep learning (DL) approaches are cutting-edge.Convolutional neural network (CNN) architectures that are based on DL can learn invariant feature descriptors through the use of subsampling and trainable filter banks [13]- [15].

RELATED WORK
Historical texts usually contain a large number of dispersed characters, making it difficult to localise and distinguish between them using formal proposal and regression-based techniques.Yang et al. [16] published a unique approach known as a recognition guided detector that successfully detects precise Chinese characters in old texts.A detection network that uses this data to precisely localize each character and a recognition that provides context material about the text make up the two concurrently trained CNNs that make up the proposed reduced gradient descending (RGD).Two more datasets with character-level annotations were constructed to train and test the recommended method.The databases' contents are made up of scanned copies of the Tripitaka.Supported by text recognition with 97.25% accuracy.
Even though text arrangement created on natural language processing (NLP) has shown promising results and has a wide range of potential practical applications, including clinical medical value, the task of NLP for Chinese electronic medical records (CEMRs) has conventional less attention than English record data.The majority of the already accessible CEMRs are non-institutionalized texts with sloppy grammar, poor usage rates, and a propensity to mix patient symptoms, prescriptions, diagnoses, and other critical information.Zhang et al. [17] capsule network model for electronic medical record categorization uses a unique routing architecture and combines long short term memory (LSTM) and gated recurrent unit (GRU) models to extract intricate medical text elements.The model outperforms other baseline models by at least 4.1% and excels on the CEMRs dataset with an F1 score of 73.51%.
Given the ubiquity of handwritten documents in interpersonal interactions, character recognition (CR) of documents has enormous practical utility.Many different types of images can be transformed into editable, searchable, and analysable data thanks to the field of OCR.For the past ten years, researchers have been digitizing printed and handwritten texts using artificial intelligence (AI) and machine learning (ML) techniques.Memon et al. [18] examined character recognition to make investigation recommendations.Adhered to a predetermined review method and employed commonly used electronic databases.Observing in depth the selecting process for the study.There are 176 items in this systematic literature review (SLR) that were picked.In India, the bulk of the population speaks Hindi, however the language of most signboards is English.On a trip for business or pleasure, the travellers become bewildered by the numerous English-written signboards.They can rely on cell phones, which have grown in popularity in recent years, for the same functions.
According to Arafat et al. [19], they worked to develop a mobile application that can recognise the English text and symbols on a signboard image, detect and translate the content and symbols from English to Hindi, and then show the translated Hindi text back on the phone's screen.The system uses an English-to-Hindi lexicon for translation, a pre-trained faster regional CNNs for object detection, and tesseract OCR for text extraction.A means of communicating choices regarding the precise design, acquisition, procurement, building, and commissioning of a plant.
Kim et al. [20] provide a solution for the piping and instrumentation diagram (P&ID) picture text translation problem using DL technology.Pre-processing P&ID photos and storage of the recognition results are all steps in our suggested methodology.Think about how to identify symbols in high-density images that are different in size and complexity.When the model was tested on this dataset after it had been trained, the results were surprisingly good, with precision, and recall for symbols being 0.9718 and 0.9827, and for text being 0.9386 and 0.9175, respectively.Due to the rapidly expanding problem of financial tickets are putting financial accountants under increasing strain and wasting an unnecessary amount of personnel.
Zhang et al. [21] suggest an architecture of the financial ticket intelligent recognition system (FTIRS) that iteratively self-learns to address this problem.A functional financial accounting system must allow iterative updating and the flexibility of the algorithm model, both of which are supported by this framework.To increase its effectiveness and efficiency even more, developed an intelligent financial ticket data warehouse.The system can presently distinguish between 482 different sorts of financial tickets and has an autonomous iterative optimization process.As a result, the types of tickets that the system can recognise and their accuracy will grow as application processing times do as well.The system's value in business has been established.It can greatly boost financial accounting efficiency while also lowering the cost of recruiting accounting staff.Arabic, Chinese, and Hindi are other languages with cursive writing besides Latin.One of these scripts is Urdu.Urdu text makes it challenging to spot specific ligatures in scene shots and to locate them from natural scene photos.
In accordance with the method laid out by Arafat and Iqbal [22], Urdu ligatures are identified, their orientation is predicted, and they are recognised in outdoor pictures.Squeezenet, Googlenet, Resnet18, and Resnet50 have been integrated with the customised faster regions based CNN (FasterRCNN) algorithm for identification.To identify ligatures, a two-stream deep neural network (DNN) was employed.The common learning environment (CLE) annotation text was used to produce five sets of datasets containing 4.2K and 51K artificial images with embedded Urdu text for our testing, which evaluated a variety of functions including ligature detection.Additionally, time series deep neural network (TSDNN) was tested using 1,094 real images that had more than 12K Urdu characters.Evaluated and compared the capacity of all four detectors to locate or recognise Urdu text with average precision (AP).With an AP of 0.98, FasterRCNN, which is based on Resnet50 features, was shown to be the best detector.
Oliveira et al. [23] addressed the non-overlapping camera vehicle identification problem in their study.Presenting the vehicle-rear dataset, a novel dataset for identifying vehicles is our main contribution.To investigate our dataset, our two-stream CNN makes use of the car's exterior and licence plate, two of the most recognisable and enduring aspects currently available.This initiative solves a serious problem: false alerts instigated by vehicles with identical designs or plates that are really similar [24].A Siamese CNN can detect shape similarities in the first network stream by comparing two low-resolution car patches taken by two cameras.In the second stream, two high-resolution licence plate patches are used, and a CNN is used.To reach a decision, a set of entirely interconnected layers incorporate the features from both streams.OCR work as per state of art is as listed in Table 1.

TELUGU LETTER DATASET
In Telugu language there are total 645 letters i.e., 18 Achus, 38 Hallus, 35 Othulu, 34×16 Guninthamulu, and 10 Ankelu as shown in Figure 1.Here a dedicated dataset was designed for Telugu language with 645 classes.Within each category (Achus, Hallus, Othulu, and Guninthamulu), ensure that there's a diverse range of examples.This could include different writing styles, sizes, fonts, and orientations.Consider

PARALLEL CONVOLUTIONAL NEURAL NETWORKS
The current standard in computer vision is deep CNNs (DCNNs).Despite all the work put into creating complex convolutional structures, it needs to be clarified how distinct the best CNNs are from one another.The current standard in computer vision (CV) is DCNNs.CNNs consistently place first in object recognition competitions and have been used for various visual tasks, including pose estimation, segmentation, object detection and localization, and visual saliency [25].CNNs are a basis that may be used to instantiate numerous designs rather than a single design, which makes them less than ideal as a cognitive architecture.In contrast to their recognition rates, we compare CNNs based on how similar the final arrangement layers use the features to identify images.

ResNet
Contrarily, ResNet has a single-scale processing unit that is simpler, has numerous layers, and allows data to move over levels.A large portion of the CV works from the past five years may be viewed as a race between labs to develop the most compelling vision architecture using the deep network framework.Notwithstanding all the work put into creating convolutional structures, it needs to be clarified how distinct the best ones are from one another.ResNet was created at Microsoft in 2016 and then improved upon [26].These numbers have been marginally enhanced by more modern architectures, including PNASNet.A large portion of current computer vision research can be seen as a race amongst research teams to develop the most effective deep network vision architecture.
ResNet which consists of two convolutional layers and a non-parameterized shortcut link that sends the output of the unaltered as shown in Figure 2, were introduced in 2015 by Ahmad et al. [27] deploying a 152-layer ResNet, significantly improved the challenge's state-of-the-art performance and proved that adding ResNet-v2 will be able to be improved further in 2016 [28].This straightforward structural element has been included successfully in numerous additional DCNNs as listed in Table 2.These modules employ a concatenation of convolutional layers with various sizes and maximum pooling calculated from the same input.

Inception
A CNN called inception separates processing by scale, blends the outcomes, and repeats.The inception family of CNNs includes inception.Inception's processing cost is also far cheaper than VGGNet or its more effective descendants.This has made it possible to use inceptions shown in Figure 3, in large data situations where huge quantities of data need to be processed affordably or where memory capability is naturally constrained, such as in mobile vision environments [30].By using specialized methods to target memory utilization or by using computational heuristics to optimize the execution of specific activities, it is possible to partially minimize these concerns.These techniques, however, increase complexity.The efficiency difference could also be widened by using similar techniques to improve the inception architecture.

OPTIMIZATION
Several gradient descent (GD) methods are obtainable in the works that can be used to systematically address the issue of "local" minima, as previously discussed.In this section is a description of the approaches that are most frequently utilized.Batch gradient descent (BGD) determines the error, but the model is updated only after all the training samples have been assessed.A training epoch is a name given to this entire process, which resembles a cycle [31].The benefits of computational effectiveness and stable convergence, as well as the drawbacks of BGD, are each epoch must have convergence to local minima and access to the entire training dataset.In contrast to BGD, stochastic gradient descent (SGD) calculates the error.In other words, it individually adjusts the settings.
The main advantages are due to sequential data processing and generally faster than BGD, the drawbacks of SGD are the detailed rate of development.Frequent updates are costly computationally and introduce noise into gradients that hinder convergence.SGD has difficulty navigating ravines, which are frequently found near local optimums and are defined as regions where the surface slopes considerably more sharply in one dimension than another.A function's local minimum can be found using the optimization technique known as GD [32].The weights are iteratively updated in backpropagation to reduce the error function.The GD optimization's calculation time can increase dramatically when training sets are big since each iteration requires computing every sample's outputs, errors, and gradients.In neural networks, SGD is thus virtually always favoured over GD using, the weights are updated.
Where   weight update and τ GD iteration.The two most popular SGD variations are batch learning and online learning.For each input () in on-line learning, the weights are changed for every input α(), i.e., in : where  the rate of learning, the network's hyperparameter controls how quickly things change.In batch learning, the weight gradients are calculated, and the error for a batch of K inputs A.
Being caught in a local minimum that is too far from the wanted global least is a typical problem with gradient-based optimization techniques.Online learning helps to avoid such local minima because the stochastic error surface is noisy.Batch learning reduces noise and is more likely to become stuck in local minima by averaging the gradients.However, batch training is frequently favored because it can be carried out quite well with modern machines.Additionally, training techniques have been created to avoid local minima, rendering online learning unnecessary.Hyperparameters took into account for SGD optimization [33].

IMPLEMENTATION
It was demonstrated to recognize Telugu characters using a parallel DCNN with a SGD optimizer (PDCNN-SGD) as shown in Figure 4. Inception and ResNet employ fully connected (FC) classifiers to assign labels to images based on the retrieved features after extracting features from images using convolutional architecture.The architectures and the number of features retrieved by the two systems differ; ResNet produces 2,048 components per image, whereas produces 1,536 [34].
However, you will see that the features recovered by Inception and ResNet are extremely similar in that the affine mapping of one predicts the other.This implies that despite their structural differences, inception and ResNet utilize essentially the same aspects of images.While they might extract the same features, inception seems to accomplish so more robustly than ResNet.According to a further review of our results, it might extract a few additional properties.It is unexpected to learn that affine transformations connect ResNet and inception features.The two systems identical performance is explained by this affine relationship, which, in our opinion, also has wider ramifications.It implies that the content of the training images, rather than the specifics of Inception and ResNet's neural architectures, drives the features that are retrieved by those systems.If this is the case, many complex CNNs should behave similarly.We used 50 and 100 epochs as variations on the number of epochs in the trials we did as shown in Figures 5 and 6.Combining ResNet and inception, two prominent DL architectures, offers superior performance due to their complementary strengths.ResNet's ability to handle very deep networks and inception's efficient multi-scale feature extraction capabilities make them an ideal pair.This ensemble reduces overfitting, enhances generalization, and provides increased accuracy, making it a powerful choice for various computer vision tasks, including image classification, object detection, and semantic segmentation.According to experimental findings, our suggested method performs better and is appropriate for TCR as shown in Figure 7.

CONCLUSION
The PDCNN-SGD model, designed for recognizing and classifying Telugu characters in handwritten text, is a complex DL architecture consisting of several stages of operations to achieve its objective effectively.The dataset comprises a total of 645 letters, including 18 Achus, 38 Hallus, 35 Othulu, 34×16 Guninthamulu, and 10 Ankelu.In the feature extraction stage, both ResNet and inception architectures are employed to extract rich and diverse feature representations from the input data.The SGD-based hyperparameter optimization process systematically tunes critical hyperparameters, such as learning rate, batch size, and weight decay, to find the optimal configuration for training the model, ensuring its readiness to achieve high accuracy in TCR tasks.

Figure 1 .
Figure 1.Telugu letter considered for dataset

ISSN: 2089- 4864 
Telugu letters dataset and parallel deep convolutional neural network with a … (Josyula Siva Phaniram) 221 layers constantly improves recognition accuracy.It is possible to show that advances are still being made beyond 1,000 layers, which was previously impossible.

Figure 7 .
Figure 7. Accuracy of models with different DCNN approaches

Table 1 .
OCR as per state-of-art

Table 2 .
Layers in ResNet