Transcript

Robust Indonesian Digit Speech Recognition using Elman Recurrent Neural Network Muhammad Fachrie Agus Harjoko Faculty of Business and Information Technology Universitas Teknologi Yogyakarta Jombor, Ringroad Utara, Yogyakarta, Indonesia [email protected] Computer Science and Elctronics Major Universitas Gadjah Mada Sekip Utara, Bulaksumur, Yogyakarta, Indonesia [email protected] Abstract—Automatic Speech Recognition (ASR) is a popular research topic since couple years ago, created to recognize speeches in many languages such as English, Mandarin, Arabic, Malay, etc. Unfortunately, there are still a few number of ASR researches conducted in Indonesian, especially the isolated digit recognition. This paper aims to give a simple implementation of ASR in Indonesian language to recognize Indonesian spoken digits using Elman Recurrent Neural Network (ERNN). The speech database consists of 1000 digit utterances that was collected from 20 native speakers. The system is an isolated word recognizer that was trained using 400 utterances, which is only two fifth of whole data. From each utterance, 11 number of Mel Frequency Cepstral Coefficients (MFCC) combined with natural logarithm of Frame Energy (lnFE) were extracted as the speech features and were used as the input for ERNN. The recognizer was tested in two modes, namely Multi Speaker Mode and Speaker Independent mode, and it succesfully achieved high accuracies of 99,30% and 95,17% for each testing mode respectively. Keywords—speech recognition; speech processing; signal processing; neural network; isolated speech recognition. I. INTRODUCTION Automatic Speech Recognition (ASR) is currently one of the populer research area in computer science. It recognizes the human speech and translates them into text, or even into another spoken language, so it enables the human-human speech communication in different languages. and [3]. Therefore, in this research we proposed a robust ASR system for recognizing Indonesian isolated digits that can achieve high recognition rate above 95%. II. RESEARCH METHOD The research was conducted by firstly collected the speech database from several speakers that were used for training and testing the system. Several phases on digital signal processing were occurred to the speech database in order to finally extract several Mel Frequency Cepstral Coefficients (MFCC). MFCC is currently known as the best speech features since its representation contains more relevant perceptually information than other features [4], [5]. The natural logarithm of Frame Energy (lnFE) was also implemented on this research as the additional feature to improve the recognition accuracy [8]. These features would be used as the input for the speech classifier. According to [6], Elman Recurrent Neural Network (ERNN) was used in this research due to its robustness to recognize the isolated digit utterances. A simple time alignment method was also used in this research based on Alotaibi’s work in [7]. The whole system that was built in this research is shown in figure 1. Research on ASR has its own difficulties since we work with speech signal that is sensitive to noise (sounds outside the human speech), has different style of pronounciation between speakers due to the style of speaking, the sex of speaker, anatomy of vocal tract, speed of speech, and the effect of social dialects. Even, for the same words, the length of speech signal could be strongly different [1]. Hence, a speech recognizer should be robust to achieve the good performance, but efficient enough to reduce time processing. Researchers have developed various ASR systems for many languages, such as English, Arabic, Chinese, Malay, Indonesian, etc. However, compared to other languages, research on Indonesian speech recognition has less number of publication, especially in isolated digit recognition. Some recent papers that discussed about the isolated digit recognition for Indonesian language did not give satisfied results as in [2] Prosiding Konferensi Nasional Informatika (KNIF) 2015, Institut Teknologi Bandung 22 Oktober 2015, pp. 49-54. Speech Signal Speech Preprocessing Modul Amplitude Normalization Signal Preemphasis MFCC Calculation Frame Blocking Fourier Transform Time Alignment Signal Windowing MFCC data Speech Recognition Recognized Digit Figure 1. The whole design of ASR system used in this research The speech database was collected from 20 native speakers consist of 10 males and 10 females who were asked to utter 10 Indonesian digits (0-9) 5 times. Thus, there are 1000 speech data in the database. The speech was recorded using sampling frequency of 16 KHz in a quiet room environment to minimize the noise. Each spoken digit was saved in separated files in ‘.wav’ format. Of course, we used a certain software (i.e. Audacity) to help recording, editting (separating each spoken digit) and storing the speech data into individual files where each file contains one spoken digit only. alignment process is needed to ensure that all investigated signal is in the same length. A. Speech Preprocessing The speech signal can not be used directly to build the ASR system due to its variability in length and energy (amplitude). Speech features should be extracted from the speech signal to be well-recognized by the classifier (i.e. ERNN). To obtain the speech features from each spoken digit, sequential processes have to be conducted in the context of digital signal processing that consist of: 1) amplitude normalization, 2) signal preemphasis, 3) frame blocking, 4) time alignment, 5) signal windowing, 6) Discrete Fourier Transform, and 7) MFCC calculation. 1’st Frame The amplitude of speech signal produced by each speakers would be different to each other, even for the same words. This is caused by the differences of voice loudness and the distance between the speaker’s mouth and the microphone. Hence, all the speech signal amplitude should be normalized in the range of [0,1] using (1). N A B 2’nd Frame 3’rd Frame 4’th Frame 5’th Frame Figure 2. Illustration of frame blocking process to a speech signal with length N, where the size of a single frame is A milisecond with overlapping of B milisecond between frames We used a simple-but-robust time alignment method which was previously introduced by Alotaibi in [7]. The method just simply takes some frames to be investigated proportionally in order to extract the MFCC features. So, if a speech signal contains 15 frames, we firstly take the first and the last frames of the signal. Then, proportionally take the other frames between the first and the last frames we have selected before until we get the number of frames (e.g. 5 frames) we need to be investigated later as described in figure 3. Speech Frames (1) 1’st frame where s is the original signal, n is the index of signal sample, and N is the length of signal (the amount of signal samples). The speech signal might contain noise background that can affect the recognition accuracy, hence pre-emphasis is needed to reduce the speech noise using (2). (2) Human speech is a non-stationary signal that change over time, whereas the speech processing in order to obtain good MFCC features requires stationary signal to reach a good performance [5]. Therefore, frame blocking was used to separate a long speech signal into several frames where each of them contains a number of speech samples. The overlapping between frames usually needed to avoid the lost of data between frames. In this research, we used frame size of 32 milisecond with overlapping size of 16 milisecond based on its good performance in [8]. If we use 16KHz of sampling frequency during the speech recording, it means that there are 16000 signal samples in one second (1000 milisecond), hence we have 512 samples for a single frame (obtained from 16000 x 40 milisecond). Figure 2 shows the illustration of frame blocking process. It is surely predicted that a speech signal for a word uttered by a speaker, or even by several speakers, will be in different lengths. This can affect the system’s accuracy. Hence, time 4’th frame 8’th frame 12’th frame 15’th frame Figure 3. Illustration of time alignment method proposed by Alotaibi in [7] which was used in this research We used three different amounts of investigated frames for each speech signal, that are 13, 15, and 20, where each of them were evaluated to see the best frame number which could achieve the highest accuracy. Every single frame produced by frame blocking process contains discontinue signal because of the forced segmentation within the process. Windowing is needed to minimize the signal discontinuities at the beginning and end of each frame [9]. A common windowing function usually used in ASR is Hamming-window which is given by (3). (3) where n is the index of signal sample and N is the length of signal (number of samples) in each frame. A windowed signal obtained by multiplicating each window value (w(n)) with corresponding samples in each frame. B. Feature Extraction Mel Frequency Cepstral Coefficient (MFCC) is a frequency based feature extraction method. Therefore, Discrete Fourier Transform (DFT) was used to convert the time domain signal into frequency domain in order to get the information about the frequency existing in the signal from each frame. The DFT formula is given by (5). (4) where F is the spectrum value produced by DFT, k is the index of signal sample in frequency domain, n is the index of signal sample in time domain, y’(n) is the windowed signal, and j represents the imaginer digit. The result of this DFT is in imaginary digit, hence it must be converted into the real number format. We used 256 DFT points in this research as many researchers did so [4], [5], [8]. To obtain MFCC features, several calculations have to be conducted. A filter bank should be firstly created. Based on [4] and [6], the filter cut-off frequencies used in the filterbank were 100Hz and 4800Hz with 20 triangular band-pass filters. The filterbank was created using (5) and (6). (5) (6) where M is the mel scale representing each band-pass filter’s borders from lower to upper frequency (100Hz – 4800Hz), f is the sampling frequency within the filter cut-off frequencies, M-1 is the inverse of M, and m is the value of mel scale corresponding to M. By using (5) and (6), we got the values of band-pass filter’s borders as given by figure 4. (8) where d is the index of MFCC coefficients, D is the number of MFCC coefficients taken, k is the index of filterbanks (i.e. 11), and K is the number of filterbanks (i.e. 20). It should be noticed that this MFCC calculations are conducted to each speech frame. So, if there are 10 frames, and 11 MFCC coefficients are taken from each frame, it means that for a single speech signal, we have 110 MFCC coefficients. An additional feature, namely natural logarithm of frame energy (lnFE), was also included to improve the system accuracy [8]. It was extracted also from each frame by using (9). (9) where s’ is the sample energy from normalized signal, n is the index of the sample, and N is the number of samples of signal from a single frame. C. Speech Classifier Elman Recurrent Neural Network (ERNN) is one of the neural network architectures which can be considered as a “partially recurrent” neural network due to the domination of feedforward connection in the network and only small part contains recurrent connection that received feedback from the previous step [10]. The architecture of ERNN with one hidden unit is presented in figure 5. Figure 4. Filterbank with 20 triangular band-pass filters with 256 points of DFT The MFCC values are obtained by doing the summation to the multiplication of DFT signal with each weight of filterbank as given by (7). (7) where E is the log of summation from multiplication of DFT signal with the filterbank’s weights, F is the value of DFT samples, k is the index of DFT samples, H is the filterbank’s weight, and i is the index of filterbanks (1-20). The number of MFCC coefficients that are used as the features is less than the number of filterbanks [8], based on [6], we used the first 11 MFCC coefficients. Afterwards, the values of E are processed using Discrete Cosine Transform (DCT) that are calculated using (8). Figure 5. Architecture of Elman Recurrent Neural Network (ERNN) with one hidden layer In this research, six architectures of ERNN were used depending on the number of its input that corresponds to the number of features used. Several MFCC coefficients that were produced from each frame were used as the input of ERNN, where the number of frames determined the total number of MFCC coefficients. The number of neuron used in hidden layer were obtained from a separated trial-error observation to get the best amount of neuron in hidden layer. The output layer has 10 neurons in order to classify 10 digits. Backpropagation method was used to train the ERNN. The detail of ERNN architectures is given in table 1. Table 1. Detail of ERNN architectures used in this research 1 13 15 20 2 143 165 220 3 13 15 20 4 143-50-10 165-65-10 220-87-10 5 156-50-10 180-65-10 240-87-10 The confusion matrix from the MS mode result is given by table 3. The errors only occurred to four digits: 2, 4, 5, and 6, where the worst accuracy obtained by digit 5 with 97% accuracy. On the other hand, digits 0, 1, 3, 7, 8, and 9 got 100% recognition rate. Table 3. Confusion matrix from Multi Speaker mode result 1: Number of investigated frames 2: Number of MFCC coefficient 3: Number of lnFE coefficient 4: ERNN architecture (MFCC only) 5: ERNN architecture (MFCC + lnFE) To evaluate the effect of lnFE feature, the accuracies achieved by the system using MFCC + lnFE were compared to the one that used MFCC only. D. Training and Testing Procedure The system were trained and tested in two modes, namely Multi Speaker (MS) mode and Speaker Independent (SI) mode. In MS mode, 400 data were used as the training data which were obtained from the first and second repetitions of each digit uttered by all speakers (10 digits x 2 repetitions x 20 speakers). Then, all the data (i.e. 1000 data) were used for the testing phase. Thus, it means that the training data in MS mode is a subset of testing data. In SI mode, 400 data were also used as the training data which were obtained from all the repetitions of digits uttered by 8 out of 20 speakers consisting of 4 males and 4 females (10 digits x 5 repetitions x 8 speakers). Then, the rest (i.e. 600 data) were used for testing phase, which means that the data used in training phase are different to ones used in testing phase. The use of training data is only two fifth of whole data. III. RESULTS AND ANALYSIS The isolated digit speech recognition system that was built in this research performed very well by achieving the accuracies above 90% for both of testing modes. Table 2 shows the detail of the results for Multi Speaker mode. Table 2. Testing results for Multi Speaker mode Number of frames 13 15 20 ERNN Accuracy (MFCC only) 99.00% 99.10% 99.10% ERNN Accuracy (MFCC + lnFE) 99.20% 99.20% 99.30% Table 2 shows that better result given by the system containing more speech frames which were achieved by the systems using 15 and 20 speech frames, that is 99.10% accuracy. The additional feature, lnFE, shows the improvement to the system accuracy by 0.1% - 0.2%. From the first result, we know that the system achieved very high accuracy of 99.30% which means that there are only 7 errors out of 1000 data. This was achieved by the system using 20 speech frames with the combination of MFCC and lnFE as the features. The result of Speaker Independent mode testing is presented in table 4 where the highest recognition rate was also achieved by the system using 20 speech frames with the combination of MFCC and lnFE features. Table 4. Testing results for Speaker Independent mode Number of frames 13 15 20 ERNN Accuracy (MFCC only) 93.00% 94.10% ERNN Accuracy (MFCC + lnFE) 94.83% 94.17% 94.33% 95.17% From the results shown by table 2 and table 3, it shows that the highest accuray is always achieved by the system using more speech frames with the combination of MFCC and lnFE as the features. However, the MS mode always achieved better result than SI mode because of the involvement of all the speakers in the training data, so it could recognize very well each speech style of all speakers. Table 5 shows the confusion matrix from SI mode result where the highest recognition rate is achieved by digit 4 (100%) and the worst accuracy is occurred to digits 0 and 5 with only 90% of accuracy. The total accuracy in SI mode is 95.17% which means that the system failed to recognize 29 digits out of 600 testing data. However, This result can be denoted as a good achievement by considering the amount of training data is only two fifth of whole data, less than the amount of testing data. Table 5. Confusion matrix from Speaker Independent mode result Several digits have been misclassified as another digit. Digit ‘5’ has the most misclassifications from both testing mode with total of 9 errors where three errors occurs in MS mode and six errors in SI mode. It was mostly misclassified with digit ‘3’ as shown in table 5. Table 6 gives the list of missclassified digits. Table 6. List of misclassified digits No. 0 1 2 3 4 5 6 7 8 9 Confused with number Multi Speaker Speaker Independent 2, 6 8 6 5 5, 7 0 3, 6, 9 2, 3, 4, 7 2, 8 0 6, 8 6 5, 6, 7 Table 7. The comparison of our system's performance to several previous researches Param. This paper Ref. [11] Ref. [12] Ref. [13] 1 95.17% 95% 85.24% 2 ERNN GA+MLP ANFIS 3 400 300 420 99.60% GA+FVQ +DHMM 500 Number of errors 4 600 100 210 500 5 MFCC+lnFE MFCC MFCC MFCC 6 2 3 4 1 9 3 2 1 5 6 10 digits 10 digits 10 digits 10 digits 7 Indonesian Chinese Malay Mandarin The misclassifications occur to several digits due to their similarities to another digit’s feature. It can be analized by comparing the spectrogram of the digit as given by figure 6 or by comparing their MFCC features. As example, the utterance of digit ‘3’ which is misclassified to digit ‘5’ has similar form of spectrogram. It also occurs to digit ‘0’ with ‘6’, and digit ‘9’ with ‘6’. In the other hand, if we compare digit ‘4’ which got the highest recognition rate from both testing mode (99% in MS mode and 100% in SI mode) to several other digits, we find that its spectrogram is almost totally different to the other digits as shown in figure 7. Based on our observation, the misclassified utterances of digit tend to happen to the digit that was uttered by the same person. In addition, the system we built in this research is better than several previous researches, even when compared to more complex system in [11], [12], and [13] as shown in table 7. Since several papers did not mention whether they used MS and SI mode, so in this comparison we only compare our Speaker Independent mode accuracy. From the comparison presented in table 7, it shows that our system has a high level performance, even when it is compared to the more complex system in [11] and [12] which used GA+MLP and ANFIS respectively. Our system’s accuracy is lower only when compared to [13] which combined three methods, i.e. GA+FVQ+DHMM. It should be noticed that we used only two fifth of whole data as the training data, whereas the other references use larger portion of training data. 1: Accuracy 2: Classifier 3: Number of training data 4: Number of testing data 5: Features 6: Utterances 7: Language Digit ‘0’ Digit ‘6’ Digit ‘3’ Digit ‘5’ Digit ‘9’ Digit ‘6’ Figure 6. Pairs of digit utterances that have similar spectrogram between digit '0' and '6' (top), digit '3' and '5' (middle), and digit '9' and '6' (bottom). These similarities cause the ambiguity between paired utterance. However, our system is simple enough compared to other systems as described in [11], [12], and [13]. The parameters that we used in this research is optimum to reach the high accuracy of the system. Moreover, the use of larger portion of training data or more number of MFCC coefficients may give better result, but the efficiency of the system should be considered besides the effectiveness. ERNN that successfully achieved the highest accuracy in each testing mode is 240-87-10 which used MFCC and lnFE as the features taken from 20 speech frames. However, the system built in this research is simple enough to obtain a good performance compared to previous researches. Digit ‘4’ REFERENCES [1] [2] [3] Digit ‘0’ Digit ‘1’ Digit ‘2’ [4] [5] Digit ‘3’ Digit ‘5’ Digit ‘6’ [6] [7] [8] Digit ‘7’ Digit ‘8’ Digit ‘9’ Figure 7. The comparison between spectrogram of digit ‘4’ to the other nine digits from a speaker. In this figure, we can see that spectrogram of digit ‘4’ has the high dissimilarity compared to the others. IV. CONCLUSION This research successfully proved that Elman Recurrent Neural Network (ERNN) is a robust isolated digit speech recognizer. The highest accuracy that was obtained for Multi Speaker mode is 99.30% and 95.17% for Speaker Independent mode. The ERNN was only trained using 400 training data, that is only two fifth of whole data. The best architecture of [9] [10] [11] [12] [13] M. Forsberg, Why is Speech Recognition Difficult?. Dept. of Computer Science, Chalmers University of Technology. 2003. N.R. Emilia, Suyanto, “Isolated word recognition using ergodic hidden markov models and genetic algorithm,” Telkomnika, vol. 10, pp. 129136, March 2012. I.N. Dewi, F. Firdausillah, C. Supriyanto, “Sphinx-4 indonesian isolated digit speech recognition,” J. of Theoretical and Applied Information Technology, vol 53, pp. 40-44, July 2013. S. B. Davis, P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustic, Speech, and Signal Processing, vol. 28, pp. 357-366, 1980. N. A. Meseguer, Speech Analysis for Automatic Speech Recognition. Master Thesis, Trondheim: Potgraduate NTNU, 2009. Alotaibi, “Spoken arabic digits recognizer using recurrent neural network,” Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, pp. 195-199, 2004. Alotaibi, “A simple time alignment algorithm for spoken arabic digit recognition,” Journal of King Abdulaziz University: Engineering Science, vol. 20, pp. 29-43, 2009. Z. Fang, Z. Guoliang, S. Zhanjiang, “Comparison of different implementations of MFCC,” Journal of Computer Science and Technology, vol. 16, pp. 582-589, November 2001. L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition. New Jersey: Prentice-Hall International, Inc. 1993. L.V. Fausset, Fundamentals of Neural Networks: Architectures, Algorithms, and Applications. New York: Pearson. 1993. M. Lan, S. Pan, C. Lai, “Using genetic algorithm to improve the performance of speech recognition based on artificial neural network,” Proceedings of the First International Conference on Innovative Computing, Information and Control, Beijing, vol. 2: 527-530, 2006. R. Sabah, R. N. Ainon, “Isolated digit speech recognition in malay language using neuro-fuzzy approach,” Proceeding of 3rd Asia International Conference on Modelling & Simulation, Bali, pp. 336-340, 2009. S. Pan, C. Chen, Y. Lee, “Genetic algorithm on fuzzy codebook training for speech recognition,” Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, pp. 1552-1558, Xian. 2012.