|Year : 2015 | Volume
| Issue : 2 | Page : 119-123
Forensic Automatic Speaker Recognition Based on Likelihood Ratio Using Acoustic-phonetic Features Measured Automatically
Huapeng Wang1, Cuiling Zhang2
1 Department of Forensic Science and Technology, National Police University of , Shenyang, Liaoning, China
2 School of Criminal Investigation, Southwest University of Political Science and Law, No. 301, Baosheng Ave, Yubei District, Chongqing, China
|Date of Web Publication||27-Nov-2015|
Department of Forensic Science and Technology, National Police University of China, Shenyang, Liaoning - 110 854
Source of Support: None, Conflict of Interest: None
Forensic speaker recognition is experiencing a remarkable paradigm shift in terms of the evaluation framework and presentation of voice evidence. This paper proposes a new method of forensic automatic speaker recognition using the likelihood ratio framework to quantify the strength of voice evidence. The proposed method uses a reference database to calculate the within- and between-speaker variability. Some acoustic-phonetic features are extracted automatically using the software VoiceSauce. The effectiveness of the approach was tested using two Mandarin databases: A mobile telephone database and a landline database. The experiment's results indicate that these acoustic-phonetic features do have some discriminating potential and are worth trying in discrimination. The automatic acoustic-phonetic features have acceptable discriminative performance and can provide more reliable results in evidence analysis when fused with other kind of voice features.
Keywords: Acoustic-phonetic speaker recognition, evidence evaluation, forensic speaker recognition, likelihood ratio
|How to cite this article:|
Wang H, Zhang C. Forensic Automatic Speaker Recognition Based on Likelihood Ratio Using Acoustic-phonetic Features Measured Automatically. J Forensic Sci Med 2015;1:119-23
|How to cite this URL:|
Wang H, Zhang C. Forensic Automatic Speaker Recognition Based on Likelihood Ratio Using Acoustic-phonetic Features Measured Automatically. J Forensic Sci Med [serial online] 2015 [cited 2018 Aug 21];1:119-23. Available from: http://www.jfsmonline.com/text.asp?2015/1/2/119/169617
| Introduction|| |
Today, forensic speaker recognition (FSR), first proposed by Champod and Meuwly, is experiencing a remarkable paradigm shift in terms of the evaluation and presentation of voice evidence, related to the comparison of quantifiable properties between recordings of known and questioned origin., The likelihood ratio (LR) framework, based on Bayes' theorem, has been applied to both automatic speaker recognition ,, and acoustic-phonetic speaker recognition., In the past in acoustic-phonetic FSR, features have been extracted manually. We propose a new method in which some automatically extracted acoustic-phonetic automatically extracted are used in the framework of forensic automatic speaker recognition (FASR). In some countries, such as Spain and Australia, forensic voice evidence-based on LR framework has been accepted in court.
As currently practiced in FSR, the acoustic-phonetic approach and the automatic approach within the LR framework differ in several ways. The main difference is in the information extracted from the recordings. The acoustic-phonetic approach is essentially local in both time and frequency domains, with traditional acoustic-phonetic features more closely associated with linguistic units such as formant frequencies and fundamental frequency, extracted from linguistically comparable items. In contrast, in FASR, the features used are mostly calculated automatically by computer systems, such as mel-frequency cepstral coefficients (MFCC), linear predictive coefficients, linear predictive cepstral coefficients (LPCC), etc. In addition to the advantages of this approach, such as robustness to channel and noise, traditional acoustic-phonetic features have much greater interpretability (transparent acoustic meaning) which is a bonus for explanations and justifying methodology in court. Automatic features, on the other hand, are much more powerful as evidence: They will, on average, yield LRs that deviate much more from unity, but they have no transparent physical meanings as acoustic-phonetic features do. Therefore, it's difficult to interpret these features and their meanings for lawyers and courts. Although acoustic-phonetic features have many advantages, they are usually extracted manually and cannot be calculated accurately by a computer. This results in inconvenience in the measurement and application of these features. Forensic speech scientists have to first identify linguistically comparable units from two compared voice recordings and then measure the acoustic-phonetic features, which costs a huge amount of manual effort. Most disadvantageous of all, accurate repeatability cannot be guaranteed because of the manual operations. Recent advances in the field of automatic extraction of acoustic-phonetic features may blend the advantages of both methods, but are yet to have been fully evaluated. This paper investigates the possibility of a FASR system using acoustic-phonetic features automatically extracted with the software VoiceSauce.
| Methods|| |
VoiceSauce is a new application software implemented in Matlab that provides automatic voice measurements over audio recordings for all the files in one folder. VoiceSauce calculates many voice features, including those using corrections for formant frequencies and bandwidths. One of the critical measurements made by VoiceSauce is fundamental frequency (F0). VoiceSauce can measure F0 using any of three different programs: The STRAIGHT algorithm  is used by default to estimate F0, at 1 m intervals, while the Snack Sound Toolkit  and Praat  are also used to estimate F0 at variable intervals, offering a choice of autocorrelation and cross-correlation algorithms. All the F0 detection algorithms rely on user specifications for max F0 and min F0 to constrain their estimations. When Praat is used to estimate formants (or F0), the user can set all the parameters. Cepstral peak prominence (CPP) calculations are based on the algorithm described by Hillenbrand et al. A variable window length equal to five pitch periods is used for the calculations. CPP is used as it has been shown to be correlated with the degree of breathiness in a voice. The Snack Sound Toolkit is used to find the frequencies and bandwidths of the first four formants, using as defaults the covariance method, preemphasis of 0.96, window length of 25 m, and frame shift of 1 m (to match STRAIGHT).
In the LR framework, likelihoods are explicitly represented by probability density models. The Gaussian mixture model (GMM) can be used to model both within- and between-speaker variability. Mathematically, H0, the hypothesis that questioned voice and suspected voice are from the same-speaker, is represented by the model denoted, which characterizes the hypothesis H0 in the feature space. H1, the alternative hypothesis, that the questioned voice and suspected voice are from different speakers, is represented by the model denoted. The values of LR are expressed by Equation (1), which expresses the probability of obtaining the observed difference between two speech samples under the hypothesis that the samples were produced by the same-speaker (H0) versus under the hypothesis that they were produced by different speakers (H1). In Equation (1), X represents feature vectors.
In this proposed FASR system, two kinds of GMM statistical model have to be established. One model (the reference background population model, or RBM) is trained with features extracted from a reference background population database to represent the distribution of general human voice features, while the other model is derived from RBM using the MAP algorithm, with the features extracted from the suspect's recordings or the questioned recordings, which represent the distribution of characteristics of the corresponding speaker.
The forensic voice evidence consists in the degree of similarity between-speaker-dependent features extracted from the questioned recording and the suspect's recordings, represented by his/her GMM model. The likelihood values are estimated from the comparison between features of the questioned recording and the suspect's model (assuming the suspect's recordings are used to train GMM). In the evaluation of forensic evidence within the framework of LR, we must take into account not only the similarity but also the typicality of selected features.
The flow chart in [Figure 1] illustrates the structure of this proposed FASR system. The approach proposed in this paper needs two databases for the calculation and interpretation of the evidence: One is the reference background population database, which includes 50 speakers who have similar background information (sex, age, career, accent, and transmit channel) to the questioned recording; its function is to train the RBM. In this paper we ignore the effect of channel-mismatched conditions (where the suspect's recordings and the questioned recording are recorded by different channels) and focus on same-channel recordings. The other database is the reference population database, which includes 20 speakers who have the same background information as speakers in the reference background population. The reference population database can be combined into 20 target (same-speaker) pairs and 190 nontarget (different-speaker) pairs. These pairs are compared in the GMM-universal background model (UBM) systems, and the target and nontarget scores are produced thereby. The obtained target scores are used to train a within-speaker model, which represents within-speaker variability of selected features, while nontarget scores are used to train a between-speaker model, which represents the between-speaker variability of the selected features. Thus, the within-speaker model is approximately represented by the target score model of the 20 reference database speakers, while the between-speaker model is approximately represented by the cross-validated nontarget scores of the 20 reference database speakers. In previous studies,, the suspect's within-speaker model was trained by its own speech materials, but in forensic scenarios, sometimes the amount of suspect's speech is not enough to train his/her within-speaker model. This proposed system seems to work well as long as the questioned recording or the suspect's recording(s) meet the demands of the adaptive GMM algorithm  to allow it to derive its model. As shown in [Figure 1], the suspect's model and reference population models are derived from RBM using the adaptive GMM algorithm. The comparative analysis part and the cross-comparative analysis part in [Figure 1] use the base GMM-UBM speaker recognition system. The evidence (score) is derived from the comparison between the questioned recording and the suspect's recordings. Finally, the evidence is evaluated using the within-speaker model to get the numerator of LR and with the between-speaker model to get the denominator of LR. The within- and between-speaker models are both set as normal distributions, which means that only one mixture is used in GMM training using the expectation maximization algorithm. In the future, we will test other distributions at present the normal distribution has good discriminant performance.
|Figure 1: Principal structure of the proposed forensic automatic speaker recognition system|
Click here to view
A log-LR cost function is used to evaluate the accuracy of the FASR system used in this study. It is independent of prior probabilities and costs and has been adopted by the National Institute of Standards and Technology Speaker Recognitions Evaluation. It is calculated using Equation (2).
In Equation (2), Nss and Nds are the number of same-speaker and different-speaker comparisons, and LRss and LRds are LR values calculated from same-speaker and different-speaker comparisons. As a metric of the reliability of a speaker verification system, Cllr has previously been used in both FASR systems and acoustic-phonetic forensic voice comparison research., More reliable systems produce smaller Cllr values, and less reliable systems, larger ones.
| Results and Discussion|| |
Twelve features automatically extracted by VoiceSauce, including strF0 sF0 pF0 sF1 pF1 sF2 pF2 sF3 pF3 sF4 pF4 CPP, were used in this test. F0 represents the fundamental frequency; accordingly F1, F2, F3, and F4 are the first four formants, and CPP. Their front labels are "str", "s", and "p", where "str" represents the results calculated using a STRAIGHT algorithm, "s" represents the results obtained from the Snack Sound Toolkit and "p" represents the results obtained from Praat. Thus, we have three fundamental frequency values calculated using different algorithms. The first four formants' central frequencies were measured by Snack and Praat separately.
One mobile phone database and one landline telephone database containing data from the same 20 male Mandarin speakers were tested. Each database was recorded in two sessions. Speakers were asked to read different text materials. In the two databases, cross-validated LRs were calculated using data from all same-speaker and different-speaker pairs. Each speaker's first-half features (across the two sessions) were pooled together to train text-independent GMM using the adaptive GMM algorithm to represent the speaker's characteristics. The second-half features in session 1 were used as test data, and those in session 2, as calibrated data. Before calculation, preprocessing was applied to the features mentioned before. If there was an 0 or an not a number (as represented in Matlab) in any of the 12 features, all the features of that frame were discarded.
In the discussion below, LRs will be frequently expressed on a base-10 logarithmic scale. This is convenient, because on a logarithmic scale, large positive numbers provide greater support for the same-speaker hypothesis and large negative numbers provide greater support for the different-speaker hypothesis. For example, a log LR of + 1 indicates that the evidence is 10 times more likely to be observed under the same-speaker hypothesis as compared to the different-speaker hypothesis, while a log LR of − 1 indicates that the evidence is 10 times more likely to be observed under the different-speaker hypothesis versus different-speaker hypothesis. [Figure 2] shows the Tippett plot of the cross-validated LRs using the acoustic-phonetic features from the mobile phone database. The red curves rising to the left represent the proportion of different-speaker comparisons with log10 LRs equal to or greater than the value indicated on the X-axis. The blue curve rising to the right represents the proportion of same-speaker comparisons with log10 LRs equal to or less than the value indicated on the X-axis. The vertical line is the threshold, which is zero in a base-10 logarithmic scale. All the Tippett plots shown below are expressed in the same way. For the same-speaker pairs, the false negative rate was 0%, meaning that all the same-speaker pairs were correctly discriminated. For the different-speaker pairs, the false positive rate was 3.68%, and so seven of 190 different-speaker pairs were wrongly discriminated as the same-speaker, for a Cllr value of 0.1455. The dotted lines in [Figure 3] represent the Tippett plot of the cross-validated LRs using the acoustic-phonetic features on the landline database. The Cllr value is 0.1144. For the same-speaker pairs, the false negative rate is 5% only one of 20 same-speaker pairs was wrongly discriminated as a different-speaker. For the different-speaker pairs, the false positive rate is 2.11% four of 190 different-speaker pairs were wrongly discriminated as the same-speaker. The LRs that favored the untrue hypotheses were not large, which indicates that the wrong results did not provide too much support for the wrong hypothesis; this is very important for evidence evaluation in forensic application. When the results were calibrated using Focal Toolbox, the Cllr value was reduced to 0.1092. The solid lines in [Figure 3] show the Tippett plot of the cross-validated LRs after calibration. [Figure 4] shows three Tippett plots of cross-validated LRs on the fixed telephone database: One is the results of acoustic-phonetic features as described in this paper, one is a baseline recognition system based on UBM-GMM using MFCC, and the last one is the fused results using simple multiplication (assuming that the two kinds of features are independent) of acoustic-phonetic features and MFCC, which are labeled separately in [Figure 4]. The results in [Figure 4] show that although the performance of acoustic-phonetic features is not as good as MFCC, their fused results give stronger evidence strength, that is to say, the fused results give more support to the conclusion.
|Figure 2: Tippett plot of the cross-.validated likelihood ratios on the mobile phone database|
Click here to view
|Figure 3: Tippett plot of the cross-validated likelihood ratios on the landline database|
Click here to view
|Figure 4: Tippett plot of the fused cross-.validated likelihood ratios on the landline database|
Click here to view
| Conclusion|| |
This paper proposes a new approach to calculating the strength of forensic voice evidence within a LR framework, and applies the acoustic-phonetic features gathered thereby into FASR. The results indicate that automatic acoustic-phonetic features have acceptable discriminative performance and can provide much help in evidence analysis.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Champod C, Meuwly D. The Inference of identity in forensic speaker recognition. Speech Commun 2000;31:193-203.
Morrison GS. Forensic voice comparison and the paradigm shift. Sci Justice 2009;49:298-308.
Saks MJ, Koehler JJ. The coming paradigm shift in forensic identification science. Science 2005;309:892-5.
Drygajlo A. Automatic Speaker Recognition for Forensic Case Assessment and Interpretation. Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism. New York: Springer-Verlag; 2011. p. 21-2.
Alexander A, Dessimoz D, Botti F, Drygajlo A. Aural and automatic forensic speaker recognition in mismatched conditions. Int J Speech Lang Law 2005;12:214-34.
Castro DR. Forensic Evaluation of the Evidence Using Automatic Speaker Recognition Systems. Madrid: Autonomous University of Madrid; 2007.
Morrison GS, Zhang C, Rose P. An empirical estimate of the precision of likelihood ratios from a forensic-voice-comparison system. Forensic Sci Int 2011;208:59-65.
Rose P. Technical Forensic Speaker Recognition: Evaluation, Types and Testing of Evidence. The Speaker and Language Recognition Workshop Odyssey; 2004. p. 159-91.
Rose P. Forensic voice comparison with secular shibboleths – A hybrid fused GMM-multivariate likelihood ratio-based approach using alveolo-palatal fricative cepstral spectra. Prague, Czech Republic: ICASSP; 2011. p. 5900-3.
Kawahara H, de Cheveigné A, Patterson RD. An instantaneous-frequency-based pitch extraction method for high-quality speech transformation: Revised TEMPO in the STRAIGHT-suite. In Proc. 5th
International Conference on Spoken Language Processing (ICSLP'96), Sydney; 1998.
Sjölander K. The Snack sound toolkit. Sweden: KTH Stockholm; 2004.
Shue YL, Chen G, Alwan A. On the Interdependencies Between Voice Quality, Glottal Gaps, and Voice-source Related Acoustic Measures. In Proceedings of Interspeech; 2010. p. 34-7.
Morrison GS. Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. J Acoust Soc Am 2009;125:2387-97.
Reynolds DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture models. Digit Signal Process 2000;10:19-41.
Wang H, Yang J, Wu M, Xu Y. A forensic automatic speaker recognition method based on improved GMM-UBM. J Univ Chin Acad Sci 2013;30:800-5.
Drygajlo A. Statistical Evaluation of Biometric Evidence in Forensic Automatic Speaker Recognition. In 3rd
International Workshop on Computational Forensics, IWCF 2009, The Hague, Netherlands; 2009. p. 1-12.
[Figure 1], [Figure 2], [Figure 3], [Figure 4]