|Year : 2017 | Volume
| Issue : 4 | Page : 217-222
Fundamental frequency statistics for young male speakers of mandarin
Honglin Cao1, Yingjing Lei2
1 Collaborative Innovation Center of Judicial Civilization; Key Laboratory of Evidence Science (China University of Political Science and Law), Ministry of Education, Beijing 100088, China
2 Key Laboratory of Evidence Science (China University of Political Science and Law), Ministry of Education, Beijing 100088, China
|Date of Web Publication||11-Jan-2018|
Dr. Honglin Cao
Key Laboratory of Evidence Science (China University of Political Science and Law), Ministry of Education, Beijing 100088
Source of Support: None, Conflict of Interest: None
In forensic speaker comparison (FSC), it is essential not only to evaluate the similarity between two (or more) samples, but also the typicality of the features in the relevant population. For the typicality, it is necessary that the population statistics related to the phonetic parameters be available. This article presents the statistics for the fundamental frequency (F0) of 100 young Chinese male speakers producing both reading and spontaneous speech. Five descriptive statistics of long-term F0, namely mean, median, mode, standard deviation (SD), and coefficient of variation (CV = SD/mean), are shown in histograms and scatter diagrams. Results show that the distributions of the five statistics are near normal. The findings are compared with the literature and discussed with respect to forensic phonetic implications. This article concludes that the results for the F0 statistics in the present study can be used in FSC casework as reference data on F0 for the young male Chinese population.
Keywords: Forensic speaker comparison, fundamental frequency, Mandarin, population statistics
|How to cite this article:|
Cao H, Lei Y. Fundamental frequency statistics for young male speakers of mandarin. J Forensic Sci Med 2017;3:217-22
| Introduction|| |
Forensic speaker comparison (FSC) is the central aspect of forensic phonetics. Its task is to compare the voice of an unknown recording of an offender with the voice of a known recording of a suspect. Several different approaches are used for the analysis of speech samples in the FSC casework; however, the most popular method is to combine the auditory-phonetic and the acoustic-phonetic approaches in Europe and Mainland China.,, Many features (segmental, supra-segmental, nonlinguistic, etc.) are expected to be analyzed when this method is applied. One of the most commonly used features is fundamental frequency (F0),,,, which represents the vibration rate of the vocal cords in speech. Although the F0 feature shows inter- and intra-speaker variations, long-term F0 distribution measures such as arithmetical mean and standard deviation (SD) are the most common variables in the FSC casework. In addition, other descriptive measures such as median, mode, and coefficient of variation (CV) are often calculated as well.,, The average values (mean, mode, and median) offer indicators of the pitch (high or low) of a voice, while measurements of range or variability (e.g., SD) indicate whether a voice is monotonous or “lively.” Its robustness is due to the fact that the F0 is relatively undisturbed by background noise and unaffected by telephone transmission.
It is widely accepted that both the similarity between two (or more) samples and the typicality of the features in the target population should be evaluated in FSC casework. In order to quantify the typicality aspect, it is necessary to have access to statistics of the speech features of a certain type of population. For forensic purposes, several speech databases have been established, such as “Pool 2010,” “Swedia,” “DyViS,” “CIVIL,” “TIMIT,” the “PKU Physical-Phonetic” database, and other large-scale but nameless databases (personal communication with Jingyang Li: there is a large-scale speech corpus at the Institute of Forensic Science of the Ministry of Public Security, of which the number of speakers is more than 1000). To some extent, these databases can be used to calculate the statistics of the acoustic features for the relevant population (strictly speaking, the social class and age of speakers should be controlled). Unfortunately, only a few features have been used for the calculation of the population statistics, including F0,,,,,, articulation rate,,, and long-term formant distributions.
Since the measurement of F0 is relatively easy to handle, F0 statistics were widely used in previous studies. These studies involved three nontonal languages, namely German,, English,, and Czech, and only one tonal language, namely Swedish. The number of speakers is greater than 100 (in studies by Lindh and Künzel, 109 and 105 male speakers were analyzed, respectively) or equal to it,, in these studies. Young male adult speakers (one age group) are collected in studies by Hudson et al., (18-25 years old) and Lindh (20-30 years old), whereas the speakers in studies by Jessen et al., Künzel, and Skarnitzl and Vaňková have a larger age span (e.g., 21-63 years old in the study by Jessen et al.,). Regarding the speech style, spontaneous speech was investigated in studies by Hudson and Lindh, and reading speech in study by Künzel, while speeches in both styles were analyzed and compared in studies by Jessen et al. and Skarnitzl and Vaňková Female speakers were studied only in study by Künzel.
To the best of our knowledge, there are still no Chinese (tonal language) population statistics for F0. This article therefore aims to explore the statistics for long-term F0 of 100 young Mandarin-speaking male speakers producing both reading and spontaneous speech, based on the PKU Physical-Phonetic database.
| Methods|| |
The PKU Mandarin Physical-Phonetic database is a large and still growing database. Many kinds of speech materials, such as sustained vowel, word, sentence, reading text, and free talk, were collected. Meanwhile, two types of physiological parameters, body size and vocal tract parameters, were also measured. Until now, 407 young individuals (228 males and 179 females) aged 19-30 years have been recruited. All speakers' nationalities were self-identified Han. They were students (accounting for a large proportion), teachers, physicians, and public servants. None of them had any noticeable voice and speech disorders. All speakers were able to speak Standard Chinese fluently. The SONY ECM-44B microphone was used to record the materials in sound-attenuated rooms at Peking University and the Second Hospital of Dalian Medical University. All recordings were made at a sampling rate of 22 kHz and 16-bit depth.
A total of 100 male speakers (mean 24.1 years, SD 2.7 years) were selected randomly from the above database in this study. Both reading speech and spontaneous speech were used to calculate F0 parameters. Reading speech was based on the Chinese version of the text “The North Wind and the Sun.” Spontaneous speech was elicited by asking the speakers to talk about their hometowns, habits, majors, and friends or their attitudes on the weather, traffic, and so on. Alternatively, the speakers could also choose to describe a series of pictures showing similar topics listed above (e.g., some pictures of scenes of the rush hour in the Beijing subway). The speakers were required to speak or read the materials at their comfortable levels of pitch and loudness as naturally as possible. WaveSurfer (KTH, Stockholm, Sweden) was chosen to edit the materials and obtain F0 data. After eliminating the parts of silence, the duration of reading speech ranged from 27.0 to 55.5 s (mean 35.5 s, SD 4.8 s). The original (spontaneous) free talk lasted more than 2 min but less than 10 min. Only the first 3 min of speech from the free talk was extracted as the participants' spontaneous speech, of which the duration ranged from 123.0 to 180.0 s (mean 173.0 s, SD 13.3 s). After eliminating the experimenter's speech along with the silence and omitting laughter and coughs in participants' speech, the durations of valid spontaneous speech fell in the 57.0-138.3 s range (mean 88.6 s, SD 16.7 s).
F0 data extraction
The ESPS method in WaveSurfer was used to extract F0 data from the target speech, with the interval of 0.01 s and extraction range being 60-400 Hz; all other settings were kept at the default values. The automatically extracted values were corrected or removed manually when they were obviously wrong (especially in the creaky voice analysis). The original F0 data were saved as txt files, which would be further analyzed using a Matlab program to calculate the mean, mode, median, SD, and CV values of F0. In this article, CV of F0 is derived by the formula: 100 × (F0_SD/F0_mean), whereby both F0_SD and F0_mean are given in Hz. Pearson's correlation and paired t-tests were made to estimate the relationship between different parameters using IBM SPSS Statistics (version 22).
| Results and Discussion|| |
The results for the overall average values of five F0 statistics in both spontaneous speech and reading speech are shown in [Table 1]. In reading speech, the average values of F0 means, medians, and modes are 128.4, 127.0, and 120.2 Hz, respectively. In spontaneous speech, the three corresponding measures are 122.4, 119.6, and 113.2 Hz, respectively. The paired t-test results show that the mean, median, and mode values in spontaneous speech are significantly lower than that in reading (t = −6.499, P < 0.0001; t = −7.806, P < 0.0001; and t = −4.476, P < 0.0001, respectively). These results are well in line with the results of Skarnitzl and Vaňková and Hollien et al., but different from the findings of Jessen et al. (for neutral speech). For spontaneous speech, the average values in this article compare well to the findings of Lindh, where the average means and medians are 120.8 and 115.8 Hz, respectively, but are dissimilar to those of Hudson et al., in which the average means and medians were found to be 106 and 105 Hz, respectively. As the speakers in these three studies are all young male adults, one potential reason for the inconsistency may be the difference in intonation patterns of Chinese, Swedish, and English. Although Czech is not a tonal language (like English), the average values shown in the study by Skarnitzl and Vaňková (e.g., the average values of F0 means are 117.3 and 129.3 Hz in spontaneous and in reading, respectively) are higher than those in the study by Hudson et al. and very similar to our study. This implies that the tone of language may not be the only or conclusive factor, while other factors such as age, body size, or emotional states may also play a part.
|Table 1: Five F0 statistics of 100 young mandarin male speakers in reading and spontaneous speeches|
Click here to view
The two variability measures were higher in reading than in spontaneous speech. Weak but significant differences were found between F0 SD in reading and F0 SD in spontaneous speech (t = −2.037, P = 0.044 < 0.05); however, no significant differences were found for F0 CV (t = −0.249, P = 0.804; the CV is also short for varco in studies by Jessen et al., Skarnitzl and Vaňková; it is a relative but not absolute form of SD). These results are also consistent with those of Skarnitzl and Vaňková in Czech.
[Figure 1] shows six histograms of the three average values (mean, median, and mode) of F0 in Hz (quantized into 10 Hz bins), both in spontaneous speech and reading. All the six distributions appear to be near normal, which is in agreement with previous studies.,,,, All the six distributions display a positive skewing (0.706≤ skewness value ≤1.138). The ranges of the six distributions are all within 70-210 Hz, whereas the ranges of reading speech are slightly wider than those of spontaneous speech. In general, three average values in spontaneous speech are more left located on the x-axis than in reading, which corresponds with the above-mentioned paired t-test results.
|Figure 1: Histograms of three average values of F0 in Hz among 100 Chinese young male speakers in spontaneous speech (left) and reading speech (right). Three average values, namely mean, median, and mode, are shown in the top, middle, and bottom of the figure, respectively. Intervals of F0 values are plotted on the x-axis|
Click here to view
To be specific, for the F0 mean distribution in spontaneous speech, the distribution maximum (thirty speakers) is located in the 120-130 Hz interval. It can also be inferred from the distribution that 2%, 79%, and 3% of the 100 speakers are located in the ranges of 80-90 Hz, 100-140 Hz, and 160-190 Hz, respectively. Similarly, for the F0 median in reading speech, 25 speakers are found in the interval of the maximum distribution; 77% of all the 100 speakers are located in the 100-140 Hz range, while only one speaker (1%) is located in either the 80-90 Hz or 190-200 Hz range. Therefore, from a forensic point of view, F0 cannot discriminate the majority of speakers very well, who are located in the central areas of the distributions (e.g., 100-140 Hz). If the F0 data of two neutral speeches fall in the outlying typical range (below or above the range of 100-140 Hz), the F0 data still have a strong discriminatory power.
[Figure 2] displays the histograms of the SD of F0 (displayed in 5 Hz bins) both in spontaneous speech and reading. It can be observed that the two distributions are both near normal, which is consistent with the findings in the studies by Jessen et al., Skarnitzl and Vaňková, and Gold. Both distributions display positive skewing (1.243≤ skewness value ≤1.484), with a range of 10-50 Hz. In spontaneous speech, forty speakers are found in the interval of the maximum distribution of F0 SD; 62% of all the 100 speakers are located in the range of 15-25 Hz, while only 4% (four speakers) of all the speakers are located in the range of 35-50 Hz. A similar distribution pattern can be observed in reading speech for F0 SD.
|Figure 2: Histograms of F0 standard deviation among 100 Chinese young male speakers in spontaneous speech (top) and reading speech (bottom)|
Click here to view
In [Figure 3], it is shown with histograms that how F0 variability, expressed as CV, is distributed among the 100 speakers. The two histograms give an indication of how speakers differ in F0 SD when F0 SD is normalized against F0 level. It is clear that speakers toward the low end of the distributions speak relatively monotonously, whereas speakers toward the high end speak with a relatively lively intonation.
|Figure 3: Histograms of the F0 coefficient of variation (CV, or varco) among 100 Chinese young male speakers in spontaneous speech (top) and reading speech (bottom)|
Click here to view
[Figure 4] and [Figure 5] (following Jessen et al.,) show the behavioral difference between individual speakers in terms of F0 mean and F0 variability (SD and CV) at the same time.
|Figure 4: Plot of F0 mean in Hz (x-axis) against standard deviation of F0 in Hz (y-axis) in spontaneous speech (red circles) and reading speech (blue squares)|
Click here to view
|Figure 5: Plot of F0 mean in Hz against coefficient of variation of F0 (F0 CV) in percentage in spontaneous speech (red circles) and reading speech (blue squares)|
Click here to view
[Figure 4] illustrates how each of the 100 Chinese speakers behaves in spontaneous speech and reading speech. For each speaker, F0 mean and F0 SD are shown as they are produced in spontaneous style (red circles) and reading (blue squares). The lines that connect circles and squares show how for each speaker F0 mean and F0 SD change from spontaneous to reading speech. It can be observed from [Figure 4] that, for most of the 100 speakers (64), F0 SD in reading speech is higher than (56) or equal to (8) F0 SD in spontaneous speech. Meanwhile, for the majority of 100 speakers (78), the F0 mean in reading is higher than (70) or equal to (8) the F0 mean in spontaneous speech. Furthermore, a general pattern can be seen from [Figure 4] that, if the F0 mean is low/high, the F0 SD tends to be low/high as well. The results of Pearson's correlation analysis also show that F0 mean is significantly positively correlated with F0 SD both in reading (r = 0.690, P < 0.0001) and spontaneous speeches (r = 0.654, P < 0.0001). These results are in line with the findings in the study by Jessen et al. According to Rose and Jessen et al., it is not desirable to treat the two highly relevant F0 parameters as separate forensic-phonetic features.
Therefore, the relations between F0 mean and the relative SD (CV), instead of the absolute one, are shown in [Figure 5]. Compared to [Figure 4], a different pattern can be found in [Figure 5]: for over half of the 100 speakers (54), F0 CV is higher (equal) in reading speech than in spontaneous speech. This is also shown by the correlations between F0 mean and F0 CV in reading (r = 0.273, P = 0.006) and spontaneous (r = 0.278, P = 0.005) speeches, for which the degree of correlation decreases from strong (when absolute SD is considered) to very weak.
Two findings can be obtained from [Figure 4] and [Figure 5]: the relative term CV, not the absolute SD, is more recommended for expressing F0 variability in FSC casework; for some speakers, the difference in F0 mean (and median and mode), F0 SD, or F0 CV between reading and spontaneous speeches is evident, while for others, it is only small.
It is well known that F0 can be easily affected by many factors; however, for forensic purposes, as long as the physiological and psychological factors of the recordings under investigation are comparable,, F0 still has discriminatory power, especially when the typicality of the F0 value is rare according to the F0 statistics found in the present and previous studies.
| Conclusion|| |
This study presents the statistics of the fundamental frequency (F0) of 100 young Mandarin-speaking male speakers producing both reading and spontaneous speeches. Three average statistics of long-term F0 (mean, median, and mode) and two measures of F0 variability (SD and CV) are described using histograms and scatter diagrams. Separate histograms are produced for all statistics in two speaking styles. All histograms are near normally distributed. In general, mean, median, mode, and SD of F0 are significantly higher in reading speech than in spontaneous speech, while no significant difference is found for F0 CV. Specifically, it can be seen from the scatter diagrams that for some speakers the difference in F0 mean (and median and mode), F0 SD, or F0 CV between reading and spontaneous speeches is evident, while for others, it is small. It is also found that F0 CV is more independent of F0 mean than F0 SD is of F0 mean. This article concludes that the results for F0 statistics in the present study can be used in the FSC casework as reference data on F0 for young male Chinese population. In future studies, other forensic-phonetic variables that can be quantified should also be investigated. The exploration of adult female speakers will be interesting as well. Such investigations are currently being undertaken by the authors.
This research is supported by the Humanities and Social Science Research Projects at China University of Political Science and Law and the Open Project of Intelligent Speech Technology Key Laboratory of the Ministry of Public Security of China (Grant No. 2014ISTKFKT01). Considerable thanks to Yinghao Li for comments on the previous drafts of this article. We also thank two anonymous reviewers for their time and useful comments.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Jessen M. Forensic phonetics. Lang Linguist Compass 2008;2:671-711.
Gold E, French P. International practices in forensic speaker comparison. Int J Speech Lang Law 2011;18:293-307.
Morrison GS, Sahito FH, Jardine G, Djokic D, Clavet S, Berghs S, et al.
INTERPOL survey of the use of speaker identification by law enforcement agencies. Forensic Sci Int 2016;263:92-100.
Cao H, Li J, Wang Y, Kong J. On expert opinion of forensic speaker identification. Evid Sci 2013;21:605-24.
Foulkes P, French P. Forensic speaker comparison: A linguistic-acoustic perspective. In: Tiersma P, Solan L, editors. The Oxford Handbook of Language and Law. Oxford University Press; 2012. p. 557-73.
French P, Nolan F, Foulkes P, Harrison P, McDougall K. The UK position statement on forensic speaker comparison: A rejoinder to Rose and Morrison. Int J Speech Lang Law 2010;17:143-52.
Rose P. Forensic Speaker Identification. London and New York: CRC Press; 2002.
Eriksson A. Aural/acoustic vs. automatic methods in forensic phonetic case work. In: Neustein A, Patil HA, editors. Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism. New York: Springer; 2011. p. 41-69.
Hudson T, De Jong G, McDougall K, Harrison P, Nolan F. F0 statistics for 100 young male speakers of Standard Southern British English. Proc of 16th
ICPhS. Saarbrücken, German; 2007.p. 1809-12.
Jessen M, Koster O, Gfroerer S. Influence of vocal effort on average and variability of fundamental frequency. Int J Speech Lang Law 2005;12:174-213.
Lindh J. Preliminary descriptive F0-statistics for young male speakers. Lund Work Pap Linguist 2006;52:89-92.
Nolan F, McDougall K, De Jong G, Hudson T. The DyVis database: Style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. Int J Speech Lang Law 2009;16:31-57.
Segundo ES, Alves H, Trinidad MF. CIVIL corpus: Voice quality for speaker forensic comparison. Procedia Soc Behav Sci 2013;95:587-93.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, et al
. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Philadelphia: Linguistic Data Consortium; 1993.
Cao H. Relationships between the Acoustic and Physical Characteristics in Mandarin. Unpublished PhD Thesis. Peking University; 2015.
Künzel HJ. How well does average fundamental frequency correlate with speaker height and weight? Phonetica 1989;46:117-25.
Hughes V, Foulke P. The relevant population in forensic voice comparison: effects of varying delimitations of social class and age. Speech Commun 2015;66:218-30.
Skarnitzl R, Vaňková J. Fundamental frequency statistics for male speakers of common Czech. Acta Univ Carol (Philol) 2017;3:7-17.
Gold E. Calculating Likelihood Ratios for Forensic Speaker Comparisons using Phonetic and Linguistic Parameters. Unpublished PhD Thesis. University of York; 2014.
Jessen M. Forensic reference data on articulation rate in German. Sci Justice 2007;47:50-67.
Cao H, Wang Y. A forensic aspect of articulation rate variation in Chinese. Proc of 17th
ICPhS. HongKong, China; 2011. p. 396-9.
Sjölander K, Beskow J. WaveSurfer - An open source speech tool. Proc of 6th
ICSLP. Beijing, China; 2000. p. 464-7.
Hollien H, Hollien PA, de Jong G. Effects of three parameters on speaking fundamental frequency. J Acoust Soc Am 1997;102:2984-92.
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5]