Samenvatting van de gevonden artikelen over kenmerken van emoties

From Control Systems Technology Group
Revision as of 14:20, 16 October 2014 by S121223 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Source: Emotions and Speech: some acoustic correlates

Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustical correlates. The Journal of the Acoustical Society of America, 52(4B), 1238-1250.

Below the fundamental frequency is stated for the emotions anger, sorrow and fear. We will not make use of the neutral emotion. The most important features of a sentence for each emotion is discussed and the fundamental frequency is compared between the different emotion.

We will briefly explain each graph. Anger: The contour shapes for utterances produced in anger situations showed an F0 that was generally higher throughout the utterances, suggesting that they were generated with greater emphasis. Furthermore, one or two syllables in each phrase were characterized by large peaks in F0, again indicating strong emphasis on these syllables. Although the excursions in F0 were quite great, there appeared to be a relatively smooth overall contour with one or two major peaks, but with no large discontinuities. Sorrow: The contouers for the utterances made in situations involving the emotion sorrow were relatively flat with few fluctuations, and the F0 was usually lower than that for neutral situations. For Voice B (fig.1), there was a slow falling contour during the first half of the utterance, and a more level contour towards the end. Fear: The contours for utterances made in fear situations often departed from the prototype shape for neutral situations. Occasionally there were rapid up-and-down fluctuations within a voiced interval, as in cluster 4 for voice B. Sometimes sharp discontinuities were noted from one syllable to the next. In the graph below the median fundamental frequency is stated of the three emotions based on the recordings of three voices. This can be seen as a recap of the results that they find on the fundamental frequency.

In the next table the mean rate of articulation for each emotion is shown.

Some of the findings of the article were difficult to interpret because the pictures of the findings were not always readable. However at the end of the article there was a summary of the findings for each emotion. This will be discussed below. Anger: the most consistent and striking acoustic manifestation of the emotion anger was a high F0 that persisted throughout the breath group. This increase was, on the average, at least half an octave above the F0 for a neutral situation. The range of F0 observed for utterances spoken in anger situations was also considerably greater than the range for the neutral situations. Some syllables were produced with increased intensity or emphasis, and the vowels in these syllables had the highest fundamental frequency. These syllables also tended to have weak first formants, and were often generated with some voicing irregularity (i.e., irregular fluctuations from one glottal pulse to the next). The basic opening and closing articulatory gestures characteristic of the vowel-consonant alternation in speech appeared to be more extreme when a speaker was angry. The vowels tended to be produced with a more open vocal tract (and hence to have higher first-formant frequencies), and the consonants were generated with a more clearly defined closure.

Fear: The average F0 for fear was lower than that observed for anger, and for some voices it was close to that for the utterance spoken in neutral situations. There were however occasional peaks in the F0 that were much higher than those encountered in a neutral situation. These peaks were interspersed with regions where the pitch the F0 was in a normal range. The pitch contours in the vicinity of the peaks sometimes had unusual shapes and voicing irregularities was sometimes present. The duration of an utterance tended to be longer than in the case of anger or neutral situations. As was observed for anger, the vowels and consonants produces in fear situation were often more precisely articulated than they were in a neutral situation. Although these various characteristics were found for some of the utterances of some voices, observations of spectrograms revealed no clear and consistent correlate for the emotion fear. Sorrow: The average F0 observed for the actors speaking in sorrow situations was considerably lower than that for neutral situations and the range of F0 was unusually quite narrow. This change in F0 was accompanied by a marked decrease in rate of articulation and an increase in duration of an utterance. The increased duration resulted from longer vowels and consonants and from pauses that were often inserted in a sentence. Perhaps the most striking effect on the wide-band spectrogram was voicing irregularity. On occasion the voicing irregularity reduced simply to noise: i.e., the voiced sounds became whispered in effect.

Article: Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., ... & Narayanan, S. (2004, October). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces (pp. 205-211). ACM.

In this article there are no specific measurements given to program certain emotions. However they do state which characteristics are mostly used in by other researchers to determine an emotion. We will list these characteristics, because it is useful to know where we should emphasize on in a sentence.

Most researchers have used global suprasegmental/prosodic features as their acoustic cues for emotion recognition, in which utterance-level statistics are calculated. For example, mean, standard deviation, maximum, and minimum of pitch contour and energy in the utterances are widely used features in this regard. Dellaert et al. attempted to classify 4 human emotions by the use of pitch-related features. They implemented three different classifiers: Maximum Likelihood Bayes classifier (MLB), Kernel Regression (KR), and K-nearest Neighbors (KNN). The main limitation of those global-level acoustic features is that they cannot describe the dynamic variation along an utterance. To address this, for example, dynamic variation in emotion in speech can be traced in spectral changes at a local segmental level, using short-term spectral features. 13 Mel-frequency cepstral coefficients (MFCC) were used to train a Hidden Markov Model (HMM) to recognize four emotions. Nwe et al. used 12 Mel-based speech signal power coefficients to train a Discrete Hidden Markov Model to classify the six archetypal emotions. The average accuracy in both approaches was between 70 and 75%. Finally, other approaches have used language and discourse information, exploring the fact that some words are highly correlated with specific emotions.

We tried all researches that are listed in the text above. Some of them we were not able to find or we could not have access to them. Others turned out to be unusable for our research. However we did found some useful articles, for example the article that was discussed earlier.

Article: Emotive Qualities in Robot Speech

Breazeal, C. (2001). Emotive qualities in robot speech. In Intelligent Robots and Systems, 2001. Proceedings. 2001 IEEE/RSJ International Conference on (Vol. 3, pp. 1388-1394). IEEE.

Fearful speech: is very fast with wide pit& contour, large pitch variance, very high mean pitch, and normal intensity. I have added a slightly breathy to the voice as people seem to associate it with a sense of trepidation. Angry speech: is loud and slightly fast with a wide pitch range and high variance. We've purposefully implemented a low mean pitch to give the voice a prohibiting quality. This makes sense as it gives the voice a threatening quality.

Sad speech: has a slower speech rate, with longer pauses than normal. It has a low mean pitch, a narrow pitch range and low variance. It is softly spoken with a slight breathy quality (it gives the voice a tired quality). It has a pitch contour that falls at the end.

Happy speech: it is relatively fast, with a high mean pitch, wide range, and wide pitch variance. It is loud with smooth undulating inflections.

There were other emotions discussed, but we are not going to implement those emotions. Therefor they were not stated here.


Bowles, T., & Pauletto, S. (2010). Emotions in the voice: humanising a robotic voice. In Proceedings of the 7th Sound and Music Computing Conference, Barcelona, Spain.

In this article three important characteristics are stated. They will be explained below. Phrase duration: Figure 1 shows the average duration of each phrase. Higher bars indicated longer duration and, when comparing the four emotional versions of the same phrase, a slower speech rate, whereas shorter bars indicate a shorter duration and, when considering different versions of the same phrase, a faster speech rate.

We looked at how each emotional phrase deviates from its neutral version and we noted that the phrases involving only monosyllabic or short words saw the greatest reduction in duration for the angry and happy phrases (-20% and below the duration of the neutral phrase). In the cases of phrases 1, 3 and 8, the average angry and happy phrases were slower than their equivalent neutral voices. Half of the sad phrases saw over a 20% increase in duration, whereas the short words phrases (phrases 5, 6 and 7) saw an increase in duration between 10-20%. The average length of pauses per phrase was also measured. The sad phrases saw the longest overall pauses with 5 out of the 8 sad phrases having the longest phrase duration. Half of the happy phrases contained pauses that were longer than the angry equivalent, whereas only angry phrases 2 and 5 contained pauses that were longer than their happy equivalent

Pitch analysis: In order to investigate the emotional changes in pitch, the maximum peak in fundamental frequency (F0), the number of pitch contours (or pitch variations) per phrase and the direction of pitch contours were examined. Figure 2 shows the average maximum peak of F0 for each phrase

For all phrases, anger and happiness have high F0 peaks, while sadness has low F0 peaks. With the exception of phrase 4, anger phrases have the highest peak fundamental frequency of the 3 emotions. The average maximum peak frequency range of the angry phrases sits between 246Hz-281Hz, with 6 out of the 8 phrases averaging above 250Hz. The happy phrases have a range of 225Hz-269Hz, with 2 of the 8 phrases averaging above 250Hz. The sad phrases have lowest fundamental frequency peaks, operating within a range of 143Hz-186Hz. Overall, the variation in the number of pitch contours was dependent on the type of phrase. Some of the angry phrases saw the greatest increase in the number of pitch contours, while the happy phrases showed greater variation between increases and decreases, from phrase to phrase. The sad phrases generally saw a decrease in the number of pitch contours, with two exceptions. The average direction of pitch contours per phrase was calculated by counting every upward curve as a positive value (+1) and every downward directed contour as a negative value (-1). The result for each phrase was totaled and an average obtained. The majority of the neutral phrases contained downward directed pitch contours. The majority of the angry phrases contained more downward directed pitch contours, whereas the happy phrases varied between having upward directed or downward directed pitch contours. The majority of the sad phrases contained downward directed contours.

Amplitude analysis: Figure 3 shows the average maximum amplitude peak for each of the 8 phrases based upon the actors’ performances.

We can see that for all the phrases, anger and happiness have the high peaks, while sadness have low peaks. The sad phrases had the lowest maximum amplitude peaks, with all of the sad phrases peaking below 95dB. All the angry phrases and 7 out of 8 happy phrases had peaks that exceeded 100dB.