Samenvatting: Difference between revisions

From Control Systems Technology Group
Jump to navigation Jump to search
Line 15: Line 15:
There is a team of reseachers (EUAN MacDonald Centre, 2014) that that tries to help people with ALS by giving them back their voice. They record the voice of ALS patients prior to the muscle failure, so that those recordings can be used in speech technology. Instead of hearing a computerized sound, ALS patients can hear their own voice when their ability to speak is impaired. This creates a stronger emotional bond between the patient and the loved ones surrounding them. Although such technology exists the company feels that it can be improved by adding emotions to the voice recordings.  
There is a team of reseachers (EUAN MacDonald Centre, 2014) that that tries to help people with ALS by giving them back their voice. They record the voice of ALS patients prior to the muscle failure, so that those recordings can be used in speech technology. Instead of hearing a computerized sound, ALS patients can hear their own voice when their ability to speak is impaired. This creates a stronger emotional bond between the patient and the loved ones surrounding them. Although such technology exists the company feels that it can be improved by adding emotions to the voice recordings.  


There are researches that have investigated the features of particular emotions (Williams & Stevens, 1972) (Breazeal, 2001) (Bowles & Pauletto, 2010)([[Samenvatting van de gevonden artikelen over kenmerken van emoties]]). These findings could be implemented in a speech program to add an emotion to a sentence. Although some emotions were recognized based on those features it was difficult for the sample group to successfully recognize certain emotions like sadness. Other studies came with an explanation for this problem. They state that emotion cannot be recognized by acoustic features alone. Brusso et al. say that the combination of acoustic features (pitch, frequency, etc.) and anatomical features (facial expressions) is more effective for the recognition of emotions. Scherer, Ladd, & Silverman (1984) found that the combination of acoustic features and grammatical features is effective. Thus both studies conclude that the combination of multiple characteristics of emtions leads to the correct recognition of a specific emotion. When implementing these findings into a speech program it is not possible to include anatomical features. Nevertheless it is important to find the best possible way for ALS patients to express themselves verbally eventhough it is only based on acoustic- and grammatical features, because it gives them an increased sense of freedom.  
There are researches that have investigated the features of particular emotions (Williams & Stevens, 1972) (Breazeal, 2001) (Bowles & Pauletto, 2010). These findings could be implemented in a speech program to add an emotion to a sentence. Although some emotions were recognized based on those features it was difficult for the sample group to successfully recognize certain emotions like sadness. Other studies came with an explanation for this problem. They state that emotion cannot be recognized by acoustic features alone. Brusso, et al. (2004) say that the combination of acoustic features (pitch, frequency, etc.) and anatomical features (facial expressions) is more effective for the recognition of emotions. Scherer, Ladd, & Silverman (1984) found that the combination of acoustic features and grammatical features is effective. Thus both studies conclude that the combination of multiple characteristics of emtions leads to the correct recognition of a specific emotion. When implementing these findings into a speech program it is not possible to include anatomical features. Nevertheless it is important to find the best possible way for ALS patients to express themselves verbally eventhough it is only based on acoustic- and grammatical features, because it gives them an increased sense of freedom.  


The goal of this research is not to look further into the characteristics of emotions, but to investigate the strength of the combination of just the acoustic- and grammatical features. At first this research starts with sentences without emotion, but who express emotion grammatically. The sentence ‘You did so great, I thought it was amazing’ expresses the emotion ‘happy’ because of the words that were chosen.  But how large or powerful is the effect when adding physical features of emotions? This question leads to the research question of this research:
The goal of this research is not to look further into the characteristics of emotions, but to investigate the strength of the combination of just the acoustic- and grammatical features. At first this research starts with sentences without emotion, but who express emotion grammatically. The sentence ‘You did so great, I thought it was amazing’ expresses the emotion ‘happy’ because of the words that were chosen.  But how large or powerful is the effect when adding physical features of emotions? This question leads to the research question of this research:
Line 21: Line 21:
What is the effect on the perception of humans by adding acoustic features of emotions to a sentence with grammatical emotional features?
What is the effect on the perception of humans by adding acoustic features of emotions to a sentence with grammatical emotional features?


Based on the knowledge of prior research the sentence with both acoustic and grammatical features should be more convincing. The hypothesis is that the recognition of emotion in a voice will have a positive affect on likeablity, animacy and persuasiveness ([[Bronnen persuasiveness]]).  
Based on the knowledge of prior research the sentence with both acoustic and grammatical features should be more convincing. The hypothesis is that the recognition of emotion in a voice will have a positive affect on likeablity, animacy and persuasiveness (Yoo, & Gretzel, 2011) (Vosse, Ham, & Midden, 2010).  


The outcome of this research can be used in speech technology for patients who suffer from ALS, but its use has a broader implementation. It can be used when no anatomical features of emotions are available, but when it is still necessary to communicate a certain emotion. With persuasive technology for example there are devices that want to convince people to behave in a certain way. Some devices use only voice recordings instead of an avatar. In these cases the use of acoustic- and grammatical features could have a stronger effect on the convincingness of the device which leads to the wanted change of behavior.
The outcome of this research can be used in speech technology for patients who suffer from ALS, but its use has a broader implementation. It can be used when no anatomical features of emotions are available, but when it is still necessary to communicate a certain emotion. With persuasive technology for example there are devices that want to convince people to behave in a certain way. Some devices use only voice recordings instead of an avatar. In these cases the use of acoustic- and grammatical features could have a stronger effect on the convincingness of the device which leads to the wanted change of behavior.

Revision as of 14:08, 16 October 2014

Terug: PRE_Groep2


Misschien nog interessant voor in de prestentatie: http://www.euanmacdonaldcentre.com/about-the-centre/euans-story/

Introduction

Freedom is valuable to humans. It is so important that their right for freedom is protected by the constitution and several human right organizations. Freedom is related to both the physical- and mental state and it can be limited or taken away. Muscles diseases take away a person’s freedom on a physical level.

A muscular disease that recently gained more attention is ALS. Mainly because of the ‘ice bucket challenge’ that went all over Facebook. The idea of the campaign was to gain more awareness for this disease so that money would be raised for further research. With ALS the neurons that are responsible for muscles movement will die off over time (Foundation ALS, 2014). Groups of muscles lose their function, because the ronsponsible neurons can no longer send a signal from the brain to the muscles. This process continues until vital muscles, like the muscles that helps a person to breathe, stop functioning. During their illness people who suffer from ALS feel like a prisoner in their own body. Their freedom is decreasing in multiple ways. For example at some point they may not be able to express themselves verbally as the vocal cords are muscles and can stop functioning.

There is a team of reseachers (EUAN MacDonald Centre, 2014) that that tries to help people with ALS by giving them back their voice. They record the voice of ALS patients prior to the muscle failure, so that those recordings can be used in speech technology. Instead of hearing a computerized sound, ALS patients can hear their own voice when their ability to speak is impaired. This creates a stronger emotional bond between the patient and the loved ones surrounding them. Although such technology exists the company feels that it can be improved by adding emotions to the voice recordings.

There are researches that have investigated the features of particular emotions (Williams & Stevens, 1972) (Breazeal, 2001) (Bowles & Pauletto, 2010). These findings could be implemented in a speech program to add an emotion to a sentence. Although some emotions were recognized based on those features it was difficult for the sample group to successfully recognize certain emotions like sadness. Other studies came with an explanation for this problem. They state that emotion cannot be recognized by acoustic features alone. Brusso, et al. (2004) say that the combination of acoustic features (pitch, frequency, etc.) and anatomical features (facial expressions) is more effective for the recognition of emotions. Scherer, Ladd, & Silverman (1984) found that the combination of acoustic features and grammatical features is effective. Thus both studies conclude that the combination of multiple characteristics of emtions leads to the correct recognition of a specific emotion. When implementing these findings into a speech program it is not possible to include anatomical features. Nevertheless it is important to find the best possible way for ALS patients to express themselves verbally eventhough it is only based on acoustic- and grammatical features, because it gives them an increased sense of freedom.

The goal of this research is not to look further into the characteristics of emotions, but to investigate the strength of the combination of just the acoustic- and grammatical features. At first this research starts with sentences without emotion, but who express emotion grammatically. The sentence ‘You did so great, I thought it was amazing’ expresses the emotion ‘happy’ because of the words that were chosen. But how large or powerful is the effect when adding physical features of emotions? This question leads to the research question of this research:

What is the effect on the perception of humans by adding acoustic features of emotions to a sentence with grammatical emotional features?

Based on the knowledge of prior research the sentence with both acoustic and grammatical features should be more convincing. The hypothesis is that the recognition of emotion in a voice will have a positive affect on likeablity, animacy and persuasiveness (Yoo, & Gretzel, 2011) (Vosse, Ham, & Midden, 2010).

The outcome of this research can be used in speech technology for patients who suffer from ALS, but its use has a broader implementation. It can be used when no anatomical features of emotions are available, but when it is still necessary to communicate a certain emotion. With persuasive technology for example there are devices that want to convince people to behave in a certain way. Some devices use only voice recordings instead of an avatar. In these cases the use of acoustic- and grammatical features could have a stronger effect on the convincingness of the device which leads to the wanted change of behavior.

Method

  • Materials

For the research fives different programs were used. Acapela Box, Audacity, Google Forms, Microsoft Office Excel, and Stata. The use of the programs will now be explained.

Acapela Box is an online text-to-speech generator which can create voice messages with your text and their voices. The voice of Will was used because this one is English (US) and has the different functions ‘happy’, ‘sad’ and ‘neutral’. (Acapela Group, 2009) To generate the voice of Will the speech of a person was recorded. Those parameters are used and implemented in the written text. While programming this, attention was paid to diaphones, syllables, morphemes, words, phrases, and sentences. (Acapela Group, 2014) The Acapela Box also gives you the opportunity to change the speech rate and the voice shaping. (Acapela Group, 2009) This can be very useful for different emotions. If sadness occurs, sentence will be spoken slower than if happiness occurs. This is because of long breaks between words and slower pronunciation. (Williams & Stevens, 1972) A person talks with 1.91 syllables per second if the person is sad and with 4.15 syllables per second if the person is angry. (Williams & Stevens, 1972) Because the speech rate of happy and angry does not differ much. It was chosen to use this value of 4.15 syllables per second for happiness. (Breazeal, 2001) Voice shaping changes the pitch of the voice. A happy voice has an average high pitch and a sad voice has an average low pitch. (Liscombe, 2007) So the Acapela Box was used to change this voice shape.

Audacity is a free, open source, cross-platform software to record and edit audio. (Audacity, 2014) In Audacity you can use a lot of functions with which you can give the audio fragment an emotional tone. With the function ‘amplify’ you may choose a new peak-amplitude which is used to make the happy audio fragments louder than the sad fragments. The difference in amplitude between these two emotions is 7 dB on average according to the research by Bowles and Pauletto. The loudness of a neutral voice is close to that of the sad voice. (Bowles & Pauletto, 2010) The pitch of a voice while being sad decreases at the beginning of the sentence and remains rather constant at the second half of the sentence. (Williams & Stevens, 1972) A sad voice could also be recognized by the pitch decreasing at the end of the sentence. In contrast, being happy gives your voice a variety of different pitches. (Breazeal, 2001) In Audacity you can change the pitch of an individual word by selecting it, choosing the effect ‘adjust pitch’ and filling in the percentage you want to change the pitch into. Another feature is that a sad voice has longer breaks between two words comparing to all other emotions. (Bowles & Pauletto, 2010) By selecting a break between words and using the function ‘change tempo’ in Audacity, the length of this break can be made longer. This function can also help you to lengthen or shorten the words of a sentence individually. This is useful because words with only one syllable are pronounced faster when being happy. And if a person is sad, longer words are pronounced 20% slower and short words are pronounced 10-20% slower than a neutral voice. (Bowles & Pauletto, 2010). With the function ‘change tempo’ the speed of the voice changes, but the pitch does not. And this is exactly what we need for these emotions.

(To see more specific adjustments on the sentences used, click here: Extra zinnen maken)

With Google Forms you can create a new survey with others at the same time. It is a tool used to collect information. Audio fragments (via video) can be inserted and every possible question can be written. After receiving enough data from the required participants, the information can be collected in a spreadsheet which can be exported to a .xlsx document (for the program Microsoft Office Excel). (Google Inc., 2014)

Microsoft Office Excel is a program to save spreadsheets. This spreadsheet can be imported in the program Stata and from there on it is considered as data in different variables. Stata is used to interpret and analyze the data.


  • Design

Before the experiment was executed a power analysis was done to get an idea of how many participants might be need. For the power analysis the power was set to 0.8 at a significance level of 0.05. For a moderate effect size 64 participants for each condition were needed in a two-tailed t-test and 51 participants for each condition were needed in a one-tailed t-test. A one-tailed t-test would be suitable for the experiment, because no relation or a positive relation was suspected. A negative relation was not to be expected. Therefore there is only one way in which there would be an effect and a two-tailed t-test would not be needed. However, it is presumably that the effect size is small instead of moderate. Another power analysis was executed to see how many participants were needed if the effect size is small. For a two-tailed t-test 394 participants per condition would be needed and for a one-tailed 310 participants per condition would be needed. Because the resources and time to collect so many participants were not available, the experiment is executed with 51 participants per condition.

For our experiment a between subjects design was used. The two conditions researched were emotionally loaded voices and neutral voices. Each group only heard one of the conditions. These two conditions made up the independent variable, which therefore is a categorical variable. The dependent variables were likeability, animacy and persuasiveness. All these variables consisted of a Likert scale varying from 1 to 5. All the dependent variables are interval variables.The dependent variables were composed of several questions in the questionnaire. For the dependent variable likeability the questions 27, 30, 35, 38 and 40 from the questionnaire were used. Questions 28, 31, 32 and 34 were used for the dependent variable animacy. And the dependent variable persuasiveness composed of the questions 29, 33, 36 and 37.

Because the survey is about water consumption and persuadibility two variables were made which could be of influence on the way participants responded on the comments of Will. These covariates are how easy you are to convince and how much you care about the environment. Both covariates are composed of several questions in the questionnaire. How easily you are convinced is composed of the questions 5, 8, 13 and 18. How much you care about the environment is composed of the questions 4, 6, 10, 11, 12, 15, 17 and 19.

The questions of the questionnaire can be found at Opzet onderzoek 2.0


  • Procedure

Each participant received a questionnaire. This questionnaire contained three parts. The first part is the general one in which some demographic information was asked. Besides some questions were asked about two personal characteristics. Furthermore, some questions were included to prevent that participants directly knew about our research goal. The second part consisted of a simulation in which everyone was supposed to fill in some questions about their showering habits. After each answer, an audio fragment was heard which either gave positive or negative feedback. The third part contained questions about the experience of the voice heard. At last, we added some final questions to give us an indication about general matters such as concentration and comprehensibility.


  • Participanten

The participants (n = 101) were personally asked to fill in the questionnaire. They were gathered from our list of friends on our Facebook accounts. Facebook was used to prevent that elderly, which are unable to change their showering habits due to living in a care home, filled in the questionnaire. Besides, only participants older than 18 were asked, since younger people do often not pay their own energy bill.

From the original data set, three participants are removed, since they submitted the questionnaire two times. Besides, one extra person is removed from the dataset, since this participant commented that he had not understood the questions about Will. Furthermore participant who totally disagreed when they were asked whether they master the English language, were removed. At last, participants were removed who were totally distracted while filling in the questionnaire. After this we were left with 94 participants.

Each participant got either the questionnaire with the neutral voice (n1 = 46) or with the voice that sounds emotionally loaded (n2 = 48). The condition with the neutral voice is called category 1, and the other condition is called category 2.

The first category consisted of 15 men and 31 women, and their age was between 18 and 57. (M = 27.5, SD = 12.2). The second category consisted 18 men and 30 women, and their age was between 18 and 57. (M = 29.6, SD = 13.3). The highest completed degree of education differed among primary school (0.0%; 4.2%), mavo (2.2%; 2.1%), havo (4.4%; 8.3%), vwo (39.1%; 31.3%), mbo (6.5%; 12.5%), hbo (15.2%; 22.9%), and university (32.6%; 18.8%). The first value is for category 1, and the second is for category 2.

Resultaten

As stated in the introduction three concepts of perception are used for this reserach: persuasion, likeability and animacy. The scales for these concepts were tested on realibility by calculating Cronbach's alpha. These values were respectively 0.85, 0.89, and 0.85.The concepts will be discussed in the given order.

To test whether persuasion of the voice is perceived differently between both conditions an ANOVA was performed. This resulted in p = 0.96, and [math]\displaystyle{ \eta ^2 }[/math] = 0.00003. A graphical representation of this test can be seen in figure 1. After this, a second test was done in which participants were only included if they said that they are willing to adjust their showering habits. In this case the p-value was 0.95, and [math]\displaystyle{ \eta ^2 }[/math] = 0.00005.

For participants who either only heard at least four positive (p = 0.43; [math]\displaystyle{ \eta ^2 }[/math] = 0.03) or four negative audiofragments (p = 0.69; [math]\displaystyle{ \eta ^2 }[/math] = 0.02) it was tested whether persuasion differed between the conditions.

Figure 1: Persuasion per condition

Figure 1: Persuasion per condition


The third test that was performed was to see whether the participants in the condition with emotion rated the voice as more likeable compared to the participants in the neutral condition. Since not all assumptions for an ANOVA were met (normal distribution across both conditions was rejected), the non-parametric kwallist test was performed. This gave a p-value of 0.17, and [math]\displaystyle{ \eta ^2 }[/math] = 0.02. This effect was calculated by using the following formula: [math]\displaystyle{ \eta ^2=\frac{\chi ^2}{N-1} }[/math] (http://oak.ucc.nau.edu/rh232/courses/EPS625/Handouts/Nonparametric/The%20Kruskal-Wallis%20Test.pdf). Figure 2 shows these results.

As a follow up test, the kwallis was executed another two times, but now one time with only participants that at least heard four positive audiofragments (p = 0.02; [math]\displaystyle{ \eta ^2 }[/math] = 0.22), and the second time with only participants that at least heard four negative audiofragments (p = 0.73; [math]\displaystyle{ \eta ^2 }[/math] = 0.01). The effect of mainly hearing positive audiofragments on likeability is shown in figure 3.

Figure 2: Likeability per condition Figure 3: Likeability per condition when heard audio-fragments were mostly positive

Figure 2: Likeability per condition ----------------------------------------------------------------------------------------- Figure 3: Likeability per condition when heard audio-fragements were mostly positive


Finally, a kwallis test was done to test whether there is a difference in animacy between both conditions. A kwallis test was chosen because the assumptions were not met (rejection of equal variance for both groups). The p-value found was 0.02 and [math]\displaystyle{ \eta ^2 }[/math] = 0.06. The found difference can be seen in figure 4.

To test whether the type of heard emotion has some influence on perceived animacy, participants who heard at least four positive fragments (p = 0.16; [math]\displaystyle{ \eta ^2 }[/math] = 0.08) were tested, as well as participants who heard at least four negative fragments (p= 0.17; [math]\displaystyle{ \eta ^2 }[/math] = 0.21).

Figure 4: Animacy per condition

Figure 4: Animacy per condition

Discussion and conclusion

Sources

Acapela Group. (2009, November 17). Acapela Box. Retrieved from Acapela: https://acapela-box.com/AcaBox/index.php

Acapela Group. (2014, September 26). How does it work? Retrieved from Acapela: http://www.acapela-group.com/voices/how-does-it-work/

Audacity. (2014, september 29). Audacity. Retrieved from Audacity: http://audacity.sourceforge.net/?lang=nl

Bowles, T., & Pauletto, S. (2010). Emotions in the voice: humanising a robotic voice. York: The University of York.

Breazeal, C. (2001). Emotive Qualities in Robot Speech. Cambridge, Massachusetts: MIT Media Lab.

Google Inc. (2014, April 14). Google Forms. Retrieved from Google: http://www.google.com/forms/about/

Liscombe, J. J. (2007). Prosody and Speaker State: Paralinguistics, Pragmatics, and Proficiency. Comlumbia: Columbia University.

Royakkers, L., Damen, F., Est, R. V., Besters, M., Brom, F., Dorren, G., & Smits, M. W. (2012). Overal robots: automatisering van de liefde tot de dood.

Scherer, K. R., Ladd, D. R., & Silverman, K. E. (1984). Vocal cues to speaker affect: Testing two models. The Journal of the Acoustical Society of America, 76(5), 1346-1356.

Williams, C. E., & Stevens, K. N. (1972). Emotions and Speech: Some Acoustical Correlates . Cambridge, Massachusets: Massachusetts Institute of Technology.