PRE2017 4 Groep7: Difference between revisions

From Control Systems Technology Group
Jump to navigation Jump to search
No edit summary
No edit summary
Line 315: Line 315:
In the the designed prototype, the only action the agent can do which impacts the user is to show or hide messages. however, it could prove advantageous to have the agent operate in more ways than just that. if for example, the user would receive a lot of messages about an upcoming group meeting and the agent has access to the users timetable, the agent could easily filter these messages into one category. The other way around could be that if a few group members schedule an appointment and invite the user over whatsapp, the agent could introduce an event in the users calendar. A very useful tool for people who are more forgetful of actually scheduling planned meetings. These new ways to act could also have a downside because they introduce complexity into the agent as for example, each course a user takes would have different keywords that are relevant and the dataset should then contain keywords for each course, this could show the agent down.
In the the designed prototype, the only action the agent can do which impacts the user is to show or hide messages. however, it could prove advantageous to have the agent operate in more ways than just that. if for example, the user would receive a lot of messages about an upcoming group meeting and the agent has access to the users timetable, the agent could easily filter these messages into one category. The other way around could be that if a few group members schedule an appointment and invite the user over whatsapp, the agent could introduce an event in the users calendar. A very useful tool for people who are more forgetful of actually scheduling planned meetings. These new ways to act could also have a downside because they introduce complexity into the agent as for example, each course a user takes would have different keywords that are relevant and the dataset should then contain keywords for each course, this could show the agent down.
This challenge will not be tackled in this study, but research into it could be useful for future studies.
This challenge will not be tackled in this study, but research into it could be useful for future studies.
== References ==
<references />

Revision as of 11:22, 27 May 2018

0LAUK0 - Group 7

Group Members

- Bas Voermans | 0967153

- Julian Smits | 0995642

- Tijn Centen | 1006867

- Bart van Schooten | 0999971

- Jodi Grooteman | 1006743

- Emre Aydogan | 0902742

Planning

The planning can be found here:

Planning PRE2017 4 Groep7

Problem Statement

A Personal assistant (PA) works closely with a person to provide administrative support, this support is usually delivered on a one-to-one basis. A PA helps a person to make the best use of their time because they limit the time spent on secretarial and administrative tasks. unfortunately having the luxury of a personal assistant is reserved for the rich and successful only, this is because of the one-to-one nature and the extensive knowledge usually required to perform PA tasks successfully. In this study the research will be focused on one aspect of a PA, which is to scan incoming messages and to only notify the person of noteworthy messages. The users in this “USE” study are defined as students in the netherlands, and because the main means of communication between students is Whatsapp Messenger. It is a good starting point to alleviate students of the current growing expected accessibility that is imposed onto them. Currently Whatsapp Messenger uses a notification system that lets you turn on and turn off notifications of a certain group or a certain person. However, in most cases this is far from ideal because if a group has a relatively low amount of relevant messages one would be inclined to switch off notifications from this group all together, but if a message sent in this group has direct relevance to the user this information would probably be missed. The goal of this study is to design some software agent that can distinguish which messages are relevant to a students academic exploits, and notifies the user accordingly. The student would effectively have a personal assistant whose role is to manage their whatsapp.

Users

Who are the users?

The users that this research is meant for the users that have to weed through countless notifications while deciding what is important to them and what is not. Hence users that deal with many of these notifications are our main goal. This research will focus mainly on the student user group, which makes it easier to define the needs and requirements of this group since this research is familiar with this group.

Requirements of the users

  • The system should run on pre-owned devices
  • The system should filter important information out of incoming messages.
  • The system should tune its intrusiveness based on the users feedback.


USE Aspects

In this chapter we look at the potential impact of the product of our research. If our product fully works and solves the problem described in the problem description, it can have a great impact on the users of the product and the society as a whole. Beneath is described what impact our product can have on the users, society, possible relevant enterprises and the economy.

Users

The users of the product will, as described above, primaliary be students, but it can also be extended to anybody with a smartphone who receives more messages than desired but does not want to miss out on any potentially important messages. When a person no longer has to spend time on reading all seemingly unimportant messages or scan through them looking for important messages, they will have more time to spend on things they want to spend their time on. This is a positive effect of our product as this allows the user to focus on their core business. However, our product might also have different effects on the user. Scanning texts messages or text in general for relevant information can be a valuable skill to have, as it has also applications in other scenarios, such as scanning scientific articles or reports for important information. When an AI takes care of this tasks, users might lose this skill. This might hinder them in the other scenarios as described above, where the AI possible can not help them find the important information. Another negative consequence might occur when the AI does not work perfect, but the user trusts it to work perfect. In this scenario the user might miss an important message, which can have quite some consequences. In a work environment this can mean that the user does not get informed about a (changed) deadline or meeting. In a social environment this can lead to irritation or even a quarrel.

Society

When we are talking about society, we are talking about all people - users and non-users of the product - combined and everything included that comes with that. To look at what impact our product might have on the society, we look at how relations between individuals chance, as well as how the entire society together behaves. The consequences described above can be extended to a society level. If people become more productive as described above, it certainly would benefit society, as more can be accomplished. The fact that people might lose the ability to quickly scan text to find important information can also have an impact on society. If an entire generation grows up like this, there will also nobody to teach it to younger generations, meaning that society as a whole will lose this skill. Now it can be questioned how relevant such a skill might still be in future society, but its a loss nonetheless. Another thing that might occur when a large public uses our product is that nobody longer reads all the seemingly unimportant messages. If nobody reads them anymore, those who write those messages will probably stop doing so, removing the purpose of our product.

Enterprise

Possible relevant enterprises might be those who are interested to buy our product. This could be either a company like WhatsApp themselves, who want to integrate it in their application themselves, or a third party that wants to publice it as an application on its own. The companies, especially a third party, would want to make profit of such an application. companies like WhatsApp could offer it as a free service to make sure users keep using their application and possible attract new users. Third party companies can not do this and would need to find another way to make profit of the application. An easy solution for this seems to make the application not free of charge.

Economy

Our product will reduce costs for users. A lot of people do not have time or do not want to filter the most important information themselfs. For this they can use a personal assistant to take over this task. But our software will be less expensive than a personal assistant. This will save money. A disadvantage of this is that personal assistants will have less work. If people use our software instead of a personal assistant for this particular task, personal assistants are not needed for this task anymore. This causes that there is less work for personal assistants.

Approach

To start of, research to the state-of-the-art will be done to acquire the knowledge to do a good study on what the desired product should be. Next an analysis will be made concerning the User, Society and Enterprise (USE) aspects with the coupled advantages and disadvantages. At this point the description of the prototype will be worked out in detail and the prototype will start to be build. At the same time research will be done to analyse the different approaches of filtering the incoming messages and the impact they give. The results of the research will be implemented in the prototype. When the prototype is complete, the goal of the project will be reflected upon and some more improvements of the prototype can be made.


Used resources

The used resources can be found here: Used resources

Prototype description

To get a good understanding of what kind of prototype is required for the described problem and the given user, a concrete goal needs to be described that will fulfill a good selection of the user requirements described in the section above. After a concrete goal is described a prototype design needs to be created to solve the problem described in the problem statement.

Goal

The goal that the prototype should fulfill is dependent on the user requirements that have been described. Since it is not possible to create a prototype that is able to achieve all requirements in the current planning, a selection of important requirements will be chosen that are to be implemented in the prototype. The rest of the requirements are going to be analysed and researched in a written manner to still be able to give insights in their importance to the user.

The requirements that are chosen for the prototype are the following:

  • The system should run on pre-owned devices
  • The system should filter important information out of incoming messages
  • The system should tune its intrusiveness based on the users feedback

So the prototype will become a software module that can be implemented by existing messaging applications like Whatsapp, telegram or other messaging applications that can be used to send and receive messages between a large group of people. The module will output a binary value depending on whether the message is important or unimportant. To determine the grandiosity the system should base its reasoning on feedback that the user gives during setup or usage of the application.

Design

To achieve the goal described above, two prototype design variations will be created to be able to analyse their effectiveness. The first variation will be using keyword based filtering which has the advantage of having an understandable filtering process, since the keywords support the reasoning. The second variation will be using machine learning in the form of a recurrent neural network (RNN), which is often used for text based machine learning. These two subsystems will be integrated in a larger system that also involves the removal of clearly identifiable spam and the coupling of closely related messages in the form of threads.

Prototype structure

Input Output interface

The required input for the filtering module should be as abstract as possible to support as many different messaging applications as possible. However, there should be consistency in the input format. Not only the message itself is important, but also the metadata like the date and time, the sender, whether a message is a response to a different message and whether any media like images is coupled with the message. The prototype will not be able to analyse any coupled media but the information of media being present can still be useful for filtering. Messages are inputted in batches, just like they are for unread notifications. The messages in a batch should all come from the same group chat since the messages could be coupled with each other. The module will then process this batch without taking other batches into account. The output of the filtering module will be a boolean value indicating for every individual message, whether the message should be shown to the user or should be discarded.

Spam filter

The first step to start analyzing the messages is to filter the spam out of the messages. The purpose of this is to cut out the messages that don’t really have an influence on the context. For example the smiley’s are mostly not important. Therefore when there is a message with only smiley’s the program can categorize this as spam and thus filter it out. In this part the message is clearly looked at from a point that it only looks at what the actual text of a message is. To give an example, a message with a strange combination of letters would be filtered out. Thus the program does not pay attention to the meaning of a message but to the actual content of that particular message. Filtering out the spam before analyzing is important because we don’t have to analyze messages that have no influence in the first place.

Categorization

The messages will also be categorized in groups that are concerned with the structural meaning of the sentence. These are for example questions, answers or announcements. By extracting this information from the messages the program can give the user even more options to filter the incoming messages. A certain user might only be interested in announcements and not in questions. This can be indicated and the appropriate messages can be shown or discarded without going through the next layers of the program. When a user does not give a preference for a certain category the messages will be propagated to the next layer, which is the thread layer.

Coupling of related messages

After the clearly identifiable spam messages have been discarded and the categories have been detected, the remaining messages can be coupled together in so called threads. This is done to retain important information that could be spread over multiple messages. A factor that could indicate a thread is for example the time of sending the messages, since messages sent in a short timespan will most likely involve the same subject. Another factor is the person that sends the messages, since information is most of the time coming from one person and is intended for all the others. The last factor is when a message is a reply on a different message. This is a feature that some messaging applications support and will link the messages that is being replied on to the new message. These two linked messages most likely need to be coupled together.

These coupled messages are then combined in such a way that the filtering in the next step will take the combined messages into account before determining the importance of the message.

Filtering

Now the program starts with categorizing the coupled messages in two groups. The first group is the important messages and the second are the unimportant messages. There are multiple ways of doing this, but the prototype will only involve two of them. Namely Keyword based filtering and Recurrent neural networks.

Keyword based filtering

The first method is keyword based filtering. This method makes use of a predefined list of important keywords. Every message is checked and given a score on how many important keywords are in that message. When a message has a higher score than a certain threshold the message will be placed in the group important messages.

Evaluation of messages will be done in a few steps. First the program checks if the message has one of the following words: Who, what, when, where. By checking these words the program already gains a lot of information about the message. The next step is to analyze what kind of word is stated after one of the W words. For example when there is a sentence that ends with who. It might not be as important as a sentence that starts with who. This is because the sentence that ends with who is not a question and thus might not have a much meaning as the other one. Also the message that ends with Who is not grammatically correct. This indicates that it has a low priority. In addition to that the length of the message is taken into account. The longer the message the more important it is most of the time.

Recurrent neural networks

The second method is recurrent neural networks. This method uses learning to categorize messages. Therefore it needs training. There are two ways of obtaining this training. The first one is to analyze messages by hand and use this to train the neural network. The second one is to give a set of messages to the user of our product and let the user categorize these messages. This creates personalized test data for all the users and thus will the neural network also be a personalized to a user when using this test data to learn. Combining these two methods of creating training data is the best thing to do. This is because then the neural network can have more training and it is not fully personalized. The fact that it is not fully personalized is a good thing because the user would otherwise fully rely on his categorization. When the user would not be able to categorize the messages the program would perform bad. Now with using both training data sources we optimize the program. Using a neural network gives a certain percentage of correct categorized messages. There option is there to make the user give a percentage to the program and that it keeps learning until this percentage is reached.

Both options of filtering the messages can be used separately or combined. An analysis will be performed when both filters are finished and based on that analysis the evaluation function will be created, which is explained in the next section.

Evaluation function

The evaluation subsystem will evaluate the incoming messages with the results of the different filtering options. Based on the results of the filtering options a different evaluation function can be chosen. Some ideas for the evaluation function are only choosing a result of one of the filters; taking the average; taking the maximum or minimum or looking at the magnitude of the difference. The evaluation subsystem also allows for personalization, since the users can indicate a degree of how many messages need to be filtered out, which can be transformed into a threshold that can be compared to the result of the evaluation function. Furthermore, personalization can be applied in the form of asking feedback. Users will most likely not want to give feedback on every message that is filtered so the results of the two filtering options could be used to get an understanding of the certainty of the network in filtering that message. If, for example, the difference of two filtering options exceeds a value that can be indirectly set by the user, the program can show the message and ask whether it is useful.

Prototype progress

The following section will show the progress of the prototype over multiple iterations. Each iteration is approximately one week and will contain the actions done in bullet points as well as a written summary of the implementation with occasional images.

Iteration 1

Class diagram showing the created structure with layers
  • Created structure with class diagram
  • Implemented the base structure from the class diagram in java
  • Started on question sentence detection for categorization
  • Started on thread layer with a hierarchical clustering algorithm
  • Started on UI

This iteration is the start of the creation of the prototype so the first action done was to create a good structure that is flexible enough to change the order of filters and other layers in the prototype later on. For this, a class diagram is made that is inspired from the prototype structure made for the prototype design. The class diagram shows that an abstract layer class is the parent of all of the layers in the prototype, which enables the use of restructuring the layers on the go when necessary. A layer class is very basic and only has a child layer and some methods for processing the messages and propagating them through to the child layer. The filter layer extends from the layer class and has an extra ‘alternative layer’ that is used to feed the messages to that got filtered out. Again there is an abstract method that should handle the filtering and which can be implemented by the subclasses which are the spam filter and the categorization filter for now. Next up there is a thread layer that is able to make use of different clustering algorithms for coupling related messages. For now only the hierarchical clustering algorithm is implemented with the properties time and sender but different algorithms could be implemented to see which works best. The evaluation layer is used to create an abstract structure that can be used to utilize multiple different evaluation methods for determining the degree of importance. The keyword evaluation is the evaluation that is going to be implemented next. After all evaluation methods have processed the messages an evaluation function needs to merge the results in a single value and determine whether the message is important or not. The last layer is the output layer which catches all messages that are outputted at different layers like the spam filter or the evaluation layer and returns the collected messages in order with the addition of an importance result.

The complete structure that can be seen in the image is already implemented in the programming language Java. This language is chosen since Java is also used for the Android operating system which is very open and could allow the prototype to be inserted and read from the incoming notifications. Furthermore Java is well known by the team. The prototype is already able to process messages created by hand, since dummy implementations have been made for all the layers. This allows implementation of some layers while the other layers might not work as intended yet. The layers that do have some implementation are the categorization filter and the thread layer.

Categorization filter

For the categorization filter, the detection of questions has been started on as the first category. The way that the question categorization works for now is to have a list of words that often indicate a question sentence when these words are placed at the beginning of the sentence. Different words receive a different amount of points, since some words always indicate a question sentence and other words occasionally. Furthermore a message can have multiple sentences from which only one is a question. To be able to detect this each individual sentence is processed and if there is a word indicating a question at the start, it will be detected. This works better than only looking at the first word of the message, since for some questions a small sentence might be before it to introduce the question. An example can be seen in the results table for the message sent at time 6. The last feature that is detected is a question mark at the end of a sentence, which also gives some points to the sentence showing a higher resemblance to a question. After the detection is done the points are compared with a threshold and if the number of points is greater than the threshold, the sentence is classified as a question. In next iterations the classification will be improved to work with other forms of question sentences and other categories will also be added. Below are the results of fifteen sentences of which are five questions. Four out of five questions are correctly classified as a question.

Time Sender Message Question categorization Expected answer
0 John test No No
1 John spam No No
2 Jane this is spam No No
3 Jan real message No No
4 Henk real good message No No
5 John Is this good? Yes Yes
6 Henk hello! shall we go to the beach? Yes Yes
10 John Are you attending the lecture? Yes Yes
11 Jane Yes, I am! No No
12 Henk Yes, I am too! No No
14 Jan No, I am on holiday No No
16 John When will you be back? Yes Yes
19 Jan I will be back tomorrow No No
21 Jane Any of you know the answer to question 5? No Yes
30 Jane ??? No No

Thread layer

The layer responsible for coupling of related messages is also started on with the addition of a clustering algorithm called hierarchical clustering. The hierarchical clustering algorithm starts with each message as a separate cluster and looks for each iteration which messages have the least ‘distance’ between them and combines them into one cluster. This distance is determined by the euclidean distance function with the properties time and sender. The property time is used by computing the difference between each pair of messages, while the sender distance is determined by the distance function described in Distance function for clustering categories. For the hierarchical clustering algorithm a depth is expected which indicates the amount of iterations to cluster the messages. If this value is too low, very few messages will be clustered meaning no extra information while a high value will result in many questions clustered in the same cluster which is practically the same as not using clustering at all. A good depth is thus required to ensure a high entropy while the entropy will be low both if the depth is too high or low. From testing on the dataset below it is determined that expressing the depth in the amount of messages works better than giving a hard value. Furthermore, the depth worked best with a factor of three fourth. However, further refining is required on different datasets and when adding extra properties to the distance function. The results in the table show that there are two threads created. Especially the thread with index 2, since the differences in time are not as close as other messages, which shows that the sender distance function also does its work.

Time Sender Message Thread id
0 John test 0
1 John spam -1
2 Jane this is spam -1
3 Jan real message 0
4 Henk real good message 0
5 John Is this good? 0
6 Henk hello! shall we go to the beach? 0
10 John Are you attending the lecture? 2
11 Jane Yes, I am! 2
12 Henk Yes, I am too! 2
14 Jan No, I am on holiday 2
16 John When will you be back? 2
19 Jan I will be back tomorrow 2
21 Jane Any of you know the answer to question 5? -1
30 Jane ??? -1


GUI Design

To make for a pleasant way of interacting with the PA prototype a GUI was designed in parallel with the actual implementation of the PA.

The GUI when it has started up and imported some text file

For the first iteration the functionality of the GUI was kept fairly limited. The user is able to import chats as .txt files through the file manager of the OS, which the GUI then shows in a text area. The user is then given the choice which filters he/she want to apply to this chat. The last bit of interactivity this GUI offers is the actual button to run the PA on the imported chat with the selected filters. This has yet to be implemented in a future version of the prototype.

After the user has pressed the 'Run PA on chat' button

What follows then is a pop up dialog that notifies the user that the analysis has been completed successfully. Furthermore, a random score is generated and shown to the user to further give an idea of how the GUI should function.

Iteration 2

Iteration 3

Preprocessing

The input the program gets is most of the time a really raw input. When analyzing emails the input will be a perfectly fine piece of text without typo’s and strange non-important messages in between.On the contrary the program that is being build has to take into account that in whatsapp a lot of typo’s are made and a lot of different strange text messages will be sent that do not mean anything by first seeing them. When an user is more used to the Whatsapp languages he or she gets to know some abbreviations that do not exist in the normal speaking languages. Therefore preprocessing is necessary to make it the program a lot easier to “read” and interpret all the messages.

The idea that removes all the words like: “the”, “is”, “was”, “where” might be a good idea to implement. Generally those words are not important to the meaning of a message. Those words are called stopwords. This would be implementing with have a list of stopwords, the stoplist. Then all the messages would be scanned for those stopwords and then the stopwords would be removed from the message. The idea to remove the stopwords would be a benefit to our program because it would have less clutter and non important words to analyze. Nevertheless it would make it harder to identify questions, as it removes one of the most important parts of the message that would identifies it as a question. Therefore this preprocessing step would suit the program more when it is done after the messages are categorized, and thus be a processing step somewhere in the middle of the programm.

An addition to the preprocessing, that other research papers suggested, could be that all the verbs would be translated back to their root form. This is called stemming. When having a sentence with the word talking in it. It would be replace talking with talk. In addition to that all the different verbs of talk would also be replaced with talk. When doing this the set of words that have to be checked would be a lot smaller because the list would only need to have one word instead of five different verbs of that word.

Threads

To get more out of singular messages threads can be used to to couple multiple messages into threads. With this the program can analyze a conversation instead of a single message. In conversations the topic doesn’t change much. Therefore in single messages there might be a topic that is not literally stated in that message. Looking the context, the other messages, most of the time there is a topic that is addressed in that message. This is why threads are a great tool to analyze messages.

When will messages be coupled together in a thread is the next question. The most important property that the program will take into account is time. When messages fall in the same time interval they will be coupled. This is a very basic but good implementation. The second property that will be implemented is checking who is the sender of a message. When the same person sends more messages right after each other. The chances are pretty high that those messages address the same topic. Thus the messages should be put into a thread.

The paper suggest the usage of K-Means or NMF to cluster the messages. K means works well when the shape of the clusters are hyper-spherical. For our algorithm the clusters are not hyper-spherical. This is not the case in the implementation of clustering messages. Also for both K-means and NMF the number of clusters have to be predefined. In the case of clustering messages the number of clusters is not defined before clustering. A third algorithm to cluster is hierarchical clustering. This algorithm starts with giving every instance its own cluster. Then the algorithm starts combining clusters until it converges. This works well for our implementation because in hierarchical clustering there is not a predefined amount of clusters. However with hierarchical clustering a depth limit has to be specified. This could be a disadvantage for our implementation.

Clustering based on time is fairly easy because every timestamp is a number and the program can cluster on the distance between messages using the euclidean distance function or another distance function. In which the distance is the difference in time for the mean of two clusters. Clustering based on who sent the message is harder. There is not a way to do clustering on contacts using an euclidean or other similar distance measures, thus therefore a different method needed to be implemented to work with categories instead of numbers, which is described below.

Distance function for clustered categories

To compute a reasonable distance measure for categories that are uniformly distributed, meaning the individual differences are equal, inspiration from the Levenshtein distance was taken. The reason why the Levenshtein distance itself is not completely what is desired for the distance of categories, like the sender is because it also takes the order and the length into account. If the order or length is different between two clusters containing the same two senders the distance could be greater than the distance between two equal length clusters having different senders.

The next step was to make a sketch of multiple clusters with a different length and sender configuration. Some general rules were established to get an idea of which cluster pairs should receive a higher distance than others. For example two equal clusters should have a distance of 0 and two completely different clusters should have a distance of 1. Then the cluster ABB was compared with clusters AB, AC and C with the desired order of increasing distance: AB, AC, C. To establish the distance, fractions are made with the denominator equal to the sum of the two clusters together and the numerator equal to the sum of different senders in both clusters. As can be seen in the table below, this gives the desired result. To balance this distance function out with the other possible distance functions a multiplication factor has been added that can scale the distance depending on the importance of the clustering categories.

clusterDistance(c1, c2):
	diffNumber = #{c1 - c2} + #{c2 - c1}
	totalNumber = #{c1} + #{c2}
	return diffNumber / totalNumber
A B C AB AC ABB
A 0 1 1 1/3 1/3 2/4
B x 0 1 1/3 1 1/4
C x x 0 1 1/3 0/4
AB x x x 0 2/4 0/5
AC x x x x 0 3/5
ABB x x x x x 0

Classic Naive Bayes

In order to design a new technique to classify relevance of messages, it is necessary to first look at established techniques that approximate the goal of a Whatsapp spam filter. The first technique that comes to mind is the use of Bayes classifiers.

Naïve Bayes classifiers are a popular technique In use for e-mail filtering. Typically spam is filtered using a bag of words technique, where words are used as tokens to calculate the probability according to Bayes Theorem that an e-mail is spam or not spam(ham).

Words have a certain probability of occurring in either spam or ham. The filter does not know these probabilities in advance, it needs to be trained first so they can be built up. For instance, the spam probability of words like “Sex” or “Nigerian” are generally higher than the probabilities of names of family members and friends. When the agent is trained, the likelihood functions are used to compute the chance of an e-mail with a particular set of words belongs to either the spam or the ham class.

One of the biggest advantages of Bayesian spam filtering is the fact that it is possible to train the filtering to each user, creating a personal spam filter. This training is possible because the spam a user receives correlates with that users activities. Eventually a Bayesian spam filter will assign a higher probability based on the user’s patterns. This property makes the use of a Bayesian classifier particularly attractive for Whatsapp spam filtering as the types of messages a user receives vary widely for users of the app. Bayesian Classifiers might also assign accurate probabilities from messages received from different groups as the group name can be used as a token as well.

Problems

The accuracy of Bayes classifiers increases with the amount of words in each message, this imposes a problem on whatsapp messages because these messages are generally much shorter and lack most of the formalities typically seen in e-mail, therefore there is simply “less to work with”. In spite of this SMS spam filtering using naïve bayes has had a

Research questions

Feedback before using the prototype

The easiest and probably most obvious way to get personal data from the user is to give them a form to fill in. This form would consist of some general questions like: “Are you a student?”, “If so, where are you studying?”, “Do you consider positive enforcing messages important? (think of a confirmation or a compliment)” and so on. Giving the user such a form to fill in has the advantages that the program would already be able to be personalized when it will do its job in the beginning. Also during the improvement of the programm while the user is using it, it would need less feedback from the user because it already got a lot. The disadvantages of using such a form is that when an user answers such a question, the program makes an assumption based on the answer. For example when the user says that is has football as hobby, the program takes a certain list of words with all the keywords for football in it and gives the user a notification when one of those keywords is sent. But it might be the case that the user is not interested what happens in the champions league at all, but it plays football as a hobby. Another disadvantage of using a from might be that the user does not like to take the time to fill it in. Furthermore when the user fills in that he or she likes everything that is in the form the program will not do anything. Considering this the choice has been made to not include a form in the beginning.

Contact biasses

For making the program more personal, considering the contacts of the user is a very important aspect in doing this. This can be done in a couple of different ways.

The first option on how to consider contacts is to label a contact with a tag which would for example be: “Peer”, “Teacher” or “Brother”. In reality there would be a lot of different tags. Also when the contact is not in one of the categories of the tags the user would be able to create a new tag and give that tag a importance rate. The user would give this tag to a sender when he receives a messages from him or her. This tagging would only be done once and that would be the first time when the user receives a message from the sender. The advantage of this is that the program can use the information from the tag to get a better view of the importance of a message. The biggest disadvantage is that the user would have to do a lot of tagging the the beginning.

The other option would be to consider if the user has the contact in his or her phone. And according to this give a priority to the contact. There are three priorities for a contact. These priorities are low, medium and high. When a the number of the contact is already in the phone of the user. The contact is set to a medium priority. When the contact’s number is not in the phone of th user the contact is set to low. The user is able to manually adjust the priority of a contact in the user interface. This solution would be good because the user does not have to do a lot of work in the beginning. In addition to that the program would be able to consider the contact in determining the importance of a message. Also the user would be able to customize the priorities when desired.

By taking both options into consideration the choice has been made to implement the second option. This is because the the advantages of being able to customize when desired and not having to do a lot of work are decisive.

Communicate with the university infrastructure

In the the designed prototype, the only action the agent can do which impacts the user is to show or hide messages. however, it could prove advantageous to have the agent operate in more ways than just that. if for example, the user would receive a lot of messages about an upcoming group meeting and the agent has access to the users timetable, the agent could easily filter these messages into one category. The other way around could be that if a few group members schedule an appointment and invite the user over whatsapp, the agent could introduce an event in the users calendar. A very useful tool for people who are more forgetful of actually scheduling planned meetings. These new ways to act could also have a downside because they introduce complexity into the agent as for example, each course a user takes would have different keywords that are relevant and the dataset should then contain keywords for each course, this could show the agent down. This challenge will not be tackled in this study, but research into it could be useful for future studies.