(Original post in Lifeanalytics
A question one could come up with is the following : How can we easily identify and extract novel information from the web? Although we could apply this "novelty detection" into many areas i would like to discuss for now the idea of semi-automatically identifying novelty among posts on Twitter.
Let's take for example the IPhone
. Thousands of Tweets are generated every day regarding the Apple IPhone. These tweets mainly discuss about :
* Which new apps are available / used / liked.
* New accessories (cases, chargers, etc)
* User Experiences and sentiment (such as blaming IPhone's short battery life)
* Pros and cons of the IPhone vs other similar devices
* Upgrading / hacking etc.
So the problem is : How can we identify novel information among thousands of tweets? Some would argue that we should first define what is "novelty" such as finding a new application or a new accessory for the famous mobile device. Others might argue that novelty is a customer idea that not many people about the IPhone thought about and for which Apple
would be interested in identifying among thousands of Tweets. As an example consider the following Tweets :
A subset of users experiences problems with the automatic orientation of the IPhone : This subset of IPhone users is perhaps very small but identifying these tweets could give Apple some ideas to work on.
Here is another subset of Tweets that talk about the charger's cable length :
In the example shown above notice that using just "iphone cable" as search terms would return a large number of Tweets, making it hard to identify novelty among all these Tweets.
Searching for novelty and identifying new ideas among Tweets is not an easy task. The problem is that we do not know what we are looking for in the first place : We can define the general context -such as wanting to identify novelty in user experience- but then we come to a halt in terms of what techniques to use (with an exception being cluster analysis).
The potential of using semi-automatic novelty detection on Twitter and other websites -such as delicious
links- is very big. Although this is work still in progress, the general methodology of novelty detection in Twitter could be to :
1) Collect a large subset of Tweets mentioning IPhone and a keyword that identifies context (such as the word charger
2) Identify keyword frequencies
3) Generate search queries using a subset of keywords chosen in an "intelligent" way, otherwise the number of search queries would be practically impossible to be evaluated.
4) Test these combination of keywords by submitting them to Twitter search and evaluating the results.
Steps (3) and (4) shown above are the key to success of course. In our example about the IPhone cable being too short we had results returned because the combination of keywords submitted could make sense. Trying out IPhone, cable, snow
tells us that such keyword combination is not a valid one and -hence- not an "intellligent" keywords subset :