What's in a Tweet?

The messages are hard for machines to interpret, but a new approach could help.

Researchers at the Palo Alto Research Center (PARC) are developing new ways to deal with the torrent of information flowing from social media sites like Twitter. They have developed a Twitter "topic browser" that extracts meaning from the posts in a user's timeline. This could help users scan through thousands of tweets quickly, and the underlying technology could also offer novel ways of mining Twitter for information or for creating targeted advertising.

The researchers' idea was to provide a way for users to deal with a large number of Twitter messages quickly. They found that many users wanted to be able to quickly catch up on what's been going on, without having to go through every single tweet in their timeline.

Ed Chi, area manager and principal scientist for the Augmented Social Cognition Research Group at PARC, says that the information coming through Twitter resembles a stream--users will dip into it from time to time, but they don't want to consume it all at once. His group's work is called the "Eddi Project" in reference to the idea of eddies in a stream.

The researchers developed two main ways of filtering Twitter content. The first, presented recently at the ACM Conference on Human Factors in Computing Systems in Atlanta, is a recommendation system that ranks which posts in a Twitter stream a user is likely to find most interesting, based on factors such as the contents of posts as well as his interactions with other Twitter users. The second tool, the Twitter topic browser, summarizes the contents of a user's timeline so that the user can quickly survey what information has come through Twitter without having to read through every post.

To create this second tool, the researchers focused on identifying the topic of each tweet. Michael Bernstein, a researcher at the Computer Science and Artificial Intelligence Lab at MIT who is involved with the project, says the group found that Twitter users were interested in filtering posts relating to specific topics, and said they found existing methods lacking. "Hashtags"--user-generated annotations that categorize tweets--are perhaps the best current option, but most tweets don't have these tags. Bernstein notes that Twitter, Google, and other companies are developing ways to identify and categorize the most popular topics of discussion on Twitter--such the Icelandic volcano. But the sheer volume of tweets provides a lot of information for algorithms to use; it's much harder, he says, to figure out the topic of tweets that are more unique.

A key challenge of extracting meaning from a tweet is its length: no more than 140 characters. Chi says that most natural language processing technology relies on having a larger sample of text to work with. For example, some methods rely on people writing out associations between terms, which requires a lot of work to maintain, and is not the best way to interpret real-time information.

The researchers realized, however, that search engines have been dealing with extracting meaning from a small number of words--in the form of search queries--for years.

"The essence of the approach is to coerce a tweet to look more like a search query and then get a search engine to tell us more," Bernstein says. The researchers first clean up a tweet by pulling common terms, like the Twitter slang "RT," which means "retweet." Once their algorithms have focused on likely significant terms, they feed those into the Yahoo's Build your Own Search Service interface--a Web service that can be used to tap directly into Yahoo's search result.

The Web is the most up-to-date source of data, Bernstein says, and the pages that come up in search results give enough information for the researchers' algorithms to produce a list of topics related to the original tweet.

A similar approach could be used with any repository of information, Chi notes, pointing out that companies could use the technology on an intranet to classify bits of information related to more specialized topics.

"Boosting the signal of a tweet by piping it through web search is an application of a well-established information-retrieval technique," says Daniel Tunkelang, an engineer at Google who is an expert on information retrieval. He compares it to using a thesaurus to set a word in a broader context.

However, Tunkelang says the PARC researchers will have to make sure that the tweet-as-search-query approach doesn't collide with search engines' increasing efforts to index tweets. It wouldn't be good for a tweet to return itself as a result.

Chi says that his team is working on a platform for managing various kinds of information streams. This summer, they plan to increase the scale of the Eddi Project so it can be placed on the live Web for testing. The longer-term goal, Chi says, is to build tools that can be optimized for enterprise customers.