Researchers develop tools to better leverage tweets in spotting trends

"" Enlarge

Want the latest unemployment figures? You’ll probably need to wait, as such standard economic measures have historically been determined only after time-consuming effort and long delay, and are even then subject to subsequent revisions.

The process of nowcasting – the term is a contraction of “now” and “forecasting” – has been of interest in the area of economics in recent years as a means of more quickly generating informative data on economic activity. One nowcasting approach has been to use on-line user activity to predict various ongoing real-world social phenomena, such as a flu outbreak, and there has been some experimentation with this approach in generating economic data. However, it has not yet led to widespread actual practice or improved accuracy over traditional methodology.

Researchers that include CSE Prof. Michael Cafarella and CSE graduate student Dolan Antenucci have been investigating barriers to the use of social data to reproduce and improve upon the generation of economic data. Their recent work has focused on two areas: feature selection and real-time measures of economic activity.

Simplified Feature Selection

In a paper authored with U-M Economics Profs. Margaret Levenstein and Matthew Shapiro and University of Wisconsin, Madison Computer Science Prof. Christopher Ré entitled “Feature Selection For Easier Nowcasting,” and to be presented at the 16th International Workshop on the Web and Databases in New York on June 23, Dolan and Cafarella propose a system for choosing a set of relevant social media objects, a task that is otherwise difficult, time-consuming, and can imply a statistical background that some users may not have. The system takes a single user input string (e.g., unemployment) and yields a number of relevant signals the user can use to build a nowcasting model.

The authors evaluate their system using a corpus of almost 6 billion tweets, and show that features chosen by the system on average are about 7% better than those from a human and significantly reduce the burden on the user. The researchers also used the system as part of a collaboration with economists to obtain real macroeconomic statistics. This video provides an overview of the system.

Real-Time Measures of Economic Activity

In a second upcoming paper, the same set of investigators analyze data from individuals’ postings (“Tweets”) on the Twitter social media service to estimate labor market flows. They find that the prevalence of Tweets with phrases like “got fired” is correlated with official measures of new claims for unemployment insurance. They then aim to develop such real-time measures of economic activity as economic indicators and to use the Twitter feed for studying economic behavior.

The paper will present preliminary results based on a 10 percent sample of Tweets from July 2011 to March 2013. The analysis is based on selected signals computed as weekly counts of Tweet phrases related to job loss and unemployment. These signals are divided into two groups: job loss and unemployment. The raw counts of Tweets are divided by the total number of Tweets and normalized (demeaned and divided by standard deviation). An index is created by regressing these normalized signals on the published data for initial claims for unemployment insurance for the first 52 weeks of data. The Twitter index tracks broad trends and week-to-week swings in the official data during the “training period” and during the subsequent 35 out-of-sample weeks. The Twitter index also predicts a substantial proportion of the surprise in initial claims measured as the difference between actual claims and the consensus forecast. Hence, the Twitter index carries information about initial claims for unemployment insurance not reflected in the expectations of market analysts just prior to the release of the official data.

Prof. Michael Cafarella received his PhD in Computer Science from the University of Washington in 2009 and joind the faculty at Michigan that year. His research interests include databases, information extraction, data integration, and data mining. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. He received the NSF CAREER award in 2011. In addition to his academic work, he costarted (with Doug Cutting) the Hadoop open-source project, which is widely used at Facebook, Yahoo!, and elsewhere.