Professional Documents
Culture Documents
Abstract—In recent years, social media has become ubiquitous This paper reports on such a study. Specifically we consider
and important for social networking and content sharing. And the task of predicting box-office revenues for movies using
yet, the content that is generated from these websites remains the chatter from Twitter, one of the fastest growing social
largely untapped. In this paper, we demonstrate how social media
content can be used to predict real-world outcomes. In particular, networks in the Internet. Twitter 1 , a micro-blogging network,
we use the chatter from Twitter.com to forecast box-office has experienced a burst of popularity in recent months leading
revenues for movies. We show that a simple model built from to a huge user-base, consisting of several tens of millions of
the rate at which tweets are created about particular topics can users who actively participate in the creation and propagation
outperform market-based predictors. We further demonstrate of content.
how sentiments extracted from Twitter can be further utilized to
improve the forecasting power of social media. We have focused on movies in this study for two main
reasons.
I. I NTRODUCTION • The topic of movies is of considerable interest among
the social media user community, characterized both by
Social media has exploded as a category of online discourse large number of users discussing movies, as well as a
where people create content, share it, bookmark it and network substantial variance in their opinions.
at a prodigious rate. Examples include Facebook, MySpace, • The real-world outcomes can be easily observed from
Digg, Twitter and JISC listservs on the academic side. Because box-office revenue for movies.
of its ease of use, speed and reach, social media is fast Our goals in this paper are as follows. First, we assess how
changing the public discourse in society and setting trends buzz and attention is created for different movies and how that
and agendas in topics that range from the environment and changes over time. Movie producers spend a lot of effort and
politics to technology and the entertainment industry. money in publicizing their movies, and have also embraced
Since social media can also be construed as a form of the Twitter medium for this purpose. We then focus on the
collective wisdom, we decided to investigate its power at mechanism of viral marketing and pre-release hype on Twitter,
predicting real-world outcomes. Surprisingly, we discovered and the role that attention plays in forecasting real-world box-
that the chatter of a community can indeed be used to make office performance. Our hypothesis is that movies that are well
quantitative predictions that outperform those of artificial talked about will be well-watched.
markets. These information markets generally involve the Next, we study how sentiments are created, how positive and
trading of state-contingent securities, and if large enough and negative opinions propagate and how they influence people.
properly designed, they are usually more accurate than other For a bad movie, the initial reviews might be enough to
techniques for extracting diffuse information, such as surveys discourage others from watching it, while on the other hand, it
and opinions polls. Specifically, the prices in these markets is possible for interest to be generated by positive reviews and
have been shown to have strong correlations with observed opinions over time. For this purpose, we perform sentiment
outcome frequencies, and thus are good indicators of future analysis on the data, using text classifiers to distinguish
outcomes [4], [5]. positively oriented tweets from negative.
In the case of social media, the enormity and high vari- Our chief conclusions are as follows:
ance of the information that propagates through large user • We show that social media feeds can be effective indica-
communities presents an interesting opportunity for harnessing tors of real-world performance.
that data into a form that allows for specific predictions • We discovered that the rate at which movie tweets
about particular outcomes, without having to institute market are generated can be used to build a powerful model
mechanisms. One can also build models to aggregate the for predicting movie box-office revenue. Moreover our
opinions of the collective population and gain useful insights predictions are consistently better than those produced
into their behavior, while predicting future trends. Moreover, by an information market such as the Hollywood Stock
gathering information on how people converse regarding par- Exchange, the gold standard in the industry [4].
ticular products can be helpful when designing marketing and
advertising campaigns [1], [3]. 1 http://www.twitter.com
• Our analysis of the sentiment content in the tweets shows in Jan 3 ). It can be considered a directed social network, where
that they can improve box-office revenue predictions each user has a set of subscribers known as followers. Each
based on tweet rates only after the movies are released. user submits periodic status updates, known as tweets, that
This paper is organized as follows. Next, we survey recent consist of short messages of maximum size 140 characters.
related work. We then provide a short introduction to Twitter These updates typically consist of personal information about
and the dataset that we collected. In Section 5, we study how the users, news or links to content such as images, video
attention and popularity are created and how they evolve. and articles. The posts made by a user are displayed on the
We then discuss our study on using tweets from Twitter user’s profile page, as well as shown to his/her followers. It is
for predicting movie performance. In Section 6, we present also possible to send a direct message to another user. Such
our analysis on sentiments and their effects. We conclude messages are preceded by @userid indicating the intended
in Section 7. We describe our prediction model in a general destination.
context in the Appendix. A retweet is a post originally made by one user that is
forwarded by another user. These retweets are a popular means
II. R ELATED W ORK of propagating interesting posts and links through the Twitter
community.
Although Twitter has been very popular as a web service, Twitter has attracted lots of attention from corporations
there has not been considerable published research on it. for the immense potential it provides for viral marketing.
Huberman and others [2] studied the social interactions on Due to its huge reach, Twitter is increasingly used by news
Twitter to reveal that the driving process for usage is a sparse organizations to filter news updates through the community.
hidden network underlying the friends and followers, while A number of businesses and organizations are using Twitter
most of the links represent meaningless interactions. Java et or similar micro-blogging services to advertise products and
al [7] investigated community structure and isolated different disseminate information to stakeholders.
types of user intentions on Twitter. Jansen and others [3]
have examined Twitter as a mechanism for word-of-mouth IV. DATASET C HARACTERISTICS
advertising, and considered particular brands and products The dataset that we used was obtained by crawling hourly
while examining the structure of the postings and the change in feed data from Twitter.com. To ensure that we obtained all
sentiments. However the authors do not perform any analysis tweets referring to a movie, we used keywords present in the
on the predictive aspect of Twitter. movie title as search arguments. We extracted tweets over
There has been some prior work on analyzing the correlation frequent intervals using the Twitter Search Api 4 , thereby
between blog and review mentions and performance. Gruhl ensuring we had the timestamp, author and tweet text for
and others [9] showed how to generate automated queries our analysis. We extracted 2.89 million tweets referring to 24
for mining blogs in order to predict spikes in book sales. different movies released over a period of three months.
And while there has been research on predicting movie sales, Movies are typically released on Fridays, with the exception
almost all of them have used meta-data information on the of a few which are released on Wednesday. Since an average of
movies themselves to perform the forecasting, such as the 2 new movies are released each week, we collected data over
movies genre, MPAA rating, running time, release date, the a time period of 3 months from November to February to have
number of screens on which the movie debuted, and the sufficient data to measure predictive behavior. For consistency,
presence of particular actors or actresses in the cast. Joshi we only considered the movies released on a Friday and only
and others [10] use linear regression from text and metadata those in wide release. For movies that were initially in limited
features to predict earnings for movies. Sharda and Delen [8] release, we began collecting data from the time it became
have treated the prediction problem as a classification problem wide. For each movie, we define the critical period as the
and used neural networks to classify movies into categories time from the week before it is released, when the promotional
ranging from ’flop’ to ’blockbuster’. Apart from the fact campaigns are in full swing, to two weeks after release, when
that they are predicting ranges over actual numbers, the best its initial popularity fades and opinions from people have been
accuracy that their model can achieve is fairly low. Zhang disseminated.
and Skiena [6] have used a news aggregation model along Some details on the movies chosen and their release dates
with IMDB data to predict movie box-office numbers. We are provided in Table 1. Note that, some movies that were
have shown how our model can generate better results when released during the period considered were not used in this
compared to their method. study, simply because it was difficult to correctly identify
tweets that were relevant to those movies. For instance,
III. T WITTER for the movie 2012, it was impractical to segregate tweets
Launched on July 13, 2006, Twitter 2 is an extremely talking about the movie, from those referring to the year. We
popular online microblogging service. It has a very large user have taken care to ensure that the data we have used was
base, consisting of several millions of users (23M unique users 3 http://blog.compete.com/2010/02/24/compete-ranks-top-sites-for-january-
2010/
2 http://www.twitter.com 4 http://search.twitter.com/api/
Movie Release Date
2
Armored 2009-12-04
Release weekend
Avatar 2009-12-18 1.9
Daybreakers 2010-01-08
log(frequency)
8
TABLE I
NAMES AND RELEASE DATES FOR THE MOVIES WE CONSIDERED IN OUR 6
ANALYSIS .
4
3500
3000
authors over the critical period. The X-axis shows the number
2500
of tweets in the log scale, while the Y-axis represents the
corresponding frequency of authors in the log scale. We can
2000
TABLE IV
Authors
5
C OEFFICIENT OF D ETERMINATION (R2 ) VALUES USING DIFFERENT
4 PREDICTORS FOR MOVIE BOX - OFFICE REVENUE FOR THE FIRST WEEKEND .
3
0
there is a greater percentage of tweets containing urls in the
2 4 6 8 10 12 14 16 18 20 22 24
Number of Movies
week prior to release than afterwards. This is consistent with
our expectation. In the case of retweets, we find the values to
Fig. 4. Distribution of total authors and the movies they comment on. be similar across the 3 weeks considered. In all, we found the
retweets to be a significant minority of the tweets on movies.
Features Week 0 Week 1 Week 2 One reason for this could be that people tend to describe their
url 39.5 25.5 22.5 own expectations and experiences, which are not necessarily
retweet 12.1 12.1 11.66 propaganda.
We want to determine whether movies that have greater
TABLE II
U RL AND RETWEET PERCENTAGES FOR CRITICAL WEEK publicity, in terms of linked urls on Twitter, perform better in
the box office. When we examined the correlation between the
urls and retweets with the box-office performance, we found
the correlation to be moderately positive, as shown in Table
promotional material) as well as retweets, which involve users 3. However, the adjusted R2 value is quite low in both cases,
forwarding tweet posts to everyone in their friend-list. Both indicating that these features are not very predictive of the
these forms of tweets are important to disseminate information relative performance of movies. This result is quite surprising
regarding movies being released. since we would expect promotional material to contribute
First, we examine the distribution of such tweets for dif- significantly to a movie’s box-office income.
ferent movies, following which we examine their correlation
B. Prediction of first weekend Box-office revenues
with the performance of the movies.
Next, we investigate the power of social media in predicting
real-world outcomes. Our goal is to observe if the knowledge
0.7
Week 0
Week 1
that can be extracted from the tweets can lead to reasonably
Week 2
0.6 accurate prediction of future outcomes in the real world.
The problem that we wish to tackle can be framed as
Tweets with urls (percentage)
0.5
Tweet−rate
0.1 HSX
0
2 4 6 8 10 12 14 16 18 20 22 24
Movies
10
Actual revenue
TABLE VI
P REDICTION OF HSX END OF OPENING WEEKEND PRICE .
TABLE VIII
P REDICTION OF SECOND WEEKEND BOX - OFFICE GROSS
Weekend Adjusted R2
Jan 15-17 0.92
Sentiment analysis is a well-studied problem in linguistics
Jan 22-24 0.97 and machine learning, with different classifiers and language
Jan 29-31 0.92 models employed in earlier work [13], [14]. It is common
Feb 05-07 0.95 to express this as a classification problem where a given
text needs to be labeled as P ositive, N egative or N eutral.
TABLE VII
C OEFFICIENT OF D ETERMINATION (R2 ) VALUES USING TWEET- RATE Here, we constructed a sentiment analysis classifier using the
TIMESERIES FOR DIFFERENT WEEKENDS LingPipe linguistic analysis package 6 which provides a set
of open-source java libraries for natural language processing
tasks. We used the DynamicLMClassifier which is a language
model classifier that accepts training events of categorized
all movies over a particular weekend. The Hollywood Stock character sequences. Training is based on a multivariate es-
Exchange de-lists movie stocks after 4 weeks of release, which timator for the category distribution and dynamic language
means that there is no timeseries available for movies after models for the per-category character sequence estimators.
4 weeks. In the case of tweets, people continue to discuss To obtain labeled training data for the classifier, we utilized
movies long after they are released. Hence, we attempt to use workers from the Amazon Mechanical Turk 7 . It has been
the timeseries of tweet-rate, over 7 days before the weekend, shown that manual labeling from Amazon Turk can correlate
to predict the box-office revenue for that particular weekend. well with experts [11]. We used thousands of workers to assign
Table 7 shows the results for 3 weekends in January and sentiments for a large random sample of tweets, ensuring that
1 in February. Note, that there were movies that were two each tweet was labeled by three different people. We used
months old in consideration for this experiment. Apart from only samples for which the vote was unanimous as training
the time series, we used two additional variables - the theater data. The samples were initially preprocessed in the following
count and the number of weeks the movie has been released. ways:
We used the coefficient of determination (adjusted R2 ) to
evaluate the regression models. From Table 7, we find that • Elimination of stop-words
the tweets continue to be good predictors even in this case, • Elimination of all special characters except exclamation
with an adjusted R2 consistently greater than 0.90. The results marks which were replaced by < EX > and question
have shown that the buzz from social media can be accurate marks (< QM >)
indicators of future outcomes. The fact that a simple linear • Removal of urls and user-ids
regression model considering only the rate of tweets on movies • Replacing the movie title with < M OV >
can perform better than artificial money markets, illustrates the We used the pre-processed samples to train the classifier using
power of social media. an n-gram model. We chose n to be 8 in our experiments.
The classifier was trained to predict three classes - Positive,
VI. S ENTIMENT A NALYSIS Negative and Neutral. When we tested on the training-set with
Next, we would like to investigate the importance of sen- cross-validation, we obtained an accuracy of 98%. We then
timents in predicting future outcomes. We have seen how used the trained classifier to predict the sentiments for all the
efficient the attention can be in predicting opening weekend tweets in the critical period for all the movies considered.
box-office values for movies. Hence we consider the problem
of utilizing the sentiments prevalent in the discussion for 6 http://www.alias-i.com/lingpipe
forecasting. 7 https://www.mturk.com/
Movie Subjectivity Variable p − value
1.6
(Intercept) 0.542
1.4 Avg Tweet-rate 2.05e-11 (***)
PNRatio 9.43e-06 (***)
1.2
1 TABLE IX
R EGRESSION USING THE AVERAGE TWEET- RATE AND THE POLARITY
0.8
(PNR ATIO ). T HE SIGNIFICANCE LEVEL (*:0.05, **: 0.01, ***: 0.001) IS
ALSO SHOWN .
0.6
0.4
0.2
0
positive than negative tweets is likely to be successful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
IX. ACKNOWLEDGEMENT
This material is based upon work supported by the National
Science Foundation under Grant # 0937060 to the Computing
Research Association for the CIFellows Project.
R EFERENCES
[1] Jure Leskovec, Lada A. Adamic and Bernardo A. Huberman. The
dynamics of viral marketing. In Proceedings of the 7th ACM Conference
on Electronic Commerce, 2006.
[2] Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. Social
networks that matter: Twitter under the microscope. First Monday, 14(1),
Jan 2009.
[3] B. Jansen, M. Zhang, K. Sobel, and A. Chowdury. Twitter power:
Tweets as electronic word of mouth. Journal of the American Society
for Information Science and Technology, 2009.