Professional Documents
Culture Documents
ABSTRACT ing companies and celebrities eager to exploit this vast new
The ever-increasing amount of information flowing through medium. As a result, ideas, opinions, and products compete
Social Media forces the members of these networks to com- with all other content for the scarce attention of the user
pete for attention and influence by relying on other people community. In spite of the seemingly chaotic fashion with
to spread their message. A large study of information propa- which all these interactions take place, certain topics man-
gation within Twitter reveals that the majority of users act age to get an inordinate amount of attention, thus bubbling
as passive information consumers and do not forward the to the top in terms of popularity and contributing to new
content to the network. Therefore, in order for individuals trends and to the public agenda of the community. How
to become influential they must not only obtain attention this happens in a world where crowdsourcing dominates is
and thus be popular, but also overcome user passivity. We still an unresolved problem, but there is considerable consen-
propose an algorithm that determines the influence and pas- sus on the fact that two aspects of information transmission
sivity of users based on their information forwarding activ- seem to be important in determining which content receives
ity. An evaluation performed with a 2.5 million user dataset inordinate amounts of attention.
shows that our influence measure is a good predictor of URL One is the popularity and status of given members of these
clicks, outperforming several other measures that do not ex- social networks, which is measured by the level of atten-
plicitly take user passivity into account. We also explicitly tion they receive in the form of followers who create links
demonstrate that high popularity does not necessarily imply to their accounts to automatically receive the content they
high influence and vice-versa. generate. The other is the influence that these individuals
wield, which is determined by the actual propagation of their
content through the network. This influence is determined
1. INTRODUCTION by many factors, such as the novelty and resonance of their
The explosive growth of Social Media has provided mil- messages with those of their followers and the quality and
lions of people the opportunity to create and share content frequency of the content they generate. Equally important
on a scale barely imaginable a few years ago. Massive partic- is the passivity of members of the network which provides a
ipation in these social networks is reflected in the countless barrier to propagation that is often hard to overcome. Thus
number of opinions, news and product reviews that are con- gaining knowledge of the identity of influential and least pas-
stantly posted and discussed in social sites such as Facebook, sive people in a network can be extremely useful from the
Digg and Twitter, to name a few. Given this widespread perspectives of viral marketing, propagating one’s point of
generation and consumption of content, it is natural to tar- view, as well as setting which topics dominate the public
get one’s messages to highly connected people who will prop- agenda.
agate them further in the social network. This is particularly In this paper, we analyze the propagation of web links on
the case in Twitter, which is one of the fastest growing so- twitter over time to understand how attention to given users
cial networks on the Internet, and thus the focus of advertis- and their influence is determined. We devise a general model
for influence using the concept of passivity in a social net-
work and develop an efficient algorithm similar to the HITS
algorithm [14] to quantify the influence of all the users in the
network. Our influence measure utilizes both the structural
Permission to make digital or hard copies of all or part of this work for properties of the network as well as the diffusion behavior
personal or classroom use is granted without fee provided that copies are among users. The influence of a user thus depends on not
not made or distributed for profit or commercial advantage and that copies only the size of the influenced audience, but also on their
bear this notice and the full citation on the first page. To copy otherwise, to passivity. This differentiates it from earlier measures of in-
republish, to post on servers or to redistribute to lists, requires prior specific fluence which were primarily based on individual statistical
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
properties such as the number of followers or retweets [7].
We have shown through our extensive evaluation that
this influence model outperforms other measures of influ-
ence such as PageRank, H-index, the number of followers 3.1 Background on Twitter
and the number of retweets. In addition it has good predic- Twitter is an extremely popular online microblogging ser-
tive properties in that it can forecast in advance the upper vice, that has gained a very large user base, consisting of
bound on the number of clicks a URL can get. We have more than 105 million users (as of April 2010). The Twitter
also presented case studies showing the top influential users graph is a directed social network, where each user chooses
uncovered by our algorithm. An important conclusion from to follow certain other users. Each user submits periodic
the results is that the correlation between popularity and status updates, known as tweets, that consist of short mes-
influence is quite weak, with the most influential users not sages limited in size to 140 characters. These updates typi-
necessarily the ones with the highest popularity. Addition- cally consist of personal information about the users, news
ally, when we considered nodes with high passivity, we found or links to content such as images, video and articles. The
a majority of them to be spammers and robot users. This posts made by a user are automatically displayed on the
demonstrates an application of our algorithm in automatic user’s profile page, as well as shown to his followers.
user categorization and filtering of online content . A retweet is a post originally made by one user that is for-
warded by another user. Retweets are useful for propagating
2. RELATED WORK interesting posts and links through the Twitter community.
The study of information and influence propagation in Twitter has attracted lots of attention from corporations
social networks has been particularly active for a number for the immense potential it provides for viral marketing.
of years in fields as disparate as sociology, communication, Due to its huge reach, Twitter is increasingly used by news
marketing, political science and physics. Earlier work fo- organizations to disseminate news updates, which are then
cused on the effects that scale-free networks and the affinity filtered and commented on by the Twitter community. A
of their members for certain topics had on the propagation number of businesses and organizations are using Twitter
of information [6]. Others discussed the presence of key or similar micro-blogging services to advertise products and
influentials [12, 11, 8, 5, 10] in a social network, defined as disseminate information to stockholders.
those who are responsible for the overall information dis- 3.2 Dataset
semination in the network. This research highlighted the
value of highly connected individuals as key elements in the Twitter provides a Search API for extracting tweets con-
propagation of information through the network. taining particular keywords. To obtain the dataset for this
Huberman et al. [2] studied the social interactions on study, we continuously queried the Twitter Search API for
Twitter to reveal that the driving process for usage is a a period of 300 hours starting on 10 Sep 2009 for all tweets
sparse hidden network underlying the friends and follow- containing the string http. This allowed us to acquire a
ers, while most of the links represent meaningless interac- continuous stream of 22 million tweets with URLs, which
tions. Jansen et al. [3] have examined twitter as a mecha- we estimated to be 1/15th of the entire Twitter activity at
nism for word-of-mouth advertising. They considered par- that time. From each of the accumulated tweets, we ex-
ticular brands and products and examined the structure of tracted the URL mentions. Each of the unique 15 million
the postings and the change in sentiments. Galuba et al. [4] URLs in the data set was then checked for valid formatting
propose a propagation model that predicts, which users will and the URLs shortened via the services such as bit.ly or
tweet about which URL based on the history of past user tinyurl.com were expanded into their original form by fol-
activity. lowing the HTTP redirects. For each encountered unique
There have also been earlier studies focused on social in- user ID, we queried the Twitter API for metadata about
fluence and propagation. Agarwal et al. [8] have examined that user and in particular the user’s followers and followees.
the problem of identifying influential bloggers in the blogo- The end result was a dataset of timestamped URL mentions
sphere. They discovered that the most influential bloggers together with the complete social graph for the users con-
were not necessarily the most active. Aral et al [9] have cerned.
distinguished the effects of homophily from influence as mo- User graph. The user graph contains those users whose
tivators for propagation. As to the study of influence within tweets appeared in the stream, i.e., users that during the
twitter, Cha et al. [7] have performed a comparison of three 300 hour observation period posted at least one public tweet
different measures of influence - indegree, retweets and user containing a URL. The graph does not contain any users who
mentions. They discovered that while retweets and men- do not mention any URLs in their tweets or users that have
tions correlated well with each other, the indegree of users chosen to make their Twitter stream private.
did not correlate well with the other two measures. Based For each newly encountered user ID, the list of followed
on this, they hypothesized that the number of followers may users was only fetched once. Our data set does not capture
not a good measure of influence. On the other hand, Weng the changes occurring in the user graph over the observation
et al [5] have proposed a topic-sensitive PageRank measure period.
for influence in Twitter. Their measure is based on the fact
that they observed high reciprocity among follower relation- 4. THE IP ALGORITHM
ships in their dataset, which they attributed to homophily. Evidence for passivity. The users that receive informa-
However, other work [7] has shown that the reciprocity is tion from other users may never see it or choose to ignore
low overall in Twitter and contradicted the assumptions of it. We have quantified the degree to which this occurs on
this work. Twitter (Fig. 4). An average Twitter user retweets only one
in 318 URLs, which is a relatively low value. The retweeting
3. TWITTER rates vary widely across the users and the small number of
Algorithm 1: The Influence-Passivity (IP) algorithm
106
audience retweeting rate I0 ← (1, 1, . . . , 1) ∈ R|N | ;
105 user retweeting rate P0 ← (1, 1, . . . , 1) ∈ R|N | ;
for i = 1 to m do
104
Update Ri using operation (2) and the values Ii−1 ;
#users
but to a larger part of the network. It is often fairly easy of influence that user j accepted from user i normalized
to obtain information about the pairwise influence relation- by the total influence accepted by j from all users on the
ships between users. In Twitter, for example, one can mea- network. The acceptance rate can be viewed as the dedi-
sure how much influence user A has on user B by counting cation or loyalty user j has to user i. On the other hand,
the number of times B retweeted A. However, it is not very for every e = (j, i) ∈ E we define the rejection rate by
clear how to use the pairwise influence information to accu- 1 − wji
vji = X . Since the value 1 − wji is amount
rately obtain information about the relative influence each (1 − wjk )
user has on the whole network. To answer this question we k:(j,k)∈E
design an algorithm (IP) that assigns a relative influence of influence that user i rejected from j, then the value vji
score and a passivity score to every user. The passivity of represents the influence that user i rejected from user j nor-
a user is a measure of how difficult it is for other users to malized by the total influence rejected from j by all users in
influence him. We assume that the influence of a user de- the network.
pends on both the quantity and the quality of the audience The algorithm is based on the following operations:
she influences. In general, our model makes the following X
assumptions: Ii ← uij Pj (1)
j:(i,j)∈E
1. A user’s influence score depends on the number of peo- X
ple she influences as well as their passivity. Pi ← vji Ij (2)
j:(j,i)∈E
2. A user’s influence score depends on how dedicated the
people she influences are. Dedication is measured by Each term on the right hand side of the above operations
the amount of attention a user pays to a given one as corresponds to one of the listed assumptions. In operation
compared to everyone else. 1 the term Pj corresponds to assumption 1 and the term
3. A user’s passivity score depends on the influence of uij corresponds to assumption 2. In operation 2 the term Ij
those who she’s exposed to but not influenced by. corresponds to assumption 3 and the term vji corresponds
to assumption 4. The Influence-Passivity algorithm (Algo-
4. A user’s passivity score depends on how much she re- rithm 1) takes the graph G as the input and computes the
jects other user’s influence compared to everyone else. influence and passivity for each node in m iterations.
Generating the input graph. There are many ways of
defining the influence graph G = (N, E, W ). We choose to 101
construct it by taking into account retweets and the follower
graph in the following way: The nodes are users who tweeted 100
change in values
at least 3 URLs. The arc (i, j) exists if user j retweeted a
URL posted by user i at least once. The arc e = (i, j) has
weight we = Qij
S
where Qi is the number of URLs that i 10-1
ij
mentioned and Sij is the number of URLs mentioned by i
and retweeted by j. 10-2
5. EVALUATION 10-3 0 10 20 30 40 50
iteration
5.1 Computations
Based on the obtained dataset (§3.2) we generate the
weighted graph using the method described in §4. The
graph consists of approximately 450k nodes and 1 million Figure 2: IP-algorithm convergence. In each itera-
arcs with mean weight of 0.07, and we use it to compute tion we measure the sum of all the absolute changes
the PageRank, influence and passivity values for each node. of the computed influence and passivity values since
The Influence-Passivity algorithm (Algorithm §1) converges the previous iteration
to the final values in tens of iterations (Fig. 2).
PageRank. The PageRank algorithm has been widely
used to rank web pages as well as people based on their au- URL traffic correlation. Using the URL click data,
thority and their influence [13, 5]. In order to compare it we take several different user attributes and test how well
with the results from the IP algorithm, we compute PageR- they can predict the attention the URLs posted by the users
ank on the weighted graph G = (N, E, W ) with a small receive (Fig. 3). It is important to note that none of the
change. First, since the arcs e = (i, j) ∈ E indicate that influence measures are capable of predicting the exact num-
user i exerts some influence on user j then we invert all the ber of clicks. The main reason for this is that the amount of
arcs before running PageRank on the graph while leaving attention a URL gets is not only a function of the influence
the weights intact. In other words, we generate a new graph of the users mentioning it, but also of many other factors
G0 = (N 0 , E 0 , W 0 ) where N 0 = N , E 0 = {(i, j) : (j, i) ∈ E}, including the virality of the URL itself and more impor-
and for each (i, j) ∈ E 0 we define wij 0
= wji . This generates tantly, whether the URL was mentioned anywhere outside
a new graph G0 analogous to G but where the influenced of Twitter, which is likely to be the biggest source of unpre-
users point to their influencers. Second, since the graph dictability in the click data. The click data that we collected
G0 is weighted we assume that when the the random surfer represents the total clicks on the URLs.
of the PageRank algorithm is currently at the node i, she The wide range of factors potentially affecting the Bit.ly
0
wij
chooses to visit node j next with probability X . clicks may prevent us from predicting their number accu-
0
wik rately. However, the upper bound on that number can to
k:(i,k)∈E 0 a large degree be predicted. To eliminate the outlier cases,
The Hirsch Index. The Hirsch index (or H-index) is we examined how the 99.9th percentile of the clicks varied
used in the scientific community in order to measure the as the measure of influence increased.
productivity and impact of a scientist. A scientist has index Number of followers. The most readily available and
h if he has published h articles which have been cited at least often used by the Twitterers measure of influence is the num-
h times each. It has been shown that the H-index is a good ber of followers a user has. As the Figure 3(a) shows, the
indicator of whether a scientist has had high achievements number of followers of an average poster of a given URL is a
such as getting the Nobel prize [16]. Analogously, in Twitter, relatively weak predictor of the maximum number of clicks
a user has index h if h of his URL posts have been retweeted that the URL can receive.
at least h times each. Number of retweets. When users post URLs their
posts might be retweeted by other users. Each retweet ex-
5.2 Influence as a correlate of attention plicitly credits the original poster of the URL (or the user
Any measure of influence is necessarily a subjective one. from whom the retweeting user heard about the URL). The
However, in this case, a good measure of influence should number of times a user has been credited in a retweet has
have a high predictive power on how well the URLs men- been assumed to be a good measure of influence [7]. How-
tioned by the influential users attract attention and propa- ever, Figure 3(b) shows that the number of times a user has
gate in the social network. We would expect the URLs that been retweeted in the past is a poor predictor of the max-
highly influential users propagate to attract a lot of atten- imum number of clicks the URLs posted by that user can
tion and user clicks. Thus, a viable estimator of attention is get.
the number of times a URL has been accessed. PageRank. Figure 3(c) shows that the average PageR-
Click data. Bit.ly is a URL shortening service that for ank of those who tweet a certain URL does not correlate
each shortened URL keeps track of how many times it has well with the number of clicks the URL will get. One of the
been accessed. For the 3.2M Bit.ly URLs found in the tweets main differences between the IP algorithm and PageRank is
we have queried the Bit.ly API for the number of clicks the that the IP algorithm takes into account the passivity of the
service has registered on that URL. people a user influences and PageRank does not.IP-influence
107 107
10000
106 106
10000
105 1000 105
# bit.ly clicks
# bit.ly clicks
104 104 1000
#urls
#urls
103 100 103 100
102 102
10 10
101 101
100 −1 1 100 −3 1
10 100 101 102 103 104 105 106 10 10−2 10−1 100 101 102 103 104
average # of followers of the posting users average # times posting users were retweeted
(a) Average number of followers vs. number of clicks (b) Average number of times users were retweeted vs.
on URLs number of clicks on URLs
107
1e+05 107 1e+05
106 106
1e4 1e4
105 105
# bit.ly clicks
# bit.ly clicks
104 1000 104 1000
#urls
#urls
103 103 100
100
102 102
10 10
101 101
100 −7 1 100 0 1
10 10−6 10−5 10−4 10−3 10−2 10 101 102
average PageRank of the posting users average H-index of the posting users
(c) Average user PageRank vs. number of clicks on (d) Average user H-index vs. number of clicks on
URLs URLs
# bit.ly clicks
104 104
#urls
#urls
100 100
103 103
102 102
10 10
101 101
100 −9 100 −9
10 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10 10−8 10−7 10−6 10−5 10−4 10−3
average IP-influence of the posting users average IP-influence of the posting users
(e) Average user IP-influence vs. number of clicks on (f) Average user IP-influence vs. number of clicks on
URLs, using the retweet graph as input URLs, using the co-mention graph as input
Figure 3: We consider several user attributes: the number of followers, the number of times a user has been
retweeted, the user’s PageRank, H-index and IP-influence. For each of the 3.2M Bit.ly URLs we compute
the average value of a user’s attribute among all the users that mentioned that URL. This value becomes the
x coordinate of the URL-point; the y coordinate is the number of clicks on the Bit.ly URL. The density of
the URL-points is then plotted for each of the four user attributes. The solid line in each figure represents
the 99.9th percentile of Bit.ly clicks at a given attribute value. The dotted line is the linear regression fit for
the solid line.
10−2 10−3
1000
1000
10−5 100
#users
#users
10−6
10−6
10 10−7 10
10−7
−8
10−8 10
10−9 0 10−9 −9
10 101 102 103 104 105 106 10 10−8 10−7 10−6 10−5 10−4 10−3 10−2
# of followers IP-influence (co-mention graph)
Figure 4: For each user we place a user-point with Figure 5: The correlation between the IP-influence
IP-influence as the y coordinate and the x coordinate values computed based on two inputs: the co-
set to the number of user’s followers. The density mention influence graph and the retweet influence
of user-points is represented in grayscale. graph.
Table 1: Users with the most IP-influence (with at Table 2: Users with the most IP-passivity
least 10 URLs posted in the period)
Table 4: Users with very few followers but high relative influence
measures over a fixed 300-hour window. However, the So- medium, regardless of their popularity.
cial Media are a rapidly changing, real-time communication
platform. There are several implications of this. First, the 10. REFERENCES
IP algorithm would need to be modified to take into ac- [1] Jure Leskovec, Lada A. Adamic and Bernardo A.
count the tweet timestamps. Second, the IP-influence it- Huberman. The dynamics of viral marketing. In
self changes over time, which brings a number of interesting Proceedings of the 7th ACM Conference on Electronic
questions about the dynamics of influence and attention. Commerce, 2006.
In particular, whether users with spikes of IP-influence are
[2] Bernardo A. Huberman, Daniel M. Romero, and Fang
overall more influential than users which can sustain their
Wu. Social networks that matter: Twitter under the
IP-influence over time is an open question.
microscope. First Monday, 14(1), Jan 2009.
[3] B. Jansen, M. Zhang, K. Sobel, and A. Chowdury.
9. CONCLUSION Twitter power: Tweets as electronic word of mouth.
Journal of the American Society for Information
Given the mushrooming popularity of Social Media, vast
Science and Technology, 2009.
efforts are devoted by individuals, governments and enter-
prises to getting attention to their ideas, policies, products, [4] Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty,
and commentary through social networks. But the very Zoran Despotovic, Wolfgang Kellerer Outtweeting the
large scale of the networks underlying Social Media makes Twitterers - Predicting Information Cascades in
it hard for any of these topics to get enough attention in Microblogs 3rd Workshop on Online Social Networks,
order to rise to the most trending ones. Given this con- WOSN, 2010)
straint, there has been a natural shift on the part of the [5] Jianshu Weng and Ee-Peng Lim and Jing Jiang and Qi
content generators towards targeting those individuals that He. TwitterRank: finding topic-sensitive influential
are perceived as influential because of their large number twitterers. WSDM, 2010.
of followers. This study shows that the correlation between [6] Fang Wu, Bernardo A. Huberman, Lada Adamic and
popularity and influence is weaker than it might be expected. Josh Tyler. Information Flow in Social Groups. Physica
This is a reflection of the fact that for information to prop- A, Vol 337, 327-335, 2004.
agate in a network, individuals need to forward it to the [7] Meeyoung Cha and Hamed Haddadi and Fabricio
other members, thus having to actively engage rather than Benevenuto and Krishna P. Gummadi. Measuring User
passively read it and cease to act on it. Moreover, since our Influence in Twitter: The Million Follower Fallacy. 4th
measure of influence is not specific to Twitter it is applicable International AAAI Conference on Weblogs and Social
to many other social networks. This opens the possibility Media (ICWSM), 2010.
of discovering influential individuals within a network which [8] Nitin Agarwal and Huan Liu and Lei Tang and Philip
can on average have a further reach than others in the same S. Yu. Identifying the Influential Bloggers in a
Community. WSDM, 2008.
[9] Sinan Aral, Lev Muchnik and Arun Sundararajan.
Distinguishing influence-based contagion from
homophily-driven diffusion in dynamic networks.
Proceedings of the National Academy of Sciences, Vol.
106 (51), pp. 21544-21549, 2009.
[10] Duncan J. Watts and Peter Sheridan Dodds.
Influentials, Networks, and Public Opinion Formation.
Journal of Consumer Research, Vol. 34 (4), pp.
441-458, 2007.
[11] Amit Goyal, Francesco Bonchi and Laks V.S.
Lakshmanan. Learning Influence Probabilities In Social
Networks. WSDM, 2010.
[12] P. Domingos and M. Richardson. Mining the network
value of customers. SIGKDD, 2001.
[13] S. Brin and L. Page. The anatomy of a large-scale
hypertextual Web search engine. Computer Networks
and ISDN Systems, Vol 30, 1-7, 1998.
[14] Jon Kleinberg. Authoritative sources in a hyperlinked
environment. Journal of the ACM 46 (5): 604 -632,
1999.
[15] Boyd Danah, Scott Golder, and Gilad Lotan. Tweet,
Tweet, Retweet: Conversational Aspects of Retweeting
on Twitter. HICSS-43. IEEE 2010.
[16] Jorge E. Hirsch. An index to quantify an individual’s
scientific research output. Proceedings of the National
Academy of Sciences 102(46): 16569 -16572, 2005.