This repository contains:
- The scripts to estimate user influence from Twitter information cascades (i.e. Cas.In);
- A small dataset of 20 cascades for testing Cas.In;
- A hands-on tutorial to walk you through running Cas.In on real cascades.
The algorithm was introduced in the paper:
Rizoiu, M.-A., Graham, T., Zhang, R., Zhang, Y., Ackland, R., & Xie, L. (2018). #DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate. In Proc. International AAAI Conference on Web and Social Media (ICWSM ’18) (pp. 1–10). Stanford, CA, USA.
pdf at arxiv with supplementary material
Bibtex
@inproceedings{rizoiu2018debatenight,
address = {Stanford, CA, USA},
author = {Rizoiu, Marian-Andrei and Graham, Timothy and Zhang, Rui and Zhang, Yifei and Ackland, Robert and Xie, Lexing},
booktitle = {International AAAI Conference on Web and Social Media (ICWSM '18)},
title = {{{\#}DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate}},
url = {https://arxiv.org/abs/1802.09808},
year = {2018}
}
Both dataset and code are distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, a copy of which can be obtained following this link. If you require a different license, please contact Yifei Zhang, Marian-Andrei Rizoiu or Lexing Xie.
- python3
- numpy
- pandas
--cascade_path : the path of cascade file (see the format here below).
--time_decay : the coefficient value of time decay (hyperparameter
--save2csv : save result to csv file. Default: False
cd scripts
python3 influence.py --cascade_path path/to/file
We provide a toy dataset -- dubbed SMH -- for testing Cas.In. It was collected in 2017 by following the Twitter handle of the Sydney Morning Herald newspaper (tweets and retweets mentioning SMH or linking to an article from SMH).
The data contains 20 cascades (one file per cascade).
We annonymized the user_id
(as per Twitter's ToS) by mapping original values to a sequence from 0 to n, while preserving the identity of users across cascades.
- A csv file with 3 columns (
time
,magnitude
,user_id
), where each row is a tweet in the cascade:time
represents the timestamp of tweet -- the first tweet is always at time zero, for the following retweets it shows the offset in seconds from the initial tweet;magnitude
is the local influence of the user (here the number of followers);user_id
the id of the user emitting the tweet (here annonymized).
- The rows in the file (i.e. the tweets) are sorted by the timestamp;
eg:
time,magnitude,user_id
0,4674,"0"
321,1327,"1"
339,976,"2"
383,477,"3"
699,1209,"4"
824,119,"5"
835,1408,"6"
1049,896,"7"
Next, we drive you through using Cas.In for estimating user influence starting from a single cascade.
We need to first load all required packages of cascade influence.
cd scripts
import pandas as pd
import numpy as np
from casIn.user_influence import P,influence
Load the first cascade in the SMH toy dataset:
cascade = pd.read_csv("../data/SMH/SMH-cascade-0.csv")
cascade.head()
time | magnitude | user_id | |
---|---|---|---|
0 | 0 | 991 | 419 |
1 | 127 | 1352 | 658 |
2 | 2149 | 2057 | 264 |
3 | 2465 | 1155 | 1016 |
4 | 2485 | 1917 | 790 |
We first need to compute the probabilities , where is the probability that tweet is a direct retweet of the (see the paper for more details). We need to specify the hyper-parameter , the time decay coefficient. Here we choose .
p_ij = P(cascade,r = -0.000068)
The function influence()
will return an array of influences for each user and the matrix , where is the influence of the tweet of the tweet (direct and indirect).
inf, m_ij = influence(p_ij)
Now, we add the computed user influence back to the pandas data structure.
cascade["influence"] = pd.Series(inf)
cascade.head()
time | magnitude | user_id | influence | |
---|---|---|---|---|
0 | 0 | 991 | 419 | 60.000000 |
1 | 127 | 1352 | 658 | 34.590370 |
2 | 2149 | 2057 | 264 | 29.656122 |
3 | 2465 | 1155 | 1016 | 13.535845 |
4 | 2485 | 1917 | 790 | 15.913873 |
The function casIn() compute influence in one cascade, which basically contain all the steps described above
from casIn.user_influence import casIn
influence = casIn(cascade_path="../data/SMH/SMH-cascade-0.csv",time_decay=-0.000068)
influence.head()
time | magnitude | user_id | influence | |
---|---|---|---|---|
0 | 0 | 991 | 419 | 60.000000 |
1 | 127 | 1352 | 658 | 34.590370 |
2 | 2149 | 2057 | 264 | 29.656122 |
3 | 2465 | 1155 | 1016 | 13.535845 |
4 | 2485 | 1917 | 790 | 15.913873 |
The SMH toy dataset contains 20 cascades for testing out Cas.In. Let's load all of them:
cascades = []
for i in range(20):
inf = casIn(cascade_path="../data/SMH/SMH-cascade-%d.csv" % i,time_decay=-0.000068)
cascades.append(inf)
cascades = pd.concat(cascades)
The influence of a user is by definition the mean influence of the tweets they emit. We compute the user influence as follows:
result = cascades.groupby("user_id").agg({"influence" : "mean"})
result.sort_values("influence",ascending=False).head()
influence | |
---|---|
user_id | |
734 | 214.000000 |
1225 | 205.000000 |
755 | 190.554571 |
60 | 189.557461 |
581 | 141.033129 |