-
Notifications
You must be signed in to change notification settings - Fork 255
Analyzing links in tweets
This is a draft document to explain how I plan to analyze links in tweets.
I've used twarc2 search
and then flattened the tweets with twarc2 flatten
.
I started by reporting that I needed the link unshortener for twarc2: https://github.com/DocNow/twarc/issues/538 In the process of reporting and testing We realized the problem was something else.
In the early days tweets had one link and it was simpler to analyze them. I remember @yourcel project that analyzed the link of #occupy tweets back in 2011 (https://occupybostonlinks.tirl.org/1/count/). Now the situation is a bit more complex.
A tweet can:
- have no link
- be a RT: has 1 one link (the link to the RT tweet)
- be a RT of a tweet with a link in it: has 2 links (the link to the RT tweet and the link in the RT tweet)
- quote a tweet with a link in it and include a link: has 3 links
If you explore the donwloaded json file you can see different examples. If you convert the file to a csv (there is an opened issue about that https://github.com/DocNow/twarc-csv/issues/33) more questions arise.
I've created a R script to do this: extract all the different links in the tweets and analyze them. Code available at https://code.montera34.com/numeroteca/tuits-analysis/-/blob/master/analysis/links-in-tweets.R
The are many things to improve, but so far the code is working. The method to extract the URL is to iterate through all the lines and extract the URLs is the following
# json is a large list of tweets in json format (see code for more details)
for ( i in 1:n) {
print(i)
tweets$created_at[i] <- as.character(json[[i]]$created_at)
tweets$author[i] <- as.character(json[[i]]$author$username)
tweets$text[i] <- as.character( json[[i]]$text )
tweets$type[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$type), NA, json[[i]]$referenced_tweets[[1]]$type) )
tweets$text_ref[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$text), NA, json[[i]]$referenced_tweets[[1]]$text) )
tweets$text_RT[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$text), NA, json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$text) )
tweets$url0[i] <- as.character( ifelse( is.null(json[[i]]$entities$urls[[1]]$expanded_url), NA,json[[i]]$entities$urls[[1]]$expanded_url) )
tweets$url1[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[1]]$expanded_url), NA,json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[1]]$expanded_url) )
# Used try to avoid stopping by error:
try( tweets$url2[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url), NA,json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url) ),
silent = TRUE
)
try( tweets$url3[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url), NA,json[[i]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url) ),
silent = TRUE
)
tweets$id[i] <- as.character(json[[i]]$id)
}
What is each variable:
- the ones that are clear are:
id
,created at
,author
-
text
: text of the tweet. Retweets are truncated -
type
: type of relationship of this tweet with the tweet at text_ref. Can be quoted, replied_to or retweeted (if none of those "normal" is inserted) -
text_ref
: full text of referenced tweet. It allows full text analysis of the tweets (as RT tweets texts are truncated).
Verify this hypothesis
-
url0
: URL of the original tweet. If it is a native RT the link to the RT tweet is included -
url1
: 1st URL of the referenced tweet -
url2
: 2nd URL of the referenced tweet -
url3
: 3rd URL of the referenced tweet, if it is a quoted tweet
It is not very clean and might leave some URL left in other places of the json structure. All in all I think it gathers a good amount of URL (I'd need to quantify how many) and it's possible to create plots like this one, where every point is not a tweet but a link in a tweet (1 tweet can have many URL in it):
I created this chart above without the Twitter links, because if Twitter URL are included, this is the result:
There are many more analysis that I am going to explore.
To classify the domains I use this manual classification (based on Spanish news media ecosystem":
news_media <- c("eldiario.es","m.eldiario.es","publico.es","elplural.com","elboletin.com","elespanol.com","eljueves.es","huffingtonpost.es", "nuevatribuna.es","atres.red","elindependiente.com","okdiario.com","okdario.com","cadenser.com","elconfidencial.com","elmundo.es","estrelladigital.es","infolibre.es","elpolitiko.com","vozpopuli.com","rac1.cat","europapress.es","digitalsevilla.es","elpais.com","periodistadigital.com","libertadigital.com","lrzn.es","diario.es","lavanguardia.com","elperiodico.com","blogs.publico.es","abc-es","lamarea.com","esdiario.es","a.msn.com","e24diari.es","rrss.abc.es","lasexta.com","ondace.ro","Eldiario.es","m.publico.es","cronicaglobal.elespanol.com","elnacional.cat","ElDiario.es","elPeriodi.co")
socialmedia <- c("meneame.net","change.org","twitter.com","facebook.com","cards.twitter.com","flip.it","klinews.com","paper.li")
video <- c("youtu.be","youtube.com","pspc.tv")
shorteners <- c("buff.ly","bit.ly","dlvr.it","ow.ly","goo.gl","trib.al","tinyurl.com","ift.tt")
society <- c("mats-sanidad.com")
other <- c("google.com","google.es")
More charts can be seen in the last commit: https://code.montera34.com/numeroteca/tuits-analysis/-/commit/e756aae2abb0880e59e970d98546d95f93e96a2c
One particular thing I want to clarify is what url0
, url1
, url2
and url3
exactly mean.