Analyzing links in tweets

This is a draft document to explain how I plan to analyze links in tweets.

1. Get the tweets

I've used twarc2 search and then flattened the tweets with twarc2 flatten.

2. Where are the links?

I started by reporting that I needed the link unshortener for twarc2: https://github.com/DocNow/twarc/issues/538 In the process of reporting and testing We realized the problem was something else.

In the early days tweets had one link and it was simpler to analyze them. I remember @yourcel project that analyzed the link of #occupy tweets back in 2011 (https://occupybostonlinks.tirl.org/1/count/). Now the situation is a bit more complex.

A tweet can:

have no link
be a RT: has 1 one link (the link to the RT tweet)
be a RT of a tweet with a link in it: has 2 links (the link to the RT tweet and the link in the RT tweet)
quote a tweet with a link in it and include a link: has 3 links

If you explore the donwloaded json file you can see different examples. If you convert the file to a csv (there is an opened issue about that https://github.com/DocNow/twarc-csv/issues/33) more questions arise.

3. Extract all relevant links

I've created a R script to do this: extract all the different links in the tweets and analyze them. Code available at https://code.montera34.com/numeroteca/tuits-analysis/-/blob/master/analysis/links-in-tweets.R

The are many things to improve, but so far the code is working. The method to extract the URL is to iterate through all the lines and extract the URLs is the following

# json is a large list of tweets in json format (see code for more details)
for ( i in 1:n) {
  print(i)
  tweets$created_at[i] <- as.character(json[[i]]$created_at)
  tweets$author[i] <- as.character(json[[i]]$author$username)
  tweets$text[i] <- as.character( json[[i]]$text )
  tweets$type[i] <-     as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$type), NA, json[[i]]$referenced_tweets[[1]]$type) )
  tweets$text_ref[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$text), NA, json[[i]]$referenced_tweets[[1]]$text) )
  tweets$text_RT[i] <- as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$text), NA, json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$text) )
  tweets$url0[i] <-  as.character( ifelse( is.null(json[[i]]$entities$urls[[1]]$expanded_url), NA,json[[i]]$entities$urls[[1]]$expanded_url) )
  tweets$url1[i] <-  as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[1]]$expanded_url), NA,json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[1]]$expanded_url) )
  # Used try to avoid stopping by error:
   try( tweets$url2[i] <-  as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url), NA,json[[i]]$referenced_tweets[[1]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url) ),
    silent = TRUE        
  )
  try( tweets$url3[i] <-  as.character( ifelse( is.null(json[[i]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url), NA,json[[i]]$referenced_tweets[[1]]$entities$urls[[2]]$expanded_url) ),
       silent = TRUE
  )
  tweets$id[i] <- as.character(json[[i]]$id)
}

What is each variable:

the ones that are clear are: id, created at, author
text: text of the tweet. Retweets are truncated
type: type of relationship of this tweet with the tweet at text_ref. Can be quoted, replied_to or retweeted (if none of those "normal" is inserted)
text_ref: full text of referenced tweet. It allows full text analysis of the tweets (as RT tweets texts are truncated).

Verify this hypothesis

url0: URL of the original tweet. If it is a native RT the link to the RT tweet is included
url1: 1st URL of the referenced tweet
url2: 2nd URL of the referenced tweet
url3: 3rd URL of the referenced tweet, if it is a quoted tweet

It is not very clean and might leave some URL left in other places of the json structure. All in all I think it gathers a good amount of URL (I'd need to quantify how many) and it's possible to create plots like this one, where every point is not a tweet but a link in a tweet (1 tweet can have many URL in it):

I created this chart above without the Twitter links, because if Twitter URL are included, this is the result:

There are many more analysis that I am going to explore.

To classify the domains I use this manual classification (based on Spanish news media ecosystem":

news_media <- c("eldiario.es","m.eldiario.es","publico.es","elplural.com","elboletin.com","elespanol.com","eljueves.es","huffingtonpost.es",    "nuevatribuna.es","atres.red","elindependiente.com","okdiario.com","okdario.com","cadenser.com","elconfidencial.com","elmundo.es","estrelladigital.es","infolibre.es","elpolitiko.com","vozpopuli.com","rac1.cat","europapress.es","digitalsevilla.es","elpais.com","periodistadigital.com","libertadigital.com","lrzn.es","diario.es","lavanguardia.com","elperiodico.com","blogs.publico.es","abc-es","lamarea.com","esdiario.es","a.msn.com","e24diari.es","rrss.abc.es","lasexta.com","ondace.ro","Eldiario.es","m.publico.es","cronicaglobal.elespanol.com","elnacional.cat","ElDiario.es","elPeriodi.co")
socialmedia <- c("meneame.net","change.org","twitter.com","facebook.com","cards.twitter.com","flip.it","klinews.com","paper.li")
video <- c("youtu.be","youtube.com","pspc.tv")
shorteners <- c("buff.ly","bit.ly","dlvr.it","ow.ly","goo.gl","trib.al","tinyurl.com","ift.tt")
society <- c("mats-sanidad.com")
other <- c("google.com","google.es")

By type of domain

By number of mentions to unique links

By type of tweet

More charts can be seen in the last commit: https://code.montera34.com/numeroteca/tuits-analysis/-/commit/e756aae2abb0880e59e970d98546d95f93e96a2c

One particular thing I want to clarify is what url0, url1, url2 and url3 exactly mean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly